In the past I have mostly written about ‘classical’ Machine Learning, like Naive Bayes classification, Logistic Regression, and the Perceptron algorithm. In the past year I have also worked with Deep Learning techniques, and I would like to share with you how to make and train a Convolutional Neural Network from scratch, using tensorflow. Later on we can use this knowledge as a building block to make interesting Deep Learning applications.

For this you will need to have tensorflow installed (see installation instructions) and you should also have a basic understanding of Python programming and the theory behind Convolutional Neural Networks. After you have installed tensorflow, you can run the smaller Neural Networks without GPU, but for the deeper networks you will definitely need some GPU power.

The Internet is full with awesome websites and courses which explain how a convolutional neural network works. Some of them have good visualisations which make it easy to understand [click here for more info]. I don’t feel the need to explain the same things again, so before you continue, make sure you understand how a convolutional neural network works. For example,

- What is a convolutional layer, and what is the filter of this convolutional layer?
- What is an activation layer (ReLu layer (most widely used), sigmoid activation or tanh)?
- What is a pooling layer (max pooling / average pooling), dropout?
- How does Stochastic Gradient Descent work?

The contents of this blog-post is as follows:

- Tensorflow basics:
- 1.1 Constants and Variables
- 1.2 Tensorflow Graphs and Sessions
- 1.3 Placeholders and feed_dicts

- Neural Networks in Tensorflow
- 2.1 Introduction
- 2.2 Loading in the data
- 2.3 Creating a (simple) 1-layer Neural Network:
- 2.4 The many faces of Tensorflow
- 2.5 Creating the LeNet5 CNN
- 2.6 How the parameters affect the outputsize of an layer
- 2.7 Adjusting the LeNet5 architecture
- 2.8 Impact of Learning Rate and Optimizer

- Deep Neural Networks in Tensorflow
- 3.1 AlexNet
- 3.2 VGG Net-16
- 3.3 AlexNet Performance

- Final words

Here I will give a short introduction to Tensorflow for people who have never worked with it before. If you want to start building Neural Networks immediatly, or you are already familiar with Tensorflow you can go ahead and skip to section 2. If you would like to know more about Tensorflow, you can also have a look at this repository, or the notes of lecture 1 and lecture 2 of Stanford’s CS20SI course.

The most basic units within tensorflow are Constants, Variables and Placeholders.

The difference between a tf.constant() and a tf.Variable() should be clear; a constant has a constant value and once you set it, it cannot be changed. The value of a Variable can be changed after it has been set, but the type and shape of the Variable can not be changed.

#We can create constants and variables of different types. #However, the different types do not mix well together. a = tf.constant(2, tf.int16) b = tf.constant(4, tf.float32) c = tf.constant(8, tf.float32) d = tf.Variable(2, tf.int16) e = tf.Variable(4, tf.float32) f = tf.Variable(8, tf.float32) #we can perform computations on variable of the same type: e + f #but the following can not be done: d + e #everything in tensorflow is a tensor, these can have different dimensions: #0D, 1D, 2D, 3D, 4D, or nD-tensors g = tf.constant(np.zeros(shape=(2,2), dtype=np.float32)) #does work h = tf.zeros([11], tf.int16) i = tf.ones([2,2], tf.float32) j = tf.zeros([1000,4,3], tf.float64) k = tf.Variable(tf.zeros([2,2], tf.float32)) l = tf.Variable(tf.zeros([5,6,5], tf.float32))

Besides the tf.zeros() and tf.ones(), which create a Tensor initialized to zero or one (see here), there is also the tf.random_normal() function which create a tensor filled with values picked randomly from a normal distribution (the default distribution has a mean of 0.0 and stddev of 1.0).

There is also the tf.truncated_normal() function, which creates an Tensor with values randomly picked from a normal distribution, where two times the standard deviation forms the lower and upper limit.

With this knowledge, we can already create weight matrices and bias vectors which can be used in a neural network.

weights = tf.Variable(tf.truncated_normal([256 * 256, 10])) biases = tf.Variable(tf.zeros([10])) print(weights.get_shape().as_list()) print(biases.get_shape().as_list()) >>>[65536, 10] >>>[10]

In Tensorflow, all of the different Variables and the operations done on these Variables are saved in a Graph. After you have build a Graph which contains all of the computational steps necessary for your model, you can run this Graph within a Session. This Session then distributes all of the computations across the available CPU and GPU resources.

graph = tf.Graph() with graph.as_default(): a = tf.Variable(8, tf.float32) b = tf.Variable(tf.zeros([2,2], tf.float32)) with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print(f) print(session.run(f)) print(session.run(k)) >>> <tf.Variable 'Variable_2:0' shape=() dtype=int32_ref> >>> 8 >>> [[ 0. 0.] >>> [ 0. 0.]]

We have seen the various forms in which we can create constants and variables. Tensorflow also has placeholders; these do not require an initial value and only serve to allocate the necessary amount of memory. During a session, these placeholder can be filled in with (external) data with a *feed_dict*.

Below is an example of the usage of a placeholder.

list_of_points1_ = [[1,2], [3,4], [5,6], [7,8]] list_of_points2_ = [[15,16], [13,14], [11,12], [9,10]] list_of_points1 = np.array([np.array(elem).reshape(1,2) for elem in list_of_points1_]) list_of_points2 = np.array([np.array(elem).reshape(1,2) for elem in list_of_points2_]) graph = tf.Graph() with graph.as_default(): #we should use a tf.placeholder() to create a variable whose value you will fill in later (during session.run()). #this can be done by 'feeding' the data into the placeholder. #below we see an example of a method which uses two placeholder arrays of size [2,1] to calculate the eucledian distance point1 = tf.placeholder(tf.float32, shape=(1, 2)) point2 = tf.placeholder(tf.float32, shape=(1, 2)) def calculate_eucledian_distance(point1, point2): difference = tf.subtract(point1, point2) power2 = tf.pow(difference, tf.constant(2.0, shape=(1,2))) add = tf.reduce_sum(power2) eucledian_distance = tf.sqrt(add) return eucledian_distance dist = calculate_eucledian_distance(point1, point2) with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() for ii in range(len(list_of_points1)): point1_ = list_of_points1[ii] point2_ = list_of_points2[ii] feed_dict = {point1 : point1_, point2 : point2_} distance = session.run([dist], feed_dict=feed_dict) print("the distance between {} and {} -> {}".format(point1_, point2_, distance)) >>> the distance between [[1 2]] and [[15 16]] -> [19.79899] >>> the distance between [[3 4]] and [[13 14]] -> [14.142136] >>> the distance between [[5 6]] and [[11 12]] -> [8.485281] >>> the distance between [[7 8]] and [[ 9 10]] -> [2.8284271]

The graph containing the Neural Network (illustrated in the image above) should contain the following steps:

- The
**input datasets**; the training dataset and labels, the test dataset and labels (and the validation dataset and labels).

The test and validation datasets can be placed inside a tf.constant(). And the training dataset is placed in a tf.placeholder() so that it can be feeded in batches during the training (stochastic gradient descent). - The Neural Network
**model**with all of its layers. This can be a simple fully connected neural network consisting of only 1 layer, or a more complicated neural network consisting of 5, 9, 16 etc layers. - The
**weight**matrices and**bias**vectors defined in the proper shape and initialized to their initial values. (One weight matrix and bias vector per layer.) - The
**loss**value: the model has as output the logit vector (estimated training labels) and by comparing the logit with the actual labels, we can calculate the loss value (with the softmax with cross-entropy function). The loss value is an indication of how close the estimated training labels are to the actual training labels and will be used to update the weight values. - An
**optimizer**, which will use the calculated loss value to update the weights and biases with backpropagation.

Let’s load the dataset which are going to be used to train and test the Neural Networks. For this we will download the MNIST and the CIFAR-10 dataset. The MNIST dataset contains 60.000 images of handwritten digits, where each image size is 28 x 28 x 1 (grayscale). The CIFAR-10 dataset contains 60.000 colour images (3 channels) – size 32 x 32 x 3 – of 10 different objects (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). Since there are 10 different objects in each dataset, both datasets contain 10 labels.

First, lets define some methods which are convenient for loading and reshaping the data into the necessary format.

def randomize(dataset, labels): permutation = np.random.permutation(labels.shape[0]) shuffled_dataset = dataset[permutation, :, :] shuffled_labels = labels[permutation] return shuffled_dataset, shuffled_labels def one_hot_encode(np_array): return (np.arange(10) == np_array[:,None]).astype(np.float32) def reformat_data(dataset, labels, image_width, image_height, image_depth): np_dataset_ = np.array([np.array(image_data).reshape(image_width, image_height, image_depth) for image_data in dataset]) np_labels_ = one_hot_encode(np.array(labels, dtype=np.float32)) np_dataset, np_labels = randomize(np_dataset_, np_labels_) return np_dataset, np_labels def flatten_tf_array(array): shape = array.get_shape().as_list() return tf.reshape(array, [shape[0], shape[1] * shape[2] * shape[3]]) def accuracy(predictions, labels): return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1)) / predictions.shape[0])

These are methods for one-hot encoding the labels, loading the data in a randomized array and a method for flattening an array (since a fully connected network needs an flat array as its input):

After we have defined these necessary function, we can load the MNIST and CIFAR-10 datasets with:

mnist_folder = './data/mnist/' mnist_image_width = 28 mnist_image_height = 28 mnist_image_depth = 1 mnist_num_labels = 10 mndata = MNIST(mnist_folder) mnist_train_dataset_, mnist_train_labels_ = mndata.load_training() mnist_test_dataset_, mnist_test_labels_ = mndata.load_testing() mnist_train_dataset, mnist_train_labels = reformat_data(mnist_train_dataset_, mnist_train_labels_, mnist_image_size, mnist_image_size, mnist_image_depth) mnist_test_dataset, mnist_test_labels = reformat_data(mnist_test_dataset_, mnist_test_labels_, mnist_image_size, mnist_image_size, mnist_image_depth) print("There are {} images, each of size {}".format(len(mnist_train_dataset), len(mnist_train_dataset[0]))) print("Meaning each image has the size of 28*28*1 = {}".format(mnist_image_size*mnist_image_size*1)) print("The training set contains the following {} labels: {}".format(len(np.unique(mnist_train_labels_)), np.unique(mnist_train_labels_))) print('Training set shape', mnist_train_dataset.shape, mnist_train_labels.shape) print('Test set shape', mnist_test_dataset.shape, mnist_test_labels.shape) train_dataset_mnist, train_labels_mnist = mnist_train_dataset, mnist_train_labels test_dataset_mnist, test_labels_mnist = mnist_test_dataset, mnist_test_labels ###################################################################################### cifar10_folder = './data/cifar10/' train_datasets = ['data_batch_1', 'data_batch_2', 'data_batch_3', 'data_batch_4', 'data_batch_5', ] test_dataset = ['test_batch'] c10_image_height = 32 c10_image_width = 32 c10_image_depth = 3 c10_num_labels = 10 with open(cifar10_folder + test_dataset[0], 'rb') as f0: c10_test_dict = pickle.load(f0, encoding='bytes') c10_test_dataset, c10_test_labels = c10_test_dict[b'data'], c10_test_dict[b'labels'] test_dataset_cifar10, test_labels_cifar10 = reformat_data(c10_test_dataset, c10_test_labels, c10_image_size, c10_image_size, c10_image_depth) c10_train_dataset, c10_train_labels = [], [] for train_dataset in train_datasets: with open(cifar10_folder + train_dataset, 'rb') as f0: c10_train_dict = pickle.load(f0, encoding='bytes') c10_train_dataset_, c10_train_labels_ = c10_train_dict[b'data'], c10_train_dict[b'labels'] c10_train_dataset.append(c10_train_dataset_) c10_train_labels += c10_train_labels_ c10_train_dataset = np.concatenate(c10_train_dataset, axis=0) train_dataset_cifar10, train_labels_cifar10 = reformat_data(c10_train_dataset, c10_train_labels, c10_image_size, c10_image_size, c10_image_depth) del c10_train_dataset del c10_train_labels print("The training set contains the following labels: {}".format(np.unique(c10_train_dict[b'labels']))) print('Training set shape', train_dataset_cifar10.shape, train_labels_cifar10.shape) print('Test set shape', test_dataset_cifar10.shape, test_labels_cifar10.shape)

You can download the MNIST dataset from Yann LeCun’s website. After you have downloaded and unzipped the files, you can load the data with the python-mnist tool. CIFAR-10 can be downloaded from here.

The most simple form of a Neural Network is a 1-layer linear Fully Connected Neural Network (FCNN). Mathematically it consists of a matrix multiplication.

It is best to start with such a simple NN in tensorflow, and later on look at the more complicated Neural Networks. When we start looking at these more complicated Neural Networks, only the model (step 2) and weights (step 3) part of the Graph will change and the other steps will remain the same.

We can make such an 1-layer FCNN as follows:

image_width = mnist_image_width image_height = mnist_image_height image_depth = mnist_image_depth num_labels = mnist_num_labels #the dataset train_dataset = mnist_train_dataset train_labels = mnist_train_labels test_dataset = mnist_test_dataset test_labels = mnist_test_labels #number of iterations and learning rate num_steps = 10001 display_step = 1000 learning_rate = 0.5 graph = tf.Graph() with graph.as_default(): #1) First we put the input data in a tensorflow friendly form. tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_width, image_height, image_depth)) tf_train_labels = tf.placeholder(tf.float32, shape = (batch_size, num_labels)) tf_test_dataset = tf.constant(test_dataset, tf.float32) #2) Then, the weight matrices and bias vectors are initialized #as a default, tf.truncated_normal() is used for the weight matrix and tf.zeros() is used for the bias vector. weights = tf.Variable(tf.truncated_normal([image_width * image_height * image_depth, num_labels]), tf.float32) bias = tf.Variable(tf.zeros([num_labels]), tf.float32) #3) define the model: #A one layered fccd simply consists of a matrix multiplication def model(data, weights, bias): return tf.matmul(flatten_tf_array(data), weights) + bias logits = model(tf_train_dataset, weights, bias) #4) calculate the loss, which will be used in the optimization of the weights loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf_train_labels)) #5) Choose an optimizer. Many are available. optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss) #6) The predicted values for the images in the train dataset and test dataset are assigned to the variables train_prediction and test_prediction. #It is only necessary if you want to know the accuracy by comparing it with the actual values. train_prediction = tf.nn.softmax(logits) test_prediction = tf.nn.softmax(model(tf_test_dataset, weights, bias)) with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print('Initialized') for step in range(num_steps): _, l, predictions = session.run([optimizer, loss, train_prediction]) if (step % display_step == 0): train_accuracy = accuracy(predictions, train_labels[:, :]) test_accuracy = accuracy(test_prediction.eval(), test_labels) message = "step {:04d} : loss is {:06.2f}, accuracy on training set {:02.2f} %, accuracy on test set {:02.2f} %".format(step, l, train_accuracy, test_accuracy) print(message)

>>> Initialized >>> step 0000 : loss is 2349.55, accuracy on training set 10.43 %, accuracy on test set 34.12 % >>> step 0100 : loss is 3612.48, accuracy on training set 89.26 %, accuracy on test set 90.15 % >>> step 0200 : loss is 2634.40, accuracy on training set 91.10 %, accuracy on test set 91.26 % >>> step 0300 : loss is 2109.42, accuracy on training set 91.62 %, accuracy on test set 91.56 % >>> step 0400 : loss is 2093.56, accuracy on training set 91.85 %, accuracy on test set 91.67 % >>> step 0500 : loss is 2325.58, accuracy on training set 91.83 %, accuracy on test set 91.67 % >>> step 0600 : loss is 22140.44, accuracy on training set 68.39 %, accuracy on test set 75.06 % >>> step 0700 : loss is 5920.29, accuracy on training set 83.73 %, accuracy on test set 87.76 % >>> step 0800 : loss is 9137.66, accuracy on training set 79.72 %, accuracy on test set 83.33 % >>> step 0900 : loss is 15949.15, accuracy on training set 69.33 %, accuracy on test set 77.05 % >>> step 1000 : loss is 1758.80, accuracy on training set 92.45 %, accuracy on test set 91.79 %

This is all there is too it! Inside the Graph, we load the data, define the weight matrices and the model, calculate the loss value from the logit vector and pass this to the optimizer which will update the weights for ‘num_steps’ number of iterations.

In the above fully connected NN, we have used the Gradient Descent Optimizer for optimizing the weights. However, there are many different optimizers available in tensorflow. The most common used optimizers are the GradientDescentOptimizer, AdamOptimizer and AdaGradOptimizer, so I would suggest to start with these if youre building a CNN.

Sebastian Ruder has a nice blog post explaining the differences between the different optimizers which you can read if you want to know more about them.

Tensorflow contains many layers, meaning the same operations can be done with different levels of abstraction. To give a simple example, the operation

`logits = tf.matmul(tf_train_dataset, weights) + biases`

,

can also be achieved with

`logits = tf.nn.xw_plus_b(train_dataset, weights, biases)`

.

This is the best visible in the layers API, which is an layer with a high level of abstraction and makes it very easy to create Neural Network consisting of many different layers. For example, the conv_2d() or the fully_connected() functions create convolutional and fully connected layers. With these functions, the number of layers, filter sizes / depths, type of activation function, etc can be specified as a parameter. The weights and bias matrices are then automatically created, as well as the additional activation functions and dropout regularization layers.

For example, with the layers API, the following lines:

import tensorflow as tf w1 = tf.Variable(tf.truncated_normal([filter_size, filter_size, image_depth, filter_depth], stddev=0.1)) b1 = tf.Variable(tf.zeros([filter_depth])) layer1_conv = tf.nn.conv2d(data, w1, [1, 1, 1, 1], padding='SAME') layer1_relu = tf.nn.relu(layer1_conv + b1) layer1_pool = tf.nn.max_pool(layer1_pool, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')

can be replaced with

from tflearn.layers.conv import conv_2d, max_pool_2d layer1_conv = conv_2d(data, filter_depth, filter_size, activation='relu') layer1_pool = max_pool_2d(layer1_conv_relu, 2, strides=2)

As you can see, we don’t need to define the weights, biases or activation functions. Especially when youre building a neural network with many layers, this keeps the code succint and clean.

However, if youre just starting out with tensorflow and want to learn how to build different kinds of Neural Networks, it is not ideal, since were letting tflearn do all the work.

Therefore we will not use the layers API in this blog-post, but I do recommend you to use it once you have a full understanding of how a neural network should be build in tensorflow.

Let’s start with building more layered Neural Network. For example the LeNet5 Convolutional Neural Network.

The LeNet5 CNN architecture was thought of by Yann Lecun as early as in 1998 (see paper). It is one of the earliest CNN’s (maybe even the first?) and was specifically designed to classify handwritten digits. Although it performs well on the MNIST dataset which consist of grayscale images of size 28 x 28, the performance drops on other datasets with more images, with a larger resolution (larger image size) and more classes. For these larger datasets, deeper ConvNets (like AlexNet, VGGNet or ResNet), will perform better.

But since the LeNet5 architecture only consists of 5 layers, it is a good starting point for learning how to build CNN’s.

The Lenet5 architecture looks as follows:

As we can see, it consists of 5 layers:

**layer 1**: a convolutional layer, with a sigmoid activation function, followed by an average pooling layer.**layer 2**: a convolutional layer, with a sigmoid activation function, followed by an average pooling layer.**layer 3**: a fully connected network (sigmoid activation)**layer 4**: a fully connected network (sigmoid activation)**layer 5**: the output layer

This means that we need to create 5 weight and bias matrices, and our model will consists of 12 lines of code (5 layers + 2 pooling + 4 activation functions + 1 flatten layer).

Since this is quiet some code, it is best to define these in a seperate function outside of the graph.

LENET5_BATCH_SIZE = 32 LENET5_PATCH_SIZE = 5 LENET5_PATCH_DEPTH_1 = 6 LENET5_PATCH_DEPTH_2 = 16 LENET5_NUM_HIDDEN_1 = 120 LENET5_NUM_HIDDEN_2 = 84 def variables_lenet5(patch_size = LENET5_PATCH_SIZE, patch_depth1 = LENET5_PATCH_DEPTH_1, patch_depth2 = LENET5_PATCH_DEPTH_2, num_hidden1 = LENET5_NUM_HIDDEN_1, num_hidden2 = LENET5_NUM_HIDDEN_2, image_depth = 1, num_labels = 10): w1 = tf.Variable(tf.truncated_normal([patch_size, patch_size, image_depth, patch_depth1], stddev=0.1)) b1 = tf.Variable(tf.zeros([patch_depth1])) w2 = tf.Variable(tf.truncated_normal([patch_size, patch_size, patch_depth1, patch_depth2], stddev=0.1)) b2 = tf.Variable(tf.constant(1.0, shape=[patch_depth2])) w3 = tf.Variable(tf.truncated_normal([5*5*patch_depth2, num_hidden1], stddev=0.1)) b3 = tf.Variable(tf.constant(1.0, shape = [num_hidden1])) w4 = tf.Variable(tf.truncated_normal([num_hidden1, num_hidden2], stddev=0.1)) b4 = tf.Variable(tf.constant(1.0, shape = [num_hidden2])) w5 = tf.Variable(tf.truncated_normal([num_hidden2, num_labels], stddev=0.1)) b5 = tf.Variable(tf.constant(1.0, shape = [num_labels])) variables = { 'w1': w1, 'w2': w2, 'w3': w3, 'w4': w4, 'w5': w5, 'b1': b1, 'b2': b2, 'b3': b3, 'b4': b4, 'b5': b5 } return variables def model_lenet5(data, variables): layer1_conv = tf.nn.conv2d(data, variables['w1'], [1, 1, 1, 1], padding='SAME') layer1_actv = tf.sigmoid(layer1_conv + variables['b1']) layer1_pool = tf.nn.avg_pool(layer1_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer2_conv = tf.nn.conv2d(layer1_pool, variables['w2'], [1, 1, 1, 1], padding='VALID') layer2_actv = tf.sigmoid(layer2_conv + variables['b2']) layer2_pool = tf.nn.avg_pool(layer2_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') flat_layer = flatten_tf_array(layer2_pool) layer3_fccd = tf.matmul(flat_layer, variables['w3']) + variables['b3'] layer3_actv = tf.nn.sigmoid(layer3_fccd) layer4_fccd = tf.matmul(layer3_actv, variables['w4']) + variables['b4'] layer4_actv = tf.nn.sigmoid(layer4_fccd) logits = tf.matmul(layer4_actv, variables['w5']) + variables['b5'] return logits

With the variables, and model defined seperately, we can adjust the the graph a little bit so that it uses these weights and model instead of the previous Fully Connected NN:

#parameters determining the model size image_size = mnist_image_size num_labels = mnist_num_labels #the datasets train_dataset = mnist_train_dataset train_labels = mnist_train_labels test_dataset = mnist_test_dataset test_labels = mnist_test_labels #number of iterations and learning rate num_steps = 10001 display_step = 1000 learning_rate = 0.001 graph = tf.Graph() with graph.as_default(): #1) First we put the input data in a tensorflow friendly form. tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_width, image_height, image_depth)) tf_train_labels = tf.placeholder(tf.float32, shape = (batch_size, num_labels)) tf_test_dataset = tf.constant(test_dataset, tf.float32) #2) Then, the weight matrices and bias vectors are initialized <strong>variables = variables_lenet5(image_depth = image_depth, num_labels = num_labels)</strong> #3. The model used to calculate the logits (predicted labels) <strong>model = model_lenet5</strong> <strong>logits = model(tf_train_dataset, variables)</strong> #4. then we compute the softmax cross entropy between the logits and the (actual) labels loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf_train_labels)) #5. The optimizer is used to calculate the gradients of the loss function optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss) # Predictions for the training, validation, and test data. train_prediction = tf.nn.softmax(logits) test_prediction = tf.nn.softmax(model(tf_test_dataset, variables))

with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print('Initialized with learning_rate', learning_rate) for step in range(num_steps): #Since we are using stochastic gradient descent, we are selecting small batches from the training dataset, #and training the convolutional neural network each time with a batch. offset = (step * batch_size) % (train_labels.shape[0] - batch_size) batch_data = train_dataset[offset:(offset + batch_size), :, :, :] batch_labels = train_labels[offset:(offset + batch_size), :] feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels} _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict) if step % display_step == 0: train_accuracy = accuracy(predictions, batch_labels) test_accuracy = accuracy(test_prediction.eval(), test_labels) message = "step {:04d} : loss is {:06.2f}, accuracy on training set {:02.2f} %, accuracy on test set {:02.2f} %".format(step, l, train_accuracy, test_accuracy) print(message)

>>> Initialized with learning_rate 0.1 >>> step 0000 : loss is 002.49, accuracy on training set 3.12 %, accuracy on test set 10.09 % >>> step 1000 : loss is 002.29, accuracy on training set 21.88 %, accuracy on test set 9.58 % >>> step 2000 : loss is 000.73, accuracy on training set 75.00 %, accuracy on test set 78.20 % >>> step 3000 : loss is 000.41, accuracy on training set 81.25 %, accuracy on test set 86.87 % >>> step 4000 : loss is 000.26, accuracy on training set 93.75 %, accuracy on test set 90.49 % >>> step 5000 : loss is 000.28, accuracy on training set 87.50 %, accuracy on test set 92.79 % >>> step 6000 : loss is 000.23, accuracy on training set 96.88 %, accuracy on test set 93.64 % >>> step 7000 : loss is 000.18, accuracy on training set 90.62 %, accuracy on test set 95.14 % >>> step 8000 : loss is 000.14, accuracy on training set 96.88 %, accuracy on test set 95.80 % >>> step 9000 : loss is 000.35, accuracy on training set 90.62 %, accuracy on test set 96.33 % >>> step 10000 : loss is 000.12, accuracy on training set 93.75 %, accuracy on test set 96.76 %

As we can see the LeNet5 architecture performs better on the MNIST dataset than a simple fully connected NN.

Generally it is true that the more layers a Neural Network has, the better it performs. We can add more layers, change activation functions and pooling layers, change the learning rate and see how each step affects the performance. Since the input of layer is the output of layer , we need to know how the output size of layer is affected by its different parameters.

To understand this, lets have a look at the conv2d() function.

It has four parameters:

- The input image, a 4D Tensor with dimensions [batch size, image_width, image_height, image_depth]
- An weight matrix, a 4-D Tensor with dimensions [filter_size, filter_size, image_depth, filter_depth]
- The number of strides in each dimension.
- Padding (= ‘SAME’ / ‘VALID’)

These four parameters determine the size of the output image.

The first two parameters are the 4-D Tensor containing the batch of input images and the 4-D Tensor containing the weights of the convolutional filter.

The third parameter is the stride of the convolution, i.e. how much the convolutional filter should skip positions in each of the four dimension. The first of these 4 dimensions indicates the image-number in the batch of images and since we dont want to skip over any image, this will always be 1. The last dimension indicates the image depth (no of color-channels; 1 for grayscale and 3 for RGB) and since we dont want to skip over any color-channels, this is also always 1. The second and third dimension indicate the stride in the X and Y direction (image width and height). If we want to apply a stride, these are the dimensions in which the filter should skip positions. So for a stride of 1, we have to set the stride-parameter to [1, 1, 1, 1] and if we want a stride of 2, set it to [1, 2, 2, 1]. etc

The last parameter indicates whether or not tensorflow should zero-pad the image in order to make sure the output size does not change size for a stride of 1. With padding = ‘SAME’ the image does get zero-padded (and output size does not change), with padding = ‘VALID’ it does not.

Below we can see two examples of a convolutional filter (with filter size 5 x 5) scanning through an image (of size 28 x 28).

On the left the padding parameter is set to ‘SAME’, the image is zero-padded and the last 4 rows / columns are included in the output image.

On the right padding is set to ‘VALID’, the image does not get zero-padded and the last 4 rows/columns are not included.

GIF

As we can see, without zero-padding the last four cells are not included, because the convolutional filter has reached the end of the (non-zero padded) image. This means that, for an input size of 28 x 28, the output size becomes 24 x 24. If padding = ‘SAME’, the output size is 28 x 28.

This becomes more clear if we write down the positions of the filter on the image while it is scanning through the image (For simplicity, only the X-direction). With a stride of 1, the X-positions are 0-5, 1-6, 2-7, etc. If the stride is 2, the X-positions are 0-5, 2-7, 4-9, etc.

If we do this for an image size of 28 x 28, filter size of 5 x 5 and strides 1 to 4, we will get the following table:

As you can see, for a stride of 1, and zero-padding the output image size is 28 x 28. Without zero-padding the output image size becomes 24 x 24. For a filter with a stride of 2, these numbers are 14 x 14 and 12 x 12, and for a filter with stride 3 it is 10 x 10 and 8 x 8. etc

For any arbitrary chosen stride S, filter size K, image size W, and padding-size P, the output size will be

If padding = ‘SAME’ in tensorflow, the numerator always adds up to 1 and the output size is only determined by the stride S.

In the original paper, a sigmoid activation function and average pooling were used in the LeNet5 architecture. However, nowadays, it is much more common to use a relu activation function. So let’s change the LeNet5 CNN a little bit to see if we can improve its accuracy. We will call this the LeNet5-like Architecture:

LENET5_LIKE_BATCH_SIZE = 32 LENET5_LIKE_FILTER_SIZE = 5 LENET5_LIKE_FILTER_DEPTH = 16 LENET5_LIKE_NUM_HIDDEN = 120 def variables_lenet5_like(filter_size = LENET5_LIKE_FILTER_SIZE, filter_depth = LENET5_LIKE_FILTER_DEPTH, num_hidden = LENET5_LIKE_NUM_HIDDEN, image_width = 28, image_depth = 1, num_labels = 10): w1 = tf.Variable(tf.truncated_normal([filter_size, filter_size, image_depth, filter_depth], stddev=0.1)) b1 = tf.Variable(tf.zeros([filter_depth])) w2 = tf.Variable(tf.truncated_normal([filter_size, filter_size, filter_depth, filter_depth], stddev=0.1)) b2 = tf.Variable(tf.constant(1.0, shape=[filter_depth])) w3 = tf.Variable(tf.truncated_normal([(image_width // 4)*(image_width // 4)*filter_depth , num_hidden], stddev=0.1)) b3 = tf.Variable(tf.constant(1.0, shape = [num_hidden])) w4 = tf.Variable(tf.truncated_normal([num_hidden, num_hidden], stddev=0.1)) b4 = tf.Variable(tf.constant(1.0, shape = [num_hidden])) w5 = tf.Variable(tf.truncated_normal([num_hidden, num_labels], stddev=0.1)) b5 = tf.Variable(tf.constant(1.0, shape = [num_labels])) variables = { 'w1': w1, 'w2': w2, 'w3': w3, 'w4': w4, 'w5': w5, 'b1': b1, 'b2': b2, 'b3': b3, 'b4': b4, 'b5': b5 } return variables def model_lenet5_like(data, variables): layer1_conv = tf.nn.conv2d(data, variables['w1'], [1, 1, 1, 1], padding='SAME') layer1_actv = tf.nn.relu(layer1_conv + variables['b1']) layer1_pool = tf.nn.avg_pool(layer1_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer2_conv = tf.nn.conv2d(layer1_pool, variables['w2'], [1, 1, 1, 1], padding='SAME') layer2_actv = tf.nn.relu(layer2_conv + variables['b2']) layer2_pool = tf.nn.avg_pool(layer2_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') flat_layer = flatten_tf_array(layer2_pool) layer3_fccd = tf.matmul(flat_layer, variables['w3']) + variables['b3'] layer3_actv = tf.nn.relu(layer3_fccd) #layer3_drop = tf.nn.dropout(layer3_actv, 0.5) layer4_fccd = tf.matmul(layer3_actv, variables['w4']) + variables['b4'] layer4_actv = tf.nn.relu(layer4_fccd) #layer4_drop = tf.nn.dropout(layer4_actv, 0.5) logits = tf.matmul(layer4_actv, variables['w5']) + variables['b5'] return logits

The main differences are that we are using a relu activation function instead of a sigmoid activation.

Besides the activation function, we can also change the used optimizers to see what the effect is of the different optimizers on accuracy.

Lets see how these CNN’s perform on the MNIST and CIFAR-10 datasets.

In the figures above, the accuracy on the test set is given as a function of the number of iterations. On the left for the one layer fully connected NN, in the middle for the LeNet5 NN and on the right for the LeNet5-like NN.

As we can see, the LeNet5 CNN works pretty good for the MNIST dataset. Which should not be such a big surprise, since it was specially designed to classify handwritten digits. The MNIST dataset is quiet small and does not provide a big challenge, so even a one layer fully connected network performs quiet good.

On the CIFAR-10 Dataset however, the performance for the LeNet5 NN drops significantly to accuracy values around 40%.

To increase the accuracy, we can change the optimizer, or fine-tune the Neural Network by applying regularization or learning rate decay.

As we can see, the AdagradOptimizer, AdamOptimizer and the RMSPropOptimizer have a better performance than the GradientDescentOptimizer. These are adaptive optimizers which in general perform better than the (simple) GradientDescentOptimizer but need more computational power.

With L2-regularization or exponential rate decay we can probably gain a bit more accuracy, but for much better results we need to go deeper.

So far we have seen the LeNet5 CNN architecture. LeNet5 contains two convolutional layers followed by fully connected layers and therefore could be called a shallow Neural Network. At that time (in 1998) GPU’s were not used for computational calculations, and the CPU’s were not even that powerful so for that time the two convolutional layers were already quiet innovative.

Later on, many other types of Convolutional Neural Networks have been designed, most of them much deeper [click here for more info].

There is the famous AlexNet architecture (2012) by Alex Krizhevsky et. al., the 7-layered ZF Net (2013), and the 16-layered VGGNet (2014).

In 2015 Google came with 22-layered CNN with an inception module (GoogLeNet), and Microsoft Research Asia created the 152-layered CNN called ResNet.

Now, with the things we have learned so far, lets see how we can create the AlexNet and VGGNet16 architectures in Tensorflow.

Although LeNet5 was the first ConvNet, it is considered to be a shallow neural network. It performs well on the MNIST dataset which consist of grayscale images of size 28 x 28, but the performance drops when we’re trying to classify larger images, with more resolution and more classes.

The first Deep CNN came out in 2012 and is called AlexNet after its creators Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Compared to the most recent architectures AlexNet can be considered simple, but at that time it was really succesfull. It won the ImageNet competition with a incredible test error rate of 15.4% (while the runner-up had an error of 26.2%) and started a revolution (also see this video) in the world of Deep Learning and AI.

It consists of 5 convolutional layers (with relu activation), 3 max pooling layers, 3 fully connected layers and 2 dropout layers. The overall architecture looks as follows:

**layer 0**: input image of size 224 x 224 x 3**layer 1**: A convolutional layer with 96 filters (filter_depth_1 = 96) of size 11 x 11 (filter_size_1 = 11) and a stride of 4. It has a relu activation function.

This is followed by max pooling and local response normalization layers.**layer 2**: A convolutional layer with 256 filters (filter_depth_2 = 256) of size 5 x 5 (filter_size_2 = 5) and a stride of 1. It has a relu activation function.

This layer is also followed by max pooling and local response normalization layers.**layer 3**: A convolutional layer with 384 filters (filter_depth_3 = 384) of size 3 x 3 (filter_size_3 = 3) and a stride of 1. It has a relu activation function.**layer 4**: Same as layer 3.**layer 5**: A convolutional layer with 256 filters (filter_depth_4 = 256) of size 3 x 3 (filter_size_4 = 3) and a stride of 1. It has a relu activation function.**layer 6-8**: These convolutional layers are followed by fully connected layers with 4096 neurons each. In the original paper they are classifying a dataset with 1000 classes, but we will use the oxford17 dataset, which has 17 different classes (of flowers).

Note that this CNN (or other deep CNN’s) cannot be used on the MNIST or the CIFAR-10 dataset, because the images in these datasets are too small. As we have seen before, a pooling layer (or a convolutional layer with a stride of 2) reduces the image size by a factor of 2. AlexNet has 3 max pooling layers and one convolutional layer with a stride of 4. This means that the original image size gets reduced by a factor of . The images in the MNIST dataset would simply get reduced to a size smaller than 0.

Therefore we need to load a dataset with larger images, preferably 224 x 224 x 3 (as the original paper indicates). The 17 category flower dataset, aka oxflower17 dataset is ideal since it contains images of exactly this size:

ox17_image_width = 224 ox17_image_height = 224 ox17_image_depth = 3 ox17_num_labels = 17 import tflearn.datasets.oxflower17 as oxflower17 train_dataset_, train_labels_ = oxflower17.load_data(one_hot=True) train_dataset_ox17, train_labels_ox17 = train_dataset_[:1000,:,:,:], train_labels_[:1000,:] test_dataset_ox17, test_labels_ox17 = train_dataset_[1000:,:,:,:], train_labels_[1000:,:] print('Training set', train_dataset_ox17.shape, train_labels_ox17.shape) print('Test set', test_dataset_ox17.shape, test_labels_ox17.shape)

Lets try to create the weight matrices and the different layers present in AlexNet. As we have seen before, we need as much weight matrices and bias vectors as the amount of layers, and each weight matrix should have a size corresponding to the filter size of the layer it belongs to.

ALEX_PATCH_DEPTH_1, ALEX_PATCH_DEPTH_2, ALEX_PATCH_DEPTH_3, ALEX_PATCH_DEPTH_4 = 96, 256, 384, 256 ALEX_PATCH_SIZE_1, ALEX_PATCH_SIZE_2, ALEX_PATCH_SIZE_3, ALEX_PATCH_SIZE_4 = 11, 5, 3, 3 ALEX_NUM_HIDDEN_1, ALEX_NUM_HIDDEN_2 = 4096, 4096 def variables_alexnet(patch_size1 = ALEX_PATCH_SIZE_1, patch_size2 = ALEX_PATCH_SIZE_2, patch_size3 = ALEX_PATCH_SIZE_3, patch_size4 = ALEX_PATCH_SIZE_4, patch_depth1 = ALEX_PATCH_DEPTH_1, patch_depth2 = ALEX_PATCH_DEPTH_2, patch_depth3 = ALEX_PATCH_DEPTH_3, patch_depth4 = ALEX_PATCH_DEPTH_4, num_hidden1 = ALEX_NUM_HIDDEN_1, num_hidden2 = ALEX_NUM_HIDDEN_2, image_width = 224, image_height = 224, image_depth = 3, num_labels = 17): w1 = tf.Variable(tf.truncated_normal([patch_size1, patch_size1, image_depth, patch_depth1], stddev=0.1)) b1 = tf.Variable(tf.zeros([patch_depth1])) w2 = tf.Variable(tf.truncated_normal([patch_size2, patch_size2, patch_depth1, patch_depth2], stddev=0.1)) b2 = tf.Variable(tf.constant(1.0, shape=[patch_depth2])) w3 = tf.Variable(tf.truncated_normal([patch_size3, patch_size3, patch_depth2, patch_depth3], stddev=0.1)) b3 = tf.Variable(tf.zeros([patch_depth3])) w4 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth3, patch_depth3], stddev=0.1)) b4 = tf.Variable(tf.constant(1.0, shape=[patch_depth3])) w5 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth3, patch_depth3], stddev=0.1)) b5 = tf.Variable(tf.zeros([patch_depth3])) pool_reductions = 3 conv_reductions = 2 no_reductions = pool_reductions + conv_reductions w6 = tf.Variable(tf.truncated_normal([(image_width // 2**no_reductions)*(image_height // 2**no_reductions)*patch_depth3, num_hidden1], stddev=0.1)) b6 = tf.Variable(tf.constant(1.0, shape = [num_hidden1])) w7 = tf.Variable(tf.truncated_normal([num_hidden1, num_hidden2], stddev=0.1)) b7 = tf.Variable(tf.constant(1.0, shape = [num_hidden2])) w8 = tf.Variable(tf.truncated_normal([num_hidden2, num_labels], stddev=0.1)) b8 = tf.Variable(tf.constant(1.0, shape = [num_labels])) variables = { 'w1': w1, 'w2': w2, 'w3': w3, 'w4': w4, 'w5': w5, 'w6': w6, 'w7': w7, 'w8': w8, 'b1': b1, 'b2': b2, 'b3': b3, 'b4': b4, 'b5': b5, 'b6': b6, 'b7': b7, 'b8': b8 } return variables def model_alexnet(data, variables): layer1_conv = tf.nn.conv2d(data, variables['w1'], [1, 4, 4, 1], padding='SAME') layer1_relu = tf.nn.relu(layer1_conv + variables['b1']) layer1_pool = tf.nn.max_pool(layer1_relu, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME') layer1_norm = tf.nn.local_response_normalization(layer1_pool) layer2_conv = tf.nn.conv2d(layer1_norm, variables['w2'], [1, 1, 1, 1], padding='SAME') layer2_relu = tf.nn.relu(layer2_conv + variables['b2']) layer2_pool = tf.nn.max_pool(layer2_relu, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME') layer2_norm = tf.nn.local_response_normalization(layer2_pool) layer3_conv = tf.nn.conv2d(layer2_norm, variables['w3'], [1, 1, 1, 1], padding='SAME') layer3_relu = tf.nn.relu(layer3_conv + variables['b3']) layer4_conv = tf.nn.conv2d(layer3_relu, variables['w4'], [1, 1, 1, 1], padding='SAME') layer4_relu = tf.nn.relu(layer4_conv + variables['b4']) layer5_conv = tf.nn.conv2d(layer4_relu, variables['w5'], [1, 1, 1, 1], padding='SAME') layer5_relu = tf.nn.relu(layer5_conv + variables['b5']) layer5_pool = tf.nn.max_pool(layer4_relu, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME') layer5_norm = tf.nn.local_response_normalization(layer5_pool) flat_layer = flatten_tf_array(layer5_norm) layer6_fccd = tf.matmul(flat_layer, variables['w6']) + variables['b6'] layer6_tanh = tf.tanh(layer6_fccd) layer6_drop = tf.nn.dropout(layer6_tanh, 0.5) layer7_fccd = tf.matmul(layer6_drop, variables['w7']) + variables['b7'] layer7_tanh = tf.tanh(layer7_fccd) layer7_drop = tf.nn.dropout(layer7_tanh, 0.5) logits = tf.matmul(layer7_drop, variables['w8']) + variables['b8'] return logits

Now we can modify the CNN model to use the weights and layers of the AlexNet model in order to classify images.

VGG Net was created in 2014 by Karen Simonyan and Andrew Zisserman of the University of Oxford. It contains much more layers (16-19 layers), but each layer is simpler in its design; all of the convolutional layers have filters of size 3 x 3 and stride of 1 and all max pooling layers have a stride of 2.

So it is a deeper CNN but simpler.

It comes in different configurations, with either 16 or 19 layers. The difference between these two different configurations is the usage of either 3 or 4 convolutional layers after the second, third and fourth max pooling layer (see below).

The configuration with 16 layers (configuration D) seems to produce the best results, so lets try to create that in tensorflow.

#The VGGNET Neural Network VGG16_PATCH_SIZE_1, VGG16_PATCH_SIZE_2, VGG16_PATCH_SIZE_3, VGG16_PATCH_SIZE_4 = 3, 3, 3, 3 VGG16_PATCH_DEPTH_1, VGG16_PATCH_DEPTH_2, VGG16_PATCH_DEPTH_3, VGG16_PATCH_DEPTH_4 = 64, 128, 256, 512 VGG16_NUM_HIDDEN_1, VGG16_NUM_HIDDEN_2 = 4096, 1000 def variables_vggnet16(patch_size1 = VGG16_PATCH_SIZE_1, patch_size2 = VGG16_PATCH_SIZE_2, patch_size3 = VGG16_PATCH_SIZE_3, patch_size4 = VGG16_PATCH_SIZE_4, patch_depth1 = VGG16_PATCH_DEPTH_1, patch_depth2 = VGG16_PATCH_DEPTH_2, patch_depth3 = VGG16_PATCH_DEPTH_3, patch_depth4 = VGG16_PATCH_DEPTH_4, num_hidden1 = VGG16_NUM_HIDDEN_1, num_hidden2 = VGG16_NUM_HIDDEN_2, image_width = 224, image_height = 224, image_depth = 3, num_labels = 17): w1 = tf.Variable(tf.truncated_normal([patch_size1, patch_size1, image_depth, patch_depth1], stddev=0.1)) b1 = tf.Variable(tf.zeros([patch_depth1])) w2 = tf.Variable(tf.truncated_normal([patch_size1, patch_size1, patch_depth1, patch_depth1], stddev=0.1)) b2 = tf.Variable(tf.constant(1.0, shape=[patch_depth1])) w3 = tf.Variable(tf.truncated_normal([patch_size2, patch_size2, patch_depth1, patch_depth2], stddev=0.1)) b3 = tf.Variable(tf.constant(1.0, shape = [patch_depth2])) w4 = tf.Variable(tf.truncated_normal([patch_size2, patch_size2, patch_depth2, patch_depth2], stddev=0.1)) b4 = tf.Variable(tf.constant(1.0, shape = [patch_depth2])) w5 = tf.Variable(tf.truncated_normal([patch_size3, patch_size3, patch_depth2, patch_depth3], stddev=0.1)) b5 = tf.Variable(tf.constant(1.0, shape = [patch_depth3])) w6 = tf.Variable(tf.truncated_normal([patch_size3, patch_size3, patch_depth3, patch_depth3], stddev=0.1)) b6 = tf.Variable(tf.constant(1.0, shape = [patch_depth3])) w7 = tf.Variable(tf.truncated_normal([patch_size3, patch_size3, patch_depth3, patch_depth3], stddev=0.1)) b7 = tf.Variable(tf.constant(1.0, shape=[patch_depth3])) w8 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth3, patch_depth4], stddev=0.1)) b8 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) w9 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b9 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) w10 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b10 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) w11 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b11 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) w12 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b12 = tf.Variable(tf.constant(1.0, shape=[patch_depth4])) w13 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b13 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) no_pooling_layers = 5 w14 = tf.Variable(tf.truncated_normal([(image_width // (2**no_pooling_layers))*(image_height // (2**no_pooling_layers))*patch_depth4 , num_hidden1], stddev=0.1)) b14 = tf.Variable(tf.constant(1.0, shape = [num_hidden1])) w15 = tf.Variable(tf.truncated_normal([num_hidden1, num_hidden2], stddev=0.1)) b15 = tf.Variable(tf.constant(1.0, shape = [num_hidden2])) w16 = tf.Variable(tf.truncated_normal([num_hidden2, num_labels], stddev=0.1)) b16 = tf.Variable(tf.constant(1.0, shape = [num_labels])) variables = { 'w1': w1, 'w2': w2, 'w3': w3, 'w4': w4, 'w5': w5, 'w6': w6, 'w7': w7, 'w8': w8, 'w9': w9, 'w10': w10, 'w11': w11, 'w12': w12, 'w13': w13, 'w14': w14, 'w15': w15, 'w16': w16, 'b1': b1, 'b2': b2, 'b3': b3, 'b4': b4, 'b5': b5, 'b6': b6, 'b7': b7, 'b8': b8, 'b9': b9, 'b10': b10, 'b11': b11, 'b12': b12, 'b13': b13, 'b14': b14, 'b15': b15, 'b16': b16 } return variables def model_vggnet16(data, variables): layer1_conv = tf.nn.conv2d(data, variables['w1'], [1, 1, 1, 1], padding='SAME') layer1_actv = tf.nn.relu(layer1_conv + variables['b1']) layer2_conv = tf.nn.conv2d(layer1_actv, variables['w2'], [1, 1, 1, 1], padding='SAME') layer2_actv = tf.nn.relu(layer2_conv + variables['b2']) layer2_pool = tf.nn.max_pool(layer2_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer3_conv = tf.nn.conv2d(layer2_pool, variables['w3'], [1, 1, 1, 1], padding='SAME') layer3_actv = tf.nn.relu(layer3_conv + variables['b3']) layer4_conv = tf.nn.conv2d(layer3_actv, variables['w4'], [1, 1, 1, 1], padding='SAME') layer4_actv = tf.nn.relu(layer4_conv + variables['b4']) layer4_pool = tf.nn.max_pool(layer4_pool, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer5_conv = tf.nn.conv2d(layer4_pool, variables['w5'], [1, 1, 1, 1], padding='SAME') layer5_actv = tf.nn.relu(layer5_conv + variables['b5']) layer6_conv = tf.nn.conv2d(layer5_actv, variables['w6'], [1, 1, 1, 1], padding='SAME') layer6_actv = tf.nn.relu(layer6_conv + variables['b6']) layer7_conv = tf.nn.conv2d(layer6_actv, variables['w7'], [1, 1, 1, 1], padding='SAME') layer7_actv = tf.nn.relu(layer7_conv + variables['b7']) layer7_pool = tf.nn.max_pool(layer7_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer8_conv = tf.nn.conv2d(layer7_pool, variables['w8'], [1, 1, 1, 1], padding='SAME') layer8_actv = tf.nn.relu(layer8_conv + variables['b8']) layer9_conv = tf.nn.conv2d(layer8_actv, variables['w9'], [1, 1, 1, 1], padding='SAME') layer9_actv = tf.nn.relu(layer9_conv + variables['b9']) layer10_conv = tf.nn.conv2d(layer9_actv, variables['w10'], [1, 1, 1, 1], padding='SAME') layer10_actv = tf.nn.relu(layer10_conv + variables['b10']) layer10_pool = tf.nn.max_pool(layer10_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer11_conv = tf.nn.conv2d(layer10_pool, variables['w11'], [1, 1, 1, 1], padding='SAME') layer11_actv = tf.nn.relu(layer11_conv + variables['b11']) layer12_conv = tf.nn.conv2d(layer11_actv, variables['w12'], [1, 1, 1, 1], padding='SAME') layer12_actv = tf.nn.relu(layer12_conv + variables['b12']) layer13_conv = tf.nn.conv2d(layer12_actv, variables['w13'], [1, 1, 1, 1], padding='SAME') layer13_actv = tf.nn.relu(layer13_conv + variables['b13']) layer13_pool = tf.nn.max_pool(layer13_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') flat_layer = flatten_tf_array(layer13_pool) layer14_fccd = tf.matmul(flat_layer, variables['w14']) + variables['b14'] layer14_actv = tf.nn.relu(layer14_fccd) layer14_drop = tf.nn.dropout(layer14_actv, 0.5) layer15_fccd = tf.matmul(layer14_drop, variables['w15']) + variables['b15'] layer15_actv = tf.nn.relu(layer15_fccd) layer15_drop = tf.nn.dropout(layer15_actv, 0.5) logits = tf.matmul(layer15_drop, variables['w16']) + variables['b16'] return logits

As a comparison, have a look at the LeNet5 CNN performance on the larger oxflower17 dataset:

The code is also available in my GitHub repository, so feel free to use it on your own dataset(s).

There is much more to explore in the world of Deep Learning; Recurrent Neural Networks, Region-Based CNN’s, GAN’s, Reinforcement Learning, etc. In future blog-posts I’ll build these types of Neural Networks, and also build awesome applications with what we have already learned.

So subscribe and stay tuned!

[1] If you feel like you need to refresh your understanding of CNN’s, here are some good starting points to get you up to speed:

- Machine Learning is fun!
- An Intuitive Explanation of Convolutional Neural Networks :
- CS231n Convolutional Neural Networks for Visual Recognition :
- Udacity’s Deep Learning course:
- Neural Networks and Deep Learning Ch 6.

[2] If you want more information about the theory behind these different Neural Networks, Adit Deshpande’s blog post provides a good comparison of them with links to the original papers. Eugenio Culurciello has a nice blog and article worth a read. In addition to that, also have a look at this github repository containing awesome deep learning papers, and this github repository where deep learning papers are ordered by task and date.

]]>For python programmers, scikit-learn is one of the best libraries to build Machine Learning applications with. It is ideal for beginners because it has a really simple interface, it is well documented with many examples and tutorials.

Besides supervised machine learning (classification and regression), it can also be used for clustering, dimensionality reduction, feature extraction and engineering, and pre-processing the data. The interface is consistent over all of these methods, so it is not only easy to use, but it is also easy to construct a large ensemble of classifiers/regression models and train them with the same commands.

In this blog lets have a look at how to build, train, evaluate and validate a classifier with scikit-learn and in this way get familiar with the scikit-learn library.

Let’s look at the process of classification with scikit-learn with two example datasets. The glass dataset, and the Mushroom dataset.

The glass dataset contains data on six types of glass (from building windows, containers, tableware, headlamps, etc) and each type of glass can be identified by the content of several minerals (for example Na, Fe, K, etc). This dataset only contains numerical data and therefore is a good dataset to get started with.

The second dataset contains non-numerical data and we will need an additional step where we encode the categorical data to numerical data.

Lets start with classifying the classes of glass!

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import time from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn import tree from sklearn.neural_network import MLPClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.gaussian_process.kernels import RBF from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import GaussianNB

First we need to import the necessary modules and libraries which we will use.

- The pandas module is used to load, inspect, process the data and get in the shape necessary for classification.
- Seaborn is a library based on matplotlib and has nice functionalities for drawing graphs.
- StandardScaler is a library for standardizing and normalizing dataset and
- the LaberEncoder library can be used to One Hot Encode the categorical features (in the mushroom dataset).
- All of the other modules are classifiers which are used for classification of the dataset.

When loading a dataset for the first time, there are several questions we need to ask ourself:

- What kind of data does the dataset contain? Numerical data, categorical data, geographic information, etc…
- Does the dataset contain any missing data?
- Does the dataset contain any redundant data (noise)?
- Do the values of the features differ over many orders of magnitude? Do we need to standardize or normalize the dataset?

filename_glass = './data/glass.csv' df_glass = pd.read_csv(filename_glass) print(df_glass.shape) display(df_glass.head()) display(df_glass.describe())

We can see that the dataset consists of 214 rows and 10 columns. All of the columns contain numerical data, and there are no rows with missing information (check this for yourself). Also most of the features have values in the same order of magnitude.

So for this dataset we do not need to remove any rows (with `.dropna()`

) or apply one hot encoding (to transform categorical data into numerical data) or standardize the data (with `StandardScaler().fit_transform(X)`

).

The .describe() method of pandas is useful for giving a quick overview of the dataset;

- How many rows of data are there?
- What are some characteristic values like the mean, standard deviation, minimum and maximum value, the 25th percentile etc.

To get more insight in how (strongly) each feature is correlated with the Type of glass, we can calculate and plot the correlation matrix for this dataset.

correlation_matrix = df_glass.corr() plt.figure(figsize=(10,8)) ax = sns.heatmap(correlation_matrix, vmax=1, square=True,annot=True,cmap='RdYlGn') plt.title('Correlation matrix between the features') plt.show()

The correlation matrix shows us for example that the oxides ‘Mg’ and ‘Al’ are most strongly correlated with the Type of glass. The content of ‘Ca’ is least strongly correlated with the type of glass. For some dataset there could be features with no correlation at all; then it might be a good idea to remove these since they will only function as noise.

The next step is building and training the actual classifier, which hopefully can accurately classify the data. With this we will be able to tell which type of glass an entry in the dataset belongs to, based on the features.

For this we need to split the dataset into a training set and a test set. With the training set we will train the classifier, and with the test set we will validate the accuracy of the classifier. Usually a 70 % / 30 % ratio is used when splitting into a training and test set, but this ratio should be chosen based on the size of the dataset. For example, if the dataset does not have enough entries, 30% of it might not contain all of the classes or enough information to properly function as a validation set.

Another important note is that the distribution of the different classes in both the training and the test set should be equal to the distribution in the actual dataset. For example, if you have a dataset with review-texts which contains 20% negative and 80% positive reviews, both the training and the test set should have this 20% / 80% ratio. The best way to do this, is to split the dataset into a training and test set **randomly**.

def get_train_test(df, y_col, ratio): mask = np.random.rand(len(df)) < ratio df_train = df[mask] df_test = df[~mask] Y_train = df_train[y_col].values Y_test = df_test[y_col].values del df_train[y_col] del df_test[y_col] X_train = df_train.values X_test = df_test.values return X_train, Y_train, X_test, Y_test y_col = 'Type' train_test_ratio = 0.7 X_train, Y_train, X_test, Y_test = get_train_test(df_glass, y_col, train_test_ratio)

With the dataset splitted into training and test sets, we can start building a classification model. I will do this in a slightly different way as usual. The idea behind this is that, when we start with a new dataset, we don’t know which (type of) classifier will perform best on this dataset. Will it be a classifier like Decision Tree or Random Forest, or a classifier which uses a functional approach like Logistic Regression, a classifier which uses a statistical approach like Naive Bayes etc.?

Because we dont know this, we will try all types of classifiers first and later we can continue to optimize the best performing classifier of this inital batch of classifiers. For this we have to make an dictionary, which contains as *keys* the name of the classifiers and as *values *an instance of the classifiers.

dict_classifiers = { "Logistic Regression": LogisticRegression(), "Nearest Neighbors": KNeighborsClassifier(), "Linear SVM": SVC(), "Gradient Boosting Classifier": GradientBoostingClassifier(), "Decision Tree": tree.DecisionTreeClassifier(), "Random Forest": RandomForestClassifier(n_estimators = 18), "Neural Net": MLPClassifier(alpha = 1), "Naive Bayes": GaussianNB() }

Then we can iterate over this dictionary, and for each classifier:

- train the classifier with
`.fit(X_train, Y_train)`

- evaluate how the classifier performs on the training set with
`.score(X_train, Y_train)`

- evaluate how the classifier perform on the test set with
`.score(X_test, Y_test)`

. - keep track of how much time it takes to train the classifier with the time module.
- save the training score, the test score, and the training time into a dataframe called ‘df_results’.

no_classifiers = len(dict_classifiers.keys()) def batch_classify(X_train, Y_train, X_test, Y_test, verbose = True): df_results = pd.DataFrame(data=np.zeros(shape=(no_classifiers,4)), columns = ['classifier', 'train_score', 'test_score', 'training_time']) count = 0 for key, classifier in dict_classifiers.items(): t_start = time.clock() classifier.fit(X_train, Y_train) t_end = time.clock() t_diff = t_end - t_start train_score = classifier.score(X_train, Y_train) test_score = classifier.score(X_test, Y_test) df_results.loc[count,'classifier'] = key df_results.loc[count,'train_score'] = train_score df_results.loc[count,'test_score'] = test_score df_results.loc[count,'training_time'] = t_diff if verbose: print("trained {c} in {f:.2f} s".format(c=key, f=t_diff)) count+=1 return df_results

The reason why we keep track of the time it takes to train a classifier, is because in practice this is also an important indicator of whether or not you would like to use a specific classifier. If there are two classifier with similar results, but one of them takes much less time to train you probably want to use that one.

The `score()`

method simply return the result of the accuracy_score() method in the metrics module. This module, contains many methods for evualating classification or regression models and I can recommend you to spent some time to learn which metrics you can use to evaluate your model.

The classification_report method for example, calculates the precision, recall and f1-score for all of the classes in your dataset. If you are looking for ways to improve the accuracy of your classifier, or if you want to know why the accuracy is lower than expected, such detailed information about the performance of the classifier on the dataset can point you in the right direction.

The accuracy on the training set, accuracy on the test set, and the duration of the training is saved into the ‘df_results’ dataframe.

df_results = batch_classify(X_train, Y_train, X_test, Y_test) display(df_results.sort_values(by='test_score', ascending=False))

What we are doing feels like a brute force approach, where a large number of classifiers are build to see which one performs best. Although this is not particularly educational, it gives an idea which classifier will perform better for a particular dataset and which one will not. After that you can continue with the best (or top 3) classifier, and try to improve the results by tweaking the parameters of the classifier, or by adding more features to the dataset.

As we can see, the Gradient Boosting classifier performs the best for this dataset. Actually, classifiers like Random Forest and Gradient Boosting classification performs best for most datasets and challenges on Kaggle (That does not mean you should rule out all other classifiers).

For the ones who are interested in the theory behind these classifiers, scikit-learn has a pretty well written user guide. Some of these classifiers were also explained in previous posts, like the naive bayes classifier, logistic regression and support vector machines was partially explained in the perceptron blog.

The second dataset we will have a look at is the mushroom dataset, which contains data on edible vs poisonous mushrooms. In the dataset there are 8124 mushrooms in total (4208 edible and 3916 poisonous) described by 22 features each.

The big difference with the glass dataset is that these features don’t have a numerical, but a categorical value. Because this dataset contains categorical values, we need one extra step in the classification process, which is the encoding of these values.

filename_mushrooms = './data/mushrooms.csv' df_mushrooms = pd.read_csv(filename_mushrooms) display(df_mushrooms.head())

A fast way to find out what type of categorical data a dataset contains, is to print out the unique values of each column in this dataframe. In this way we can also see whether the dataset contains any missing values or redundant columns.

for col in df_mushrooms.columns.values: print(col, df_mushrooms[col].unique())

As we can see, there are 22 categorical features. Of these, the feature ‘veil-type’ only contains one value ‘p’ and therefore does not provide any added value for any classifier. The best thing to do is to remove this feature.

del df_mushrooms['veil-type']

Most classifier can only work with numerical data, and will raise an error when categorical values in the form of strings are used as input. Luckily scikit-learn contains the module LabelEncoder, which can be used to transform non-numerical values to numerical values. This is done by first fitting the LabelEncoder with all possible (unique) values and then transforming all values to numerical values.

(Both steps can also be done in one go with the fit_transform() method.)

def label_encode(df, columns): for col in columns: le = LabelEncoder() col_values_unique = list(df[col].unique()) le_fitted = le.fit(col_values_unique) col_values = list(df[col].values) le.classes_ col_values_transformed = le.transform(col_values) df[col] = col_values_transformed to_be_encoded_cols = df_mushrooms.columns.values label_encode(df_mushrooms, to_be_encoded_cols) display(df_mushrooms.head())

As we can see, the columns previously containing non-numerical values now contain numerical values and the dataset is ready for classification. Again, we will split the dataset into a 70% training set and a 30% test set and start training and validating a batch of the eight most used classifiers.

y_col = 'class' ratio = 0.7 X_train, Y_train, X_test, Y_test = get_train_test(df_mushrooms, y_col, ratio) df_results = batch_classify(X_train, Y_train, X_test, Y_test) display(df_results.sort_values(by='test_score', ascending=False))

As we can see, the accuracy of the classifiers for this dataset is actually also quiet high. For datasets, where this is not the case we can play around with the features in the dataset, add extra features from additional datasets or change the parameters of the classifiers in order to improve the accuracy.

In my opinion, the best way to master the scikit-learn library is to simply start coding with it. I hope this blog-post gave some insight into the working of scikit-learn library, but for the ones who need some more information, here are some useful links:

dataschool – machine learning with scikit-learn video series

Classification example using the iris dataset

Official scikit-learn documentation

]]>

Most tasks in Machine Learning can be reduced to classification tasks. For example, we have a medical dataset and we want to classify who has diabetes (positive class) and who doesn’t (negative class). We have a dataset from the financial world and want to know which customers will default on their credit (positive class) and which customers will not (negative class).

To do this, we can train a Classifier with a ‘training dataset’ and after such a Classifier is trained (we have determined its model parameters) and can accurately classify the training set, we can use it to classify new data (test set). If the training is done properly, the Classifier should predict the class probabilities of the new data with a similar accuracy.

There are three popular Classifiers which use three different mathematical approaches to classify data. Previously we have looked at the first two of these; Logistic Regression and the Naive Bayes classifier. Logistic Regression uses a functional approach to classify data, and the Naive Bayes classifier uses a statistical (Bayesian) approach to classify data.

Logistic Regression assumes there is some function which forms a correct model of the dataset (i.e. it maps the input values correctly to the output values). This function is defined by its parameters . We can use the gradient descent method to find the optimum values of these parameters.

The Naive Bayes method is much simpler than that; we do not have to optimize a function, but can calculate the Bayesian (conditional) probabilities directly from the training dataset. This can be done quiet fast (by creating a hash table containing the probability distributions of the features) but is generally less accurate.

Classification of data can also be done via a third way, by using a geometrical approach. The main idea is to find a line, or a plane, which can separate the two classes in their feature space. Classifiers which are using a geometrical approach are the Perceptron and the SVM (Support Vector Machines) methods.

Below we will discuss the Perceptron classification algorithm. Although Support Vector Machines is used more often, I think a good understanding of the Perceptron algorithm is essential to understanding Support Vector Machines and Neural Networks.

The Perceptron is a lightweight algorithm, which can classify data quiet fast. But it only works in the limited case of a linearly separable, binary dataset. If you have a dataset consisting of only two classes, the Perceptron classifier can be trained to find a linear hyperplane which seperates the two. If the dataset is not linearly separable, the perceptron will fail to find a separating hyperplane.

If the dataset consists of more than two classes we can use the standard approaches in multiclass classification (one-vs-all and one-vs-one) to transform the multiclass dataset to a binary dataset. For example, if we have a dataset, which consists of three different classes:

- In
**one-vs-all**, class I is considered as the positive class and the rest of the classes are considered as the negative class. We can then look for a separating hyperplane between class I and the rest of the dataset (class II and III). This process is repeated for class II and then for class III. So we are trying to find three separating hyperplanes; between class I and the rest of the data, between class II and the rest of the data, etc.

If the dataset consists of K classes, we end up with K separating hyperplanes. - In
**one-vs-one**, class I is considered as the positive class and each of the other classes is considered as the negative class; so first class II is considered as the negative class and then class III is is considered as the negative class. Then this process is repeated with the other classes as the positive class.

So if the dataset consists of K classes, we are looking for separating hyperplanes.

Although the one-vs-one can be a bit slower (there is one more iteration layer), it is not difficult to imagine it will be more advantageous in situations where a (linear) separating hyperplane does not exist between one class and the rest of the data, while it does exists between one class and other classes when they are considered individually. In the image below there is no separating line between the pear-class and the other two classes.

The algorithm for the Perceptron is similar to the algorithm of Support Vector Machines (SVM). Both algorithms find a (linear) hyperplane separating the two classes. The biggest difference is that the Perceptron algorithm will find **any** hyperplane, while the SVM algorithm uses a Lagrangian constraint* *to find the hyperplane which is optimized to have the **maximum margin**. That is, the sum of the squared distances of each point to the hyperplane is maximized. This is illustrated in the figure below. While the Perceptron classifier is satisfied if any of these seperating hyperplanes are found, a SVM classifier will find the green one , which has the maximum margin.

Another difference is; If the dataset is not linearly seperable [2] the perceptron will fail to find a separating hyperplane. The algorithm simply does not converge during its iteration cycle. The SVM on the other hand, can still find a maximum margin minimum cost decision boundary (a separating hyperplane which does not separate 100% of the data, but does it with some small error).

It is often said that the perceptron is modeled after neurons in the brain. It has input values (which correspond with the features of the examples in the training set) and one output value. Each input value is multiplied by a weight-factor . If the sum of the products between the feature value and weight-factor is larger than zero, the perceptron is activated and ‘fires’ a signal (+1). Otherwise it is not activated.

The weighted sum between the input-values and the weight-values, can mathematically be determined with the scalar-product . To produce the behaviour of ‘firing’ a signal (+1) we can use the signum function ; it maps the output to +1 if the input is positive, and it maps the output to -1 if the input is negative.

Thus, this Perceptron can mathematically be modeled by the function . Here is the bias, i.e. the default value when all feature values are zero.

The perceptron algorithm looks as follows:

class Perceptron(): """ Class for performing Perceptron. X is the input array with n rows (no_examples) and m columns (no_features) Y is a vector containing elements which indicate the class (1 for positive class, -1 for negative class) w is the weight-vector (m number of elements) b is the bias-value """ def __init__(self, b = 0, max_iter = 1000): self.max_iter = max_iter self.w = [] self.b = 0 self.no_examples = 0 self.no_features = 0 def train(self, X, Y): self.no_examples, self.no_features = np.shape(X) self.w = np.zeros(self.no_features) for ii in range(0, self.max_iter): w_updated = False for jj in range(0, self.no_examples): a = self.b + np.dot(self.w, X[jj]) if np.sign(Y[jj]*a) != 1: w_updated = True self.w += Y[jj] * X[jj] self.b += Y[jj] if not w_updated: print("Convergence reached in %i iterations." % ii) break if w_updated: print( """ WARNING: convergence not reached in %i iterations. Either dataset is not linearly separable, or max_iter should be increased """ % self.max_iter ) def classify_element(self, x_elem): return int(np.sign(self.b + np.dot(self.w, x_elem))) def classify(self, X): predicted_Y = [] for ii in range(np.shape(X)[0]): y_elem = self.classify_element(X[ii]) predicted_Y.append(y_elem) return predicted_Y

As you can see, we set the bias-value and all the elements in the weight-vector to zero. Then we iterate ‘max_iter’ number of times over all the examples in the training set.

Here, is the actual output value of each training example. This is either +1 (if it belongs to the positive class) or -1 (if it does not belong to the positive class).,

The activation function value is the predicted output value. It will be if the prediction is correct and if the prediction is incorrect. Therefore, if the prediction made (with the weight vector from the previous training example) is incorrect, will be -1, and the weight vector is updated.

If the weight vector is not updated after some iteration, it means we have reached convergence and we can break out of the loop.

If the weight vector was updated in the last iteration, it means we still didnt reach convergence and either the dataset is not linearly separable, or we need to increase ‘max_iter’.

We can see that the Perceptron is an online algorithm; it iterates through the examples in the training set, and for each example in the training set it calculates the value of the activation function and updates the values of the weight-vector.

Now lets examine the Perceptron algorithm for a linearly separable dataset which exists in 2 dimensions. For this we first have to create this dataset:

def generate_data(no_points): X = np.zeros(shape=(no_points, 2)) Y = np.zeros(shape=no_points) for ii in range(no_points): X[ii][0] = random.randint(1,9)+0.5 X[ii][1] = random.randint(1,9)+0.5 Y[ii] = 1 if X[ii][0]+X[ii][1] >= 13 else -1 return X, Y

In the 2D case, the perceptron algorithm looks like:

X, Y = generate_data(100) p = Perceptron() p.train(X, Y) X_test, Y_test = generate_data(50) predicted_Y_test = p.classify(X_test)

As we can see, the weight vector and the bias ( which together determine the separating hyperplane ) are updated when is not positive.

The result is nicely illustrated in this gif:

GIF

We can extend this to a dataset in any number of dimensions, and as long as it is linearly separable, the Perceptron algorithm will converge.

One of the benefits of this Perceptron is that it is a very ‘lightweight’ algorithm; it is computationally very fast and easy to implement for datasets which are linearly separable. But if the dataset is not linearly separable, it will not converge.

For such datasets, the Perceptron can still be used if the correct kernel is applied. In practice this is never done, and Support Vector Machines are used whenever a Kernel needs to be applied. Some of these Kernels are:

Linear: | |

Polynomial: | with |

Laplacian RBF: | |

Gaussian RBF: |

At this point, it will be too much to also implement Kernel functions, but I hope to do it at a next post about SVM. For more information about Kernel functions, a comprehensive list of kernels, and their source code, please click here.

**PS: The Python code for Logistic Regression can be forked/cloned from GitHub. **

In the previous blog we have seen the theory and mathematics behind the Logistic Regression Classifier.

Logistic Regression is one of the most powerful classification methods within machine learning and can be used for a wide variety of tasks. Think of pre-policing or predictive analytics in health; it can be used to aid tuberculosis patients, aid breast cancer diagnosis, etc. Think of modeling urban growth, analysing mortgage pre-payments and defaults, forecasting the direction and strength of stock market movement, and even sports.

Reading all of this, the theory[1] of Logistic Regression Classification might look difficult. In my experience, the average Developer does not believe they can design a proper Logistic Regression Classifier from scratch. I strongly disagree: not only is the mathematics behind is relatively simple, it can also be implemented with a few lines of code.

I have done this in the past month, so I thought I’d show you how to do it. The code is in Python but it should be relatively easy to translate it to other languages. Some of the examples contain self-generated data, while other examples contain real-world (iris) data. As was also done in the blog-posts about the bag-of-words model and the Naive Bayes Classifier, we will also try to automatically classify the sentiments of Amazon.com book reviews.

We have seen that the technique to perform Logistic Regression is similar to regular Regression Analysis.

There is a function which maps the input values to the output and this function is completely determined by its parameters . So once we have determined the values with training examples, we can determine the class of any new example.

We are trying to estimate the feature values with the iterative Gradient Descent method. In the Gradient Descent method, the values of the parameters in the current iteration are calculated by updating the values of from the previous iteration with the gradient of the cost function .

In (regular) Regression this hypothesis function can be any function which you expect will provide a good model of the dataset. In Logistic Regression the hypothesis function is always given by the Logistic function:

.

Different cost functions exist, but most often the log-likelihood function known as binary cross-entropy (see equation 2 of previous post) is used.

One of its benefits is that the gradient of this cost function, turns out to be quiet simple, and since it is the gradient we use to update the values of this makes our work easier.

Taking all of this into account, this is how Gradient Descent works:

- Make an initial but intelligent guess for the values of the parameters .
- Keep iterating while the value of the cost function has not met your criteria*:
- With the current values of , calculate the gradient of the cost function ( ).
- Update the values for the parameters
- Fill in these new values in the hypothesis function and calculate again the value of the cost function;

*Usually the iteration stops when either the maximum number of iterations has been reached, or the error (the difference between the cost of this iteration and the cost of the previous iteration) is smaller than some minimum error value (0.001).

We have seen the self-generated example of students participating in a Machine Learning course, where their final grade depended on how many hours they had studied.

First, let’s generate the data:

import random import numpy as np num_of_datapoints = 100 x_max = 10 initial_theta = [1, 0.07] def func1(X, theta, add_noise = True): if add_noise: return theta[0]*X[0] + theta[1]*X[1]**2 + 0.25*X[1]*(random.random()-1) else: return theta[0]*X[0] + theta[1]*X[1]**2 def generate_data(num_of_datapoints, x_max, theta): X = np.zeros(shape=(num_of_datapoints, 2)) Y = np.zeros(shape=num_of_datapoints) for ii in range(num_of_datapoints): X[ii][0] = 1 X[ii][1] = (x_max*ii) / float(num_of_datapoints) Y[ii] = func1(X[ii], theta) return X, Y X, Y = generate_data(num_of_datapoints, x_max, initial_theta)

We can see that we have generated 100 points uniformly distributed over the -axis. For each of these – points the -value is determined by minus some random value.

On the left we can see a scatterplot of the datapoints and on the right we can see the same data with a curve fitted through the points. This is the curve we are trying to estimate with the Gradient Descent method. This is done as follows:

numIterations= 1000 alpha = 0.00000005 m, n = np.shape(X) theta = np.ones(n) theta = gradient_descent(X, Y, theta, alpha, m, numIterations) def gradient_descent(X, Y, theta, alpha, m, number_of_iterations): for ii in range(0,number_of_iterations): print "iteration %s : feature-value: %s" % (ii, theta) hypothesis = np.dot(X, theta) cost = sum([theta[0]*X[iter][0]+theta[1]*X[iter][1]-Y[iter] for iter in range(m)]) grad0 = (2.0/m)*sum([(func1(X[iter], theta, False) - Y[iter])*X[iter][0]**2 for iter in range(m)]) grad1 = (2.0/m)*sum([(func1(X[iter], theta, False) - Y[iter])*X[iter][1]**4 for iter in range(m)]) theta[0] = theta[0] - alpha * grad0 theta[1] = theta[1] - alpha * grad1 return theta

We can see that we have to calculate the gradient of the cost function times and update the feature values simultaneously! This indeed results in the curve we were looking for:

After this short example of Regression, lets have a look at a few examples of Logistic Regression. We will start out with a the self-generated example of students passing a course or not and then we will look at real world data.

Let’s generate some data points. There are students participating in the course Machine Learning and whether a student passes ( ) or not ( ) depends on two variables;

- : how many hours student has studied for the exam.
- : how many hours student has slept the day before the exam.

import random import numpy as np def func2(x_i): if x_i[1] <= 4: y = 0 else: if x_i[1]+x_i[2] <= 13: y = 0 else: y = 1 return y def generate_data2(no_points): X = np.zeros(shape=(no_points, 3)) Y = np.zeros(shape=no_points) for ii in range(no_points): X[ii][0] = 1 X[ii][1] = random.random()*9+0.5 X[ii][2] = random.random()*9+0.5 Y[ii] = func2(X[ii]) return X, Y X, Y = generate_data2(300)

In our example, the results are pretty binary; everyone who has studied less than 4 hours fails the course, as well as everyone whose studying time + sleeping time is less than or equal to 13 hours (). The results looks like this (the green dots indicate a pass and the red dots a fail):

We have a LogisticRegression class, which sets the values of the learning rate and the maximum number of iterations at its initialization. The values of X, Y are set when these matrices are passed to the “train()” function, and then the values of no_examples, no_features, and theta are determined.

import numpy as np class LogisticRegression(): """ Class for performing logistic regression. """ def __init__(self, learning_rate = 0.7, max_iter = 1000): self.learning_rate = learning_rate self.max_iter = max_iter self.theta = [] self.no_examples = 0 self.no_features = 0 self.X = None self.Y = None def add_bias_col(self, X): bias_col = np.ones((X.shape[0], 1)) return np.concatenate([bias_col, X], axis=1)

We also have the hypothesis, cost and gradient functions:

def hypothesis(self, X): return 1 / (1 + np.exp(-1.0 * np.dot(X, self.theta))) def cost_function(self): """ We will use the binary cross entropy as the cost function. https://en.wikipedia.org/wiki/Cross_entropy """ predicted_Y_values = self.hypothesis(self.X) cost = (-1.0/self.no_examples) * np.sum(self.Y * np.log(predicted_Y_values) + (1 - self.Y) * (np.log(1-predicted_Y_values))) return cost def gradient(self): predicted_Y_values = self.hypothesis(self.X) grad = (-1.0/self.no_examples) * np.dot((self.Y-predicted_Y_values), self.X) return grad

With these functions, the gradient descent method can be defined as:

def gradient_descent(self): for iter in range(1,self.max_iter): cost = self.cost_function() delta = self.gradient() self.theta = self.theta - self.learning_rate * delta print("iteration %s : cost %s " % (iter, cost))

These functions are used by the “train()” method, which first sets the values of the matrices X, Y and theta, and then calls the gradient_descent method:

def train(self, X, Y): self.X = self.add_bias_col(X) self.Y = Y self.no_examples, self.no_features = np.shape(X) self.theta = np.ones(self.no_features + 1) self.gradient_descent()

Once the values have been determined with the gradient descent method, we can use it to classify new examples:

def classify(self, X): X = self.add_bias_col(X) predicted_Y = self.hypothesis(X) predicted_Y_binary = np.round(predicted_Y) return predicted_Y_binary

Using this algorithm for gradient descent, we can correctly classify 297 out of 300 datapoints of our self-generated example (wrongly classified points are indicated with a cross).

Now that the concept of Logistic Regression is a bit more clear, let’s classify real-world data!

One of the most famous classification datasets is The Iris Flower Dataset. This dataset consists of three classes, where each example has four numerical features.

import pandas as pd to_bin_y = { 1: { 'Iris-setosa': 1, 'Iris-versicolor': 0, 'Iris-virginica': 0 }, 2: { 'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 0 }, 3: { 'Iris-setosa': 0, 'Iris-versicolor': 0, 'Iris-virginica': 1 } } #loading the dataset datafile = '../datasets/iris/iris.data' df = pd.read_csv(datafile, header=None) df_train = df.sample(frac=0.7) df_test = df.loc[~df.index.isin(df_train.index)] X_train = df_train.values[:,0:4].astype(float) y_train = df_train.values[:,4] X_test = df_test.values[:,0:4].astype(float) y_test = df_test.values[:,4] Y_train = np.array([to_bin_y[3][x] for x in y_train]) Y_test = np.array([to_bin_y[3][x] for x in y_test]) print("training Logistic Regression Classifier") lr = LogisticRegression() lr.train(X_train, Y_train) print("trained") predicted_Y_test = lr.classify(X_test) f1 = f1_score(predicted_Y_test, Y_test, 1) print("F1-score on the test-set for class %s is: %s" % (1, f1))

As you can see, our simple LogisticRegression class can classify the iris dataset with quiet a high accuracy:

training Logistic Regression Classifier iteration 1 : cost 8.4609605194 iteration 2 : cost 3.50586831057 iteration 3 : cost 3.78903735339 iteration 4 : cost 6.01488933456 iteration 5 : cost 0.458208317153 iteration 6 : cost 2.67703502395 iteration 7 : cost 3.66033580721 (...) iteration 998 : cost 0.0362384208231 iteration 999 : cost 0.0362289106001 trained F1-score on the test-set for class 1 is: 0.973225806452

For a full overview of the code, please have a look at GitHub.

Logistic Regression by using Gradient Descent can also be used for NLP / Text Analysis tasks. There are a wide variety of tasks which can are done in the field of NLP; autorship attribution, spam filtering, topic classification and sentiment analysis.

For a task like sentiment analysis we can follow the same procedure. We will have as the input a large collection of labelled text documents. These will be used to train the Logistic Regression classifier. The most important task then, is to select the proper features which will lead to the best sentiment classification. Almost everything in the text document can be used as a feature[2]; you are only limited by your creativity.

For sentiment analysis usually the occurence of (specific) words is used, or the relative occurence of words (the word occurences divided by the total number of words).

As we have done before, we have to fill in the and matrices, which will serve as an input for the gradient descent algorithm and this algorithm will give us the resulting feature vector . With this vector we can determine the class of other text documents.

As always is a vector with elements (where is the number of text-documents). The matrix is a by matrix; here is the total number of relevant words in all of the text-documents. I will illustrate how to build up this matrix with three book reviews:

**pos:**“This is such a beautiful edition of Harry Potter and the Sorcerer’s Stone. I’m so glad I bought it as a keep sake. The illustrations are just stunning.” (28 words in total)**pos:**“A brilliant book that helps you to open up your mind as wide as the sky” (16 words in total)**neg:**“This publication is virtually unreadable. It doesn’t do this classic justice. Multiple typos, no illustrations, and the most wonky footnotes conceivable. Spend a dollar more and get a decent edition.” (30 words in total)

These three reviews will result in the following -matrix.

As you can see, each row of the matrix contains all of the data per review and each column contains the data per word. If a review does not contain a specific word, the corresponding column will contain a zero. Such a -matrix containing all the data from the training set can be build up in the following manner:

Assuming that we have a list containing the data from the *training set*:

[ ([u'downloaded', u'the', u'book', u'to', u'my', ..., u'art'], 'neg'), ([u'this', u'novel', u'if', u'bunch', u'of', ..., u'ladies'], 'neg'), ([u'forget', u'reading', u'the', u'book', u'and', ..., u'hilarious!'], 'neg'), ... ]

From this *training_set*, we are going to generate a *words_vector*. This *words_vector* is used to keep track to which column a specific word belongs to. After this *words_vector* has been generated, the matrix and vector can filled in.

def generate_words_vector(training_set): words_vector = [] for review in training_set: for word in review[0]: if word not in words_vector: words_vector.append(word) return words_vector def generate_Y_vector(training_set, training_class): no_reviews = len(training_set) Y = np.zeros(shape=no_reviews) for ii in range(0,no_reviews): review_class = training_set[ii][1] Y[ii] = 1 if review_class == training_class else 0 return Y def generate_X_matrix(training_set, words_vector): no_reviews = len(training_set) no_words = len(words_vector) X = np.zeros(shape=(no_reviews, no_words+1)) for ii in range(0,no_reviews): X[ii][0] = 1 review_text = training_set[ii][0] total_words_in_review = len(review_text) for word in Set(review_text): word_occurences = review_text.count(word) word_index = words_vector.index(word)+1 X[ii][word_index] = word_occurences / float(total_words_in_review) return X words_vector = generate_words_vector(training_set) X = generate_X_matrix(training_set, words_vector) Y_neg = generate_Y_vector(training_set, 'neg')

As we have done before, the gradient descent method can be applied to derive the feature vector from the and matrices:

numIterations = 100 alpha = 0.55 m,n = np.shape(X) theta = np.ones(n) theta_neg = gradient_descent2(X, Y_neg, theta, alpha, m, numIterations)

What should we do if a specific review tests positive (Y=1) for more than one class? A review could result in Y=1 for both the *neu* class as well as the *neg* class. In that case we will pick the class with the highest score. This is called multinomial logistic regression.

So far, we have seen how to implement a Logistic Regression Classifier in its most basic form. It is true that building such a classifier from scratch, is great for learning purposes. It is also true that no one will get to the point of using deeper / more advanced Machine Learning skills without learning the basics first.

For real-world applications however, often the best solution is to not re-invent the wheel but to re-use tools which are already available. Tools which have been tested thorougly and have been used by plenty of smart programmers before you. One of such a tool is Python’s NLTK library.

NLTK is Python’s Natural Language Toolkit and it can be used for a wide variety of Text Processing and Analytics jobs like tokenization, part-of-speech tagging and classification. It is easy to use and even includes a lot of text corpora, which can be used to train your model if you have no training set available.

Let us also have a look at how to perform sentiment analysis and text classification with NLTK. As always, we will use a training set to train NLTK’s Maximum Entropy Classifier and a test set to verify the results. Our training set has the following format:

training_set = [ ([u'this', u'novel', u'if', u'bunch', u'of', u'childish', ..., u'ladies'], 'neg') ([u'where', u'to', u'begin', u'jeez', u'gasping', u'blushing', ..., u'fail????'], 'neg') ... ]

As you can see, the training set consists of a list of tuples of two elements. The first element is a list of the words in the text of the document and the second element is the class-label of this specific review (‘neg’, ‘neu’ or ‘pos’). Unfortunately NLTK’s Classifiers only accepts the text in a hashable format (dictionaries for example) and that is why we need to convert this list of words into a dictionary of words.

def list_to_dict(words_list): return dict([(word, True) for word in words_list]) training_set_formatted = [(list_to_dict(element[0]), element[1]) for element in training_set]

‘

Once the training set has been converted into the proper format, it can be feed into the train method of the MaxEnt Classifier:

import nltk numIterations = 100 algorithm = nltk.classify.MaxentClassifier.ALGORITHMS[0] classifier = nltk.MaxentClassifier.train(training_set_formatted, algorithm, max_iter=numIterations) classifier.show_most_informative_features(10)

Once the training of the MaxEntClassifier is done, it can be used to classify the review in the test set:

for review in test_set_formatted: label = review[1] text = review[0] determined_label = classifier.classify(text) print determined_label, label

So far we have seen the theory behind the Naive Bayes Classifier and how to implement it (in the context of Text Classification) and in the previous and this blog-post we have seen the theory and implementation of Logistic Regression Classifiers. Although this is done at a basic level, it should give some understanding of the Logistic Regression method (I hope at a level where you can apply it and classify data yourself). There are however still many (advanced) topics which have not been discussed here:

- Which hill-climbing / gradient descent algorithm to use; IIS (Improved Iterative Scaling), GIS (Generalized Iterative Scaling), BFGS, L-BFGS or Coordinate Descent
- Encoding of the feature vector and the use of dummy variables
- Logistic Regression is an inherently sequential algorithm; although it is quiet fast, you might need a parallelization strategy if you start using larger datasets.

If you see any errors please do not hesitate to contact me. If you have enjoyed reading, maybe even learned something, do not forget to subscribe to this blog and share it!

—

[1] See the paper of Nigam et. al. on Maximum Entropy and the paper of Bo Pang et. al. on Sentiment Analysis using Maximum Entropy. Also see Using Maximum Entropy for text classification (1999), A simple introduction to Maximum Entropy models(1997), A brief MaxEnt tutorial, and another good MIT article.

[2] See for example Chapter 7 of Speech and Language Processing by (Jurafsky & Martin): For the task of period disambiguation a feature could be whether or not a period is followed by a capital letter unless the previous word is *St.*

One of the most important tasks in Machine Learning are the Classification tasks (a.k.a. supervised machine learning). Classification is used to make an accurate prediction of the class of entries in a test set (a dataset of which the entries have not yet been labelled) with the model which was constructed from a training set. You could think of classifying crime in the field of pre-policing, classifying patients in the health sector, classifying houses in the real-estate sector. Another field in which classification is big, is Natural Lanuage Processing (NLP). This goal of this field of science is to makes machines (computers) understand written (human) language. You could think of text categorization, sentiment analysis, spam detection and topic categorization.

For classification tasks there are three widely used algorithms; the Naive Bayes, Logistic Regression / Maximum Entropy and Support Vector Machines. We have already seen how the Naive Bayes works in the context of Sentiment Analysis. Although it is more accurate than a bag-of-words model, it has the assumption of conditional independence of its features. This is a simplification which makes the NB classifier easy to implement, but it is also unrealistic in most cases and leads to a lower accuracy. A direct improvement on the N.B. classifier, is an algorithm which does not assume conditional independence but tries to estimate the weight vectors (feature values) directly. This algorithm is called Maximum Entropy in the field of NLP and Logistic Regression in the field of Statistics.

Maximum Entropy might sound like a difficult concept, but actually it is not. It is a simple idea, which can be implemented with a few lines of code. But to fully understand it, we must first go into the basics of Regression and Logistic Regression.

Regression Analysis is the field of mathematics where the goal is to find a function which best correlates with a dataset. Let’s say we have a dataset containing datapoints; . For each of these (input) datapoints there is a corresponding (output) -value. Here, the -datapoints are called the independent variables and the dependent variable; the value of depends on the value of , while the value of may be freely chosen without any restriction imposed on it by any other variable.

The goal of Regression analysis is to find a function which can best describe the correlation between and . In the field of Machine Learning, this function is called the hypothesis function and is denoted as .

If we can find such a function, we can say we have successfully built a Regression model. If the input-data lives in a 2D-space, this boils down to finding a curve which fits through the data points. In the 3D case we have to find a plane and in higher dimensions a hyperplane.

To give an example, let’s say that we are trying to find a predictive model for the success of students in a course called Machine Learning. We have a dataset which contains the final grade of students. Dataset contains the values of the independent variables. Our initial assumption is that the final grade only depends on the studying time. The variable therefore indicates how many hours student has studied. The first thing we would do is visualize this data:

If the results looks like the figure on the left, then we are out of luck. It looks like the points are distributed randomly and there is no correlation between and at all. However, if it looks like the figure on the right, there is probably a strong correlation and we can start looking for the function which describes this correlation.

This function could for example be:

or

where are the dependent parameters of our model.

In evaluating the results from the previous section, we may find the results unsatisfying; the function does not correlate with the datapoints strongly enough. Our initial assumption is probably not complete. Taking only the studying time into account is not enough. The final grade does not only depend on the studying time, but also on how much the students have slept the night before the exam. Now the dataset contains an additional variable which represents the sleeping time. Our dataset is then given by . In this dataset indicates how many hours student has studied and indicates how many hours he has slept.

This is an example of multivariate regression. The function has to include both variables. For example:

or

.

All of the above examples are examples of linear regression. We have seen that in some cases depends on a linear form of , but it can also depend on some power of , or on the log or any other form of . However, in all cases the parameters were linear.

So, what makes linear regression linear is not that depends in a linear way on , but that it depends in a linear way on . needs to be linear with respect to the model-parameters . Mathematically speaking it needs to satisfy the superposition principle. Examples of nonlinear regression would be:

or

The reason why the distinction is made between linear and nonlinear regression is that nonlinear regression problems are more difficult to solve and therefore more computational intensive algorithms are needed.

Linear regression models can be written as a linear system of equations, which can be solved by finding the closed-form solution with Linear Algebra. See these statistics notes for more on solving linear models with linear algebra.

As discussed before, such a closed-form solution can only be found for linear regression problems. However, even when the problem is linear in nature, we need to take into account that calculating the inverse of a by matrix has a time-complexity of . This means that for large datasets ( ) finding the closed-form solution will take more time than solving it iteratively (gradient descent method) as is done for nonlinear problems. So solving it iteratively is usually preferred for larger datasets, even if it is a linear problem.

The Gradient Descent method is a general optimization technique in which we try to find the value of the parameters with an iterative approach.

First, we construct a cost function (also known as loss function or error function) which gives the difference between the values of (the values you expect to have with the determined values of ) and the actual values of . The better your estimation of is, the better the values of will approach the values of .

Usually, the cost function is expressed as the squared error between this difference:

At each iteration we choose new values for the parameters , and move towards the ‘true’ values of these parameters, i.e. the values which make this cost function as small as possible. The direction in which we have to move is the negative gradient direction;

.

The reason for this is that a function’s value decreases the fastest if we move towards the direction of the negative gradient (the directional derivative is maximal in the direction of the gradient).

Taking all this into account, this is how gradient descent works:

- Make an initial but intelligent guess for the values of the parameters .
- Keep iterating while the value of the cost function has not met your criteria:
- With the current values of , calculate the gradient of the cost function J ( ).
- Update the values for the parameters
- Fill in these new values in the hypothesis function and calculate again the value of the cost function;

Just as important as the initial guess of the parameters is the value you choose for the learning rate . This learning rate determines how fast you move along the slope of the gradient. If the selected value of this learning rate is too small, it will take too many iterations before you reach your convergence criteria. If this value is too large, you might overshoot and not converge.

Logistic Regression is similar to (linear) regression, but adapted for the purpose of classification. The difference is small; for Logistic Regression we also have to apply gradient descent iteratively to estimate the values of the parameter . And again, during the iteration, the values are estimated by taking the gradient of the cost function. And again, the cost function is given by the squared error of the difference between the hypothesis function and . The major difference however, is the form of the hypothesis function.

When you want to classify something, there are a limited number of classes it can belong to. And for each of these possible classes there can only be two states for ;

either belongs to the specified class and , or it does not belong to the class and . Even though the output values are binary, the independent variables are still continuous. So, we need a function which has as input a large set of continuous variables and for each of these variables produces a binary output. This function, the hypothesis function, has the following form:

.

This function is also known as the logistic function, which is a part of the sigmoid function family. These functions are widely used in the natural sciences because they provide the simplest model for population growth. However, the reason why the logistic function is used for classification in Machine Learning is its ‘S-shape’.

As you can see this function is bounded in the y-direction by 0 and 1. If the variable is very negative, the output function will go to zero (it does not belong to the class). If the variable is very positive, the output will be one and it does belong to the class. (Such a function is called an indicator function.)

The question then is, what will happen to input values which are neither very positive nor very negative, but somewhere ‘in the middle’. We have to define a decision boundary, which separates the positive from the negative class. Usually this decision boundary is chosen at the middle of the logistic function, namely at where the output value is .

(1)

For those who are wondering where entered the picture that we were talking about before. As we can see in the formula of the logistic function, . Meaning, the dependent parameter (also known as the feature), maps the input variable to a position on the -axis. With its -value, we can use the logistic function to calculate the -value. If this -value we assume it does belong in this class and vice versa.

So the feature should be chosen such that it predicts the class membership correctly. It is therefore essential to know which features are useful for the classification task. Once the appropriate features are selected , gradient descent can be used to find the optimal value of these features.

How can we do gradient descent with this logistic function? Except for the hypothesis function having a different form, the gradient descent method is exactly the same. We again have a cost function, of which we have to iteratively take the gradient w.r.t. the feature and update the feature value at each iteration.

This cost function is given by

(2)

We know that:

and

(3)

Plugging these two equations back into the cost function gives us:

(4)

The gradient of the cost function with respect to is given by

(5)

So the gradient of the seemingly difficult cost function, turns out to be a much simpler equation. And with this simple equation, gradient descent for Logistic Regression is again performed in the same way:

- Make an initial but intelligent guess for the values of the parameters .
- Keep iterating while the value of the cost function has not met your criteria:
- With the current values of , calculate the gradient of the cost function J ( ).
- Update the values for the parameters
- Fill in these new values in the hypothesis function and calculate again the value of the cost function;

In the previous section we have seen how we can use Gradient Descent to estimate the feature values , which can then be used to determine the class with the Logistic function. As stated in the introduction, this can be used for a wide variety of classification tasks. The only thing that will be different for each of these classification tasks is the form the features take on.

Here we will continue to look at the example of Text Classification; Lets assume we are doing Sentiment Analysis and want to know whether a specific review should be classified as positive, neutral or negative.

The first thing we need to know is which and what types of features we need to include.

For NLP we will need a large number of features; often as large as the number of words present in the training set. We could reduce the number of features by excluding stopwords, or by only considering n-gram features.

For example, the 5-gram ‘kept me reading and reading’ is much less likely to occur in a review-document than the unigram ‘reading’, but if it occurs it is much more indicative of the class (positive) than ‘reading’. Since we only need to consider n-grams which actually are present in the training set, there will be much less features if we only consider n-grams instead of unigrams.

The second thing we need to know is the actual value of these features. The values are learned by initializing all features to zero, and applying the gradient descent method using the labeled examples in the training set. Once we know the values for the features, we can compute the probability for each class and choose the class with the maximum probability. This is done with the following Logistic function.

In this post we have discussed only the theory of Maximum Entropy and Logistic Regression. Usually such discussions are better understood with examples and the actual code. I will save that for the next blog.

If you have enjoyed reading this post or maybe even learned something from it, subscribe to this blog so you can receive a notification the next time something is posted.

Miles Osborne, Using Maximum Entropy for Sentence Extraction (2002)

Jurafsky and Martin, Speech and Language Processing; Chapter 7

Nigam et. al., Using Maximum Entropy for Text Classification

]]>

With the bag-of-words model we check which word of the text-document appears in a positive-words-list or a negative-words-list. If the word appears in a positive-words-list the total score of the text is updated with +1 and vice versa. If at the end the total score is positive, the text is classified as positive and if it is negative, the text is classified as negative. Simple enough!

With the Naive Bayes model, we do not take only a small set of positive and negative words into account, but all words the NB Classifier was trained with, i.e. all words presents in the training set. If a word has not appeared in the training set, we have no data available and apply Laplacian smoothing (use 1 instead of the conditional probability of the word).

The probability a document belongs to a class is given by the class probability multiplied by the products of the conditional probabilities of each word for that class.

Here is the number of occurences of word in class , is the total number of words in class and is the number of words in the document we are currently classifying.

does not change (unless the training set is expanded), so it can be placed outside of the product:

With this information it is easy to implement a Naive Bayes Text Classifier. (Naive Bayes can also be used to classify non-text / numerical datasets, for an explanation see this notebook).

We have a NaiveBayesText class, which accepts the input values for X and Y as parameters for the “train()” method. Here X is a list of lists, where each lower level list contains all the words in the document. Y is a list containing the label/class of each document.

class NaiveBaseClass: def calculate_relative_occurences(self, list1): no_examples = len(list1) ro_dict = dict(Counter(list1)) for key in ro_dict.keys(): ro_dict[key] = ro_dict[key] / float(no_examples) return ro_dict def get_max_value_key(self, d1): values = d1.values() keys = d1.keys() max_value_index = values.index(max(values)) max_key = keys[max_value_index] return max_key def initialize_nb_dict(self): self.nb_dict = {} for label in self.labels: self.nb_dict[label] = defaultdict(list) class NaiveBayesText(NaiveBaseClass): """" When the goal is classifying text, it is better to give the input X in the form of a list of lists containing words. X = [ ['this', 'is', 'a',...], (...) ] Y still is a 1D array / list containing the labels of each entry def initialize_nb_dict(self): self.nb_dict = {} for label in self.labels: self.nb_dict[label] = [] def train(self, X, Y): self.class_probabilities = self.calculate_relative_occurences(Y) self.labels = np.unique(Y) self.no_examples = len(Y) self.initialize_nb_dict() for ii in range(0,len(Y)): label = Y[ii] self.nb_dict[label] += X[ii] #transform the list with all occurences to a dict with relative occurences for label in self.labels: self.nb_dict[label] = self.calculate_relative_occurences(self.nb_dict[label])

As we can see, the training of the Naive Bayes Classifier is done by iterating through all of the documents in the training set. From all of the documents, a Hash table (dictionary in python language) with the relative occurence of each word per class is constructed.

This is done in two steps:

1. construct a huge list of all occuring words per class:

for ii in range(0,len(Y)): label = Y[ii] self.nb_dict[label] += X[ii]

2. calculate the relative occurence of each word in this huge list, with the “calculate_relative_occurences” method. This method simply uses Python’s Counter module to count how much each word occurs and then divides this number with the total number of words.

The result is saved in the dictionary *nb_dict*.

As we can see, it is easy to train the Naive Bayes Classifier. We simply calculate the relative occurence of each word per class, and save the result in the “nb_dict” dictionary.

This dictionary can be updated, saved to file, and loaded back from file. It contains the results of Naive Bayes Classifier training.

Classifying new documents is also done quite easily by calculating the class probability for each class and then selecting the class with the highest probability.

def classify_single_elem(self, X_elem): Y_dict = {} for label in self.labels: class_probability = self.class_probabilities[label] nb_dict_features = self.nb_dict[label] for word in X_elem: if word in nb_dict_features.keys(): relative_word_occurence = nb_dict_features[word] class_probability *= relative_word_occurence else: class_probability *= 0 Y_dict[label] = class_probability return self.get_max_value_key(Y_dict) def classify(self, X): self.predicted_Y_values = [] n = len(X) for ii in range(0,n): X_elem = X[ii] prediction = self.classify_single_elem(X_elem) self.predicted_Y_values.append(prediction) return self.predicted_Y_values

In the next blog we will look at the results of this naively implemented algorithm for the Naive Bayes Classifier and see how it performs under various conditions; we will see the influence of varying training set sizes and whether the use of n-gram features will improve the accuracy of the classifier.

]]>- To keep track of the number of occurences of each word, we tokenize the text and add each word to a single list. Then by using a Counter element we can keep track of the number of occurences.
- We can make a DataFrame containing the class probabilities of each word by adding each word to the DataFrame as we encounter it and dividing it by the total number of occurences afterwards.
- Sorting this DataFrame by the values in the columns of the Positive or Negative class, then taking the top 100 / 200 words we can construct a list containing negative or positive words.
- These words in this constructed Sentiment Lexicon can be used to give a value to the subjectivity of the reviews in the test set.

Using the steps described above, we were able to determine the subjectivity of reviews in the test set with an accuracy (F-score) of ~60%.

In this blog we will look into the effectiveness of cross-book sentiment lexicons; how well does a sentiment lexicon made from book A perform at sentiment analysis of book B?

We will also see how we can improve the bag-of-words technique by including n-gram features in the bag-of-words.

In the previous post, we have seen that the sentiment of reviews in the test-set of ‘Gone Girl’ could be predicted with a 60% accuracy. How well does the sentiment lexicon derived from the training set of book A perform at deducing the sentiment of reviews in the test set of book B?

In the table above, we can see that the most effective Sentiment Lexicons are created from books with a large amount of Positive ánd Negative reviews. In the previous post we saw that Fifty Shades of Grey has a large amount of negative reviews. This makes it a good book to construct an effective Sentiment Lexicon from.

Other books have a lot of positive reviews but only a few negative ones. The Sentiment Lexicon constructed from these books has a high accuracy in determining the sentiment of positive reviews, but a low accuracy for negative reviews… bringing the average down.

In the previous blog-post we had constructed a bag-of-words model with unigram features. Meaning that we split the entire text in single words and count the occurence of each word. Such a model does not take the position of each word in the sentence, its context and the grammar into account. That is why, the bag-of-words model has a low accuracy in detecting the sentiment of a text document.

For example, with the bag-of-words model the following two sentences will be given the same score:

1. “This is not a good book” –> 0 + 0 + 0 + 0 + 1 + 0 –> positive

2. “This is a very good book” –> 0 + 0 +0 +0 +1 + 0 –> positive

If we include features consisting of two or three words, this problem can be avoided; “not good” and “very good” will be two different features with different subjectivity scores. The biggest reason why bigram or trigram features are not used more often is that the number of possible combinations of words increases exponentially with the number of words. Theoretically, a document with 2.000 words can have 2.000 possible unigram features, 40.000 possible bigram features and 8.000.000.000 possible trigram features.

However, if we consider this problem from a pragmatic point of view we can say that most of the combinations of words which can be made, are grammatically not possible, or do not occur with a significant amount and hence don’t need to be taken into account.

Actually, we only need to define a small set of words (prepisitions, conjunctions, interjections etc) of which we know it changes the meaning of the words following it and/or the rest of the sentence. I we encounter such a ‘ngram word’, we do not split the sentence but split it after the next word. In this way we will construct ngram features consisting of the specified words and the words directly following them. Some examples of such words are:

In the previous post, we had seen that the code to construct a DataFrame containing the class probabilities of words in the training set is:

from sets import Set import pandas as pd BOW_df = pd.DataFrame(0, columns=scores, index='') words_set = Set() for review in training_set: score = review['score'] text = review['review_text'] splitted_text = split_text(text) for word in splitted_text: if word not in words_set: words_set.add(word) BOW_df.loc[word] = [0,0,0,0,0] BOW_df.ix[word][score] += 1 else: BOW_df.ix[word][score] += 1

If we also want to include ngrams in this class probability DataFrame, we need to include a function which generates n-grams from the splitted text and the list of specified ngram words:

(...) splitted_text = split_text('text') text_with_ngrams = generate_ngrams(splitted_text, ngram_words) for word in text_with_ngrams: (...)

There are a few conditions this “generate_ngrams” function needs to fulfill:

- When it iterates through the splitted text and encounters a ngram-word, it needs to concatenate this word with the next word. So [“I”,”do”,”not”,”recommend”,”this”,”book”] needs to become [“I”, “do”, “not recommend”, “this”, “book”]. At the same time it needs to skip the next iteration so the next word does not appear two times.
- It needs to be recursive: we might encounter multiple ngram words in a row. Then all of the words needs to be concatenated into a single ngram. So [“This”,”is”,”a”,”very”,”very”, “good”,”book”] needs to be concatenated in [“This”,”is”,”a”,”very very good”, “book”]. If n words are concatenated together into a single n-gram, the next n iterations need to be skipped.
- In addition to concatenating words with the words following it, it might also be interesting if we concatenating it with the word preceding it. For example, forming n-grams including the word “book” and its preceding words leads to features like “worst book”, “best book”, “fascinating book” etc…

Now that we know this information, lets have a look at the code:

def generate_ngrams(text, ngram_words): new_text = [] index = 0 while index < len(text): [new_word, new_index] = concatenate_words(index, text, ngram_words) new_text.append(new_word) index = new_index+1 if index!= new_index else index+1 return new_text def concatenate_words(index, text, ngram_words): word = text[index] if index == len(text)-1: return word, index if word in bigram_array: [word_new, new_index] = concatenate_words(index+1, text, ngram_words) word = word + ' ' + word_new index = new_index return word, index

Here concatenate_words is a recursive function which either returns the word at the index position in the array, or the word concatenated with the next word. It also return the index so we know how many iterations need to be skipped.

This function will also work if we want to append words to its previous words. Then we simply need to pass the reversed text to it `text = list(reversed(text))`

and concatenate it in reversed order: ` word = word_new + ' ' + word`

.

We can put this information together in a single function, which can either concatenate with the next word or with the previous word, depending on the value of the parameter ‘forward’:

def generate_ngrams(text, ngram_words, forward = True): new_text = [] index = 0 if not forward: text = list(reversed(text)) while index < len(text): [new_word, new_index] = concatenate_words(index, text, ngram_words, forward) new_text.append(new_word) index = new_index+1 if index!= new_index else index+1 if not forward: return list(reversed(new_text)) return new_text def concatenate_words(index, text, ngram_words, forward): words = text[index] if index == len(text)-1: return words, index if words.split(' ')[0] in bigram_array: [new_word, new_index] = concatenate_words(index+1, text, ngram_words, forward) if forward: words = words + ' ' + new_word else: words = new_word + ' ' + words index = index_new return words, index

Using this simple function to concatenate words in order to form n-grams, will lead to features which strongly correlate with a specific (Negative/Positive) class like ‘highly recommend’, ‘best book’ or even ‘couldn’t put it down’.

Now that we have a better understanding of Text Classification terms like bag-of-words, features and n-grams, we can start using Classifiers for Sentiment Analysis. Think of Naive Bayes, Maximum Entropy and SVM.

]]>In my previous post I have explained the Theory behind three of the most popular Text Classification methods (Naive Bayes, Maximum Entropy and Support Vector Machines) and told you that I will use these Classifiers for the automatic classification of the subjectivity of Amazon.com book reviews.

The purpose is to get a better understanding of how these Classifiers work and perform under various conditions, i.e. do a comparative study about Sentiment Analytics.

In this blog-post we will use the bag-of-words model to do Sentiment Analysis. The bag-of-words model can perform quiet well at Topic Classification, but is inaccurate when it comes to Sentiment Classification. Bo Pang and Lillian Lee report an accuracy of 69% in their 2002 research about Movie review sentiment analysis. With the three Classifiers this percentage goes up to about 80% (depending on the chosen feature).

The reason to still make a bag-of-words model is that it gives us a better understanding of the content of the text and we can use this to select the features for the three classifiers. The Naive Bayes model is also based on the bag-of-words model, so the bag-of-words model can be used as an intermediate step.

We can collect book reviews from Amazon.com by scraping them from the website with BeautifulSoup. The process for this was already explained in the context of Twitter.com and it should not be too difficult to do the same for Amazon.com.

In total 213.335 book reviews were collected for eight randomly chosen books:

After making a bar-plot of the distribution of the different stars for the chosen books, we can we that there is a strong variation. Books which are considered to be average have almost no 1-star ratings while books far below average have a more uniform distribution of the different ratings.

We can see that the book ‘Gone Girl’ has a pretty uniform distribution so it seems like a good choice for our training set. Books like ‘Unbroken’ or ‘The Martian’ might not have enough 1-star reviews to train for the Negative class.

As the next step, we are going to divide the corpus of reviews into a training set and a test set. The book ‘Gone Girl’ has about 40.000 reviews, so we can use *up to* half of it for training purposes and the other half for testing the accuracy of our model. In order to also take into account the effects of the training set size on the accuracy of our model, we will vary the training set size from 1.000 up to 20.000.

The bag-of-words model is one of the simplest language models used in NLP. It makes an unigram model of the text by keeping track of the number of occurences of each word. This can later be used as a features for Text Classifiers. In this bag-of-words model you only take individual words into account and give each word a specific subjectivity score. This subjectivity score can be looked up in a sentiment lexicon[1]. If the total score is negative the text will be classified as negative and if its positive the text will be classified as positive. It is simple to make, but is less accurate because it does not take the word order or grammar into account.

A simple improvement on using unigrams would be to use unigrams + bigrams. That is, not split a sentence after words like “not”,”no”,”very”, “just” etc. It is easy to implement but can give significant improvement to the accuracy. The sentence “This book is not good” will be interpreted as a positive sentence, unless such a construct is implemented. Another example is that the sentences “This book is very good” and “This book is good” will have the same score with a unigram model of the text, but not with an unigram + bigram model.

My pseudocode for creating a bag-of-words model is as follows:

*list_BOW*= []- For each review in the training set:
- Strip the newline charachter “n” at the end of each review.
- Place a space before and after each of the following characters: .,()[]:;” (This prevents sentences like “I like this book.It is engaging” being interpreted as [“I”, “like”, “this”, “book.It”, “is”, “engaging”].)
- Tokenize the text by splitting it on spaces.
- Remove tokens which consist of only a space, empty string or punctuation marks.
- Append the tokens to list_BOW.

*list_BOW*now contains all words occuring in the training set.- Place
*list_BOW*in a Python Counter element. This counter now contains all occuring words together with their frequencies. Its entries can be sorted with the most_common() method.

The real question is, how we should determine the sentiment/subjectivity score of each word in order to determine the total subjectivity score of the text. We can use one of the sentiment lexicons given in [1], but we dont really know in which circumstances and for which purposes these lexicons are created. Furthermore, in most of these lexicons the words are classified in a binary way (either positive or negative ). Bing Liu’s sentiment lexicon for example contains a list of a few thousands positive and a few thousand negative words.

Bo Pang and Lillian Lee used words which were chosen by two student as positive and negative words.

It would be better if we determine the subjectivity score of each word using some simple statistics of the training set. To do this we need to determine the class probability of each word present in the bag-of-words. This can be done by using pandas dataframe as a datacontainer (but can just as easily be done with dictionaries or other data structures). The code for this looks like:

from sets import Set import pandas as pd BOW_df = pd.DataFrame(0, columns=scores, index='') words_set = Set() for review in training_set: score = review['score'] text = review['review_text'] splitted_text = split_text(text) for word in splitted_text: if word not in words_set: words_set.add(word) BOW_df.loc[word] = [0,0,0,0,0] BOW_df.ix[word][score] += 1 else: BOW_df.ix[word][score] += 1

Here `split_text`

is the method for splitting a text into a list of individual words:

def expand_around_chars(text, characters): for char in characters: text = text.replace(char, &quot; &quot;+char+&quot; &quot;) return text def split_text(text): text = strip_quotations_newline(text) text = expand_around_chars(text, '&quot;.,()[]{}:;') splitted_text = text.split(&quot; &quot;) cleaned_text = [x for x in splitted_text if len(x)&gt;1] text_lowercase = [x.lower() for x in cleaned_text] return text_lowercase

This gives us a DataFrame containing of the number of occurances of each word in each class:

Unnamed: 0 1 2 3 4 5 0 i 4867 5092 9178 14180 17945 1 through 210 232 414 549 627 2 all 499 537 923 1355 1791 3 drawn-out 1 0 1 1 0 4 , 4227 4779 8750 15069 18334 5 detailed 3 7 15 30 36 ... ... ... ... ... ... ... 31800 a+++++++ 0 0 0 0 1 31801 nailbiter 0 0 0 0 1 31802 melinda 0 0 0 0 1 31803 reccomend! 0 0 0 0 1 31804 suspense!! 0 0 0 0 1 [31804 rows x 6 columns]

As we can see there are also quiet a few words which only occur one time. These words will have a class probability of 100% for the class they are occuring in.

This distribution however, does not approximate the real class distribution of that word at all. It is therefore good to define some ‘occurence cut off value’; words which occur less than this value are not taken into account.

By dividing each element of each row by the sum of the elements of that row we will get a DataFrame containing the relative occurences of each word in each class, i.e. a DataFrame with the class probabilities of each word. After this is done, the words with the highest probability in class 1 can be taken as negative words and words with the highest probability in class 5 can be taken as positive words.

We can construct such a sentiment lexicon from the training set and use it to measure the subjectivity of reviews in the test set. Depending on the size of the training set, the sentiment lexicon becomes more accurate for prediciton.

By labeling 4 and 5-star reviews as Positive, 1 and 2-star reviews as Negative and 3 star reviews as Neutral and using the following positive and negative word:

we can determine with the bag-of-words model whether a review is positive or negative with a 60% accuracy .

- How accurate is this list of positive and negative words constructed from the reviews of book A in determening the subjectivity of book B reviews.
- How much more accurate will the bag-of-words model become if we take bigrams or even trigrams into account? There were words with a high negative or positive subjectivity in the word-list which do not have a negative or positive meaning by themselves. This can only be understood if you take the preceding or following words into account.
- Make an overall sentiment lexicon from all the reviews of all the books.
- Use the bag-of-words as features for the Classifiers; Naive Bayes, Maximum Entropy and Support Vector Machines.

]]>

If it is done wrong, it can be boring not grabbing the attention of the readers, or even worse; convey the wrong message.

If it done correctly, it can intrigue even the most indifferent reader (some people can even turn Data Visualizations into an art form).

I personally think Python’s matplotlib is a great library for data visualization. Another amazing library is D3, which is very intuitive and flexible like matplotlib. In addition to that it is a javascript library so it works in the browser, which makes it is platform independent and you dont have to install any software. Did I already tell you D3 is a.. maa.. zing!?

That is why, I will focus on Data Visualizations with D3 in the future. But for now, I will start with something simpler and show you how to make a choropleth map. This is the kind of map you see at every election, where each state is colored in the color of the winning party. Although it might seem difficult to make such a map… it is not.

First thing we need is a map of a country (or the area we want to visualize) in SVG form. Wikipedia has a nice collection of blank maps we can use. Copy the code of this map in a <div> element of a basic html page. As an example, we can take this map of the world.

Another thing we need to include is the jQuery library, so go ahead and link to the latest jQuery version hosted by google like this:

If we open the page now, we should see the map drawn out.

As we can see in the code, each <path> element has its own id. The code for Australia for example looks like:

Sometimes we might be lucky and this id will actually be equal to the name of the state/country and sometimes it might be a random number/word. If that is the case, dont lose any sleep over it. It is not very difficult to discover which element belongs to which country with Chrome Developer Tools (right click on the country and then click on ‘inspect element’).

Now that we know the id of the country we want to color in, we can give it a color with the javascript code:

$(“#path6235”).css(‘fill’,’red’);

Now we need some data to fill in the map. Since the war in Syria / the Syrian refugee crisis is a current issue, it might be interesting to see which countries are donating the most / least to the Syrian crisis. The data for this can be found on this website. We could chose to color based on the absolute amount of money, but it does seem more fair to look at this donated amount relative to the countries’ GDP.

If we divide the donated amount by the GDP of that country of that year, we will get this data. Now we only need to put it in the correct format, which is JSON.

In our example, the correct data in the correct format looks like:

The complete dataset can be downloaded from here. In this file each number indicates the donated amount as 1/1000th percentage of the annual GDP. Go ahead and place the data in a <script> tag so that it can be accessed by JavaScript.

You can check whether or not the data is recognized by the browser, by executing * console.log(data["Switzerland"])* within a <script> tag. This should print the data for Switzerland in the console of the browser:

Now the entire map can be filled in with a javascript function which iterates through the variable containing the data:

` `

With the correct colors filled in, the map looks like:

In the above map all of the countries with no donations for the Syrian crisis (in 2015) are colored red. The countries which have donated money are colored in based on a blue to green gradient, where blue indicates a relative low and green indicates a relative large donation (Russia ~ 0.5 / 1000 % of GDP and Canada ~ 11 / 1000 % of their GDP).

If you are interested, you can download the entire html file from here.

Now that I have covered the basics of making a choropleth map, I want to address the issue that the way you choose to visualize your data can have a huge impact on the message your visualization is conveying.

If the countries with no donated money were left untouched, the first impression of the visualization would be there is no data available on these countries.

Choosing the gradient scale from red to green instead of blue to green, conveys the message that the countries colored with red have done something bad (red is associated with danger).

Although I think everybody can donate more, I would not want to give the impression that Brazil has done something bad by donating ‘only’ 5.000.000 USD.

]]>

Natural Language Processing (NLP) is a vast area of Computer Science that is concerned with the interaction between Computers and Human Language[1].

Within NLP many tasks are – or can be reformulated as – classification tasks. In classification tasks we are trying to produce a classification function which can give the correlation between a certain ‘feature’ and a class . This Classifier first has to be trained with a training dataset, and then it can be used to actually classify documents. Training means that we have to determine its model parameters. If the set of training examples is chosen correctly, the Classifier should predict the class probabilities of the actual documents with a similar accuracy (as the training examples).

After construction, such a Classifier could for example tell us that document containing the words “Bose-Einstein condensate” should be categorized as a Physics article, while documents containing the words “Arbitrage” and “Hedging” should be categorized as a Finance article.

Another Classifier could tell us that mails starting with “Dear Customer/Guest/Sir” (instead of your name) and containing words like “Great opportunity” or “one-time offer” can be classified as spam.

Here we can already see two uses of classification models: *topic classification* and *spam filtering*. For these purposes a Classifiers work quiet well and perform better than most trained professionals.

A third usage of Classifiers is Sentiment Analysis. Here the purpose is to determine the subjective value of a text-document, i.e. how positive or negative is the content of a text document. Unfortunately, for this purpose these Classifiers fail to achieve the same accuracy. This is due to the subtleties of human language; sarcasm, irony, context interpretation, use of slang, cultural differences and the different ways in which opinion can be expressed (subjective vs comparative, explicit vs implicit).

In this blog I will discuss the theory behind three popular Classifiers (Naive Bayes, Maximum Entropy and Support Vector Machines) in the context of Sentiment Analysis[2]. In the next blog I will apply this gained knowledge to automatically deduce the sentiment of collected Amazon.com book reviews.

The contents of this blog-post is as follows:

- Basic concepts of text classification:
- Tokenization
- Word normalization
- bag-of-words model
- Classifier evaluation

- Naieve Bayesian Classifier
- Maximum Entropy Classifier
- Support Vector Machines
- What to Expect

Tokenization is the name given to the process of chopping up sentences into smaller pieces (words or tokens). The segmentation into tokens can be done with decision trees, which contains information to correctly solve the issues you might encounter. Some of these issues you would have to consider are:

- The choice for the delimiter will for most cases be a whitespace (“We’re going to Barcelona” -> [“We’re”, “going”, “to”, “Barcelona.”]), but what should you do when you come across words with a white space in them (“We’re going to The Hague.”->[“We’re”, “going”,”to”,”The”, “Hague”]).
- What should you do with punctuation marks? Although many tokenizers are geared towards throwing punctuation away, for Sentiment analysis a lot of valuable information could be deduced from them.
**!**puts extra emphasis on the negative/positive sentiment of the sentence, while**?**can mean uncertainty (no sentiment). - “, ‘ , [], () can mean that the words belong together and should be treated as a separate sentence. Same goes for words which are
**bold**,*italic*,__underlined__, or inside a link. If you also want to take these last elements into considerating, you should scrape the html code and not just the text.

**Word Normalization **is the reduction of each word to its base/stem form (by chopping of the affixes). While doing this, we should consider the following issues:

- Capital letters should be normalized to lowercase, unless it occurs in the middle of a sentence; this could indicate the name of a writer, place, brand etc.
- What should be done with the apostrophe (‘); “George’s phone” should obviously be tokenized as “George” and “phone”, but I’m, we’re, they’re should be translated as I am, we are and they are. To make it even more difficult; it can also be used as a quotation mark.
- Ambigious words like High-tech, The Hague, P.h.D., USA, U.S.A., US and us.

After the text has been segmented into sentences, each sentence has been segmented into words, the words have been tokenized and normalized, we can make a simple bag-of-words model of the text. In this bag-of-words representation you only take individual words into account and give each word a specific subjectivity score. This subjectivity score can be looked up in a sentiment lexicon[7]. If the total score is negative the text will be classified as negative and if its positive the text will be classified as positive.

For determining the accuracy of a single Classifier, or comparing the results of different Classifier, the F-score is usually used. This F-score is given by

where is the precision and is the recall. The precision is the number of correctly classified examples divided by the total number of classified examples. The recall is the number of correctly classified examples divided by the actual number of examples in the training set.

Naive Bayes [3] classifiers are studying the classification task from a Statistical point of view. The starting point is that the probability of a class is given by the posterior probability given a training document . Here refers to all of the text in the entire training set. It is given by , where is the attribute (word) of document .

Using Bayes’ rule, this posterior probability can be rewritten as:

Since the marginal probability is equal for all classes, it can be disregarded and the equation becomes:

The document belongs to the class which maximizes this probability, so:

Assuming conditional independence of the words , this equation simplifies to:

Here is the conditional probability that word i belongs to class . For the purpose of text classification, this probability can simply be calculated by calculating the frequency of word in class relative to the total number of words in class .

We have seen that we need to multiply the class probability with all of the prior-probabilities of the individual words belonging to that class. The question then is, how do we know what the prior-probabilities of the words are? Here we need to remember that this is a supervised machine learning algorithm: we can estimate the prior-probabilities with a training set with documents that are already labeled with their classes. With this training set we can train the model and obtain values for the prior probabilities. This trained model can then be used for classifying unlabeled documents.

This is relatively easy to understand with an example. Lets say we have counted the number of words in a set of labeled training documents. In this set each text document has been labeled as either Positive, Neutral or as Negative. The result will then look like :

From this table we can already deduce each of the class probabilites:

,

,

.

If we look at the sentence “This blog-post is awesome.”, then the probabilities for this sentence belonging to a specific class are:

This sentence can thus be classified in the positive category.

The principle behind Maximum Entropy [4] is that the correct distribution is the one that maximizes the Entropy / uncertainty and still meets the constraints which are set by the ‘evidence’.

Let me explain this a bit more. In Information Theory, the word Entropy is used as a unit of measure for the unpredictability of the content of information. If you would throw a fair dice, each of the six outcomes have the same probability of occuring (1/6). Therefore you have maximum uncertainty; an entropy of 1. If the dice is weighted you already know one of the six outcomes has a higher probability of occuring and the uncertainty becomes less. If the dice is weighted so much that the outcome is always six, there is zero uncertainty in the outcome and hence the information entropy is also zero.

The same applies to letters in a word (or words in a sentence): if you assume that every letter has the same probability of occuring you have maximum uncertainty in predicting the next letter. But if you know that letters like E, A, O or I have a higher probability of occuring you have less uncertainty.

Knowing this, we can say that complex data has a high entropy, patterns and trends have lower entropy, information you know for a fact to be true has zero entropy (and therefore can be excluded).

The idea behind Maximum Entropy is that you want a model which is as unbiased as possible; events which are not excluded by known constraints should be assigned as much uncertainty as possible, meaning the probability distribution should be as uniform as possible. You are looking for the maximum value of the Entropy. If this is not entirely clear, I recommend you to read through this example.

The mathematical formula for Entropy is given by , so the most likely probability distribution is the one that maximizes this entropy:

It can be shown that the probability distribution has an exponential form and hence is given by:

,

where is a feature function, is the weight parameter of the feature function and is a normalization factor given by

.

This feature function is an indicator function, which is expresses the expected value of the chosen statistics (words) in the training set. These feature functions can then be taken as constraints for the classification of the actual dataset (by eliminating the probability distributions which do not fit with these constraints).

Usually, the weight parameters are automatically determined by the Improved Iterative Scaling algorithm. This is simply a gradient descent function which can be iterated over until it converges to the global maximum. The pseudocode for the this algorithm is as follows:

- Initialize all weight parameters to zero.
- Repeat until convergence:
- calculate the probability distribution with the weight parameters filled in.
- for each parameter calculate . This is the solution to:

- update the value for the weight parameter:

In step 2b is given by the sum of all features in the training dataset d:

Maximum Entropy is a general statistical classification algorithm and can be used to estimate any probability distribution. For the specific case of text classification, we can limit its form a bit more by using word counts as features:

(1)

Although it is not immediatly obvious from the name, the SVM algorithm is a ‘simple’ linear classification/regression algorithm[6]. It tries to find a hyperplane which seperates the data in two classes as optimally as possible.

Here as optimally as possible means that as much points as possible of label A should be seperated to one side of the hyperplane and as points of label B to the other side, while maximizing the distance of each point to this hyperplane.

In the image above we can see this illustrated for the example of points plotted in 2D-space. The set of points are labeled with two categories (illustrated here with black and white points) and SVM chooses the hypeplane that maximizes the margin between the two classes. This hyperplane is given by

where is a n-dimensional input vector, is its output value, is the weight vector (the normal vector) defining the hyperplane and the terms are the Lagrangian multipliers.

Once the hyperplane is constructed (the vector is defined) with a training set, the class of any other input vector can be determined:

if then it belongs to the positive class (the class we are interested in), otherwise it belongs to the negative class (all of the other classes).

We can already see this leads to two interesting questions:

1. SVM only seems to work when the two classes are linearly separable. How can we deal with non-linear datasets? Here I feel the urge to point out that the Naive Bayes and Maximum Entropy are linear classifiers as well and most text documents will be linear. Our training example of Amazon book reviews will be linear as well. But an explanation of the SVM system will not be complete without an explanation of Kernel functions.

2. SVM only seems to be able to separate the dataset into two classes? How can we deal with datasets with more than two classes. For Sentiment Classification we have for example three classes (positive, neutral, negative) and for Topic Classification we can have even more than that.

**Kernel Functions:
**The classical SVM system requires that the dataset is linearly separable, i.e. there is a single hyperplane which can separate the two classes. For non-linear datasets a Kernel function is used to map the data to a higher dimensional space in which it is linearly separable. This video gives a good illustation of such a mapping. In this higher dimensional feature space, the classical SVM system can then be used to construct a hyperplane.

**Multiclass classification:**

The classical SVM system is a binary classifier, meaning that it can only separate the dataset into two classes. To deal with datasets with more than two classes usually the dataset is reduced to a binary class dataset with which the SVM can work. There are two approaches for decomposing a multiclass classification problem to a binary classification problem: the one-vs-all and one-vs-one approach.

In the one-vs-all approach one SVM Classifier is build per class. This Classifier takes that one class as the positive class and the rest of the classes as the negative class. A datapoint is then only classified within a specific class if it is accepted by that Class’ Classifier and rejected by all other classifiers. Although this can lead to accurate results (if the dataset is clustered), a lot of datapoints can also be left unclassified (if the dataset is not clustered).

In the one-vs-one approach, you build one SVM Classifier per chosen pair of classes. Since there are possible pair combinations for a set of N classes, this means you have to construct more Classifiers. Datapoints are then categorized in the class for which they have received the most points.

In our example, there are only three classes (positive, neutral, negative) so there is no real difference between these two approaches. In both approaches we have to construct two hyperplanes; positive vs the rest and negative vs the rest.

For the purpose of testing these Classification methods, I have collected >300.000 book reviews of 10 different books from Amazon.com. I will use a part of these book reviews for training purposes and a part as the test dataset. In the next few blogs I will try to automatically classify the sentiment of these reviews with the four models described above.

—————————————-

**[1] Machine Learning Literature:
**Foundations of Statistical Natural Language Processing by Manning and Schutze,

Machine Learning: A probabilistic perspective by Kevin P. Murphy,

Foundations of Machine Learning by Mehryar Mohri

**[2]Sentiment Analysis literature:**

There is already a lot of information available and a lot of research done on Sentiment Analysis. To get a basic understanding and some background information, you can read Pang et.al.’s 2002 article. In this article, the different Classifiers are explained and compared for sentiment analysis of Movie reviews (IMDB). This research was very close to Turney’s 2002 research on Sentiment Analysis of movie reviews (see article). You can also read Bo Pang and Lillian Lee’s 2009 article , which is more general in nature (about the challenges of SA, the different ML techniques etc.)

There are also two relevant books: Web Data Mining and Sentiment Analysis, both by Bing Liu. And last but not least, works of Socher are also quiet interesting (see paper, website containing live demo); it even has inspired this kaggle competition.

**[3] Naive Bayes Literature:**

Machine Learning by Tom Mitchel, Stanford’s IR-book, Sebastian Raschka’s blog-post, Stanford’s online NLP course.

**[4]Maximum Entropy Literature:**

Using Maximum Entropy for text classification (1999), A simple introduction to Maximum Entropy models (1997), A brief MaxEnt tutorial, another good MIT article.

**[6]SVM Literature:**

This youtube video gives a general idea about SVM. For a more technical explanation, this and this article can be read. Here you can find a good explanation as well as a list of the mostly used Kernel functions. one-vs-one and one-vs-all.

**[7] Sentiment Lexicons:
**I have selected a list of sentiment analysis lexicons; most of these were mentioned in the Natural Language Processing course, the rest are from stackoverflow.

- WordStat sentiment Dictionary; This is probably one of the largest lexicons freely available. It contains ~14.000 words ( 9164 negative and 4847 positive words ) and gives words a binary classification (positive or a negative ) score.
- Bill McDonalds 2014 Master dictionary, containing ~85.000 word
- Harvard Inquirer; Contains about ~11.780 words and has a more complex way of ‘scoring’ words; each word can be scored in 15+ categories; words can be Positiv-Negative, Strong-Weak, Active-Passive, Pleasure-Pain, words can indicate pleasure, pain, virtue and vice etc etc
- SentiWordNet; gives the words a positive or negative score between 0 and 1. It contains about 117.660 words, however only ~29.000 of these words have been scored (either positive or negative).
- MPQA; contains about ~8.200 words and binary classifies each word (as either positive or as negative). It also gives additional information such as whether a word is an adjective or a noun and whether a word is ‘strong subjective’ or ‘weak subjective’.
- Bing Liu’s opinion lexicon; contains 4.782 negative and 2.005 positive words.

**Including Emoticons in your dictionary;**

None of the dictionaries described above contain emoticons, which might be an essential part of text if you are analyzing social media. So how can we include emoticons in our subjectivity analysis? Everybody knows is a positive and is a negative emoticon but what exactly does mean and how is it different from :-/?

There are a few emoticon sentiment dictionaries on the web which you could use; Emoticon Sentiment Lexicon created by Hogenboom et. al., containing a list of 477 emoticons which are scored either 1 (positive), 0 (neutral) or -1 (negative). You could also make your own emoticon sentiment dictionary by giving the emoticons the same score as their meaning in words.

]]>