Most tasks in Machine Learning can be reduced to classification tasks. For example, we have a medical dataset and we want to classify who has diabetes (positive class) and who doesn’t (negative class). We have a dataset from the financial world and want to know which customers will default on their credit (positive class) and which customers will not (negative class).

To do this, we can train a Classifier with a ‘training dataset’ and after such a Classifier is trained (we have determined its model parameters) and can accurately classify the training set, we can use it to classify new data (test set). If the training is done properly, the Classifier should predict the class probabilities of the new data with a similar accuracy.

There are three popular Classifiers which use three different mathematical approaches to classify data. Previously we have looked at the first two of these; Logistic Regression and the Naive Bayes classifier. Logistic Regression uses a functional approach to classify data, and the Naive Bayes classifier uses a statistical (Bayesian) approach to classify data.

Logistic Regression assumes there is some function which forms a correct model of the dataset (i.e. it maps the input values correctly to the output values). This function is defined by its parameters . We can use the gradient descent method to find the optimum values of these parameters.

The Naive Bayes method is much simpler than that; we do not have to optimize a function, but can calculate the Bayesian (conditional) probabilities directly from the training dataset. This can be done quiet fast (by creating a hash table containing the probability distributions of the features) but is generally less accurate.

Classification of data can also be done via a third way, by using a geometrical approach. The main idea is to find a line, or a plane, which can separate the two classes in their feature space. Classifiers which are using a geometrical approach are the Perceptron and the SVM (Support Vector Machines) methods.

Below we will discuss the Perceptron classification algorithm. Although Support Vector Machines is used more often, I think a good understanding of the Perceptron algorithm is essential to understanding Support Vector Machines and Neural Networks.

The Perceptron is a lightweight algorithm, which can classify data quiet fast. But it only works in the limited case of a linearly separable, binary dataset. If you have a dataset consisting of only two classes, the Perceptron classifier can be trained to find a linear hyperplane which seperates the two. If the dataset is not linearly separable, the perceptron will fail to find a separating hyperplane.

If the dataset consists of more than two classes we can use the standard approaches in multiclass classification (one-vs-all and one-vs-one) to transform the multiclass dataset to a binary dataset. For example, if we have a dataset, which consists of three different classes:

- In
**one-vs-all**, class I is considered as the positive class and the rest of the classes are considered as the negative class. We can then look for a separating hyperplane between class I and the rest of the dataset (class II and III). This process is repeated for class II and then for class III. So we are trying to find three separating hyperplanes; between class I and the rest of the data, between class II and the rest of the data, etc.

If the dataset consists of K classes, we end up with K separating hyperplanes. - In
**one-vs-one**, class I is considered as the positive class and each of the other classes is considered as the negative class; so first class II is considered as the negative class and then class III is is considered as the negative class. Then this process is repeated with the other classes as the positive class.

So if the dataset consists of K classes, we are looking for separating hyperplanes.

Although the one-vs-one can be a bit slower (there is one more iteration layer), it is not difficult to imagine it will be more advantageous in situations where a (linear) separating hyperplane does not exist between one class and the rest of the data, while it does exists between one class and other classes when they are considered individually. In the image below there is no separating line between the pear-class and the other two classes.

The algorithm for the Perceptron is similar to the algorithm of Support Vector Machines (SVM). Both algorithms find a (linear) hyperplane separating the two classes. The biggest difference is that the Perceptron algorithm will find **any** hyperplane, while the SVM algorithm uses a Lagrangian constraint* *to find the hyperplane which is optimized to have the **maximum margin**. That is, the sum of the squared distances of each point to the hyperplane is maximized. This is illustrated in the figure below. While the Perceptron classifier is satisfied if any of these seperating hyperplanes are found, a SVM classifier will find the green one , which has the maximum margin.

Another difference is; If the dataset is not linearly seperable [2] the perceptron will fail to find a separating hyperplane. The algorithm simply does not converge during its iteration cycle. The SVM on the other hand, can still find a maximum margin minimum cost decision boundary (a separating hyperplane which does not separate 100% of the data, but does it with some small error).

It is often said that the perceptron is modeled after neurons in the brain. It has input values (which correspond with the features of the examples in the training set) and one output value. Each input value is multiplied by a weight-factor . If the sum of the products between the feature value and weight-factor is larger than zero, the perceptron is activated and ‘fires’ a signal (+1). Otherwise it is not activated.

The weighted sum between the input-values and the weight-values, can mathematically be determined with the scalar-product . To produce the behaviour of ‘firing’ a signal (+1) we can use the signum function ; it maps the output to +1 if the input is positive, and it maps the output to -1 if the input is negative.

Thus, this Perceptron can mathematically be modeled by the function . Here is the bias, i.e. the default value when all feature values are zero.

The perceptron algorithm looks as follows:

def Perceptron(X, Y, b=0, max_iter=10): """ b is the bias, X is the input array with n rows (training examples) and m columns (features) """ n,m = np.shape(X) #weight-vector w = np.zeros(m) for ii in range(0,max_iter): for jj in xrange(n): x_i = X[jj] y_i = Y[jj] a = b + np.dot(w, x_i) if np.sign(y_i*a) != 1: w += y_i*x_i b += y_i print("iteration %s; new weight_vector: %s - new b: %s" % (ii, w, b))

As you can see, we set the bias-value and all the elements in the weight-vector to zero and iterate over all the examples in the training set.

Here, is the actual output value of each training example. This is either +1 (if it belongs to the positive class) or -1 (if it does not belong to the positive class).,

The activation function value is the predicted output value. It will be if the prediction is correct and if the prediction is incorrect. Therefore, if the prediction made (with the weight vector from the previous training example) is incorrect, will be -1, and the weight vector is updated.

We can see that the Perceptron is an online algorithm; it iterates through the examples in the training set, and for each example in the training set it calculates the value of the activation function and updates the values of the weight-vector.

Now lets examine the Perceptron algorithm for a linearly separable dataset which exists in 2 dimensions. For this we first have to create this dataset:

def generate_data(no_points): X = np.zeros(shape=(no_points, 2)) Y = np.zeros(shape=no_points) for ii in range(no_points): X[ii][0] = random.randint(1,9)+0.5 X[ii][1] = random.randint(1,9)+0.5 Y[ii] = 1 if X[ii][0]+X[ii][1] >= 13 else -1 return X, Y

In the 2D case, the perceptron algorithm looks like:

X, Y = generate_data(100) Perceptron(X, Y, max_iter=20)

As we can see, the weight vector and the bias ( which together determine the separating hyperplane ) are updated when is not positive.

The result is nicely illustrated in this gif:

GIF

We can extend this to a dataset in any number of dimensions, and as long as it is linearly separable, the Perceptron algorithm will converge.

One of the benefits of this Perceptron is that it is a very ‘lightweight’ algorithm; it is computationally very fast and easy to implement for datasets which are linearly separable. But if the dataset is not linearly separable, it will not converge.

For such datasets, the Perceptron can still be used if the correct kernel is applied. In practice this is never done, and Support Vector Machines are used whenever a Kernel needs to be applied. Some of these Kernels are:

Linear: | |

Polynomial: | with |

Laplacian RBF: | |

Gaussian RBF: |

At this point, it will be too much to also implement Kernel functions, but I hope to do it at a next post about SVM. For more information about Kernel functions, a comprehensive list of kernels, and their source code, please click here.

PS: The Python code for Logistic Regression can be forked/cloned from GitHub.

]]>

In the previous blog we have seen the theory and mathematics behind the Maximum Entropy and Logistic Regression Classifiers.

Logistic Regression is one of the most powerful classification methods within machine learning and can be used for a wide variety of tasks. Think of pre-policing or predictive analytics in health; it can be used to aid tuberculosis patients, aid breast cancer diagnosis, etc. Think of modeling urban growth, analysing mortgage pre-payments and defaults, forecasting the direction and strength of stock market movement, and even sports.

Reading all of this, the theory[1] of Maximum Entropy Classification might look difficult. In my experience, the average Developer does not believe they can design a proper Maximum Entropy / Logistic Regression Classifier from scratch. I strongly disagree: not only is the mathematics behind is relatively simple, it can also be implemented with a few lines of code.

I have done this in the past month, so I thought I’d show you how to do it. The code is in Python but it should be relatively easy to translate it to other languages. Some of the examples contain self-generated data, while other examples contain real-world (medical) data. As was also done in the blog-posts about the bag-of-words model and the Naive Bayes Classifier, we will also try to automatically classify the sentiments of Amazon.com book reviews.

We have seen that the technique to perform Logistic Regression is similar to regular Regression Analysis. We are trying to estimate the feature values iteratively with the Gradient Descent method. In the Gradient Descent method, the values of the parameters in the current iteration are calculated by updating the values of from the previous iteration with the gradient of the cost function .

Different cost functions exist, but most often the squared error between the hypothesis function and is used. In (regular) Regression this hypothesis function can be any function which you expect will provide a good model of the dataset. In Logistic Regression the hypothesis function is always given by the Logistic function:

.

This also means that for Logistic Regression, we no longer have to think about the form of the hypothesis function (while we still have to do that for regular regression). What we do have to think about, is which features will classify our current dataset optimally.

Taking all of this into account, this is how Gradient Descent works:

- Make an initial but intelligent guess for the values of the parameters .
- Keep iterating while the value of the cost function has not met your criteria:
- With the current values of , calculate the gradient of the cost function ( ).
- Update the values for the parameters
- Fill in these new values in the hypothesis function and calculate again the value of the cost function;

We have seen the self-generated example of students participating in a Machine Learning course, where their final grade depended on how many hours they had studied.

First, let’s generate the data:

import random import numpy as np num_of_datapoints = 100 x_max = 10 initial_theta = [1, 0.07] def func1(X, theta, add_noise = True): if add_noise: return theta[0]*X[0] + theta[1]*X[1]**2 + 0.25*X[1]*(random.random()-1) else: return theta[0]*X[0] + theta[1]*X[1]**2 def generate_data(num_of_datapoints, x_max, theta): X = np.zeros(shape=(num_of_datapoints, 2)) Y = np.zeros(shape=num_of_datapoints) for ii in range(num_of_datapoints): X[ii][0] = 1 X[ii][1] = (x_max*ii) / float(num_of_datapoints) Y[ii] = func1(X[ii], theta) return X, Y X, Y = generate_data(num_of_datapoints, x_max, initial_theta)

We can see that we have generated 100 points uniformly distributed over the -axis. For each of these – points the -value is determined by minus some random value.

On the left we can see a scatterplot of the datapoints and on the right we can see the same data with a curve fitted through the points. This is the curve we are trying to estimate with the Gradient Descent method. This is done as follows:

numIterations= 1000 alpha = 0.00000005 m, n = np.shape(X) theta = np.ones(n) theta = gradient_descent(X, Y, theta, alpha, m, numIterations) def gradient_descent(X, Y, theta, alpha, m, number_of_iterations): for ii in range(0,number_of_iterations): print "iteration %s : feature-value: %s" % (ii, theta) hypothesis = np.dot(X, theta) cost = sum([theta[0]*X[iter][0]+theta[1]*X[iter][1]-Y[iter] for iter in range(m)]) grad0 = (2.0/m)*sum([(func1(X[iter], theta, False) - Y[iter])*X[iter][0]**2 for iter in range(m)]) grad1 = (2.0/m)*sum([(func1(X[iter], theta, False) - Y[iter])*X[iter][1]**4 for iter in range(m)]) theta[0] = theta[0] - alpha * grad0 theta[1] = theta[1] - alpha * grad1 return theta

We can see that we have to calculate the gradient of the cost function times and update the feature values simultaneously! This indeed results in the curve we were looking for:

After this short example of Regression, lets have a look at a few examples of Logistic Regression. We will start out with a the self-generated example of students passing a course or not and then we will look at real data from the medical world.

Let’s generate some data points. There are students participating in the course Machine Learning and whether a student passes ( ) or not ( ) depends on two variables;

- : how many hours student has studied for the exam.
- : how many hours student has slept the day before the exam.

import random import numpy as np def func2(x_i): if x_i[1] <= 4: y = 0 else: if x_i[1]+x_i[2] <= 13: y = 0 else: y = 1 return y def generate_data2(no_points): X = np.zeros(shape=(no_points, 3)) Y = np.zeros(shape=no_points) for ii in range(no_points): X[ii][0] = 1 X[ii][1] = random.random()*9+0.5 X[ii][2] = random.random()*9+0.5 Y[ii] = func2(X[ii]) return X, Y X, Y = generate_data2(300)

In our example, the results are pretty binary; everyone who has studied less than 4 hours fails the course, as well as everyone whose studying time + sleeping time is less than or equal to 13 hours (). The results looks like this (the green dots indicate a pass and the red dots a fail):

For this example we will again apply Gradient Descent to determine the feature values which can classify the dataset optimally. This is done as follows:

import math import numpy as np def to_binary(x_i): #this can probably also be done with round() return 1 if x_i > 0.5 else 0 def determine_correct_guesses(X, Y, theta, m): determined_Y = [np.dot(theta, X[ii]) for ii in range(m)] determined_Y_binary = [to_binary(elem) for elem in determined_Y] correct = 0 for ii in range(0,m): if determined_Y_binary[ii] == Y[ii]: correct+=1 return correct def hypothesis(theta, x_i): z = np.dot(theta, x_i) sigmoid = 1.0 / (1.0 + math.exp(-1.0*z)) return sigmoid def gradient_descent(X, Y, theta, alpha, m, number_of_iterations=1000): for iter in range(0,number_of_iterations): cost = (-1.0/m)*sum([Y[ii]*math.log(hypothesis(theta, X[ii]))+(1-Y[ii])*math.log(1-hypothesis(theta, X[ii])) for ii in range(m)]) grad = (-1.0/m)*sum([X[ii]*(Y[ii]-hypothesis(theta, X[ii])) for ii in range(m)]) theta = theta - alpha * grad correct = determine_correct_guesses(X, Y, theta, m) print "iteration %s : cost %s : correct_guesses %s / %s" % (iter, cost, correct, len(Y)) return theta numIterations = 3000 alpha = 0.6 m,n = np.shape(X) theta = np.ones(n) theta = gradient_descent(X, Y, theta, alpha, m, numIterations)

As we can see, the code of the Gradient Descent method looks very similar to the code in the case of regular regression. The main difference is that the hypothesis function is now equal to the sigmoid function.

Using this algorithm for gradient descent, we can correctly classify 297 out of 300 datapoints (wrongly classified points are indicated with a cross).

Now that the concept of Logistic Regression is a bit more clear, let’s classify real-world data!

The University of Massachusetts provides some datasets which are ideal to perform Logistic Regression on. They are small (so my small laptop can also perform it in a reasonable amount of time) and there are various datasets with different (amount of) features.

The dataset “myopia.dat” contains the medical data of 618 subjects, and has 15 features describing the characteristics of each subject. We can read this data in as follows:

datafile = 'myopia.dat' file = open(datafile, 'r') X = np.zeros(shape=(no_points, 16)) Y = np.zeros(shape=no_points) rownum = 0 for line in file: line = line.split() Y[rownum] = int(line[2]) X[rownum][0] = 1 X[rownum][1] = int(line[3])/10.0 X[rownum][2] = int(line[4]) X[rownum][3] = float(line[5]) X[rownum][4] = float(line[6])/10.0 X[rownum][5] = float(line[7])/10.0 X[rownum][6] = float(line[8])/10.0 X[rownum][7] = float(line[9])/10.0 X[rownum][8] = int(line[10])/10.0 X[rownum][9] = int(line[11])/10.0 X[rownum][10] = int(line[12])/10.0 X[rownum][11] = int(line[13])/10.0 X[rownum][12] = int(line[14])/10.0 X[rownum][13] = int(line[15])/10.0 X[rownum][14] = int(line[16]) X[rownum][15] = int(line[17]) rownum+=1

This results in a -vector with **618** elements and a -matrix which is **618** by **15**:

While reading in, all values are normalized to a value around 1. This is done in order to speed up the calculations and to ensure that you never take the logarithm of a zero value (log functions don’t really like that).

Once the data has been read into the and matrices, logistic regression can be applied:

numIterations = 3000 alpha = 0.5 m,n = np.shape(X) theta = np.ones(n) theta = gradient_descent2(X, Y, theta, alpha, m, numIterations)

This simple algorithm for logistic regression correctly classifies ~550 of the 618 subjects, giving it an accuracy of ~90%.

Logistic Regression by using Gradient Descent can also be used for NLP / Text Analysis tasks. There are a wide variety of tasks which can are done in the field of NLP; autorship attribution, spam filtering, topic classification and sentiment analysis.

For a task like sentiment analysis we can follow the same procedure. We will have as the input a large collection of labelled text documents. These will be used to train the Logistic Regression classifier. The most important task then, is to select the proper features which will lead to the best sentiment classification. Almost everything in the text document can be used as a feature[2]; you are only limited by your creativity.

For sentiment analysis usually the occurence of (specific) words is used, or the relative occurence of words (the word occurences divided by the total number of words).

As we have done before, we have to fill in the and matrices, which will serve as an input for the gradient descent algorithm and this algorithm will give us the resulting feature vector . With this vector we can determine the class of other text documents.

As always is a vector with elements (where is the number of text-documents). The matrix is a by matrix; here is the total number of relevant words in all of the text-documents. I will illustrate how to build up this matrix with three book reviews:

**pos:**“This is such a beautiful edition of Harry Potter and the Sorcerer’s Stone. I’m so glad I bought it as a keep sake. The illustrations are just stunning.” (28 words in total)**pos:**“A brilliant book that helps you to open up your mind as wide as the sky” (16 words in total)**neg:**“This publication is virtually unreadable. It doesn’t do this classic justice. Multiple typos, no illustrations, and the most wonky footnotes conceivable. Spend a dollar more and get a decent edition.” (30 words in total)

These three reviews will result in the following -matrix.

As you can see, each row of the matrix contains all of the data per review and each column contains the data per word. If a review does not contain a specific word, the corresponding column will contain a zero. Such a -matrix containing all the data from the training set can be build up in the following manner:

Assuming that we have a list containing the data from the *training set*:

[ ([u'downloaded', u'the', u'book', u'to', u'my', ..., u'art'], 'neg'), ([u'this', u'novel', u'if', u'bunch', u'of', ..., u'ladies'], 'neg'), ([u'forget', u'reading', u'the', u'book', u'and', ..., u'hilarious!'], 'neg'), ... ]

From this *training_set*, we are going to generate a *words_vector*. This *words_vector* is used to keep track to which column a specific word belongs to. After this *words_vector* has been generated, the matrix and vector can filled in.

def generate_words_vector(training_set): words_vector = [] for review in training_set: for word in review[0]: if word not in words_vector: words_vector.append(word) return words_vector def generate_Y_vector(training_set, training_class): no_reviews = len(training_set) Y = np.zeros(shape=no_reviews) for ii in range(0,no_reviews): review_class = training_set[ii][1] Y[ii] = 1 if review_class == training_class else 0 return Y def generate_X_matrix(training_set, words_vector): no_reviews = len(training_set) no_words = len(words_vector) X = np.zeros(shape=(no_reviews, no_words+1)) for ii in range(0,no_reviews): X[ii][0] = 1 review_text = training_set[ii][0] total_words_in_review = len(review_text) for word in Set(review_text): word_occurences = review_text.count(word) word_index = words_vector.index(word)+1 X[ii][word_index] = word_occurences / float(total_words_in_review) return X words_vector = generate_words_vector(training_set) X = generate_X_matrix(training_set, words_vector) Y_neg = generate_Y_vector(training_set, 'neg')

As we have done before, the gradient descent method can be applied to derive the feature vector from the and matrices:

numIterations = 100 alpha = 0.55 m,n = np.shape(X) theta = np.ones(n) theta_neg = gradient_descent2(X, Y_neg, theta, alpha, m, numIterations)

What should we do if a specific review tests positive (Y=1) for more than one class? A review could result in Y=1 for both the *neu* class as well as the *neg* class. In that case we will pick the class with the highest score. This is called multinomial logistic regression.

So far, we have seen how to implement a Logistic Regression Classifier in its most basic form. It is true that building such a classifier from scratch, is great for learning purposes. It is also true that no one will get to the point of using deeper / more advanced Machine Learning skills without learning the basics first.

For real-world applications however, often the best solution is to not re-invent the wheel but to re-use tools which are already available. Tools which have been tested thorougly and have been used by plenty of smart programmers before you. One of such a tool is Python’s NLTK library.

NLTK is Python’s Natural Language Toolkit and it can be used for a wide variety of Text Processing and Analytics jobs like tokenization, part-of-speech tagging and classification. It is easy to use and even includes a lot of text corpora, which can be used to train your model if you have no training set available.

Let us also have a look at how to perform sentiment analysis and text classification with NLTK. As always, we will use a training set to train NLTK’s Maximum Entropy Classifier and a test set to verify the results. Our training set has the following format:

training_set = [ ([u'this', u'novel', u'if', u'bunch', u'of', u'childish', ..., u'ladies'], 'neg') ([u'where', u'to', u'begin', u'jeez', u'gasping', u'blushing', ..., u'fail????'], 'neg') ... ]

As you can see, the training set consists of a list of tuples of two elements. The first element is a list of the words in the text of the document and the second element is the class-label of this specific review (‘neg’, ‘neu’ or ‘pos’). Unfortunately NLTK’s Classifiers only accepts the text in a hashable format (dictionaries for example) and that is why we need to convert this list of words into a dictionary of words.

def list_to_dict(words_list): return dict([(word, True) for word in words_list]) training_set_formatted = [(list_to_dict(element[0]), element[1]) for element in training_set]

‘

Once the training set has been converted into the proper format, it can be feed into the train method of the MaxEnt Classifier:

import nltk numIterations = 100 algorithm = nltk.classify.MaxentClassifier.ALGORITHMS[0] classifier = nltk.MaxentClassifier.train(training_set_formatted, algorithm, max_iter=numIterations) classifier.show_most_informative_features(10)

Once the training of the MaxEntClassifier is done, it can be used to classify the review in the test set:

for review in test_set_formatted: label = review[1] text = review[0] determined_label = classifier.classify(text) print determined_label, label

So far we have seen the theory behind the Naive Bayes Classifier and how to implement it (in the context of Text Classification) and in the previous and this blog-post we have seen the theory and implementation of Logistic Regression Classifiers. Although this is done at a basic level, it should give some understanding of the Logistic Regression method (I hope at a level where you can apply it and classify data yourself). There are however still many (advanced) topics which have not been discussed here:

- Which hill-climbing / gradient descent algorithm to use; IIS (Improved Iterative Scaling), GIS (Generalized Iterative Scaling), BFGS, L-BFGS or Coordinate Descent
- Encoding of the feature vector and the use of dummy variables
- Logistic Regression is an inherently sequential algorithm; although it is quiet fast, you might need a parallelization strategy if you start using larger datasets.

If you see any errors please do not hesitate to contact me. If you have enjoyed reading, maybe even learned something, do not forget to subscribe to this blog and share it!

—

[1] See the paper of Nigam et. al. on Maximum Entropy and the paper of Bo Pang et. al. on Sentiment Analysis using Maximum Entropy. Also see Using Maximum Entropy for text classification (1999), A simple introduction to Maximum Entropy models(1997), A brief MaxEnt tutorial, and another good MIT article.

[2] See for example Chapter 7 of Speech and Language Processing by (Jurafsky & Martin): For the task of period disambiguation a feature could be whether or not a period is followed by a capital letter unless the previous word is *St.*

One of the most important tasks in Machine Learning are the Classification tasks (a.k.a. supervised machine learning). Classification is used to make an accurate prediction of the class of entries in a test set (a dataset of which the entries have not yet been labelled) with the model which was constructed from a training set. You could think of classifying crime in the field of pre-policing, classifying patients in the health sector, classifying houses in the real-estate sector. Another field in which classification is big, is Natural Lanuage Processing (NLP). This goal of this field of science is to makes machines (computers) understand written (human) language. You could think of text categorization, sentiment analysis, spam detection and topic categorization.

For classification tasks there are three widely used algorithms; the Naive Bayes, Logistic Regression / Maximum Entropy and Support Vector Machines. We have already seen how the Naive Bayes works in the context of Sentiment Analysis. Although it is more accurate than a bag-of-words model, it has the assumption of conditional independence of its features. This is a simplification which makes the NB classifier easy to implement, but it is also unrealistic in most cases and leads to a lower accuracy. A direct improvement on the N.B. classifier, is an algorithm which does not assume conditional independence but tries to estimate the weight vectors (feature values) directly. This algorithm is called Maximum Entropy in the field of NLP and Logistic Regression in the field of Statistics.

Maximum Entropy might sound like a difficult concept, but actually it is not. It is a simple idea, which can be implemented with a few lines of code. But to fully understand it, we must first go into the basics of Regression and Logistic Regression.

Regression Analysis is the field of mathematics where the goal is to find a function which best correlates with a dataset. Let’s say we have a dataset containing datapoints; . For each of these (input) datapoints there is a corresponding (output) -value. Here, the -datapoints are called the independent variables and the dependent variable; the value of depends on the value of , while the value of may be freely chosen without any restriction imposed on it by any other variable.

The goal of Regression analysis is to find a function which can best describe the correlation between and . In the field of Machine Learning, this function is called the hypothesis function and is denoted as .

If we can find such a function, we can say we have successfully built a Regression model. If the input-data lives in a 2D-space, this boils down to finding a curve which fits through the data points. In the 3D case we have to find a plane and in higher dimensions a hyperplane.

To give an example, let’s say that we are trying to find a predictive model for the success of students in a course called Machine Learning. We have a dataset which contains the final grade of students. Dataset contains the values of the independent variables. Our initial assumption is that the final grade only depends on the studying time. The variable therefore indicates how many hours student has studied. The first thing we would do is visualize this data:

If the results looks like the figure on the left, then we are out of luck. It looks like the points are distributed randomly and there is no correlation between and at all. However, if it looks like the figure on the right, there is probably a strong correlation and we can start looking for the function which describes this correlation.

This function could for example be:

or

where are the dependent parameters of our model.

In evaluating the results from the previous section, we may find the results unsatisfying; the function does not correlate with the datapoints strongly enough. Our initial assumption is probably not complete. Taking only the studying time into account is not enough. The final grade does not only depend on the studying time, but also on how much the students have slept the night before the exam. Now the dataset contains an additional variable which represents the sleeping time. Our dataset is then given by . In this dataset indicates how many hours student has studied and indicates how many hours he has slept.

This is an example of multivariate regression. The function has to include both variables. For example:

or

.

All of the above examples are examples of linear regression. We have seen that in some cases depends on a linear form of , but it can also depend on some power of , or on the log or any other form of . However, in all cases the parameters were linear.

So, what makes linear regression linear is not that depends in a linear way on , but that it depends in a linear way on . needs to be linear with respect to the model-parameters . Mathematically speaking it needs to satisfy the superposition principle. Examples of nonlinear regression would be:

or

The reason why the distinction is made between linear and nonlinear regression is that nonlinear regression problems are more difficult to solve and therefore more computational intensive algorithms are needed.

Linear regression models can be written as a linear system of equations, which can be solved by finding the closed-form solution with Linear Algebra. See these statistics notes for more on solving linear models with linear algebra.

As discussed before, such a closed-form solution can only be found for linear regression problems. However, even when the problem is linear in nature, we need to take into account that calculating the inverse of a by matrix has a time-complexity of . This means that for large datasets ( ) finding the closed-form solution will take more time than solving it iteratively (gradient descent method) as is done for nonlinear problems. So solving it iteratively is usually preferred for larger datasets, even if it is a linear problem.

The Gradient Descent method is a general optimization technique in which we try to find the value of the parameters with an iterative approach.

First, we construct a cost function (also known as loss function or error function) which gives the difference between the values of (the values you expect to have with the determined values of ) and the actual values of . The better your estimation of is, the better the values of will approach the values of .

Usually, the cost function is expressed as the squared error between this difference:

At each iteration we choose new values for the parameters , and move towards the ‘true’ values of these parameters, i.e. the values which make this cost function as small as possible. The direction in which we have to move is the negative gradient direction;

.

The reason for this is that a function’s value decreases the fastest if we move towards the direction of the negative gradient (the directional derivative is maximal in the direction of the gradient).

Taking all this into account, this is how gradient descent works:

- Make an initial but intelligent guess for the values of the parameters .
- Keep iterating while the value of the cost function has not met your criteria:
- With the current values of , calculate the gradient of the cost function J ( ).
- Update the values for the parameters
- Fill in these new values in the hypothesis function and calculate again the value of the cost function;

Just as important as the initial guess of the parameters is the value you choose for the learning rate . This learning rate determines how fast you move along the slope of the gradient. If the selected value of this learning rate is too small, it will take too many iterations before you reach your convergence criteria. If this value is too large, you might overshoot and not converge.

Logistic Regression is similar to (linear) regression, but adapted for the purpose of classification. The difference is small; for Logistic Regression we also have to apply gradient descent iteratively to estimate the values of the parameter . And again, during the iteration, the values are estimated by taking the gradient of the cost function. And again, the cost function is given by the squared error of the difference between the hypothesis function and . The major difference however, is the form of the hypothesis function.

When you want to classify something, there are a limited number of classes it can belong to. And for each of these possible classes there can only be two states for ;

either belongs to the specified class and , or it does not belong to the class and . Even though the output values are binary, the independent variables are still continuous. So, we need a function which has as input a large set of continuous variables and for each of these variables produces a binary output. This function, the hypothesis function, has the following form:

.

This function is also known as the logistic function, which is a part of the sigmoid function family. These functions are widely used in the natural sciences because they provide the simplest model for population growth. However, the reason why the logistic function is used for classification in Machine Learning is its ‘S-shape’.

As you can see this function is bounded in the y-direction by 0 and 1. If the variable is very negative, the output function will go to zero (it does not belong to the class). If the variable is very positive, the output will be one and it does belong to the class. (Such a function is called an indicator function.)

The question then is, what will happen to input values which are neither very positive nor very negative, but somewhere ‘in the middle’. We have to define a decision boundary, which separates the positive from the negative class. Usually this decision boundary is chosen at the middle of the logistic function, namely at where the output value is .

(1)

For those who are wondering where entered the picture that we were talking about before. As we can see in the formula of the logistic function, . Meaning, the dependent parameter (also known as the feature), maps the input variable to a position on the -axis. With its -value, we can use the logistic function to calculate the -value. If this -value we assume it does belong in this class and vice versa.

So the feature should be chosen such that it predicts the class membership correctly. It is therefore essential to know which features are useful for the classification task. Once the appropriate features are selected , gradient descent can be used to find the optimal value of these features.

How can we do gradient descent with this logistic function? Except for the hypothesis function having a different form, the gradient descent method is exactly the same. We again have a cost function, of which we have to iteratively take the gradient w.r.t. the feature and update the feature value at each iteration.

This cost function is given by

(2)

We know that:

and

(3)

Plugging these two equations back into the cost function gives us:

(4)

The gradient of the cost function with respect to is given by

(5)

So the gradient of the seemingly difficult cost function, turns out to be a much simpler equation. And with this simple equation, gradient descent for Logistic Regression is again performed in the same way:

- Make an initial but intelligent guess for the values of the parameters .
- Keep iterating while the value of the cost function has not met your criteria:
- With the current values of , calculate the gradient of the cost function J ( ).
- Update the values for the parameters
- Fill in these new values in the hypothesis function and calculate again the value of the cost function;

In the previous section we have seen how we can use Gradient Descent to estimate the feature values , which can then be used to determine the class with the Logistic function. As stated in the introduction, this can be used for a wide variety of classification tasks. The only thing that will be different for each of these classification tasks is the form the features take on.

Here we will continue to look at the example of Text Classification; Lets assume we are doing Sentiment Analysis and want to know whether a specific review should be classified as positive, neutral or negative.

The first thing we need to know is which and what types of features we need to include.

For NLP we will need a large number of features; often as large as the number of words present in the training set. We could reduce the number of features by excluding stopwords, or by only considering n-gram features.

For example, the 5-gram ‘kept me reading and reading’ is much less likely to occur in a review-document than the unigram ‘reading’, but if it occurs it is much more indicative of the class (positive) than ‘reading’. Since we only need to consider n-grams which actually are present in the training set, there will be much less features if we only consider n-grams instead of unigrams.

The second thing we need to know is the actual value of these features. The values are learned by initializing all features to zero, and applying the gradient descent method using the labeled examples in the training set. Once we know the values for the features, we can compute the probability for each class and choose the class with the maximum probability. This is done with the following Logistic function.

In this post we have discussed only the theory of Maximum Entropy and Logistic Regression. Usually such discussions are better understood with examples and the actual code. I will save that for the next blog.

If you have enjoyed reading this post or maybe even learned something from it, subscribe to this blog so you can receive a notification the next time something is posted.

Miles Osborne, Using Maximum Entropy for Sentence Extraction (2002)

Jurafsky and Martin, Speech and Language Processing; Chapter 7

Nigam et. al., Using Maximum Entropy for Text Classification

]]>

With the bag-of-words model we check which word of the text-document appears in a positive-words-list or a negative-words-list. If the word appears in a positive-words-list the total score of the text is updated with +1 and vice versa. If at the end the total score is positive, the text is classified as positive and if it is negative, the text is classified as negative. Simple enough!

With the Naive Bayes model, we do not take only a small set of positive and negative words into account, but all words the NB Classifier was trained with, i.e. all words presents in the training set. If a word has not appeared in the training set, we have no data available and apply Laplacian smoothing (use 1 instead of the conditional probability of the word).

The probability a document belongs to a class is given by the class probability multiplied by the products of the conditional probabilities of each word for that class.

Here is the number of occurences of word in class , is the total number of words in class and is the number of words in the document we are currently classifying.

does not change (unless the training set is expanded), so it can be placed outside of the product:

In theory we want a training set as large as possible, since that will increase the accuracy. In practice this results in large numbers for and . For example, for our training set with 5000 reviews we got

Taking the n-th power of such a large number, will definitely result in computational problems, so we should normalize it. We can divide it by a number so that it becomes a number close to 1. In our case, this number is 100.000 and this normalization results in:

However, if the number of words in the document is large, this can still lead to computational problems:

>>> 4.59**500 Traceback (most recent call last): File "", line 1, in OverflowError: (34, 'Result too large') >>>

In Python there are a few modules which can handle large number (like Decimal), but a better solution would be to take the logarithm. This will not affect the outcome of the classification process; if a document has the highest probability for a specific class, the logarithm of the probabilities will also be the highest for that class.

This results in:

With this information it is easy to implement a Naive Bayes Classifier algorithm.

Our training set consists in the form of a list of tuples, where each tuple contains two elements; the tokenized text and the label.

training_set = [ ([u'this', u'is', u'the', u'1st', u'book', u"i've", u'read', (...), u'brain'], 'neg'), ([u'it', u'is', u'sometimes', u'hard', u'for', (...), u'omg!', u'lots', u'of', u'twists'], 'pos'), ([u'know', u'everyone', u'seemed', u'to', u'like', (...), u'movies', u'ugg!'], 'neg'), etc, etc, ... ... ]

The training of the Naive Bayes Classifier is done by iterating through all of the documents in the training set, keeping track of the number of occurences of each word per class. Of course, we can exclude stopwords and include n-gram features if these options are chosen during the training process(see previous post). Many different data containers can be chosen to keep track of the number of occurences. Within Python a DataFrame is very useful for this purpose. The advantage of a DataFrame is that it is easy to save it (.to_csv) once the training is done. At a later time it can be loaded again with .read_csv. The code to train a NB Classifier looks like:

from sets import Set import pandas as pd df = pd.DataFrame(0, columns=['neg','neu','pos'], index='') words_set = Set() for item in training_set: label = item[1] text = item[0] for word in text: if word not in words_set: words_set.add(word) df.loc[word] = [0,0,0] df.ix[word][label] += 1 else: df.ix[word][label] += 1

People who are already familiar with pandas DataFrame will know that it is going to look something like this:

neg neu pos kept 114 122 514 reading 315 312 1384 through 166 188 649 drawn-out 1 0 0 detailed 4 9 27 story 386 571 2544 of 1995 2432 10475 seeing 13 25 94 how 240 271 1303 justice 24 26 88 ending 504 672 1891 ridiculous 61 15 19 because 329 317 1149 ... ... ... ... fudgsticle! 0 0 1 ooohhhh 0 0 1 signing 0 0 1 flynn!!! 0 0 1 wow!two 0 0 1 allllllll 0 0 1 chose? 0 0 1 [22703 rows x 3 columns]

The DataFrame containing the number of occurences of each word from the training set, is actually all of the training our model needs. With this DataFrame `df`

, the algorithm for the Naive Bayes looks like:

import operator #for sorting the dictionary import math processed_words = list(df.index.values) class_probabilities = { 'neg' : 0.1566, 'neu': 0.15, 'pos' : 0.6934 } labels = class_probabilities.keys() words_per_class = {} for label in labels: words_per_class[label] = df[label].sum() def nb_classify(document): no_words_in_doc = len(document) current_class_prob = {} for label in labels: prob = math.log(class_probabilities[label],2) - no_words_in_doc * math.log(words_per_class[label],2) for word in document: if word in processed_words: occurence = df.loc[word][label] if occurence > 0: prob += math.log(occurence,2) else: #Laplacian/ add-1 smoothing. Log of 1 however is zero. We are adding zero. prob += math.log(1,2) else: prob += math.log(1,2) current_class_prob[label] = prob #sort the current_class_prob dictionary by its values, so we can take the key with the maximum value sorted_labels = sorted(current_class_prob.items(), key=operator.itemgetter(1)) most_probable_class = sorted_labels[-1][0] return most_probable_class for item in test_set: classification = nb_classify(item)

As we can see, the Naive Bayes Classifier is easy to implement. Its algorithm is given by lines 11 to 30.

The most probable class is given by the key with the maximum value in the dictionary `current_class_prob`

.

In the next blog we will look at the results of this naively implemented algorithm for the Naive Bayes Classifier and see how it performs under various conditions; we will see the influence of varying training set sizes and whether the use of n-gram features will improve the accuracy of the classifier.

]]>- To keep track of the number of occurences of each word, we tokenize the text and add each word to a single list. Then by using a Counter element we can keep track of the number of occurences.
- We can make a DataFrame containing the class probabilities of each word by adding each word to the DataFrame as we encounter it and dividing it by the total number of occurences afterwards.
- Sorting this DataFrame by the values in the columns of the Positive or Negative class, then taking the top 100 / 200 words we can construct a list containing negative or positive words.
- These words in this constructed Sentiment Lexicon can be used to give a value to the subjectivity of the reviews in the test set.

Using the steps described above, we were able to determine the subjectivity of reviews in the test set with an accuracy (F-score) of ~60%.

In this blog we will look into the effectiveness of cross-book sentiment lexicons; how well does a sentiment lexicon made from book A perform at sentiment analysis of book B?

We will also see how we can improve the bag-of-words technique by including n-gram features in the bag-of-words.

In the previous post, we have seen that the sentiment of reviews in the test-set of ‘Gone Girl’ could be predicted with a 60% accuracy. How well does the sentiment lexicon derived from the training set of book A perform at deducing the sentiment of reviews in the test set of book B?

In the table above, we can see that the most effective Sentiment Lexicons are created from books with a large amount of Positive ánd Negative reviews. In the previous post we saw that Fifty Shades of Grey has a large amount of negative reviews. This makes it a good book to construct an effective Sentiment Lexicon from.

Other books have a lot of positive reviews but only a few negative ones. The Sentiment Lexicon constructed from these books has a high accuracy in determining the sentiment of positive reviews, but a low accuracy for negative reviews… bringing the average down.

In the previous blog-post we had constructed a bag-of-words model with unigram features. Meaning that we split the entire text in single words and count the occurence of each word. Such a model does not take the position of each word in the sentence, its context and the grammar into account. That is why, the bag-of-words model has a low accuracy in detecting the sentiment of a text document.

For example, with the bag-of-words model the following two sentences will be given the same score:

1. “This is not a good book” –> 0 + 0 + 0 + 0 + 1 + 0 –> positive

2. “This is a very good book” –> 0 + 0 +0 +0 +1 + 0 –> positive

If we include features consisting of two or three words, this problem can be avoided; “not good” and “very good” will be two different features with different subjectivity scores. The biggest reason why bigram or trigram features are not used more often is that the number of possible combinations of words increases exponentially with the number of words. Theoretically, a document with 2.000 words can have 2.000 possible unigram features, 40.000 possible bigram features and 8.000.000.000 possible trigram features.

However, if we consider this problem from a pragmatic point of view we can say that most of the combinations of words which can be made, are grammatically not possible, or do not occur with a significant amount and hence don’t need to be taken into account.

Actually, we only need to define a small set of words (prepisitions, conjunctions, interjections etc) of which we know it changes the meaning of the words following it and/or the rest of the sentence. I we encounter such a ‘ngram word’, we do not split the sentence but split it after the next word. In this way we will construct ngram features consisting of the specified words and the words directly following them. Some examples of such words are:

In the previous post, we had seen that the code to construct a DataFrame containing the class probabilities of words in the training set is:

from sets import Set import pandas as pd BOW_df = pd.DataFrame(0, columns=scores, index='') words_set = Set() for review in training_set: score = review['score'] text = review['review_text'] splitted_text = split_text(text) for word in splitted_text: if word not in words_set: words_set.add(word) BOW_df.loc[word] = [0,0,0,0,0] BOW_df.ix[word][score] += 1 else: BOW_df.ix[word][score] += 1

If we also want to include ngrams in this class probability DataFrame, we need to include a function which generates n-grams from the splitted text and the list of specified ngram words:

(...) splitted_text = split_text('text') text_with_ngrams = generate_ngrams(splitted_text, ngram_words) for word in text_with_ngrams: (...)

There are a few conditions this “generate_ngrams” function needs to fulfill:

- When it iterates through the splitted text and encounters a ngram-word, it needs to concatenate this word with the next word. So [“I”,”do”,”not”,”recommend”,”this”,”book”] needs to become [“I”, “do”, “not recommend”, “this”, “book”]. At the same time it needs to skip the next iteration so the next word does not appear two times.
- It needs to be recursive: we might encounter multiple ngram words in a row. Then all of the words needs to be concatenated into a single ngram. So [“This”,”is”,”a”,”very”,”very”, “good”,”book”] needs to be concatenated in [“This”,”is”,”a”,”very very good”, “book”]. If n words are concatenated together into a single n-gram, the next n iterations need to be skipped.
- In addition to concatenating words with the words following it, it might also be interesting if we concatenating it with the word preceding it. For example, forming n-grams including the word “book” and its preceding words leads to features like “worst book”, “best book”, “fascinating book” etc…

Now that we know this information, lets have a look at the code:

def generate_ngrams(text, ngram_words): new_text = [] index = 0 while index < len(text): [new_word, new_index] = concatenate_words(index, text, ngram_words) new_text.append(new_word) index = new_index+1 if index!= new_index else index+1 return new_text def concatenate_words(index, text, ngram_words): word = text[index] if index == len(text)-1: return word, index if word in bigram_array: [word_new, new_index] = concatenate_words(index+1, text, ngram_words) word = word + ' ' + word_new index = new_index return word, index

Here concatenate_words is a recursive function which either returns the word at the index position in the array, or the word concatenated with the next word. It also return the index so we know how many iterations need to be skipped.

This function will also work if we want to append words to its previous words. Then we simply need to pass the reversed text to it `text = list(reversed(text))`

and concatenate it in reversed order: ` word = word_new + ' ' + word`

.

We can put this information together in a single function, which can either concatenate with the next word or with the previous word, depending on the value of the parameter ‘forward’:

def generate_ngrams(text, ngram_words, forward = True): new_text = [] index = 0 if not forward: text = list(reversed(text)) while index < len(text): [new_word, new_index] = concatenate_words(index, text, ngram_words, forward) new_text.append(new_word) index = new_index+1 if index!= new_index else index+1 if not forward: return list(reversed(new_text)) return new_text def concatenate_words(index, text, ngram_words, forward): words = text[index] if index == len(text)-1: return words, index if words.split(' ')[0] in bigram_array: [new_word, new_index] = concatenate_words(index+1, text, ngram_words, forward) if forward: words = words + ' ' + new_word else: words = new_word + ' ' + words index = index_new return words, index

Using this simple function to concatenate words in order to form n-grams, will lead to features which strongly correlate with a specific (Negative/Positive) class like ‘highly recommend’, ‘best book’ or even ‘couldn’t put it down’.

Now that we have a better understanding of Text Classification terms like bag-of-words, features and n-grams, we can start using Classifiers for Sentiment Analysis. Think of Naive Bayes, Maximum Entropy and SVM.

]]>In my previous post I have explained the Theory behind three of the most popular Text Classification methods (Naive Bayes, Maximum Entropy and Support Vector Machines) and told you that I will use these Classifiers for the automatic classification of the subjectivity of Amazon.com book reviews.

The purpose is to get a better understanding of how these Classifiers work and perform under various conditions, i.e. do a comparative study about Sentiment Analytics.

In this blog-post we will use the bag-of-words model to do Sentiment Analysis. The bag-of-words model can perform quiet well at Topic Classification, but is inaccurate when it comes to Sentiment Classification. Bo Pang and Lillian Lee report an accuracy of 69% in their 2002 research about Movie review sentiment analysis. With the three Classifiers this percentage goes up to about 80% (depending on the chosen feature).

The reason to still make a bag-of-words model is that it gives us a better understanding of the content of the text and we can use this to select the features for the three classifiers. The Naive Bayes model is also based on the bag-of-words model, so the bag-of-words model can be used as an intermediate step.

We can collect book reviews from Amazon.com by scraping them from the website with BeautifulSoup. The process for this was already explained in the context of Twitter.com and it should not be too difficult to do the same for Amazon.com.

In total 213.335 book reviews were collected for eight randomly chosen books:

After making a bar-plot of the distribution of the different stars for the chosen books, we can we that there is a strong variation. Books which are considered to be average have almost no 1-star ratings while books far below average have a more uniform distribution of the different ratings.

We can see that the book ‘Gone Girl’ has a pretty uniform distribution so it seems like a good choice for our training set. Books like ‘Unbroken’ or ‘The Martian’ might not have enough 1-star reviews to train for the Negative class.

As the next step, we are going to divide the corpus of reviews into a training set and a test set. The book ‘Gone Girl’ has about 40.000 reviews, so we can use *up to* half of it for training purposes and the other half for testing the accuracy of our model. In order to also take into account the effects of the training set size on the accuracy of our model, we will vary the training set size from 1.000 up to 20.000.

The bag-of-words model is one of the simplest language models used in NLP. It makes an unigram model of the text by keeping track of the number of occurences of each word. This can later be used as a features for Text Classifiers. In this bag-of-words model you only take individual words into account and give each word a specific subjectivity score. This subjectivity score can be looked up in a sentiment lexicon[1]. If the total score is negative the text will be classified as negative and if its positive the text will be classified as positive. It is simple to make, but is less accurate because it does not take the word order or grammar into account.

A simple improvement on using unigrams would be to use unigrams + bigrams. That is, not split a sentence after words like “not”,”no”,”very”, “just” etc. It is easy to implement but can give significant improvement to the accuracy. The sentence “This book is not good” will be interpreted as a positive sentence, unless such a construct is implemented. Another example is that the sentences “This book is very good” and “This book is good” will have the same score with a unigram model of the text, but not with an unigram + bigram model.

My pseudocode for creating a bag-of-words model is as follows:

*list_BOW*= []- For each review in the training set:
- Strip the newline charachter “n” at the end of each review.
- Place a space before and after each of the following characters: .,()[]:;” (This prevents sentences like “I like this book.It is engaging” being interpreted as [“I”, “like”, “this”, “book.It”, “is”, “engaging”].)
- Tokenize the text by splitting it on spaces.
- Remove tokens which consist of only a space, empty string or punctuation marks.
- Append the tokens to list_BOW.

*list_BOW*now contains all words occuring in the training set.- Place
*list_BOW*in a Python Counter element. This counter now contains all occuring words together with their frequencies. Its entries can be sorted with the most_common() method.

The real question is, how we should determine the sentiment/subjectivity score of each word in order to determine the total subjectivity score of the text. We can use one of the sentiment lexicons given in [1], but we dont really know in which circumstances and for which purposes these lexicons are created. Furthermore, in most of these lexicons the words are classified in a binary way (either positive or negative ). Bing Liu’s sentiment lexicon for example contains a list of a few thousands positive and a few thousand negative words.

Bo Pang and Lillian Lee used words which were chosen by two student as positive and negative words.

It would be better if we determine the subjectivity score of each word using some simple statistics of the training set. To do this we need to determine the class probability of each word present in the bag-of-words. This can be done by using pandas dataframe as a datacontainer (but can just as easily be done with dictionaries or other data structures). The code for this looks like:

from sets import Set import pandas as pd BOW_df = pd.DataFrame(0, columns=scores, index='') words_set = Set() for review in training_set: score = review['score'] text = review['review_text'] splitted_text = split_text(text) for word in splitted_text: if word not in words_set: words_set.add(word) BOW_df.loc[word] = [0,0,0,0,0] BOW_df.ix[word][score] += 1 else: BOW_df.ix[word][score] += 1

Here `split_text`

is the method for splitting a text into a list of individual words:

def expand_around_chars(text, characters): for char in characters: text = text.replace(char, &quot; &quot;+char+&quot; &quot;) return text def split_text(text): text = strip_quotations_newline(text) text = expand_around_chars(text, '&quot;.,()[]{}:;') splitted_text = text.split(&quot; &quot;) cleaned_text = [x for x in splitted_text if len(x)&gt;1] text_lowercase = [x.lower() for x in cleaned_text] return text_lowercase

This gives us a DataFrame containing of the number of occurances of each word in each class:

Unnamed: 0 1 2 3 4 5 0 i 4867 5092 9178 14180 17945 1 through 210 232 414 549 627 2 all 499 537 923 1355 1791 3 drawn-out 1 0 1 1 0 4 , 4227 4779 8750 15069 18334 5 detailed 3 7 15 30 36 ... ... ... ... ... ... ... 31800 a+++++++ 0 0 0 0 1 31801 nailbiter 0 0 0 0 1 31802 melinda 0 0 0 0 1 31803 reccomend! 0 0 0 0 1 31804 suspense!! 0 0 0 0 1 [31804 rows x 6 columns]

As we can see there are also quiet a few words which only occur one time. These words will have a class probability of 100% for the class they are occuring in.

This distribution however, does not approximate the real class distribution of that word at all. It is therefore good to define some ‘occurence cut off value’; words which occur less than this value are not taken into account.

By dividing each element of each row by the sum of the elements of that row we will get a DataFrame containing the relative occurences of each word in each class, i.e. a DataFrame with the class probabilities of each word. After this is done, the words with the highest probability in class 1 can be taken as negative words and words with the highest probability in class 5 can be taken as positive words.

We can construct such a sentiment lexicon from the training set and use it to measure the subjectivity of reviews in the test set. Depending on the size of the training set, the sentiment lexicon becomes more accurate for prediciton.

By labeling 4 and 5-star reviews as Positive, 1 and 2-star reviews as Negative and 3 star reviews as Neutral and using the following positive and negative word:

we can determine with the bag-of-words model whether a review is positive or negative with a 60% accuracy .

- How accurate is this list of positive and negative words constructed from the reviews of book A in determening the subjectivity of book B reviews.
- How much more accurate will the bag-of-words model become if we take bigrams or even trigrams into account? There were words with a high negative or positive subjectivity in the word-list which do not have a negative or positive meaning by themselves. This can only be understood if you take the preceding or following words into account.
- Make an overall sentiment lexicon from all the reviews of all the books.
- Use the bag-of-words as features for the Classifiers; Naive Bayes, Maximum Entropy and Support Vector Machines.

]]>

If it is done wrong, it can be boring not grabbing the attention of the readers, or even worse; convey the wrong message.

If it done correctly, it can intrigue even the most indifferent reader (some people can even turn Data Visualizations into an art form).

I personally think Python’s matplotlib is a great library for data visualization. Another amazing library is D3, which is very intuitive and flexible like matplotlib. In addition to that it is a javascript library so it works in the browser, which makes it is platform independent and you dont have to install any software. Did I already tell you D3 is a.. maa.. zing!?

That is why, I will focus on Data Visualizations with D3 in the future. But for now, I will start with something simpler and show you how to make a choropleth map. This is the kind of map you see at every election, where each state is colored in the color of the winning party. Although it might seem difficult to make such a map… it is not.

First thing we need is a map of a country (or the area we want to visualize) in SVG form. Wikipedia has a nice collection of blank maps we can use. Copy the code of this map in a <div> element of a basic html page. As an example, we can take this map of the world.

Another thing we need to include is the jQuery library, so go ahead and link to the latest jQuery version hosted by google like this:

If we open the page now, we should see the map drawn out.

As we can see in the code, each <path> element has its own id. The code for Australia for example looks like:

Sometimes we might be lucky and this id will actually be equal to the name of the state/country and sometimes it might be a random number/word. If that is the case, dont lose any sleep over it. It is not very difficult to discover which element belongs to which country with Chrome Developer Tools (right click on the country and then click on ‘inspect element’).

Now that we know the id of the country we want to color in, we can give it a color with the javascript code:

$(“#path6235”).css(‘fill’,’red’);

Now we need some data to fill in the map. Since the war in Syria / the Syrian refugee crisis is a current issue, it might be interesting to see which countries are donating the most / least to the Syrian crisis. The data for this can be found on this website. We could chose to color based on the absolute amount of money, but it does seem more fair to look at this donated amount relative to the countries’ GDP.

If we divide the donated amount by the GDP of that country of that year, we will get this data. Now we only need to put it in the correct format, which is JSON.

In our example, the correct data in the correct format looks like:

The complete dataset can be downloaded from here. In this file each number indicates the donated amount as 1/1000th percentage of the annual GDP. Go ahead and place the data in a <script> tag so that it can be accessed by JavaScript.

You can check whether or not the data is recognized by the browser, by executing * console.log(data["Switzerland"])* within a <script> tag. This should print the data for Switzerland in the console of the browser:

Now the entire map can be filled in with a javascript function which iterates through the variable containing the data:

` `

With the correct colors filled in, the map looks like:

In the above map all of the countries with no donations for the Syrian crisis (in 2015) are colored red. The countries which have donated money are colored in based on a blue to green gradient, where blue indicates a relative low and green indicates a relative large donation (Russia ~ 0.5 / 1000 % of GDP and Canada ~ 11 / 1000 % of their GDP).

If you are interested, you can download the entire html file from here.

Now that I have covered the basics of making a choropleth map, I want to address the issue that the way you choose to visualize your data can have a huge impact on the message your visualization is conveying.

If the countries with no donated money were left untouched, the first impression of the visualization would be there is no data available on these countries.

Choosing the gradient scale from red to green instead of blue to green, conveys the message that the countries colored with red have done something bad (red is associated with danger).

Although I think everybody can donate more, I would not want to give the impression that Brazil has done something bad by donating ‘only’ 5.000.000 USD.

]]>

Natural Language Processing (NLP) is a vast area of Computer Science that is concerned with the interaction between Computers and Human Language[1].

Within NLP many tasks are – or can be reformulated as – classification tasks. In classification tasks we are trying to produce a classification function which can give the correlation between a certain ‘feature’ and a class . This Classifier first has to be trained with a training dataset, and then it can be used to actually classify documents. Training means that we have to determine its model parameters. If the set of training examples is chosen correctly, the Classifier should predict the class probabilities of the actual documents with a similar accuracy (as the training examples).

After construction, such a Classifier could for example tell us that document containing the words “Bose-Einstein condensate” should be categorized as a Physics article, while documents containing the words “Arbitrage” and “Hedging” should be categorized as a Finance article.

Another Classifier could tell us that mails starting with “Dear Customer/Guest/Sir” (instead of your name) and containing words like “Great opportunity” or “one-time offer” can be classified as spam.

Here we can already see two uses of classification models: *topic classification* and *spam filtering*. For these purposes a Classifiers work quiet well and perform better than most trained professionals.

A third usage of Classifiers is Sentiment Analysis. Here the purpose is to determine the subjective value of a text-document, i.e. how positive or negative is the content of a text document. Unfortunately, for this purpose these Classifiers fail to achieve the same accuracy. This is due to the subtleties of human language; sarcasm, irony, context interpretation, use of slang, cultural differences and the different ways in which opinion can be expressed (subjective vs comparative, explicit vs implicit).

In this blog I will discuss the theory behind three popular Classifiers (Naive Bayes, Maximum Entropy and Support Vector Machines) in the context of Sentiment Analysis[2]. In the next blog I will apply this gained knowledge to automatically deduce the sentiment of collected Amazon.com book reviews.

The contents of this blog-post is as follows:

- Basic concepts of text classification:
- Tokenization
- Word normalization
- bag-of-words model
- Classifier evaluation

- Naieve Bayesian Classifier
- Maximum Entropy Classifier
- Support Vector Machines
- What to Expect

Tokenization is the name given to the process of chopping up sentences into smaller pieces (words or tokens). The segmentation into tokens can be done with decision trees, which contains information to correctly solve the issues you might encounter. Some of these issues you would have to consider are:

- The choice for the delimiter will for most cases be a whitespace (“We’re going to Barcelona” -> [“We’re”, “going”, “to”, “Barcelona.”]), but what should you do when you come across words with a white space in them (“We’re going to The Hague.”->[“We’re”, “going”,”to”,”The”, “Hague”]).
- What should you do with punctuation marks? Although many tokenizers are geared towards throwing punctuation away, for Sentiment analysis a lot of valuable information could be deduced from them.
**!**puts extra emphasis on the negative/positive sentiment of the sentence, while**?**can mean uncertainty (no sentiment). - “, ‘ , [], () can mean that the words belong together and should be treated as a separate sentence. Same goes for words which are
**bold**,*italic*,__underlined__, or inside a link. If you also want to take these last elements into considerating, you should scrape the html code and not just the text.

**Word Normalization **is the reduction of each word to its base/stem form (by chopping of the affixes). While doing this, we should consider the following issues:

- Capital letters should be normalized to lowercase, unless it occurs in the middle of a sentence; this could indicate the name of a writer, place, brand etc.
- What should be done with the apostrophe (‘); “George’s phone” should obviously be tokenized as “George” and “phone”, but I’m, we’re, they’re should be translated as I am, we are and they are. To make it even more difficult; it can also be used as a quotation mark.
- Ambigious words like High-tech, The Hague, P.h.D., USA, U.S.A., US and us.

After the text has been segmented into sentences, each sentence has been segmented into words, the words have been tokenized and normalized, we can make a simple bag-of-words model of the text. In this bag-of-words representation you only take individual words into account and give each word a specific subjectivity score. This subjectivity score can be looked up in a sentiment lexicon[7]. If the total score is negative the text will be classified as negative and if its positive the text will be classified as positive.

For determining the accuracy of a single Classifier, or comparing the results of different Classifier, the F-score is usually used. This F-score is given by

where is the precision and is the recall. The precision is the number of correctly classified examples divided by the total number of classified examples. The recall is the number of correctly classified examples divided by the actual number of examples in the training set.

Naive Bayes [3] classifiers are studying the classification task from a Statistical point of view. The starting point is that the probability of a class is given by the posterior probability given a training document . Here refers to all of the text in the entire training set. It is given by , where is the attribute (word) of document .

Using Bayes’ rule, this posterior probability can be rewritten as:

Since the marginal probability is equal for all classes, it can be disregarded and the equation becomes:

The document belongs to the class which maximizes this probability, so:

Assuming conditional independence of the words , this equation simplifies to:

Here is the conditional probability that word i belongs to class . For the purpose of text classification, this probability can simply be calculated by calculating the frequency of word in class relative to the total number of words in class .

We have seen that we need to multiply the class probability with all of the prior-probabilities of the individual words belonging to that class. The question then is, how do we know what the prior-probabilities of the words are? Here we need to remember that this is a supervised machine learning algorithm: we can estimate the prior-probabilities with a training set with documents that are already labeled with their classes. With this training set we can train the model and obtain values for the prior probabilities. This trained model can then be used for classifying unlabeled documents.

This is relatively easy to understand with an example. Lets say we have counted the number of words in a set of labeled training documents. In this set each text document has been labeled as either Positive, Neutral or as Negative. The result will then look like :

From this table we can already deduce each of the class probabilites:

,

,

.

If we look at the sentence “This blog-post is awesome.”, then the probabilities for this sentence belonging to a specific class are:

This sentence can thus be classified in the positive category.

The principle behind Maximum Entropy [4] is that the correct distribution is the one that maximizes the Entropy / uncertainty and still meets the constraints which are set by the ‘evidence’.

Let me explain this a bit more. In Information Theory, the word Entropy is used as a unit of measure for the unpredictability of the content of information. If you would throw a fair dice, each of the six outcomes have the same probability of occuring (1/6). Therefore you have maximum uncertainty; an entropy of 1. If the dice is weighted you already know one of the six outcomes has a higher probability of occuring and the uncertainty becomes less. If the dice is weighted so much that the outcome is always six, there is zero uncertainty in the outcome and hence the information entropy is also zero.

The same applies to letters in a word (or words in a sentence): if you assume that every letter has the same probability of occuring you have maximum uncertainty in predicting the next letter. But if you know that letters like E, A, O or I have a higher probability of occuring you have less uncertainty.

Knowing this, we can say that complex data has a high entropy, patterns and trends have lower entropy, information you know for a fact to be true has zero entropy (and therefore can be excluded).

The idea behind Maximum Entropy is that you want a model which is as unbiased as possible; events which are not excluded by known constraints should be assigned as much uncertainty as possible, meaning the probability distribution should be as uniform as possible. You are looking for the maximum value of the Entropy. If this is not entirely clear, I recommend you to read through this example.

The mathematical formula for Entropy is given by , so the most likely probability distribution is the one that maximizes this entropy:

It can be shown that the probability distribution has an exponential form and hence is given by:

,

where is a feature function, is the weight parameter of the feature function and is a normalization factor given by

.

This feature function is an indicator function, which is expresses the expected value of the chosen statistics (words) in the training set. These feature functions can then be taken as constraints for the classification of the actual dataset (by eliminating the probability distributions which do not fit with these constraints).

Usually, the weight parameters are automatically determined by the Improved Iterative Scaling algorithm. This is simply a gradient descent function which can be iterated over until it converges to the global maximum. The pseudocode for the this algorithm is as follows:

- Initialize all weight parameters to zero.
- Repeat until convergence:
- calculate the probability distribution with the weight parameters filled in.
- for each parameter calculate . This is the solution to:

- update the value for the weight parameter:

In step 2b is given by the sum of all features in the training dataset d:

Maximum Entropy is a general statistical classification algorithm and can be used to estimate any probability distribution. For the specific case of text classification, we can limit its form a bit more by using word counts as features:

(1)

Although it is not immediatly obvious from the name, the SVM algorithm is a ‘simple’ linear classification/regression algorithm[6]. It tries to find a hyperplane which seperates the data in two classes as optimally as possible.

Here as optimally as possible means that as much points as possible of label A should be seperated to one side of the hyperplane and as points of label B to the other side, while maximizing the distance of each point to this hyperplane.

In the image above we can see this illustrated for the example of points plotted in 2D-space. The set of points are labeled with two categories (illustrated here with black and white points) and SVM chooses the hypeplane that maximizes the margin between the two classes. This hyperplane is given by

where is a n-dimensional input vector, is its output value, is the weight vector (the normal vector) defining the hyperplane and the terms are the Lagrangian multipliers.

Once the hyperplane is constructed (the vector is defined) with a training set, the class of any other input vector can be determined:

if then it belongs to the positive class (the class we are interested in), otherwise it belongs to the negative class (all of the other classes).

We can already see this leads to two interesting questions:

1. SVM only seems to work when the two classes are linearly separable. How can we deal with non-linear datasets? Here I feel the urge to point out that the Naive Bayes and Maximum Entropy are linear classifiers as well and most text documents will be linear. Our training example of Amazon book reviews will be linear as well. But an explanation of the SVM system will not be complete without an explanation of Kernel functions.

2. SVM only seems to be able to separate the dataset into two classes? How can we deal with datasets with more than two classes. For Sentiment Classification we have for example three classes (positive, neutral, negative) and for Topic Classification we can have even more than that.

**Kernel Functions:
**The classical SVM system requires that the dataset is linearly separable, i.e. there is a single hyperplane which can separate the two classes. For non-linear datasets a Kernel function is used to map the data to a higher dimensional space in which it is linearly separable. This video gives a good illustation of such a mapping. In this higher dimensional feature space, the classical SVM system can then be used to construct a hyperplane.

**Multiclass classification:**

The classical SVM system is a binary classifier, meaning that it can only separate the dataset into two classes. To deal with datasets with more than two classes usually the dataset is reduced to a binary class dataset with which the SVM can work. There are two approaches for decomposing a multiclass classification problem to a binary classification problem: the one-vs-all and one-vs-one approach.

In the one-vs-all approach one SVM Classifier is build per class. This Classifier takes that one class as the positive class and the rest of the classes as the negative class. A datapoint is then only classified within a specific class if it is accepted by that Class’ Classifier and rejected by all other classifiers. Although this can lead to accurate results (if the dataset is clustered), a lot of datapoints can also be left unclassified (if the dataset is not clustered).

In the one-vs-one approach, you build one SVM Classifier per chosen pair of classes. Since there are possible pair combinations for a set of N classes, this means you have to construct more Classifiers. Datapoints are then categorized in the class for which they have received the most points.

In our example, there are only three classes (positive, neutral, negative) so there is no real difference between these two approaches. In both approaches we have to construct two hyperplanes; positive vs the rest and negative vs the rest.

For the purpose of testing these Classification methods, I have collected >300.000 book reviews of 10 different books from Amazon.com. I will use a part of these book reviews for training purposes and a part as the test dataset. In the next few blogs I will try to automatically classify the sentiment of these reviews with the four models described above.

—————————————-

**[1] Machine Learning Literature:
**Foundations of Statistical Natural Language Processing by Manning and Schutze,

Machine Learning: A probabilistic perspective by Kevin P. Murphy,

Foundations of Machine Learning by Mehryar Mohri

**[2]Sentiment Analysis literature:**

There is already a lot of information available and a lot of research done on Sentiment Analysis. To get a basic understanding and some background information, you can read Pang et.al.’s 2002 article. In this article, the different Classifiers are explained and compared for sentiment analysis of Movie reviews (IMDB). This research was very close to Turney’s 2002 research on Sentiment Analysis of movie reviews (see article). You can also read Bo Pang and Lillian Lee’s 2009 article , which is more general in nature (about the challenges of SA, the different ML techniques etc.)

There are also two relevant books: Web Data Mining and Sentiment Analysis, both by Bing Liu. And last but not least, works of Socher are also quiet interesting (see paper, website containing live demo); it even has inspired this kaggle competition.

**[3] Naive Bayes Literature:**

Machine Learning by Tom Mitchel, Stanford’s IR-book, Sebastian Raschka’s blog-post, Stanford’s online NLP course.

**[4]Maximum Entropy Literature:**

Using Maximum Entropy for text classification (1999), A simple introduction to Maximum Entropy models (1997), A brief MaxEnt tutorial, another good MIT article.

**[6]SVM Literature:**

This youtube video gives a general idea about SVM. For a more technical explanation, this and this article can be read. Here you can find a good explanation as well as a list of the mostly used Kernel functions. one-vs-one and one-vs-all.

**[7] Sentiment Lexicons:
**I have selected a list of sentiment analysis lexicons; most of these were mentioned in the Natural Language Processing course, the rest are from stackoverflow.

- WordStat sentiment Dictionary; This is probably one of the largest lexicons freely available. It contains ~14.000 words ( 9164 negative and 4847 positive words ) and gives words a binary classification (positive or a negative ) score.
- Bill McDonalds 2014 Master dictionary, containing ~85.000 word
- Harvard Inquirer; Contains about ~11.780 words and has a more complex way of ‘scoring’ words; each word can be scored in 15+ categories; words can be Positiv-Negative, Strong-Weak, Active-Passive, Pleasure-Pain, words can indicate pleasure, pain, virtue and vice etc etc
- SentiWordNet; gives the words a positive or negative score between 0 and 1. It contains about 117.660 words, however only ~29.000 of these words have been scored (either positive or negative).
- MPQA; contains about ~8.200 words and binary classifies each word (as either positive or as negative). It also gives additional information such as whether a word is an adjective or a noun and whether a word is ‘strong subjective’ or ‘weak subjective’.
- Bing Liu’s opinion lexicon; contains 4.782 negative and 2.005 positive words.

**Including Emoticons in your dictionary;**

None of the dictionaries described above contain emoticons, which might be an essential part of text if you are analyzing social media. So how can we include emoticons in our subjectivity analysis? Everybody knows is a positive and is a negative emoticon but what exactly does mean and how is it different from :-/?

There are a few emoticon sentiment dictionaries on the web which you could use; Emoticon Sentiment Lexicon created by Hogenboom et. al., containing a list of 477 emoticons which are scored either 1 (positive), 0 (neutral) or -1 (negative). You could also make your own emoticon sentiment dictionary by giving the emoticons the same score as their meaning in words.

]]>———–

For most people, the most interesting part of the previous post, will be the final results. But for the ones who would like to try something similar or the ones who are also curious about the technical part, I will explain the methods and techniques I used (mostly webscraping with Beautifulsoup4) to collect a few million Tweets.

**Setting up Python and its relevant packages**

I used Python as the programming language to collect all the relevant data, because I have prior experience with it, but the same techniques should be applicable with other languages. In case you do not have any experience with Python, but still would like to use it, I can recommend the Coursera Python Course, Codeacademy Python Course or the Learn Python the Hard Way book. It might also be a good idea to get a basic understanding of how web APIs work.

The Python Packages you will need are packages like Numpy & SciPy for basic calculations, OAuth2 for authorization towards the Twitter, tweepy because it provides an user-friendly wrapper of the Twitter API, pymongo for interacting with the MongoDB database from Python (if that is the db you will use).

If you do not have Python and/or some of its packages, the easiest way to install it is; on linux install pip (Python package manager) first and then install any of the missing packages with `pip install <package>.`

For Windows I recommend to install Anaconda (which has a lot of built-in packages including pip) first and then IPython. The missing packages can then be installed with the same command.

**Getting your twitter credentials;**

Twitter is using OAuth2 for authorization and authentication, so whether you are using tweepy to access the Twitter stream or some other method, make sure you have installed the OAuth2 package. After you have installed OAuth2 it is time to get your log-in credentials from https://apps.twitter.com. Log in and click on *create a new app*, fill in the application details and copy your Consumer Secret and your Access Token Secret to some text file.

**Accessing Twitter with its API**

I recommend you to use tweepy [1], which is an open-source Twitter API wrapper, making it easy to access twitter. If you are using a programming language other than Python or if you don’t feel like using tweepy, you can look at the Twitter API documentation and find other means of accessing Twitter.

There are two ways in which you can mine for tweets; with the **Streaming API** or with the **Search Api**. The main difference (for an overview, click here) between them is that with Search you can mine for tweets posted in the past while Streaming goes forward in time and captures tweets as they are posted.

It is also important to take the rate-limit for both API’s into account:

- With the Search API you can only sent 180 Requests every 15 min timeframe. With a maximum number of 100 tweets per Request this means you can mine for 4 x 180 x 100 = 72.000 tweets per hour. One way to increase number of tweets is to authenticate as an application instead of an user. This will increase the rate-limit from 180 Requests to 450 Requests while reducing some of the possibilities you had as an user.
- With the Streaming API you can collect all tweets containing your keyword(s), up to 1 % of the total tweets currently being posted on twitter. So if your keyword is very general and more than 1 % of the tweets contain this term, you will not get all of the tweets containing this term. The obvious solution is to make your query more specific and combining multiple keywords. At the moment 500+ million tweets are posted a day, so 1 % of all tweets still gives you 1+ million tweets a day.

**Which one should you use?**

Obviously any prediction about the future should be based on tweets coming from the Streaming API, but if you need some data to fine-tune your model you can use the Search API to collect tweets from the past seven days – it does not go further back – (Twitter documentation). However, since there are around 500+ million tweets posted every day, the past seven days should provide you with enough data to get you started. However, if you need tweets older than 7 days, webscraping might be a good alternative, since a search at twitter.com does return old tweets.

Using the tweepy package for Streaming Twitter messages is pretty straight forward. There even is an code sample on the github page of tweepy. So all you need to do is install tweepy/clone the github repository and fill in the search terms in the relevant part of search.py.

With tweepy you can also search for Twitter messages (not older than 7 days). the code sample below shows how it is done.

import tweepy access_token = "" access_token_secret = "" consumer_key = "" consumer_secret = "" auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_acces_token(access_token, access_token_secret) api = tweepy.API(auth) for tweet in tweepy.Cursor(api.search, q="Tayyip%20Erdogan", lang="tr").items(): print tweet

These two samples of code show again the advantages of tweepy; it makes it really easy to access the Twitter API for Python and as a result of this is probably the most popular Python Twitter package. But because it is using the Twitter API, it is also subject to the limitations posed by Twitter; the rate-limit and the fact that you can not search for twitter messages older than 7 days. Since I needed data from the previous elections, this posed a serious problem for me and I had to use web-scraping to collect Twitter messages from May.

**Using BeautifulSoup4 to scrape for tweets**

There are some pro’s and cons with using web scraping for the collection of twitter data (instead of their API). One of most important pro’s are that there is no rate-limit on the website so you can collect more tweets than the limit which is imposed on the Twitter API. Furthermore, you can also mine for tweets older than seven days :).

If we want to scrape twitter.com with BeautifulSoup we need to send a Request and extract the relevant information from the response. The Search API documentation gives a nice overview of the relevant parameters you can use in your query.

For example, if you want to request all tweets containing ‘akparti’ from 01 May 2015 until 05 June 2015, written in Turkish, you can do that with the following url

`https://twitter.com/search?q=akparti%20since%3A2015-05-01%20until%3A2015-06-05&lang=tr`

The tweets on this page can easily be scraped with the Python module BeautifulSoup.

import urllib2 from bs4 import BeautifulSoup url = "https://twitter.com/search?q=akparti%20since%3A2015-05-01%20until%3A2015-06-05&amp;amp;amp;amp;amp;lang=tr" response = urllib2.urlopen(url) html = response.read() soup = BeautifulSoup(html)

‘soup’ now contains the entire contents of the html page. Now, lets look at how we can extract the more specific elements containing only the tweet-text, tweet-timestamp or user. With the developer tools of Chrome (right-click on the tweet and then ‘Inspect element’) you can see which elements contain the desired contents and scrape them by their class-name.

We can see that an **<li>** element with class ‘ *js-stream-item*‘ contains the entire contents of the tweet, a **<p>** with class ‘*tweet-text*‘ contains the text and the user is contained in a **<span>** with class ‘*username*‘. This gives us enough information to extract these with BeautifulSoup:

tweets = soup.find_all('li','js-stream-item') for tweet in tweets: if tweet.find('p','tweet-text'): tweet_user = tweet.find('span','username').text tweet_text = tweet.find('p','tweet-text').text.encode('utf8') tweet_id = tweet['data-item-id'] timestamp = tweet.find('a','tweet-timestamp')['title'] tweet_timestamp = dt.datetime.strptime(timestamp, '%H:%M - %d %b %Y') else: continue

**Notes:**

- The ‘text‘ after
`tweet.find('span','username')`

is necessary to extract only the visible text excluding all html elements. - Since the tweets are written in Turkish they probably contain non-standard characters which are not support, so it is necessary to encode them as utf8.
- The date in the twitter message is written in human readable format. To convert it to a datetime format which can further be used by Python we need to use datetime’s strptime method. To do this we need to additionally import the datetime and locale package[code language=”Python”]import datetime as dt

import locale

locale.setlocale(locale.LC_ALL,’turkish’)[/code]

**Scraping pages with infinite scroll:
**In principle this should be enough to scrape all of the twitter messages containing the keyword ‘akparti’ within the specified dates. However the website of twitter uses infinite scroll, which means it initially shows only ~20 tweets and keeps loading more tweets as you scroll down. So a single Request will only get you the initial 20 tweets.

One of the most commenly used solution for scraping pages with infinite scroll is to use Selenium. Selenium can open a web-browser and scroll down to the bottom of the page (see stackoverflow) after which you can scrape the page. I do not recommend you to use this. The biggest disadvantage of Selenium is that it physically opens up a browser and loads all of the tweets. Nowadays tweets can also contain videos and images, and loading these in your web-browser will be slower than simply loading the source code of the page. If you are planning on scraping thousands or millions of tweets it will be a very time consuming and a memory intensive process.

*There must be another way!*

Lets open up Chrome developer tools again (Ctrl + Shift + I) again to find a solution for this problem. Under the Network tab, you can see the GET and POST requests which are being sent in the instant that you have reached the bottom and the page is being filled with more tweets.

In our case this is a GET request which looks like

`https://twitter.com/i/search/timeline?vertical=default&q=Erdogan%20since%3A2015-05-01%20until%3A2015-06-06&include_available_features=1&include_entities=1&lang=tr&last_note_ts=2088&`

**max_position=TWEET-606971359399411712-606973762026803200**&reset_error_state=false

The interesting parameter in this request is the parameter

**max_position=TWEET-606971359399411712-606973762026803200. **

Here the first digit is the tweet-id of the first tweet on the page and the second digit is of the last tweet. Scrolling down to the bottom again we can see that this parameter has become

**max_position=TWEET-606967763807182849-606973762026803200.
**So every time new tweets a loaded on the page the get request above is sent with the id of the first and last tweet on the page.

At this point I hope it has become clear what needs to be done to scrape all tweets from the twitter page:

1. Read the response of the ‘regular’ Twitter URL with BeautifulSoup. Extract the information you need and save it to a file/database. Separately save the tweet-id of the first and last tweet on the page.

2. Construct the above GET Request where you have filled in the tweet-id of the first and last tweet in their corresponding places. Read the response of this Request with BeautifulSoup and update the tweet-id of the last tweet.

3. Repeat step 2 until you get a response with no more new tweets.

——–

[1] Here are some good documents to get started with tweepy:

http://docs.tweepy.org/en/latest/getting_started.html

http://adilmoujahid.com/posts/2014/07/twitter-analytics/

http://pythoncentral.io/introduction-to-tweepy-twitter-for-python/

http://pythonprogramming.net/twitter-api-streaming-tweets-python-tutorial/

http://www.dototot.com/how-to-write-a-twitter-bot-with-python-and-tweepy/

Although my predicted voting percentage for AKP was much closer to the actual result compared to most of the traditional polls, it is also true that my predicted value for MHP is far off, making the overall prediction error bigger than most conventional polls (see table below).

Election results | AKP | CHP | MHP | HDP | Others | prediction error |
---|---|---|---|---|---|---|

49.4 |
25.4 |
11.9 |
10.7 |
2.5 |
||

My prediction | 47.3 | 22.4 | 18.8 | 11.68 | 0 | 3.1 |

Traditional polls: |
||||||

Andy-Ar | 43.7 | 27.1 | 14.0 | 13.0 | 2.2 | 2.42 |

Konda | 41.7 | 27.9 | 14.2 | 13.8 | 2.3 | 3.16 |

A&G | 47.2 | 25.3 | 13.5 | 12.2 | 1.8 | 1.22 |

Gezici | 43 | 26.1 | 14.9 | 12.2 | 3.8 | 2.58 |

Metropoll | 43.3 | 25.9 | 14.8 | 13.4 | 2.6 | 2.46 |

ORC | 43.3 | 27.4 | 14 | 12.2 | 3.1 | 2.46 |

So to be honest, I have to conclude that the results of this research do not point towards a clear victory for Twitter Data Analytics. Although it is not a clear victory, it is also not a clear loss.

On the bright side, this research was done with a few Amazone EC2 instances with a total cost of about three dollars, while the cost of traditional polls was in the range of a few million (put mildly). For the ones who are interested, this is an interesting article about the current state of the polling industry.

I still believe that the content of Twitter can be representative of an electorate and political sentiment can be modeled from Twitter messages effectively. However, it is clear that further research is needed and challenges lie ahead.

At the moment I can not give a clear answer to the question why there is such a large discrepancy between the predicted and actual result. I hope to provide you with a better explanation later on, but for now I can already tell you that this discrepancy is partly caused by ‘Ahmet Kaya’.

There were two politicians of the MHP party, named ‘Ahmet Kaya’ who were also participating in the elections (one for the province of Diyarbakir and one for the province of Erzincan). Now, the problem with these two politicians is that Ahmet Kaya was also the name of a very famous Turkish singer (who happened to be born on 28 October).

Ofcourse the Twitter Data Collector is not smart enough to distinguish between Ahmet Kaya the politician and Ahmet Kaya the singer and since I did not check the content of the Tweets or go through the dictionary containing the names of the ~550 politicians in great detail, MHP got thousands of Tweets more than it should have…

In later posts I will go into the more technical part about how to collect data from Twitter, for the ones interested in doing Twitter Data Analytics.

]]>