For python programmers, scikit-learn is one of the best libraries to build Machine Learning applications with. It is ideal for beginners because it has a really simple interface, it is well documented with many examples and tutorials.

Besides supervised machine learning (classification and regression), it can also be used for clustering, dimensionality reduction, feature extraction and engineering, and pre-processing the data. The interface is consistent over all of these methods, so it is not only easy to use, but it is also easy to construct a large ensemble of classifiers/regression models and train them with the same commands.

In this blog lets have a look at how to build, train, evaluate and validate a classifier with scikit-learn and in this way get familiar with the scikit-learn library.

Let’s look at the process of classification with scikit-learn with two example datasets. The glass dataset, and the Mushroom dataset.

The glass dataset contains data on six types of glass (from building windows, containers, tableware, headlamps, etc) and each type of glass can be identified by the content of several minerals (for example Na, Fe, K, etc). This dataset only contains numerical data and therefore is a good dataset to get started with.

The second dataset contains non-numerical data and we will need an additional step where we encode the categorical data to numerical data.

Lets start with classifying the classes of glass!

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import time from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn import tree from sklearn.neural_network import MLPClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.gaussian_process.kernels import RBF from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import GaussianNB

First we need to import the necessary modules and libraries which we will use.

- The pandas module is used to load, inspect, process the data and get in the shape necessary for classification.
- Seaborn is a library based on matplotlib and has nice functionalities for drawing graphs.
- StandardScaler is a library for standardizing and normalizing dataset and
- the LaberEncoder library can be used to One Hot Encode the categorical features (in the mushroom dataset).
- All of the other modules are classifiers which are used for classification of the dataset.

When loading a dataset for the first time, there are several questions we need to ask ourself:

- What kind of data does the dataset contain? Numerical data, categorical data, geographic information, etc…
- Does the dataset contain any missing data?
- Does the dataset contain any redundant data (noise)?
- Do the values of the features differ over many orders of magnitude? Do we need to standardize or normalize the dataset?

filename_glass = './data/glass.csv' df_glass = pd.read_csv(filename_glass) print(df_glass.shape) display(df_glass.head()) display(df_glass.describe())

We can see that the dataset consists of 214 rows and 10 columns. All of the columns contain numerical data, and there are no rows with missing information (check this for yourself). Also most of the features have values in the same order of magnitude.

So for this dataset we do not need to remove any rows (with `.dropna()`

) or apply one hot encoding (to transform categorical data into numerical data) or standardize the data (with `StandardScaler().fit_transform(X)`

).

The .describe() method of pandas is useful for giving a quick overview of the dataset;

- How many rows of data are there?
- What are some characteristic values like the mean, standard deviation, minimum and maximum value, the 25th percentile etc.

To get more insight in how (strongly) each feature is correlated with the Type of glass, we can calculate and plot the correlation matrix for this dataset.

correlation_matrix = df_glass.corr() plt.figure(figsize=(10,8)) ax = sns.heatmap(correlation_matrix, vmax=1, square=True,annot=True,cmap='RdYlGn') plt.title('Correlation matrix between the features') plt.show()

The correlation matrix shows us for example that the oxides ‘Mg’ and ‘Al’ are most strongly correlated with the Type of glass. The content of ‘Ca’ is least strongly correlated with the type of glass. For some dataset there could be features with no correlation at all; then it might be a good idea to remove these since they will only function as noise.

The next step is building and training the actual classifier, which hopefully can accurately classify the data. With this we will be able to tell which type of glass an entry in the dataset belongs to, based on the features.

For this we need to split the dataset into a training set and a test set. With the training set we will train the classifier, and with the test set we will validate the accuracy of the classifier. Usually a 70 % / 30 % ratio is used when splitting into a training and test set, but this ratio should be chosen based on the size of the dataset. For example, if the dataset does not have enough entries, 30% of it might not contain all of the classes or enough information to properly function as a validation set.

Another important note is that the distribution of the different classes in both the training and the test set should be equal to the distribution in the actual dataset. For example, if you have a dataset with review-texts which contains 20% negative and 80% positive reviews, both the training and the test set should have this 20% / 80% ratio. The best way to do this, is to split the dataset into a training and test set **randomly**.

def get_train_test(df, y_col, ratio): mask = np.random.rand(len(df)) < ratio df_train = df[mask] df_test = df[~mask] Y_train = df_train[y_col].values Y_test = df_test[y_col].values del df_train[y_col] del df_test[y_col] X_train = df_train.values X_test = df_test.values return X_train, Y_train, X_test, Y_test y_col = 'Type' train_test_ratio = 0.7 X_train, Y_train, X_test, Y_test = get_train_test(df_glass, y_col, train_test_ratio)

With the dataset splitted into training and test sets, we can start building a classification model. I will do this in a slightly different way as usual. The idea behind this is that, when we start with a new dataset, we don’t know which (type of) classifier will perform best on this dataset. Will it be a classifier like Decision Tree or Random Forest, or a classifier which uses a functional approach like Logistic Regression, a classifier which uses a statistical approach like Naive Bayes etc.?

Because we dont know this, we will try all types of classifiers first and later we can continue to optimize the best performing classifier of this inital batch of classifiers. For this we have to make an dictionary, which contains as *keys* the name of the classifiers and as *values *an instance of the classifiers.

dict_classifiers = { "Logistic Regression": LogisticRegression(), "Nearest Neighbors": KNeighborsClassifier(), "Linear SVM": SVC(), "Gradient Boosting Classifier": GradientBoostingClassifier(), "Decision Tree": tree.DecisionTreeClassifier(), "Random Forest": RandomForestClassifier(n_estimators = 18), "Neural Net": MLPClassifier(alpha = 1), "Naive Bayes": GaussianNB() }

Then we can iterate over this dictionary, and for each classifier:

- train the classifier with
`.fit(X_train, Y_train)`

- evaluate how the classifier performs on the training set with
`.score(X_train, Y_train)`

- evaluate how the classifier perform on the test set with
`.score(X_test, Y_test)`

. - keep track of how much time it takes to train the classifier with the time module.
- save the training score, the test score, and the training time into a dataframe called ‘df_results’.

no_classifiers = len(dict_classifiers.keys()) def batch_classify(X_train, Y_train, X_test, Y_test, verbose = True): df_results = pd.DataFrame(data=np.zeros(shape=(no_classifiers,4)), columns = ['classifier', 'train_score', 'test_score', 'training_time']) count = 0 for key, classifier in dict_classifiers.items(): t_start = time.clock() classifier.fit(X_train, Y_train) t_end = time.clock() t_diff = t_end - t_start train_score = classifier.score(X_train, Y_train) test_score = classifier.score(X_test, Y_test) df_results.loc[count,'classifier'] = key df_results.loc[count,'train_score'] = train_score df_results.loc[count,'test_score'] = test_score df_results.loc[count,'training_time'] = t_diff if verbose: print("trained {c} in {f:.2f} s".format(c=key, f=t_diff)) count+=1 return df_results

The reason why we keep track of the time it takes to train a classifier, is because in practice this is also an important indicator of whether or not you would like to use a specific classifier. If there are two classifier with similar results, but one of them takes much less time to train you probably want to use that one.

The `score()`

method simply return the result of the accuracy_score() method in the metrics module. This module, contains many methods for evualating classification or regression models and I can recommend you to spent some time to learn which metrics you can use to evaluate your model.

The classification_report method for example, calculates the precision, recall and f1-score for all of the classes in your dataset. If you are looking for ways to improve the accuracy of your classifier, or if you want to know why the accuracy is lower than expected, such detailed information about the performance of the classifier on the dataset can point you in the right direction.

The accuracy on the training set, accuracy on the test set, and the duration of the training is saved into the ‘df_results’ dataframe.

df_results = batch_classify(X_train, Y_train, X_test, Y_test) display(df_results.sort_values(by='test_score', ascending=False))

What we are doing feels like a brute force approach, where a large number of classifiers are build to see which one performs best. Although this is not particularly educational, it gives an idea which classifier will perform better for a particular dataset and which one will not. After that you can continue with the best (or top 3) classifier, and try to improve the results by tweaking the parameters of the classifier, or by adding more features to the dataset.

As we can see, the Gradient Boosting classifier performs the best for this dataset. Actually, classifiers like Random Forest and Gradient Boosting classification performs best for most datasets and challenges on Kaggle (That does not mean you should rule out all other classifiers).

For the ones who are interested in the theory behind these classifiers, scikit-learn has a pretty well written user guide. Some of these classifiers were also explained in previous posts, like the naive bayes classifier, logistic regression and support vector machines was partially explained in the perceptron blog.

The second dataset we will have a look at is the mushroom dataset, which contains data on edible vs poisonous mushrooms. In the dataset there are 8124 mushrooms in total (4208 edible and 3916 poisonous) described by 22 features each.

The big difference with the glass dataset is that these features don’t have a numerical, but a categorical value. Because this dataset contains categorical values, we need one extra step in the classification process, which is the encoding of these values.

filename_mushrooms = './data/mushrooms.csv' df_mushrooms = pd.read_csv(filename_mushrooms) display(df_mushrooms.head())

A fast way to find out what type of categorical data a dataset contains, is to print out the unique values of each column in this dataframe. In this way we can also see whether the dataset contains any missing values or redundant columns.

for col in df_mushrooms.columns.values: print(col, df_mushrooms[col].unique())

As we can see, there are 22 categorical features. Of these, the feature ‘veil-type’ only contains one value ‘p’ and therefore does not provide any added value for any classifier. The best thing to do is to remove this feature.

del df_mushrooms['veil-type']

Most classifier can only work with numerical data, and will raise an error when categorical values in the form of strings are used as input. Luckily scikit-learn contains the module LabelEncoder, which can be used to transform non-numerical values to numerical values. This is done by first fitting the LabelEncoder with all possible (unique) values and then transforming all values to numerical values.

(Both steps can also be done in one go with the fit_transform() method.)

def label_encode(df, columns): for col in columns: le = LabelEncoder() col_values_unique = list(df[col].unique()) le_fitted = le.fit(col_values_unique) col_values = list(df[col].values) le.classes_ col_values_transformed = le.transform(col_values) df[col] = col_values_transformed to_be_encoded_cols = df_mushrooms.columns.values label_encode(df_mushrooms, to_be_encoded_cols) display(df_mushrooms.head())

As we can see, the columns previously containing non-numerical values now contain numerical values and the dataset is ready for classification. Again, we will split the dataset into a 70% training set and a 30% test set and start training and validating a batch of the eight most used classifiers.

y_col = 'class' ratio = 0.7 X_train, Y_train, X_test, Y_test = get_train_test(df_mushrooms, y_col, ratio) df_results = batch_classify(X_train, Y_train, X_test, Y_test) display(df_results.sort_values(by='test_score', ascending=False))

As we can see, the accuracy of the classifiers for this dataset is actually also quiet high. For datasets, where this is not the case we can play around with the features in the dataset, add extra features from additional datasets or change the parameters of the classifiers in order to improve the accuracy.

In my opinion, the best way to master the scikit-learn library is to simply start coding with it. I hope this blog-post gave some insight into the working of scikit-learn library, but for the ones who need some more information, here are some useful links:

dataschool – machine learning with scikit-learn video series

Classification example using the iris dataset

Official scikit-learn documentation

]]>

Most tasks in Machine Learning can be reduced to classification tasks. For example, we have a medical dataset and we want to classify who has diabetes (positive class) and who doesn’t (negative class). We have a dataset from the financial world and want to know which customers will default on their credit (positive class) and which customers will not (negative class).

To do this, we can train a Classifier with a ‘training dataset’ and after such a Classifier is trained (we have determined its model parameters) and can accurately classify the training set, we can use it to classify new data (test set). If the training is done properly, the Classifier should predict the class probabilities of the new data with a similar accuracy.

There are three popular Classifiers which use three different mathematical approaches to classify data. Previously we have looked at the first two of these; Logistic Regression and the Naive Bayes classifier. Logistic Regression uses a functional approach to classify data, and the Naive Bayes classifier uses a statistical (Bayesian) approach to classify data.

Logistic Regression assumes there is some function which forms a correct model of the dataset (i.e. it maps the input values correctly to the output values). This function is defined by its parameters . We can use the gradient descent method to find the optimum values of these parameters.

The Naive Bayes method is much simpler than that; we do not have to optimize a function, but can calculate the Bayesian (conditional) probabilities directly from the training dataset. This can be done quiet fast (by creating a hash table containing the probability distributions of the features) but is generally less accurate.

Classification of data can also be done via a third way, by using a geometrical approach. The main idea is to find a line, or a plane, which can separate the two classes in their feature space. Classifiers which are using a geometrical approach are the Perceptron and the SVM (Support Vector Machines) methods.

Below we will discuss the Perceptron classification algorithm. Although Support Vector Machines is used more often, I think a good understanding of the Perceptron algorithm is essential to understanding Support Vector Machines and Neural Networks.

The Perceptron is a lightweight algorithm, which can classify data quiet fast. But it only works in the limited case of a linearly separable, binary dataset. If you have a dataset consisting of only two classes, the Perceptron classifier can be trained to find a linear hyperplane which seperates the two. If the dataset is not linearly separable, the perceptron will fail to find a separating hyperplane.

If the dataset consists of more than two classes we can use the standard approaches in multiclass classification (one-vs-all and one-vs-one) to transform the multiclass dataset to a binary dataset. For example, if we have a dataset, which consists of three different classes:

- In
**one-vs-all**, class I is considered as the positive class and the rest of the classes are considered as the negative class. We can then look for a separating hyperplane between class I and the rest of the dataset (class II and III). This process is repeated for class II and then for class III. So we are trying to find three separating hyperplanes; between class I and the rest of the data, between class II and the rest of the data, etc.

If the dataset consists of K classes, we end up with K separating hyperplanes. - In
**one-vs-one**, class I is considered as the positive class and each of the other classes is considered as the negative class; so first class II is considered as the negative class and then class III is is considered as the negative class. Then this process is repeated with the other classes as the positive class.

So if the dataset consists of K classes, we are looking for separating hyperplanes.

Although the one-vs-one can be a bit slower (there is one more iteration layer), it is not difficult to imagine it will be more advantageous in situations where a (linear) separating hyperplane does not exist between one class and the rest of the data, while it does exists between one class and other classes when they are considered individually. In the image below there is no separating line between the pear-class and the other two classes.

The algorithm for the Perceptron is similar to the algorithm of Support Vector Machines (SVM). Both algorithms find a (linear) hyperplane separating the two classes. The biggest difference is that the Perceptron algorithm will find **any** hyperplane, while the SVM algorithm uses a Lagrangian constraint* *to find the hyperplane which is optimized to have the **maximum margin**. That is, the sum of the squared distances of each point to the hyperplane is maximized. This is illustrated in the figure below. While the Perceptron classifier is satisfied if any of these seperating hyperplanes are found, a SVM classifier will find the green one , which has the maximum margin.

Another difference is; If the dataset is not linearly seperable [2] the perceptron will fail to find a separating hyperplane. The algorithm simply does not converge during its iteration cycle. The SVM on the other hand, can still find a maximum margin minimum cost decision boundary (a separating hyperplane which does not separate 100% of the data, but does it with some small error).

It is often said that the perceptron is modeled after neurons in the brain. It has input values (which correspond with the features of the examples in the training set) and one output value. Each input value is multiplied by a weight-factor . If the sum of the products between the feature value and weight-factor is larger than zero, the perceptron is activated and ‘fires’ a signal (+1). Otherwise it is not activated.

The weighted sum between the input-values and the weight-values, can mathematically be determined with the scalar-product . To produce the behaviour of ‘firing’ a signal (+1) we can use the signum function ; it maps the output to +1 if the input is positive, and it maps the output to -1 if the input is negative.

Thus, this Perceptron can mathematically be modeled by the function . Here is the bias, i.e. the default value when all feature values are zero.

The perceptron algorithm looks as follows:

class Perceptron(): """ Class for performing Perceptron. X is the input array with n rows (no_examples) and m columns (no_features) Y is a vector containing elements which indicate the class (1 for positive class, -1 for negative class) w is the weight-vector (m number of elements) b is the bias-value """ def __init__(self, b = 0, max_iter = 1000): self.max_iter = max_iter self.w = [] self.b = 0 self.no_examples = 0 self.no_features = 0 def train(self, X, Y): self.no_examples, self.no_features = np.shape(X) self.w = np.zeros(self.no_features) for ii in range(0, self.max_iter): w_updated = False for jj in range(0, self.no_examples): a = self.b + np.dot(self.w, X[jj]) if np.sign(Y[jj]*a) != 1: w_updated = True self.w += Y[jj] * X[jj] self.b += Y[jj] if not w_updated: print("Convergence reached in %i iterations." % ii) break if w_updated: print( """ WARNING: convergence not reached in %i iterations. Either dataset is not linearly separable, or max_iter should be increased """ % self.max_iter ) def classify_element(self, x_elem): return int(np.sign(self.b + np.dot(self.w, x_elem))) def classify(self, X): predicted_Y = [] for ii in range(np.shape(X)[0]): y_elem = self.classify_element(X[ii]) predicted_Y.append(y_elem) return predicted_Y

As you can see, we set the bias-value and all the elements in the weight-vector to zero. Then we iterate ‘max_iter’ number of times over all the examples in the training set.

Here, is the actual output value of each training example. This is either +1 (if it belongs to the positive class) or -1 (if it does not belong to the positive class).,

The activation function value is the predicted output value. It will be if the prediction is correct and if the prediction is incorrect. Therefore, if the prediction made (with the weight vector from the previous training example) is incorrect, will be -1, and the weight vector is updated.

If the weight vector is not updated after some iteration, it means we have reached convergence and we can break out of the loop.

If the weight vector was updated in the last iteration, it means we still didnt reach convergence and either the dataset is not linearly separable, or we need to increase ‘max_iter’.

We can see that the Perceptron is an online algorithm; it iterates through the examples in the training set, and for each example in the training set it calculates the value of the activation function and updates the values of the weight-vector.

Now lets examine the Perceptron algorithm for a linearly separable dataset which exists in 2 dimensions. For this we first have to create this dataset:

def generate_data(no_points): X = np.zeros(shape=(no_points, 2)) Y = np.zeros(shape=no_points) for ii in range(no_points): X[ii][0] = random.randint(1,9)+0.5 X[ii][1] = random.randint(1,9)+0.5 Y[ii] = 1 if X[ii][0]+X[ii][1] >= 13 else -1 return X, Y

In the 2D case, the perceptron algorithm looks like:

X, Y = generate_data(100) p = Perceptron() p.train(X, Y) X_test, Y_test = generate_data(50) predicted_Y_test = p.classify(X_test)

As we can see, the weight vector and the bias ( which together determine the separating hyperplane ) are updated when is not positive.

The result is nicely illustrated in this gif:

GIF

We can extend this to a dataset in any number of dimensions, and as long as it is linearly separable, the Perceptron algorithm will converge.

One of the benefits of this Perceptron is that it is a very ‘lightweight’ algorithm; it is computationally very fast and easy to implement for datasets which are linearly separable. But if the dataset is not linearly separable, it will not converge.

For such datasets, the Perceptron can still be used if the correct kernel is applied. In practice this is never done, and Support Vector Machines are used whenever a Kernel needs to be applied. Some of these Kernels are:

Linear: | |

Polynomial: | with |

Laplacian RBF: | |

Gaussian RBF: |

At this point, it will be too much to also implement Kernel functions, but I hope to do it at a next post about SVM. For more information about Kernel functions, a comprehensive list of kernels, and their source code, please click here.

**PS: The Python code for Logistic Regression can be forked/cloned from GitHub. **

In the previous blog we have seen the theory and mathematics behind the Logistic Regression Classifier.

Logistic Regression is one of the most powerful classification methods within machine learning and can be used for a wide variety of tasks. Think of pre-policing or predictive analytics in health; it can be used to aid tuberculosis patients, aid breast cancer diagnosis, etc. Think of modeling urban growth, analysing mortgage pre-payments and defaults, forecasting the direction and strength of stock market movement, and even sports.

Reading all of this, the theory[1] of Logistic Regression Classification might look difficult. In my experience, the average Developer does not believe they can design a proper Logistic Regression Classifier from scratch. I strongly disagree: not only is the mathematics behind is relatively simple, it can also be implemented with a few lines of code.

I have done this in the past month, so I thought I’d show you how to do it. The code is in Python but it should be relatively easy to translate it to other languages. Some of the examples contain self-generated data, while other examples contain real-world (iris) data. As was also done in the blog-posts about the bag-of-words model and the Naive Bayes Classifier, we will also try to automatically classify the sentiments of Amazon.com book reviews.

We have seen that the technique to perform Logistic Regression is similar to regular Regression Analysis.

There is a function which maps the input values to the output and this function is completely determined by its parameters . So once we have determined the values with training examples, we can determine the class of any new example.

We are trying to estimate the feature values with the iterative Gradient Descent method. In the Gradient Descent method, the values of the parameters in the current iteration are calculated by updating the values of from the previous iteration with the gradient of the cost function .

In (regular) Regression this hypothesis function can be any function which you expect will provide a good model of the dataset. In Logistic Regression the hypothesis function is always given by the Logistic function:

.

Different cost functions exist, but most often the log-likelihood function known as binary cross-entropy (see equation 2 of previous post) is used.

One of its benefits is that the gradient of this cost function, turns out to be quiet simple, and since it is the gradient we use to update the values of this makes our work easier.

Taking all of this into account, this is how Gradient Descent works:

- Make an initial but intelligent guess for the values of the parameters .
- Keep iterating while the value of the cost function has not met your criteria*:
- With the current values of , calculate the gradient of the cost function ( ).
- Update the values for the parameters
- Fill in these new values in the hypothesis function and calculate again the value of the cost function;

*Usually the iteration stops when either the maximum number of iterations has been reached, or the error (the difference between the cost of this iteration and the cost of the previous iteration) is smaller than some minimum error value (0.001).

We have seen the self-generated example of students participating in a Machine Learning course, where their final grade depended on how many hours they had studied.

First, let’s generate the data:

import random import numpy as np num_of_datapoints = 100 x_max = 10 initial_theta = [1, 0.07] def func1(X, theta, add_noise = True): if add_noise: return theta[0]*X[0] + theta[1]*X[1]**2 + 0.25*X[1]*(random.random()-1) else: return theta[0]*X[0] + theta[1]*X[1]**2 def generate_data(num_of_datapoints, x_max, theta): X = np.zeros(shape=(num_of_datapoints, 2)) Y = np.zeros(shape=num_of_datapoints) for ii in range(num_of_datapoints): X[ii][0] = 1 X[ii][1] = (x_max*ii) / float(num_of_datapoints) Y[ii] = func1(X[ii], theta) return X, Y X, Y = generate_data(num_of_datapoints, x_max, initial_theta)

We can see that we have generated 100 points uniformly distributed over the -axis. For each of these – points the -value is determined by minus some random value.

On the left we can see a scatterplot of the datapoints and on the right we can see the same data with a curve fitted through the points. This is the curve we are trying to estimate with the Gradient Descent method. This is done as follows:

numIterations= 1000 alpha = 0.00000005 m, n = np.shape(X) theta = np.ones(n) theta = gradient_descent(X, Y, theta, alpha, m, numIterations) def gradient_descent(X, Y, theta, alpha, m, number_of_iterations): for ii in range(0,number_of_iterations): print "iteration %s : feature-value: %s" % (ii, theta) hypothesis = np.dot(X, theta) cost = sum([theta[0]*X[iter][0]+theta[1]*X[iter][1]-Y[iter] for iter in range(m)]) grad0 = (2.0/m)*sum([(func1(X[iter], theta, False) - Y[iter])*X[iter][0]**2 for iter in range(m)]) grad1 = (2.0/m)*sum([(func1(X[iter], theta, False) - Y[iter])*X[iter][1]**4 for iter in range(m)]) theta[0] = theta[0] - alpha * grad0 theta[1] = theta[1] - alpha * grad1 return theta

We can see that we have to calculate the gradient of the cost function times and update the feature values simultaneously! This indeed results in the curve we were looking for:

After this short example of Regression, lets have a look at a few examples of Logistic Regression. We will start out with a the self-generated example of students passing a course or not and then we will look at real world data.

Let’s generate some data points. There are students participating in the course Machine Learning and whether a student passes ( ) or not ( ) depends on two variables;

- : how many hours student has studied for the exam.
- : how many hours student has slept the day before the exam.

import random import numpy as np def func2(x_i): if x_i[1] <= 4: y = 0 else: if x_i[1]+x_i[2] <= 13: y = 0 else: y = 1 return y def generate_data2(no_points): X = np.zeros(shape=(no_points, 3)) Y = np.zeros(shape=no_points) for ii in range(no_points): X[ii][0] = 1 X[ii][1] = random.random()*9+0.5 X[ii][2] = random.random()*9+0.5 Y[ii] = func2(X[ii]) return X, Y X, Y = generate_data2(300)

In our example, the results are pretty binary; everyone who has studied less than 4 hours fails the course, as well as everyone whose studying time + sleeping time is less than or equal to 13 hours (). The results looks like this (the green dots indicate a pass and the red dots a fail):

We have a LogisticRegression class, which sets the values of the learning rate and the maximum number of iterations at its initialization. The values of X, Y are set when these matrices are passed to the “train()” function, and then the values of no_examples, no_features, and theta are determined.

import numpy as np class LogisticRegression(): """ Class for performing logistic regression. """ def __init__(self, learning_rate = 0.7, max_iter = 1000): self.learning_rate = learning_rate self.max_iter = max_iter self.theta = [] self.no_examples = 0 self.no_features = 0 self.X = None self.Y = None def add_bias_col(self, X): bias_col = np.ones((X.shape[0], 1)) return np.concatenate([bias_col, X], axis=1)

We also have the hypothesis, cost and gradient functions:

def hypothesis(self, X): return 1 / (1 + np.exp(-1.0 * np.dot(X, self.theta))) def cost_function(self): """ We will use the binary cross entropy as the cost function. https://en.wikipedia.org/wiki/Cross_entropy """ predicted_Y_values = self.hypothesis(self.X) cost = (-1.0/self.no_examples) * np.sum(self.Y * np.log(predicted_Y_values) + (1 - self.Y) * (np.log(1-predicted_Y_values))) return cost def gradient(self): predicted_Y_values = self.hypothesis(self.X) grad = (-1.0/self.no_examples) * np.dot((self.Y-predicted_Y_values), self.X) return grad

With these functions, the gradient descent method can be defined as:

def gradient_descent(self): for iter in range(1,self.max_iter): cost = self.cost_function() delta = self.gradient() self.theta = self.theta - self.learning_rate * delta print("iteration %s : cost %s " % (iter, cost))

These functions are used by the “train()” method, which first sets the values of the matrices X, Y and theta, and then calls the gradient_descent method:

def train(self, X, Y): self.X = self.add_bias_col(X) self.Y = Y self.no_examples, self.no_features = np.shape(X) self.theta = np.ones(self.no_features + 1) self.gradient_descent()

Once the values have been determined with the gradient descent method, we can use it to classify new examples:

def classify(self, X): X = self.add_bias_col(X) predicted_Y = self.hypothesis(X) predicted_Y_binary = np.round(predicted_Y) return predicted_Y_binary

Using this algorithm for gradient descent, we can correctly classify 297 out of 300 datapoints of our self-generated example (wrongly classified points are indicated with a cross).

Now that the concept of Logistic Regression is a bit more clear, let’s classify real-world data!

One of the most famous classification datasets is The Iris Flower Dataset. This dataset consists of three classes, where each example has four numerical features.

import pandas as pd to_bin_y = { 1: { 'Iris-setosa': 1, 'Iris-versicolor': 0, 'Iris-virginica': 0 }, 2: { 'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 0 }, 3: { 'Iris-setosa': 0, 'Iris-versicolor': 0, 'Iris-virginica': 1 } } #loading the dataset datafile = '../datasets/iris/iris.data' df = pd.read_csv(datafile, header=None) df_train = df.sample(frac=0.7) df_test = df.loc[~df.index.isin(df_train.index)] X_train = df_train.values[:,0:4].astype(float) y_train = df_train.values[:,4] X_test = df_test.values[:,0:4].astype(float) y_test = df_test.values[:,4] Y_train = np.array([to_bin_y[3][x] for x in y_train]) Y_test = np.array([to_bin_y[3][x] for x in y_test]) print("training Logistic Regression Classifier") lr = LogisticRegression() lr.train(X_train, Y_train) print("trained") predicted_Y_test = lr.classify(X_test) f1 = f1_score(predicted_Y_test, Y_test, 1) print("F1-score on the test-set for class %s is: %s" % (1, f1))

As you can see, our simple LogisticRegression class can classify the iris dataset with quiet a high accuracy:

training Logistic Regression Classifier iteration 1 : cost 8.4609605194 iteration 2 : cost 3.50586831057 iteration 3 : cost 3.78903735339 iteration 4 : cost 6.01488933456 iteration 5 : cost 0.458208317153 iteration 6 : cost 2.67703502395 iteration 7 : cost 3.66033580721 (...) iteration 998 : cost 0.0362384208231 iteration 999 : cost 0.0362289106001 trained F1-score on the test-set for class 1 is: 0.973225806452

For a full overview of the code, please have a look at GitHub.

Logistic Regression by using Gradient Descent can also be used for NLP / Text Analysis tasks. There are a wide variety of tasks which can are done in the field of NLP; autorship attribution, spam filtering, topic classification and sentiment analysis.

For a task like sentiment analysis we can follow the same procedure. We will have as the input a large collection of labelled text documents. These will be used to train the Logistic Regression classifier. The most important task then, is to select the proper features which will lead to the best sentiment classification. Almost everything in the text document can be used as a feature[2]; you are only limited by your creativity.

For sentiment analysis usually the occurence of (specific) words is used, or the relative occurence of words (the word occurences divided by the total number of words).

As we have done before, we have to fill in the and matrices, which will serve as an input for the gradient descent algorithm and this algorithm will give us the resulting feature vector . With this vector we can determine the class of other text documents.

As always is a vector with elements (where is the number of text-documents). The matrix is a by matrix; here is the total number of relevant words in all of the text-documents. I will illustrate how to build up this matrix with three book reviews:

**pos:**“This is such a beautiful edition of Harry Potter and the Sorcerer’s Stone. I’m so glad I bought it as a keep sake. The illustrations are just stunning.” (28 words in total)**pos:**“A brilliant book that helps you to open up your mind as wide as the sky” (16 words in total)**neg:**“This publication is virtually unreadable. It doesn’t do this classic justice. Multiple typos, no illustrations, and the most wonky footnotes conceivable. Spend a dollar more and get a decent edition.” (30 words in total)

These three reviews will result in the following -matrix.

As you can see, each row of the matrix contains all of the data per review and each column contains the data per word. If a review does not contain a specific word, the corresponding column will contain a zero. Such a -matrix containing all the data from the training set can be build up in the following manner:

Assuming that we have a list containing the data from the *training set*:

[ ([u'downloaded', u'the', u'book', u'to', u'my', ..., u'art'], 'neg'), ([u'this', u'novel', u'if', u'bunch', u'of', ..., u'ladies'], 'neg'), ([u'forget', u'reading', u'the', u'book', u'and', ..., u'hilarious!'], 'neg'), ... ]

From this *training_set*, we are going to generate a *words_vector*. This *words_vector* is used to keep track to which column a specific word belongs to. After this *words_vector* has been generated, the matrix and vector can filled in.

def generate_words_vector(training_set): words_vector = [] for review in training_set: for word in review[0]: if word not in words_vector: words_vector.append(word) return words_vector def generate_Y_vector(training_set, training_class): no_reviews = len(training_set) Y = np.zeros(shape=no_reviews) for ii in range(0,no_reviews): review_class = training_set[ii][1] Y[ii] = 1 if review_class == training_class else 0 return Y def generate_X_matrix(training_set, words_vector): no_reviews = len(training_set) no_words = len(words_vector) X = np.zeros(shape=(no_reviews, no_words+1)) for ii in range(0,no_reviews): X[ii][0] = 1 review_text = training_set[ii][0] total_words_in_review = len(review_text) for word in Set(review_text): word_occurences = review_text.count(word) word_index = words_vector.index(word)+1 X[ii][word_index] = word_occurences / float(total_words_in_review) return X words_vector = generate_words_vector(training_set) X = generate_X_matrix(training_set, words_vector) Y_neg = generate_Y_vector(training_set, 'neg')

As we have done before, the gradient descent method can be applied to derive the feature vector from the and matrices:

numIterations = 100 alpha = 0.55 m,n = np.shape(X) theta = np.ones(n) theta_neg = gradient_descent2(X, Y_neg, theta, alpha, m, numIterations)

What should we do if a specific review tests positive (Y=1) for more than one class? A review could result in Y=1 for both the *neu* class as well as the *neg* class. In that case we will pick the class with the highest score. This is called multinomial logistic regression.

So far, we have seen how to implement a Logistic Regression Classifier in its most basic form. It is true that building such a classifier from scratch, is great for learning purposes. It is also true that no one will get to the point of using deeper / more advanced Machine Learning skills without learning the basics first.

For real-world applications however, often the best solution is to not re-invent the wheel but to re-use tools which are already available. Tools which have been tested thorougly and have been used by plenty of smart programmers before you. One of such a tool is Python’s NLTK library.

NLTK is Python’s Natural Language Toolkit and it can be used for a wide variety of Text Processing and Analytics jobs like tokenization, part-of-speech tagging and classification. It is easy to use and even includes a lot of text corpora, which can be used to train your model if you have no training set available.

Let us also have a look at how to perform sentiment analysis and text classification with NLTK. As always, we will use a training set to train NLTK’s Maximum Entropy Classifier and a test set to verify the results. Our training set has the following format:

training_set = [ ([u'this', u'novel', u'if', u'bunch', u'of', u'childish', ..., u'ladies'], 'neg') ([u'where', u'to', u'begin', u'jeez', u'gasping', u'blushing', ..., u'fail????'], 'neg') ... ]

As you can see, the training set consists of a list of tuples of two elements. The first element is a list of the words in the text of the document and the second element is the class-label of this specific review (‘neg’, ‘neu’ or ‘pos’). Unfortunately NLTK’s Classifiers only accepts the text in a hashable format (dictionaries for example) and that is why we need to convert this list of words into a dictionary of words.

def list_to_dict(words_list): return dict([(word, True) for word in words_list]) training_set_formatted = [(list_to_dict(element[0]), element[1]) for element in training_set]

‘

Once the training set has been converted into the proper format, it can be feed into the train method of the MaxEnt Classifier:

import nltk numIterations = 100 algorithm = nltk.classify.MaxentClassifier.ALGORITHMS[0] classifier = nltk.MaxentClassifier.train(training_set_formatted, algorithm, max_iter=numIterations) classifier.show_most_informative_features(10)

Once the training of the MaxEntClassifier is done, it can be used to classify the review in the test set:

for review in test_set_formatted: label = review[1] text = review[0] determined_label = classifier.classify(text) print determined_label, label

So far we have seen the theory behind the Naive Bayes Classifier and how to implement it (in the context of Text Classification) and in the previous and this blog-post we have seen the theory and implementation of Logistic Regression Classifiers. Although this is done at a basic level, it should give some understanding of the Logistic Regression method (I hope at a level where you can apply it and classify data yourself). There are however still many (advanced) topics which have not been discussed here:

- Which hill-climbing / gradient descent algorithm to use; IIS (Improved Iterative Scaling), GIS (Generalized Iterative Scaling), BFGS, L-BFGS or Coordinate Descent
- Encoding of the feature vector and the use of dummy variables
- Logistic Regression is an inherently sequential algorithm; although it is quiet fast, you might need a parallelization strategy if you start using larger datasets.

If you see any errors please do not hesitate to contact me. If you have enjoyed reading, maybe even learned something, do not forget to subscribe to this blog and share it!

—

[1] See the paper of Nigam et. al. on Maximum Entropy and the paper of Bo Pang et. al. on Sentiment Analysis using Maximum Entropy. Also see Using Maximum Entropy for text classification (1999), A simple introduction to Maximum Entropy models(1997), A brief MaxEnt tutorial, and another good MIT article.

[2] See for example Chapter 7 of Speech and Language Processing by (Jurafsky & Martin): For the task of period disambiguation a feature could be whether or not a period is followed by a capital letter unless the previous word is *St.*

One of the most important tasks in Machine Learning are the Classification tasks (a.k.a. supervised machine learning). Classification is used to make an accurate prediction of the class of entries in a test set (a dataset of which the entries have not yet been labelled) with the model which was constructed from a training set. You could think of classifying crime in the field of pre-policing, classifying patients in the health sector, classifying houses in the real-estate sector. Another field in which classification is big, is Natural Lanuage Processing (NLP). This goal of this field of science is to makes machines (computers) understand written (human) language. You could think of text categorization, sentiment analysis, spam detection and topic categorization.

For classification tasks there are three widely used algorithms; the Naive Bayes, Logistic Regression / Maximum Entropy and Support Vector Machines. We have already seen how the Naive Bayes works in the context of Sentiment Analysis. Although it is more accurate than a bag-of-words model, it has the assumption of conditional independence of its features. This is a simplification which makes the NB classifier easy to implement, but it is also unrealistic in most cases and leads to a lower accuracy. A direct improvement on the N.B. classifier, is an algorithm which does not assume conditional independence but tries to estimate the weight vectors (feature values) directly. This algorithm is called Maximum Entropy in the field of NLP and Logistic Regression in the field of Statistics.

Maximum Entropy might sound like a difficult concept, but actually it is not. It is a simple idea, which can be implemented with a few lines of code. But to fully understand it, we must first go into the basics of Regression and Logistic Regression.

Regression Analysis is the field of mathematics where the goal is to find a function which best correlates with a dataset. Let’s say we have a dataset containing datapoints; . For each of these (input) datapoints there is a corresponding (output) -value. Here, the -datapoints are called the independent variables and the dependent variable; the value of depends on the value of , while the value of may be freely chosen without any restriction imposed on it by any other variable.

The goal of Regression analysis is to find a function which can best describe the correlation between and . In the field of Machine Learning, this function is called the hypothesis function and is denoted as .

If we can find such a function, we can say we have successfully built a Regression model. If the input-data lives in a 2D-space, this boils down to finding a curve which fits through the data points. In the 3D case we have to find a plane and in higher dimensions a hyperplane.

To give an example, let’s say that we are trying to find a predictive model for the success of students in a course called Machine Learning. We have a dataset which contains the final grade of students. Dataset contains the values of the independent variables. Our initial assumption is that the final grade only depends on the studying time. The variable therefore indicates how many hours student has studied. The first thing we would do is visualize this data:

If the results looks like the figure on the left, then we are out of luck. It looks like the points are distributed randomly and there is no correlation between and at all. However, if it looks like the figure on the right, there is probably a strong correlation and we can start looking for the function which describes this correlation.

This function could for example be:

or

where are the dependent parameters of our model.

In evaluating the results from the previous section, we may find the results unsatisfying; the function does not correlate with the datapoints strongly enough. Our initial assumption is probably not complete. Taking only the studying time into account is not enough. The final grade does not only depend on the studying time, but also on how much the students have slept the night before the exam. Now the dataset contains an additional variable which represents the sleeping time. Our dataset is then given by . In this dataset indicates how many hours student has studied and indicates how many hours he has slept.

This is an example of multivariate regression. The function has to include both variables. For example:

or

.

All of the above examples are examples of linear regression. We have seen that in some cases depends on a linear form of , but it can also depend on some power of , or on the log or any other form of . However, in all cases the parameters were linear.

So, what makes linear regression linear is not that depends in a linear way on , but that it depends in a linear way on . needs to be linear with respect to the model-parameters . Mathematically speaking it needs to satisfy the superposition principle. Examples of nonlinear regression would be:

or

The reason why the distinction is made between linear and nonlinear regression is that nonlinear regression problems are more difficult to solve and therefore more computational intensive algorithms are needed.

Linear regression models can be written as a linear system of equations, which can be solved by finding the closed-form solution with Linear Algebra. See these statistics notes for more on solving linear models with linear algebra.

As discussed before, such a closed-form solution can only be found for linear regression problems. However, even when the problem is linear in nature, we need to take into account that calculating the inverse of a by matrix has a time-complexity of . This means that for large datasets ( ) finding the closed-form solution will take more time than solving it iteratively (gradient descent method) as is done for nonlinear problems. So solving it iteratively is usually preferred for larger datasets, even if it is a linear problem.

The Gradient Descent method is a general optimization technique in which we try to find the value of the parameters with an iterative approach.

First, we construct a cost function (also known as loss function or error function) which gives the difference between the values of (the values you expect to have with the determined values of ) and the actual values of . The better your estimation of is, the better the values of will approach the values of .

Usually, the cost function is expressed as the squared error between this difference:

At each iteration we choose new values for the parameters , and move towards the ‘true’ values of these parameters, i.e. the values which make this cost function as small as possible. The direction in which we have to move is the negative gradient direction;

.

The reason for this is that a function’s value decreases the fastest if we move towards the direction of the negative gradient (the directional derivative is maximal in the direction of the gradient).

Taking all this into account, this is how gradient descent works:

- Make an initial but intelligent guess for the values of the parameters .
- Keep iterating while the value of the cost function has not met your criteria:
- With the current values of , calculate the gradient of the cost function J ( ).
- Update the values for the parameters
- Fill in these new values in the hypothesis function and calculate again the value of the cost function;

Just as important as the initial guess of the parameters is the value you choose for the learning rate . This learning rate determines how fast you move along the slope of the gradient. If the selected value of this learning rate is too small, it will take too many iterations before you reach your convergence criteria. If this value is too large, you might overshoot and not converge.

Logistic Regression is similar to (linear) regression, but adapted for the purpose of classification. The difference is small; for Logistic Regression we also have to apply gradient descent iteratively to estimate the values of the parameter . And again, during the iteration, the values are estimated by taking the gradient of the cost function. And again, the cost function is given by the squared error of the difference between the hypothesis function and . The major difference however, is the form of the hypothesis function.

When you want to classify something, there are a limited number of classes it can belong to. And for each of these possible classes there can only be two states for ;

either belongs to the specified class and , or it does not belong to the class and . Even though the output values are binary, the independent variables are still continuous. So, we need a function which has as input a large set of continuous variables and for each of these variables produces a binary output. This function, the hypothesis function, has the following form:

.

This function is also known as the logistic function, which is a part of the sigmoid function family. These functions are widely used in the natural sciences because they provide the simplest model for population growth. However, the reason why the logistic function is used for classification in Machine Learning is its ‘S-shape’.

As you can see this function is bounded in the y-direction by 0 and 1. If the variable is very negative, the output function will go to zero (it does not belong to the class). If the variable is very positive, the output will be one and it does belong to the class. (Such a function is called an indicator function.)

The question then is, what will happen to input values which are neither very positive nor very negative, but somewhere ‘in the middle’. We have to define a decision boundary, which separates the positive from the negative class. Usually this decision boundary is chosen at the middle of the logistic function, namely at where the output value is .

(1)

For those who are wondering where entered the picture that we were talking about before. As we can see in the formula of the logistic function, . Meaning, the dependent parameter (also known as the feature), maps the input variable to a position on the -axis. With its -value, we can use the logistic function to calculate the -value. If this -value we assume it does belong in this class and vice versa.

So the feature should be chosen such that it predicts the class membership correctly. It is therefore essential to know which features are useful for the classification task. Once the appropriate features are selected , gradient descent can be used to find the optimal value of these features.

How can we do gradient descent with this logistic function? Except for the hypothesis function having a different form, the gradient descent method is exactly the same. We again have a cost function, of which we have to iteratively take the gradient w.r.t. the feature and update the feature value at each iteration.

This cost function is given by

(2)

We know that:

and

(3)

Plugging these two equations back into the cost function gives us:

(4)

The gradient of the cost function with respect to is given by

(5)

So the gradient of the seemingly difficult cost function, turns out to be a much simpler equation. And with this simple equation, gradient descent for Logistic Regression is again performed in the same way:

- Make an initial but intelligent guess for the values of the parameters .
- Keep iterating while the value of the cost function has not met your criteria:
- With the current values of , calculate the gradient of the cost function J ( ).
- Update the values for the parameters
- Fill in these new values in the hypothesis function and calculate again the value of the cost function;

In the previous section we have seen how we can use Gradient Descent to estimate the feature values , which can then be used to determine the class with the Logistic function. As stated in the introduction, this can be used for a wide variety of classification tasks. The only thing that will be different for each of these classification tasks is the form the features take on.

Here we will continue to look at the example of Text Classification; Lets assume we are doing Sentiment Analysis and want to know whether a specific review should be classified as positive, neutral or negative.

The first thing we need to know is which and what types of features we need to include.

For NLP we will need a large number of features; often as large as the number of words present in the training set. We could reduce the number of features by excluding stopwords, or by only considering n-gram features.

For example, the 5-gram ‘kept me reading and reading’ is much less likely to occur in a review-document than the unigram ‘reading’, but if it occurs it is much more indicative of the class (positive) than ‘reading’. Since we only need to consider n-grams which actually are present in the training set, there will be much less features if we only consider n-grams instead of unigrams.

The second thing we need to know is the actual value of these features. The values are learned by initializing all features to zero, and applying the gradient descent method using the labeled examples in the training set. Once we know the values for the features, we can compute the probability for each class and choose the class with the maximum probability. This is done with the following Logistic function.

In this post we have discussed only the theory of Maximum Entropy and Logistic Regression. Usually such discussions are better understood with examples and the actual code. I will save that for the next blog.

If you have enjoyed reading this post or maybe even learned something from it, subscribe to this blog so you can receive a notification the next time something is posted.

Miles Osborne, Using Maximum Entropy for Sentence Extraction (2002)

Jurafsky and Martin, Speech and Language Processing; Chapter 7

Nigam et. al., Using Maximum Entropy for Text Classification

]]>

With the bag-of-words model we check which word of the text-document appears in a positive-words-list or a negative-words-list. If the word appears in a positive-words-list the total score of the text is updated with +1 and vice versa. If at the end the total score is positive, the text is classified as positive and if it is negative, the text is classified as negative. Simple enough!

With the Naive Bayes model, we do not take only a small set of positive and negative words into account, but all words the NB Classifier was trained with, i.e. all words presents in the training set. If a word has not appeared in the training set, we have no data available and apply Laplacian smoothing (use 1 instead of the conditional probability of the word).

The probability a document belongs to a class is given by the class probability multiplied by the products of the conditional probabilities of each word for that class.

Here is the number of occurences of word in class , is the total number of words in class and is the number of words in the document we are currently classifying.

does not change (unless the training set is expanded), so it can be placed outside of the product:

With this information it is easy to implement a Naive Bayes Text Classifier. (Naive Bayes can also be used to classify non-text / numerical datasets, for an explanation see this notebook).

We have a NaiveBayesText class, which accepts the input values for X and Y as parameters for the “train()” method. Here X is a list of lists, where each lower level list contains all the words in the document. Y is a list containing the label/class of each document.

class NaiveBaseClass: def calculate_relative_occurences(self, list1): no_examples = len(list1) ro_dict = dict(Counter(list1)) for key in ro_dict.keys(): ro_dict[key] = ro_dict[key] / float(no_examples) return ro_dict def get_max_value_key(self, d1): values = d1.values() keys = d1.keys() max_value_index = values.index(max(values)) max_key = keys[max_value_index] return max_key def initialize_nb_dict(self): self.nb_dict = {} for label in self.labels: self.nb_dict[label] = defaultdict(list) class NaiveBayesText(NaiveBaseClass): """" When the goal is classifying text, it is better to give the input X in the form of a list of lists containing words. X = [ ['this', 'is', 'a',...], (...) ] Y still is a 1D array / list containing the labels of each entry def initialize_nb_dict(self): self.nb_dict = {} for label in self.labels: self.nb_dict[label] = [] def train(self, X, Y): self.class_probabilities = self.calculate_relative_occurences(Y) self.labels = np.unique(Y) self.no_examples = len(Y) self.initialize_nb_dict() for ii in range(0,len(Y)): label = Y[ii] self.nb_dict[label] += X[ii] #transform the list with all occurences to a dict with relative occurences for label in self.labels: self.nb_dict[label] = self.calculate_relative_occurences(self.nb_dict[label])

As we can see, the training of the Naive Bayes Classifier is done by iterating through all of the documents in the training set. From all of the documents, a Hash table (dictionary in python language) with the relative occurence of each word per class is constructed.

This is done in two steps:

1. construct a huge list of all occuring words per class:

for ii in range(0,len(Y)): label = Y[ii] self.nb_dict[label] += X[ii]

2. calculate the relative occurence of each word in this huge list, with the “calculate_relative_occurences” method. This method simply uses Python’s Counter module to count how much each word occurs and then divides this number with the total number of words.

The result is saved in the dictionary *nb_dict*.

As we can see, it is easy to train the Naive Bayes Classifier. We simply calculate the relative occurence of each word per class, and save the result in the “nb_dict” dictionary.

This dictionary can be updated, saved to file, and loaded back from file. It contains the results of Naive Bayes Classifier training.

Classifying new documents is also done quite easily by calculating the class probability for each class and then selecting the class with the highest probability.

def classify_single_elem(self, X_elem): Y_dict = {} for label in self.labels: class_probability = self.class_probabilities[label] nb_dict_features = self.nb_dict[label] for word in X_elem: if word in nb_dict_features.keys(): relative_word_occurence = nb_dict_features[word] class_probability *= relative_word_occurence else: class_probability *= 0 Y_dict[label] = class_probability return self.get_max_value_key(Y_dict) def classify(self, X): self.predicted_Y_values = [] n = len(X) for ii in range(0,n): X_elem = X[ii] prediction = self.classify_single_elem(X_elem) self.predicted_Y_values.append(prediction) return self.predicted_Y_values

In the next blog we will look at the results of this naively implemented algorithm for the Naive Bayes Classifier and see how it performs under various conditions; we will see the influence of varying training set sizes and whether the use of n-gram features will improve the accuracy of the classifier.

]]>- To keep track of the number of occurences of each word, we tokenize the text and add each word to a single list. Then by using a Counter element we can keep track of the number of occurences.
- We can make a DataFrame containing the class probabilities of each word by adding each word to the DataFrame as we encounter it and dividing it by the total number of occurences afterwards.
- Sorting this DataFrame by the values in the columns of the Positive or Negative class, then taking the top 100 / 200 words we can construct a list containing negative or positive words.
- These words in this constructed Sentiment Lexicon can be used to give a value to the subjectivity of the reviews in the test set.

Using the steps described above, we were able to determine the subjectivity of reviews in the test set with an accuracy (F-score) of ~60%.

In this blog we will look into the effectiveness of cross-book sentiment lexicons; how well does a sentiment lexicon made from book A perform at sentiment analysis of book B?

We will also see how we can improve the bag-of-words technique by including n-gram features in the bag-of-words.

In the previous post, we have seen that the sentiment of reviews in the test-set of ‘Gone Girl’ could be predicted with a 60% accuracy. How well does the sentiment lexicon derived from the training set of book A perform at deducing the sentiment of reviews in the test set of book B?

In the table above, we can see that the most effective Sentiment Lexicons are created from books with a large amount of Positive ánd Negative reviews. In the previous post we saw that Fifty Shades of Grey has a large amount of negative reviews. This makes it a good book to construct an effective Sentiment Lexicon from.

Other books have a lot of positive reviews but only a few negative ones. The Sentiment Lexicon constructed from these books has a high accuracy in determining the sentiment of positive reviews, but a low accuracy for negative reviews… bringing the average down.

In the previous blog-post we had constructed a bag-of-words model with unigram features. Meaning that we split the entire text in single words and count the occurence of each word. Such a model does not take the position of each word in the sentence, its context and the grammar into account. That is why, the bag-of-words model has a low accuracy in detecting the sentiment of a text document.

For example, with the bag-of-words model the following two sentences will be given the same score:

1. “This is not a good book” –> 0 + 0 + 0 + 0 + 1 + 0 –> positive

2. “This is a very good book” –> 0 + 0 +0 +0 +1 + 0 –> positive

If we include features consisting of two or three words, this problem can be avoided; “not good” and “very good” will be two different features with different subjectivity scores. The biggest reason why bigram or trigram features are not used more often is that the number of possible combinations of words increases exponentially with the number of words. Theoretically, a document with 2.000 words can have 2.000 possible unigram features, 40.000 possible bigram features and 8.000.000.000 possible trigram features.

However, if we consider this problem from a pragmatic point of view we can say that most of the combinations of words which can be made, are grammatically not possible, or do not occur with a significant amount and hence don’t need to be taken into account.

Actually, we only need to define a small set of words (prepisitions, conjunctions, interjections etc) of which we know it changes the meaning of the words following it and/or the rest of the sentence. I we encounter such a ‘ngram word’, we do not split the sentence but split it after the next word. In this way we will construct ngram features consisting of the specified words and the words directly following them. Some examples of such words are:

In the previous post, we had seen that the code to construct a DataFrame containing the class probabilities of words in the training set is:

from sets import Set import pandas as pd BOW_df = pd.DataFrame(0, columns=scores, index='') words_set = Set() for review in training_set: score = review['score'] text = review['review_text'] splitted_text = split_text(text) for word in splitted_text: if word not in words_set: words_set.add(word) BOW_df.loc[word] = [0,0,0,0,0] BOW_df.ix[word][score] += 1 else: BOW_df.ix[word][score] += 1

If we also want to include ngrams in this class probability DataFrame, we need to include a function which generates n-grams from the splitted text and the list of specified ngram words:

(...) splitted_text = split_text('text') text_with_ngrams = generate_ngrams(splitted_text, ngram_words) for word in text_with_ngrams: (...)

There are a few conditions this “generate_ngrams” function needs to fulfill:

- When it iterates through the splitted text and encounters a ngram-word, it needs to concatenate this word with the next word. So [“I”,”do”,”not”,”recommend”,”this”,”book”] needs to become [“I”, “do”, “not recommend”, “this”, “book”]. At the same time it needs to skip the next iteration so the next word does not appear two times.
- It needs to be recursive: we might encounter multiple ngram words in a row. Then all of the words needs to be concatenated into a single ngram. So [“This”,”is”,”a”,”very”,”very”, “good”,”book”] needs to be concatenated in [“This”,”is”,”a”,”very very good”, “book”]. If n words are concatenated together into a single n-gram, the next n iterations need to be skipped.
- In addition to concatenating words with the words following it, it might also be interesting if we concatenating it with the word preceding it. For example, forming n-grams including the word “book” and its preceding words leads to features like “worst book”, “best book”, “fascinating book” etc…

Now that we know this information, lets have a look at the code:

def generate_ngrams(text, ngram_words): new_text = [] index = 0 while index < len(text): [new_word, new_index] = concatenate_words(index, text, ngram_words) new_text.append(new_word) index = new_index+1 if index!= new_index else index+1 return new_text def concatenate_words(index, text, ngram_words): word = text[index] if index == len(text)-1: return word, index if word in bigram_array: [word_new, new_index] = concatenate_words(index+1, text, ngram_words) word = word + ' ' + word_new index = new_index return word, index

Here concatenate_words is a recursive function which either returns the word at the index position in the array, or the word concatenated with the next word. It also return the index so we know how many iterations need to be skipped.

This function will also work if we want to append words to its previous words. Then we simply need to pass the reversed text to it `text = list(reversed(text))`

and concatenate it in reversed order: ` word = word_new + ' ' + word`

.

We can put this information together in a single function, which can either concatenate with the next word or with the previous word, depending on the value of the parameter ‘forward’:

def generate_ngrams(text, ngram_words, forward = True): new_text = [] index = 0 if not forward: text = list(reversed(text)) while index < len(text): [new_word, new_index] = concatenate_words(index, text, ngram_words, forward) new_text.append(new_word) index = new_index+1 if index!= new_index else index+1 if not forward: return list(reversed(new_text)) return new_text def concatenate_words(index, text, ngram_words, forward): words = text[index] if index == len(text)-1: return words, index if words.split(' ')[0] in bigram_array: [new_word, new_index] = concatenate_words(index+1, text, ngram_words, forward) if forward: words = words + ' ' + new_word else: words = new_word + ' ' + words index = index_new return words, index

Using this simple function to concatenate words in order to form n-grams, will lead to features which strongly correlate with a specific (Negative/Positive) class like ‘highly recommend’, ‘best book’ or even ‘couldn’t put it down’.

Now that we have a better understanding of Text Classification terms like bag-of-words, features and n-grams, we can start using Classifiers for Sentiment Analysis. Think of Naive Bayes, Maximum Entropy and SVM.

]]>In my previous post I have explained the Theory behind three of the most popular Text Classification methods (Naive Bayes, Maximum Entropy and Support Vector Machines) and told you that I will use these Classifiers for the automatic classification of the subjectivity of Amazon.com book reviews.

The purpose is to get a better understanding of how these Classifiers work and perform under various conditions, i.e. do a comparative study about Sentiment Analytics.

In this blog-post we will use the bag-of-words model to do Sentiment Analysis. The bag-of-words model can perform quiet well at Topic Classification, but is inaccurate when it comes to Sentiment Classification. Bo Pang and Lillian Lee report an accuracy of 69% in their 2002 research about Movie review sentiment analysis. With the three Classifiers this percentage goes up to about 80% (depending on the chosen feature).

The reason to still make a bag-of-words model is that it gives us a better understanding of the content of the text and we can use this to select the features for the three classifiers. The Naive Bayes model is also based on the bag-of-words model, so the bag-of-words model can be used as an intermediate step.

We can collect book reviews from Amazon.com by scraping them from the website with BeautifulSoup. The process for this was already explained in the context of Twitter.com and it should not be too difficult to do the same for Amazon.com.

In total 213.335 book reviews were collected for eight randomly chosen books:

After making a bar-plot of the distribution of the different stars for the chosen books, we can we that there is a strong variation. Books which are considered to be average have almost no 1-star ratings while books far below average have a more uniform distribution of the different ratings.

We can see that the book ‘Gone Girl’ has a pretty uniform distribution so it seems like a good choice for our training set. Books like ‘Unbroken’ or ‘The Martian’ might not have enough 1-star reviews to train for the Negative class.

As the next step, we are going to divide the corpus of reviews into a training set and a test set. The book ‘Gone Girl’ has about 40.000 reviews, so we can use *up to* half of it for training purposes and the other half for testing the accuracy of our model. In order to also take into account the effects of the training set size on the accuracy of our model, we will vary the training set size from 1.000 up to 20.000.

The bag-of-words model is one of the simplest language models used in NLP. It makes an unigram model of the text by keeping track of the number of occurences of each word. This can later be used as a features for Text Classifiers. In this bag-of-words model you only take individual words into account and give each word a specific subjectivity score. This subjectivity score can be looked up in a sentiment lexicon[1]. If the total score is negative the text will be classified as negative and if its positive the text will be classified as positive. It is simple to make, but is less accurate because it does not take the word order or grammar into account.

A simple improvement on using unigrams would be to use unigrams + bigrams. That is, not split a sentence after words like “not”,”no”,”very”, “just” etc. It is easy to implement but can give significant improvement to the accuracy. The sentence “This book is not good” will be interpreted as a positive sentence, unless such a construct is implemented. Another example is that the sentences “This book is very good” and “This book is good” will have the same score with a unigram model of the text, but not with an unigram + bigram model.

My pseudocode for creating a bag-of-words model is as follows:

*list_BOW*= []- For each review in the training set:
- Strip the newline charachter “n” at the end of each review.
- Place a space before and after each of the following characters: .,()[]:;” (This prevents sentences like “I like this book.It is engaging” being interpreted as [“I”, “like”, “this”, “book.It”, “is”, “engaging”].)
- Tokenize the text by splitting it on spaces.
- Remove tokens which consist of only a space, empty string or punctuation marks.
- Append the tokens to list_BOW.

*list_BOW*now contains all words occuring in the training set.- Place
*list_BOW*in a Python Counter element. This counter now contains all occuring words together with their frequencies. Its entries can be sorted with the most_common() method.

The real question is, how we should determine the sentiment/subjectivity score of each word in order to determine the total subjectivity score of the text. We can use one of the sentiment lexicons given in [1], but we dont really know in which circumstances and for which purposes these lexicons are created. Furthermore, in most of these lexicons the words are classified in a binary way (either positive or negative ). Bing Liu’s sentiment lexicon for example contains a list of a few thousands positive and a few thousand negative words.

Bo Pang and Lillian Lee used words which were chosen by two student as positive and negative words.

It would be better if we determine the subjectivity score of each word using some simple statistics of the training set. To do this we need to determine the class probability of each word present in the bag-of-words. This can be done by using pandas dataframe as a datacontainer (but can just as easily be done with dictionaries or other data structures). The code for this looks like:

from sets import Set import pandas as pd BOW_df = pd.DataFrame(0, columns=scores, index='') words_set = Set() for review in training_set: score = review['score'] text = review['review_text'] splitted_text = split_text(text) for word in splitted_text: if word not in words_set: words_set.add(word) BOW_df.loc[word] = [0,0,0,0,0] BOW_df.ix[word][score] += 1 else: BOW_df.ix[word][score] += 1

Here `split_text`

is the method for splitting a text into a list of individual words:

def expand_around_chars(text, characters): for char in characters: text = text.replace(char, &quot; &quot;+char+&quot; &quot;) return text def split_text(text): text = strip_quotations_newline(text) text = expand_around_chars(text, '&quot;.,()[]{}:;') splitted_text = text.split(&quot; &quot;) cleaned_text = [x for x in splitted_text if len(x)&gt;1] text_lowercase = [x.lower() for x in cleaned_text] return text_lowercase

This gives us a DataFrame containing of the number of occurances of each word in each class:

Unnamed: 0 1 2 3 4 5 0 i 4867 5092 9178 14180 17945 1 through 210 232 414 549 627 2 all 499 537 923 1355 1791 3 drawn-out 1 0 1 1 0 4 , 4227 4779 8750 15069 18334 5 detailed 3 7 15 30 36 ... ... ... ... ... ... ... 31800 a+++++++ 0 0 0 0 1 31801 nailbiter 0 0 0 0 1 31802 melinda 0 0 0 0 1 31803 reccomend! 0 0 0 0 1 31804 suspense!! 0 0 0 0 1 [31804 rows x 6 columns]

As we can see there are also quiet a few words which only occur one time. These words will have a class probability of 100% for the class they are occuring in.

This distribution however, does not approximate the real class distribution of that word at all. It is therefore good to define some ‘occurence cut off value’; words which occur less than this value are not taken into account.

By dividing each element of each row by the sum of the elements of that row we will get a DataFrame containing the relative occurences of each word in each class, i.e. a DataFrame with the class probabilities of each word. After this is done, the words with the highest probability in class 1 can be taken as negative words and words with the highest probability in class 5 can be taken as positive words.

We can construct such a sentiment lexicon from the training set and use it to measure the subjectivity of reviews in the test set. Depending on the size of the training set, the sentiment lexicon becomes more accurate for prediciton.

By labeling 4 and 5-star reviews as Positive, 1 and 2-star reviews as Negative and 3 star reviews as Neutral and using the following positive and negative word:

we can determine with the bag-of-words model whether a review is positive or negative with a 60% accuracy .

- How accurate is this list of positive and negative words constructed from the reviews of book A in determening the subjectivity of book B reviews.
- How much more accurate will the bag-of-words model become if we take bigrams or even trigrams into account? There were words with a high negative or positive subjectivity in the word-list which do not have a negative or positive meaning by themselves. This can only be understood if you take the preceding or following words into account.
- Make an overall sentiment lexicon from all the reviews of all the books.
- Use the bag-of-words as features for the Classifiers; Naive Bayes, Maximum Entropy and Support Vector Machines.

]]>

If it is done wrong, it can be boring not grabbing the attention of the readers, or even worse; convey the wrong message.

If it done correctly, it can intrigue even the most indifferent reader (some people can even turn Data Visualizations into an art form).

I personally think Python’s matplotlib is a great library for data visualization. Another amazing library is D3, which is very intuitive and flexible like matplotlib. In addition to that it is a javascript library so it works in the browser, which makes it is platform independent and you dont have to install any software. Did I already tell you D3 is a.. maa.. zing!?

That is why, I will focus on Data Visualizations with D3 in the future. But for now, I will start with something simpler and show you how to make a choropleth map. This is the kind of map you see at every election, where each state is colored in the color of the winning party. Although it might seem difficult to make such a map… it is not.

First thing we need is a map of a country (or the area we want to visualize) in SVG form. Wikipedia has a nice collection of blank maps we can use. Copy the code of this map in a <div> element of a basic html page. As an example, we can take this map of the world.

Another thing we need to include is the jQuery library, so go ahead and link to the latest jQuery version hosted by google like this:

If we open the page now, we should see the map drawn out.

As we can see in the code, each <path> element has its own id. The code for Australia for example looks like:

Sometimes we might be lucky and this id will actually be equal to the name of the state/country and sometimes it might be a random number/word. If that is the case, dont lose any sleep over it. It is not very difficult to discover which element belongs to which country with Chrome Developer Tools (right click on the country and then click on ‘inspect element’).

Now that we know the id of the country we want to color in, we can give it a color with the javascript code:

$(“#path6235”).css(‘fill’,’red’);

Now we need some data to fill in the map. Since the war in Syria / the Syrian refugee crisis is a current issue, it might be interesting to see which countries are donating the most / least to the Syrian crisis. The data for this can be found on this website. We could chose to color based on the absolute amount of money, but it does seem more fair to look at this donated amount relative to the countries’ GDP.

If we divide the donated amount by the GDP of that country of that year, we will get this data. Now we only need to put it in the correct format, which is JSON.

In our example, the correct data in the correct format looks like:

The complete dataset can be downloaded from here. In this file each number indicates the donated amount as 1/1000th percentage of the annual GDP. Go ahead and place the data in a <script> tag so that it can be accessed by JavaScript.

You can check whether or not the data is recognized by the browser, by executing * console.log(data["Switzerland"])* within a <script> tag. This should print the data for Switzerland in the console of the browser:

Now the entire map can be filled in with a javascript function which iterates through the variable containing the data:

` `

With the correct colors filled in, the map looks like:

In the above map all of the countries with no donations for the Syrian crisis (in 2015) are colored red. The countries which have donated money are colored in based on a blue to green gradient, where blue indicates a relative low and green indicates a relative large donation (Russia ~ 0.5 / 1000 % of GDP and Canada ~ 11 / 1000 % of their GDP).

If you are interested, you can download the entire html file from here.

Now that I have covered the basics of making a choropleth map, I want to address the issue that the way you choose to visualize your data can have a huge impact on the message your visualization is conveying.

If the countries with no donated money were left untouched, the first impression of the visualization would be there is no data available on these countries.

Choosing the gradient scale from red to green instead of blue to green, conveys the message that the countries colored with red have done something bad (red is associated with danger).

Although I think everybody can donate more, I would not want to give the impression that Brazil has done something bad by donating ‘only’ 5.000.000 USD.

]]>

Natural Language Processing (NLP) is a vast area of Computer Science that is concerned with the interaction between Computers and Human Language[1].

Within NLP many tasks are – or can be reformulated as – classification tasks. In classification tasks we are trying to produce a classification function which can give the correlation between a certain ‘feature’ and a class . This Classifier first has to be trained with a training dataset, and then it can be used to actually classify documents. Training means that we have to determine its model parameters. If the set of training examples is chosen correctly, the Classifier should predict the class probabilities of the actual documents with a similar accuracy (as the training examples).

After construction, such a Classifier could for example tell us that document containing the words “Bose-Einstein condensate” should be categorized as a Physics article, while documents containing the words “Arbitrage” and “Hedging” should be categorized as a Finance article.

Another Classifier could tell us that mails starting with “Dear Customer/Guest/Sir” (instead of your name) and containing words like “Great opportunity” or “one-time offer” can be classified as spam.

Here we can already see two uses of classification models: *topic classification* and *spam filtering*. For these purposes a Classifiers work quiet well and perform better than most trained professionals.

A third usage of Classifiers is Sentiment Analysis. Here the purpose is to determine the subjective value of a text-document, i.e. how positive or negative is the content of a text document. Unfortunately, for this purpose these Classifiers fail to achieve the same accuracy. This is due to the subtleties of human language; sarcasm, irony, context interpretation, use of slang, cultural differences and the different ways in which opinion can be expressed (subjective vs comparative, explicit vs implicit).

In this blog I will discuss the theory behind three popular Classifiers (Naive Bayes, Maximum Entropy and Support Vector Machines) in the context of Sentiment Analysis[2]. In the next blog I will apply this gained knowledge to automatically deduce the sentiment of collected Amazon.com book reviews.

The contents of this blog-post is as follows:

- Basic concepts of text classification:
- Tokenization
- Word normalization
- bag-of-words model
- Classifier evaluation

- Naieve Bayesian Classifier
- Maximum Entropy Classifier
- Support Vector Machines
- What to Expect

Tokenization is the name given to the process of chopping up sentences into smaller pieces (words or tokens). The segmentation into tokens can be done with decision trees, which contains information to correctly solve the issues you might encounter. Some of these issues you would have to consider are:

- The choice for the delimiter will for most cases be a whitespace (“We’re going to Barcelona” -> [“We’re”, “going”, “to”, “Barcelona.”]), but what should you do when you come across words with a white space in them (“We’re going to The Hague.”->[“We’re”, “going”,”to”,”The”, “Hague”]).
- What should you do with punctuation marks? Although many tokenizers are geared towards throwing punctuation away, for Sentiment analysis a lot of valuable information could be deduced from them.
**!**puts extra emphasis on the negative/positive sentiment of the sentence, while**?**can mean uncertainty (no sentiment). - “, ‘ , [], () can mean that the words belong together and should be treated as a separate sentence. Same goes for words which are
**bold**,*italic*,__underlined__, or inside a link. If you also want to take these last elements into considerating, you should scrape the html code and not just the text.

**Word Normalization **is the reduction of each word to its base/stem form (by chopping of the affixes). While doing this, we should consider the following issues:

- Capital letters should be normalized to lowercase, unless it occurs in the middle of a sentence; this could indicate the name of a writer, place, brand etc.
- What should be done with the apostrophe (‘); “George’s phone” should obviously be tokenized as “George” and “phone”, but I’m, we’re, they’re should be translated as I am, we are and they are. To make it even more difficult; it can also be used as a quotation mark.
- Ambigious words like High-tech, The Hague, P.h.D., USA, U.S.A., US and us.

After the text has been segmented into sentences, each sentence has been segmented into words, the words have been tokenized and normalized, we can make a simple bag-of-words model of the text. In this bag-of-words representation you only take individual words into account and give each word a specific subjectivity score. This subjectivity score can be looked up in a sentiment lexicon[7]. If the total score is negative the text will be classified as negative and if its positive the text will be classified as positive.

For determining the accuracy of a single Classifier, or comparing the results of different Classifier, the F-score is usually used. This F-score is given by

where is the precision and is the recall. The precision is the number of correctly classified examples divided by the total number of classified examples. The recall is the number of correctly classified examples divided by the actual number of examples in the training set.

Naive Bayes [3] classifiers are studying the classification task from a Statistical point of view. The starting point is that the probability of a class is given by the posterior probability given a training document . Here refers to all of the text in the entire training set. It is given by , where is the attribute (word) of document .

Using Bayes’ rule, this posterior probability can be rewritten as:

Since the marginal probability is equal for all classes, it can be disregarded and the equation becomes:

The document belongs to the class which maximizes this probability, so:

Assuming conditional independence of the words , this equation simplifies to:

Here is the conditional probability that word i belongs to class . For the purpose of text classification, this probability can simply be calculated by calculating the frequency of word in class relative to the total number of words in class .

We have seen that we need to multiply the class probability with all of the prior-probabilities of the individual words belonging to that class. The question then is, how do we know what the prior-probabilities of the words are? Here we need to remember that this is a supervised machine learning algorithm: we can estimate the prior-probabilities with a training set with documents that are already labeled with their classes. With this training set we can train the model and obtain values for the prior probabilities. This trained model can then be used for classifying unlabeled documents.

This is relatively easy to understand with an example. Lets say we have counted the number of words in a set of labeled training documents. In this set each text document has been labeled as either Positive, Neutral or as Negative. The result will then look like :

From this table we can already deduce each of the class probabilites:

,

,

.

If we look at the sentence “This blog-post is awesome.”, then the probabilities for this sentence belonging to a specific class are:

This sentence can thus be classified in the positive category.

The principle behind Maximum Entropy [4] is that the correct distribution is the one that maximizes the Entropy / uncertainty and still meets the constraints which are set by the ‘evidence’.

Let me explain this a bit more. In Information Theory, the word Entropy is used as a unit of measure for the unpredictability of the content of information. If you would throw a fair dice, each of the six outcomes have the same probability of occuring (1/6). Therefore you have maximum uncertainty; an entropy of 1. If the dice is weighted you already know one of the six outcomes has a higher probability of occuring and the uncertainty becomes less. If the dice is weighted so much that the outcome is always six, there is zero uncertainty in the outcome and hence the information entropy is also zero.

The same applies to letters in a word (or words in a sentence): if you assume that every letter has the same probability of occuring you have maximum uncertainty in predicting the next letter. But if you know that letters like E, A, O or I have a higher probability of occuring you have less uncertainty.

Knowing this, we can say that complex data has a high entropy, patterns and trends have lower entropy, information you know for a fact to be true has zero entropy (and therefore can be excluded).

The idea behind Maximum Entropy is that you want a model which is as unbiased as possible; events which are not excluded by known constraints should be assigned as much uncertainty as possible, meaning the probability distribution should be as uniform as possible. You are looking for the maximum value of the Entropy. If this is not entirely clear, I recommend you to read through this example.

The mathematical formula for Entropy is given by , so the most likely probability distribution is the one that maximizes this entropy:

It can be shown that the probability distribution has an exponential form and hence is given by:

,

where is a feature function, is the weight parameter of the feature function and is a normalization factor given by

.

This feature function is an indicator function, which is expresses the expected value of the chosen statistics (words) in the training set. These feature functions can then be taken as constraints for the classification of the actual dataset (by eliminating the probability distributions which do not fit with these constraints).

Usually, the weight parameters are automatically determined by the Improved Iterative Scaling algorithm. This is simply a gradient descent function which can be iterated over until it converges to the global maximum. The pseudocode for the this algorithm is as follows:

- Initialize all weight parameters to zero.
- Repeat until convergence:
- calculate the probability distribution with the weight parameters filled in.
- for each parameter calculate . This is the solution to:

- update the value for the weight parameter:

In step 2b is given by the sum of all features in the training dataset d:

Maximum Entropy is a general statistical classification algorithm and can be used to estimate any probability distribution. For the specific case of text classification, we can limit its form a bit more by using word counts as features:

(1)

Although it is not immediatly obvious from the name, the SVM algorithm is a ‘simple’ linear classification/regression algorithm[6]. It tries to find a hyperplane which seperates the data in two classes as optimally as possible.

Here as optimally as possible means that as much points as possible of label A should be seperated to one side of the hyperplane and as points of label B to the other side, while maximizing the distance of each point to this hyperplane.

In the image above we can see this illustrated for the example of points plotted in 2D-space. The set of points are labeled with two categories (illustrated here with black and white points) and SVM chooses the hypeplane that maximizes the margin between the two classes. This hyperplane is given by

where is a n-dimensional input vector, is its output value, is the weight vector (the normal vector) defining the hyperplane and the terms are the Lagrangian multipliers.

Once the hyperplane is constructed (the vector is defined) with a training set, the class of any other input vector can be determined:

if then it belongs to the positive class (the class we are interested in), otherwise it belongs to the negative class (all of the other classes).

We can already see this leads to two interesting questions:

1. SVM only seems to work when the two classes are linearly separable. How can we deal with non-linear datasets? Here I feel the urge to point out that the Naive Bayes and Maximum Entropy are linear classifiers as well and most text documents will be linear. Our training example of Amazon book reviews will be linear as well. But an explanation of the SVM system will not be complete without an explanation of Kernel functions.

2. SVM only seems to be able to separate the dataset into two classes? How can we deal with datasets with more than two classes. For Sentiment Classification we have for example three classes (positive, neutral, negative) and for Topic Classification we can have even more than that.

**Kernel Functions:
**The classical SVM system requires that the dataset is linearly separable, i.e. there is a single hyperplane which can separate the two classes. For non-linear datasets a Kernel function is used to map the data to a higher dimensional space in which it is linearly separable. This video gives a good illustation of such a mapping. In this higher dimensional feature space, the classical SVM system can then be used to construct a hyperplane.

**Multiclass classification:**

The classical SVM system is a binary classifier, meaning that it can only separate the dataset into two classes. To deal with datasets with more than two classes usually the dataset is reduced to a binary class dataset with which the SVM can work. There are two approaches for decomposing a multiclass classification problem to a binary classification problem: the one-vs-all and one-vs-one approach.

In the one-vs-all approach one SVM Classifier is build per class. This Classifier takes that one class as the positive class and the rest of the classes as the negative class. A datapoint is then only classified within a specific class if it is accepted by that Class’ Classifier and rejected by all other classifiers. Although this can lead to accurate results (if the dataset is clustered), a lot of datapoints can also be left unclassified (if the dataset is not clustered).

In the one-vs-one approach, you build one SVM Classifier per chosen pair of classes. Since there are possible pair combinations for a set of N classes, this means you have to construct more Classifiers. Datapoints are then categorized in the class for which they have received the most points.

In our example, there are only three classes (positive, neutral, negative) so there is no real difference between these two approaches. In both approaches we have to construct two hyperplanes; positive vs the rest and negative vs the rest.

For the purpose of testing these Classification methods, I have collected >300.000 book reviews of 10 different books from Amazon.com. I will use a part of these book reviews for training purposes and a part as the test dataset. In the next few blogs I will try to automatically classify the sentiment of these reviews with the four models described above.

—————————————-

**[1] Machine Learning Literature:
**Foundations of Statistical Natural Language Processing by Manning and Schutze,

Machine Learning: A probabilistic perspective by Kevin P. Murphy,

Foundations of Machine Learning by Mehryar Mohri

**[2]Sentiment Analysis literature:**

There is already a lot of information available and a lot of research done on Sentiment Analysis. To get a basic understanding and some background information, you can read Pang et.al.’s 2002 article. In this article, the different Classifiers are explained and compared for sentiment analysis of Movie reviews (IMDB). This research was very close to Turney’s 2002 research on Sentiment Analysis of movie reviews (see article). You can also read Bo Pang and Lillian Lee’s 2009 article , which is more general in nature (about the challenges of SA, the different ML techniques etc.)

There are also two relevant books: Web Data Mining and Sentiment Analysis, both by Bing Liu. And last but not least, works of Socher are also quiet interesting (see paper, website containing live demo); it even has inspired this kaggle competition.

**[3] Naive Bayes Literature:**

Machine Learning by Tom Mitchel, Stanford’s IR-book, Sebastian Raschka’s blog-post, Stanford’s online NLP course.

**[4]Maximum Entropy Literature:**

Using Maximum Entropy for text classification (1999), A simple introduction to Maximum Entropy models (1997), A brief MaxEnt tutorial, another good MIT article.

**[6]SVM Literature:**

This youtube video gives a general idea about SVM. For a more technical explanation, this and this article can be read. Here you can find a good explanation as well as a list of the mostly used Kernel functions. one-vs-one and one-vs-all.

**[7] Sentiment Lexicons:
**I have selected a list of sentiment analysis lexicons; most of these were mentioned in the Natural Language Processing course, the rest are from stackoverflow.

- WordStat sentiment Dictionary; This is probably one of the largest lexicons freely available. It contains ~14.000 words ( 9164 negative and 4847 positive words ) and gives words a binary classification (positive or a negative ) score.
- Bill McDonalds 2014 Master dictionary, containing ~85.000 word
- Harvard Inquirer; Contains about ~11.780 words and has a more complex way of ‘scoring’ words; each word can be scored in 15+ categories; words can be Positiv-Negative, Strong-Weak, Active-Passive, Pleasure-Pain, words can indicate pleasure, pain, virtue and vice etc etc
- SentiWordNet; gives the words a positive or negative score between 0 and 1. It contains about 117.660 words, however only ~29.000 of these words have been scored (either positive or negative).
- MPQA; contains about ~8.200 words and binary classifies each word (as either positive or as negative). It also gives additional information such as whether a word is an adjective or a noun and whether a word is ‘strong subjective’ or ‘weak subjective’.
- Bing Liu’s opinion lexicon; contains 4.782 negative and 2.005 positive words.

**Including Emoticons in your dictionary;**

None of the dictionaries described above contain emoticons, which might be an essential part of text if you are analyzing social media. So how can we include emoticons in our subjectivity analysis? Everybody knows is a positive and is a negative emoticon but what exactly does mean and how is it different from :-/?

There are a few emoticon sentiment dictionaries on the web which you could use; Emoticon Sentiment Lexicon created by Hogenboom et. al., containing a list of 477 emoticons which are scored either 1 (positive), 0 (neutral) or -1 (negative). You could also make your own emoticon sentiment dictionary by giving the emoticons the same score as their meaning in words.

]]>———–

For most people, the most interesting part of the previous post, will be the final results. But for the ones who would like to try something similar or the ones who are also curious about the technical part, I will explain the methods and techniques I used (mostly webscraping with Beautifulsoup4) to collect a few million Tweets.

**Setting up Python and its relevant packages**

I used Python as the programming language to collect all the relevant data, because I have prior experience with it, but the same techniques should be applicable with other languages. In case you do not have any experience with Python, but still would like to use it, I can recommend the Coursera Python Course, Codeacademy Python Course or the Learn Python the Hard Way book. It might also be a good idea to get a basic understanding of how web APIs work.

The Python Packages you will need are packages like Numpy & SciPy for basic calculations, OAuth2 for authorization towards the Twitter, tweepy because it provides an user-friendly wrapper of the Twitter API, pymongo for interacting with the MongoDB database from Python (if that is the db you will use).

If you do not have Python and/or some of its packages, the easiest way to install it is; on linux install pip (Python package manager) first and then install any of the missing packages with `pip install <package>.`

For Windows I recommend to install Anaconda (which has a lot of built-in packages including pip) first and then IPython. The missing packages can then be installed with the same command.

**Getting your twitter credentials;**

Twitter is using OAuth2 for authorization and authentication, so whether you are using tweepy to access the Twitter stream or some other method, make sure you have installed the OAuth2 package. After you have installed OAuth2 it is time to get your log-in credentials from https://apps.twitter.com. Log in and click on *create a new app*, fill in the application details and copy your Consumer Secret and your Access Token Secret to some text file.

**Accessing Twitter with its API**

I recommend you to use tweepy [1], which is an open-source Twitter API wrapper, making it easy to access twitter. If you are using a programming language other than Python or if you don’t feel like using tweepy, you can look at the Twitter API documentation and find other means of accessing Twitter.

There are two ways in which you can mine for tweets; with the **Streaming API** or with the **Search Api**. The main difference (for an overview, click here) between them is that with Search you can mine for tweets posted in the past while Streaming goes forward in time and captures tweets as they are posted.

It is also important to take the rate-limit for both API’s into account:

- With the Search API you can only sent 180 Requests every 15 min timeframe. With a maximum number of 100 tweets per Request this means you can mine for 4 x 180 x 100 = 72.000 tweets per hour. One way to increase number of tweets is to authenticate as an application instead of an user. This will increase the rate-limit from 180 Requests to 450 Requests while reducing some of the possibilities you had as an user.
- With the Streaming API you can collect all tweets containing your keyword(s), up to 1 % of the total tweets currently being posted on twitter. So if your keyword is very general and more than 1 % of the tweets contain this term, you will not get all of the tweets containing this term. The obvious solution is to make your query more specific and combining multiple keywords. At the moment 500+ million tweets are posted a day, so 1 % of all tweets still gives you 1+ million tweets a day.

**Which one should you use?**

Obviously any prediction about the future should be based on tweets coming from the Streaming API, but if you need some data to fine-tune your model you can use the Search API to collect tweets from the past seven days – it does not go further back – (Twitter documentation). However, since there are around 500+ million tweets posted every day, the past seven days should provide you with enough data to get you started. However, if you need tweets older than 7 days, webscraping might be a good alternative, since a search at twitter.com does return old tweets.

Using the tweepy package for Streaming Twitter messages is pretty straight forward. There even is an code sample on the github page of tweepy. So all you need to do is install tweepy/clone the github repository and fill in the search terms in the relevant part of search.py.

With tweepy you can also search for Twitter messages (not older than 7 days). the code sample below shows how it is done.

import tweepy access_token = "" access_token_secret = "" consumer_key = "" consumer_secret = "" auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_acces_token(access_token, access_token_secret) api = tweepy.API(auth) for tweet in tweepy.Cursor(api.search, q="Tayyip%20Erdogan", lang="tr").items(): print tweet

These two samples of code show again the advantages of tweepy; it makes it really easy to access the Twitter API for Python and as a result of this is probably the most popular Python Twitter package. But because it is using the Twitter API, it is also subject to the limitations posed by Twitter; the rate-limit and the fact that you can not search for twitter messages older than 7 days. Since I needed data from the previous elections, this posed a serious problem for me and I had to use web-scraping to collect Twitter messages from May.

**Using BeautifulSoup4 to scrape for tweets**

There are some pro’s and cons with using web scraping for the collection of twitter data (instead of their API). One of most important pro’s are that there is no rate-limit on the website so you can collect more tweets than the limit which is imposed on the Twitter API. Furthermore, you can also mine for tweets older than seven days :).

If we want to scrape twitter.com with BeautifulSoup we need to send a Request and extract the relevant information from the response. The Search API documentation gives a nice overview of the relevant parameters you can use in your query.

For example, if you want to request all tweets containing ‘akparti’ from 01 May 2015 until 05 June 2015, written in Turkish, you can do that with the following url

`https://twitter.com/search?q=akparti%20since%3A2015-05-01%20until%3A2015-06-05&lang=tr`

The tweets on this page can easily be scraped with the Python module BeautifulSoup.

import urllib2 from bs4 import BeautifulSoup url = "https://twitter.com/search?q=akparti%20since%3A2015-05-01%20until%3A2015-06-05&amp;amp;amp;amp;amp;lang=tr" response = urllib2.urlopen(url) html = response.read() soup = BeautifulSoup(html)

‘soup’ now contains the entire contents of the html page. Now, lets look at how we can extract the more specific elements containing only the tweet-text, tweet-timestamp or user. With the developer tools of Chrome (right-click on the tweet and then ‘Inspect element’) you can see which elements contain the desired contents and scrape them by their class-name.

We can see that an **<li>** element with class ‘ *js-stream-item*‘ contains the entire contents of the tweet, a **<p>** with class ‘*tweet-text*‘ contains the text and the user is contained in a **<span>** with class ‘*username*‘. This gives us enough information to extract these with BeautifulSoup:

tweets = soup.find_all('li','js-stream-item') for tweet in tweets: if tweet.find('p','tweet-text'): tweet_user = tweet.find('span','username').text tweet_text = tweet.find('p','tweet-text').text.encode('utf8') tweet_id = tweet['data-item-id'] timestamp = tweet.find('a','tweet-timestamp')['title'] tweet_timestamp = dt.datetime.strptime(timestamp, '%H:%M - %d %b %Y') else: continue

**Notes:**

- The ‘text‘ after
`tweet.find('span','username')`

is necessary to extract only the visible text excluding all html elements. - Since the tweets are written in Turkish they probably contain non-standard characters which are not support, so it is necessary to encode them as utf8.
- The date in the twitter message is written in human readable format. To convert it to a datetime format which can further be used by Python we need to use datetime’s strptime method. To do this we need to additionally import the datetime and locale package[code language=”Python”]import datetime as dt

import locale

locale.setlocale(locale.LC_ALL,’turkish’)[/code]

**Scraping pages with infinite scroll:
**In principle this should be enough to scrape all of the twitter messages containing the keyword ‘akparti’ within the specified dates. However the website of twitter uses infinite scroll, which means it initially shows only ~20 tweets and keeps loading more tweets as you scroll down. So a single Request will only get you the initial 20 tweets.

One of the most commenly used solution for scraping pages with infinite scroll is to use Selenium. Selenium can open a web-browser and scroll down to the bottom of the page (see stackoverflow) after which you can scrape the page. I do not recommend you to use this. The biggest disadvantage of Selenium is that it physically opens up a browser and loads all of the tweets. Nowadays tweets can also contain videos and images, and loading these in your web-browser will be slower than simply loading the source code of the page. If you are planning on scraping thousands or millions of tweets it will be a very time consuming and a memory intensive process.

*There must be another way!*

Lets open up Chrome developer tools again (Ctrl + Shift + I) again to find a solution for this problem. Under the Network tab, you can see the GET and POST requests which are being sent in the instant that you have reached the bottom and the page is being filled with more tweets.

In our case this is a GET request which looks like

`https://twitter.com/i/search/timeline?vertical=default&q=Erdogan%20since%3A2015-05-01%20until%3A2015-06-06&include_available_features=1&include_entities=1&lang=tr&last_note_ts=2088&`

**max_position=TWEET-606971359399411712-606973762026803200**&reset_error_state=false

The interesting parameter in this request is the parameter

**max_position=TWEET-606971359399411712-606973762026803200. **

Here the first digit is the tweet-id of the first tweet on the page and the second digit is of the last tweet. Scrolling down to the bottom again we can see that this parameter has become

**max_position=TWEET-606967763807182849-606973762026803200.
**So every time new tweets a loaded on the page the get request above is sent with the id of the first and last tweet on the page.

At this point I hope it has become clear what needs to be done to scrape all tweets from the twitter page:

1. Read the response of the ‘regular’ Twitter URL with BeautifulSoup. Extract the information you need and save it to a file/database. Separately save the tweet-id of the first and last tweet on the page.

2. Construct the above GET Request where you have filled in the tweet-id of the first and last tweet in their corresponding places. Read the response of this Request with BeautifulSoup and update the tweet-id of the last tweet.

3. Repeat step 2 until you get a response with no more new tweets.

——–

[1] Here are some good documents to get started with tweepy:

http://docs.tweepy.org/en/latest/getting_started.html

http://adilmoujahid.com/posts/2014/07/twitter-analytics/

http://pythoncentral.io/introduction-to-tweepy-twitter-for-python/

http://pythonprogramming.net/twitter-api-streaming-tweets-python-tutorial/

http://www.dototot.com/how-to-write-a-twitter-bot-with-python-and-tweepy/