Regression, Logistic Regression and Maximum Entropy part 2 (code + examples)

Posted on Posted in Classification, Sentiment Analytics

update: The Python code for Logistic Regression can be forked/cloned from my Git repository. It is also available on PyPi. The relevant information in the blog-posts about Linear and Logistic Regression are also available as a Jupyter Notebook on my Git repository.



In the previous blog we have seen the theory and mathematics behind the Maximum Entropy and Logistic Regression Classifiers.
Logistic Regression is one of the most powerful classification methods within machine learning and can be used for a wide variety of tasks. Think of pre-policing or predictive analytics in health; it can be used to aid tuberculosis patients, aid breast cancer diagnosis, etc. Think of modeling urban growth, analysing mortgage pre-payments and defaults, forecasting the direction and strength of stock market movement, and even sports.

Reading all of this, the theory[1] of Maximum Entropy Classification might look difficult. In my experience, the average Developer does not believe they can design a proper Maximum Entropy / Logistic Regression Classifier from scratch. I strongly disagree: not only is the mathematics behind is relatively simple, it can also be implemented with a few lines of code.

I have done this in the past month, so I thought I’d show you how to do it. The code is in Python but it should be relatively easy to translate it to other languages. Some of the examples contain self-generated data, while other examples contain real-world (medical) data. As was also done in the blog-posts about the bag-of-words model and the Naive Bayes Classifier, we will also try to automatically classify the sentiments of book reviews.



Short recap of the theory

We have seen that the technique to perform Logistic Regression is similar to regular Regression Analysis. We are trying to estimate the feature values \theta iteratively with the Gradient Descent method. In the Gradient Descent method, the values of the parameters \theta in the current iteration are calculated by updating the values of \theta from the previous iteration with the gradient of the cost function J .

Different cost functions exist, but most often the squared error between the hypothesis function h_{\theta}(x) and Y is used. In (regular) Regression this hypothesis function can be any function which you expect will provide a good model of the dataset. In Logistic Regression the hypothesis function is always given by the Logistic function:

h_{\theta} = \frac{1}{1 + \exp(-z)}} = \frac{1}{1 + \exp(-\theta x)}.


This also means that for Logistic Regression, we no longer have to think about the form of the hypothesis function (while we still have to do that for regular regression). What we do have to think about, is which features \theta will classify our current dataset optimally.

Taking all of this into account, this is how Gradient Descent works:

  • Make an initial but intelligent guess for the values of the parameters \theta.
  • Keep iterating while the value of the cost function has not met your criteria:
    • With the current values of \theta, calculate the gradient of the cost function J  ( \Delta \theta = - \alpa \frac{d}{d\theta} J(x) ).
    • Update the values for the parameters \theta := \theta + \alpha \Delta \theta
    • Fill in these new values in the hypothesis function and calculate again the value of the cost function;



Regression Analysis

We have seen the self-generated example of students participating in a Machine Learning course, where their final grade depended on how many hours they had studied.

First, let’s generate the data:


We can see that we have generated 100 points uniformly distributed over the x-axis. For each of these x – points the y-value is determined by y = 1 + 0.07 x^{2} minus some random value.

fitted_regression_ fitted_regression_A


On the left we can see a scatterplot of the datapoints and on the right we can see the same data with a curve fitted through the points. This is the curve we are trying to estimate with the Gradient Descent method. This is done as follows:


We can see that we have to calculate the gradient of the cost function n times and update the n feature values simultaneously! This indeed results in the curve we were looking for:




Logistic Regression

After this short example of Regression, lets have a look at a few examples of Logistic Regression. We will start out with a the self-generated example of students passing a course or not and then we will look at real data from the medical world.

Let’s generate some data points. There are n = 300 students participating in the course Machine Learning and whether a student i passes ( y_i = 1) or not ( y_i = 0 ) depends on two variables;

  • x_i^{(1)} : how many hours student i has studied for the exam.
  • x_i^{(2)} : how many hours student i has slept the day before the exam.



In our example, the results are pretty binary; everyone who has studied less than 4 hours fails the course, as well as everyone whose studying time + sleeping time is less than or equal to 13 hours (x_i^{(1)} + x_i^{(2)} \leq 13). The results looks like this (the green dots indicate a pass and the red dots a fail):


For this example we will again apply Gradient Descent to determine the feature values which can classify the dataset optimally. This is done as follows:


As we can see, the code of the Gradient Descent method looks very similar to the code in the case of regular regression. The main difference is that the hypothesis function is now equal to the sigmoid function.

Using this algorithm for gradient descent, we can correctly classify 297 out of 300 datapoints (wrongly classified points are indicated with a cross).




Logistic Regression on Medical Data

Now  that the concept of Logistic Regression is a bit more clear, let’s classify real-world data!
The University of Massachusetts provides some datasets which are ideal to perform Logistic Regression on. They are small (so my small laptop can also perform it in a reasonable amount of time) and there are various datasets with different (amount of) features.

The dataset “myopia.dat” contains the medical data of 618 subjects, and has 15 features describing the characteristics of each subject. We can read this data in as follows:


This results in a Y-vector with 618 elements and a X-matrix which is 618 by 15:



While reading in, all values are normalized to a value around 1. This is done in order to speed up the calculations and to ensure that you never take the logarithm of a zero value (log functions don’t really like that).

Once the data has been read into the X and Y matrices, logistic regression can be applied:


This simple algorithm for logistic regression correctly classifies ~550 of the 618 subjects, giving it an accuracy of ~90%.


Maximum Entropy Text Classification

Logistic Regression by using Gradient Descent can also be used for NLP / Text Analysis tasks. There are a wide variety of tasks which can are done in the field of NLP; autorship attribution, spam filtering, topic classification and sentiment analysis. 

For  a task like sentiment analysis we can follow the same procedure. We will have as the input a large collection of labelled text documents. These will be used to train the Logistic Regression classifier. The most important task then, is to select the proper features which will lead to the best sentiment classification. Almost everything in the text document can be used as a feature[2]; you are only limited by your creativity.

For sentiment analysis usually the occurence of (specific) words is used, or the relative occurence of words (the word occurences divided by the total number of words).

As we have done before, we have to fill in the X and Y matrices, which will serve as an input for the gradient descent algorithm and this algorithm will give us the resulting feature vector \theta. With this vector we can determine the class of other text documents.

As always Y is a vector with n elements (where n is the number of text-documents). The matrix X is a n by m matrix; here m is the total number of relevant words in all of the text-documents. I will illustrate how to build up this X matrix with three book reviews:

  • pos: “This is such a beautiful edition of Harry Potter and the Sorcerer’s Stone. I’m so glad I bought it as a keep sake. The illustrations are just stunning.” (28 words in total)
  • pos: “A brilliant book that helps you to open up your mind as wide as the sky” (16 words in total)
  • neg: “This publication is virtually unreadable. It doesn’t do this classic justice. Multiple typos, no illustrations, and the most wonky footnotes conceivable. Spend a dollar more and get a decent edition.” (30 words in total)

These three reviews will result in the following X-matrix.


As you can see, each row of the X matrix contains all of the data per review and each column contains the data per word. If a review does not contain a specific word, the corresponding column will contain a zero. Such a X-matrix containing all the data from the training set can be build up in the following manner:


Assuming that we have a list containing the data from the training set:


From this training_set, we are going to generate a words_vector. This words_vector is used to keep track to which column a specific word belongs to. After this words_vector has been generated, the X matrix and Y vector can filled in.


As we have done before, the gradient descent method can be applied to derive the feature vector from the X and Y matrices:


What should we do if a specific review tests positive (Y=1) for more than one class? A review could result in Y=1 for both the neu class as well as the neg class. In that case we will pick the  class with the highest score. This is called multinomial logistic regression.


Maximum Entropy Text Classification with Python’s NLTK library

So far, we have seen how to implement a Logistic Regression Classifier in its most basic form. It is true that building such a classifier from scratch, is great for learning purposes. It is also true that no one will get to the point of using deeper / more advanced Machine Learning skills without learning the basics first.

For real-world applications however, often the best solution is to not re-invent the wheel but to re-use tools which are already available. Tools which have been tested thorougly and have been used by plenty of smart programmers before you. One of such a tool is Python’s NLTK library.

NLTK is Python’s Natural Language Toolkit and it can be used for a wide variety of Text Processing and Analytics jobs like tokenization, part-of-speech tagging and classification. It is easy to use and even includes a lot of text corpora, which can be used to train your model if you have no training set available.

Let us also have a look at how to perform sentiment analysis and text classification with NLTK. As always, we will use a training set to train NLTK’s Maximum Entropy Classifier and a test set to verify the results. Our training set has  the following format:


As you can see, the training set consists of a list of tuples of two elements. The first element is a list of the words in the text of the document and the second element is the class-label of this specific review (‘neg’, ‘neu’ or ‘pos’). Unfortunately NLTK’s Classifiers only accepts the text in a hashable format (dictionaries for example) and that is why we need to convert this list of words into a dictionary of words.

Once the training set has been converted into the proper format, it can be feed into the train method of the MaxEnt Classifier:


Once the training of the MaxEntClassifier is done, it can be used to classify the review in the test set:



Final Words:

So far we have seen the theory behind the Naive Bayes Classifier and how to implement it (in the context of Text Classification) and in the previous and this blog-post we have seen the theory and implementation of Logistic Regression Classifiers. Although this is done at a basic level, it should give some understanding of the Logistic Regression method (I hope at a level where you can apply it and classify data yourself). There are however still many (advanced) topics which have not been discussed here:

  • Which hill-climbing / gradient descent algorithm  to use; IIS (Improved Iterative Scaling), GIS (Generalized Iterative Scaling), BFGS, L-BFGS or Coordinate Descent
  • Encoding of the feature vector and the use of dummy variables
  • Logistic Regression is an inherently sequential algorithm; although it is quiet fast, you might need a parallelization strategy if you start using larger datasets.

If you see any errors please do not hesitate to contact me. If you have enjoyed reading, maybe even learned something, do not forget to subscribe to this blog and share it!



[1] See the paper of Nigam et. al. on Maximum Entropy and the paper of Bo Pang et. al. on Sentiment Analysis using Maximum Entropy. Also see Using Maximum Entropy for text classification (1999), A simple introduction to Maximum Entropy models(1997), A brief MaxEnt tutorial, and another good MIT article.

[2] See for example Chapter 7 of Speech and Language Processing by (Jurafsky & Martin): For the task of period disambiguation a feature could be whether or not a period is followed by a capital letter unless the previous word is St.

Share This:

8 thoughts on “Regression, Logistic Regression and Maximum Entropy part 2 (code + examples)

  1. Pingback: Cıvata
  2. In your function:

    def determine_correct_guesses(X, Y, theta, m):
    determined_z = [, X[ii]) for ii in range(m)]
    determined_Y = [z_to_y(elem) for elem in determined_z]
    correct_guesses = [y_i – det_y_i for y_i, det_y_i in zip(Y, determined_Y)]
    return correct_guesses

    I am struggling to follow it, because if the correct answer is 1 and the guess is 1, then you are deducting both values and then that leaves a value of “0”, which I would think would be an incorrect answer (with 1 being a correct answer). Either all values of “0” are treated as correct implicitly or I have misunderstood something fairly simple. Maybe it is just notational but I I just had this doubt as to “0” would be a positive indicator of a “correct” answer.

    If Y = [0, 0, 1, 1] and determined_Y = [1,0,1,0] then correct_guessed would be:
    [-1, 0, 0, 1]

    Where indices 1 and 2 are the correct ones, but valued at zero.
    That is what is going on, right? Wouldn’t it be better to call this “incorrect_guesses” rather than “correct_guesses” ?

    1. Hi Alex,
      All Values of “0” are indeed treated as correct implicitely. You can see that on line 24, where the number of correct_guesses is set to the number of zero’s.
      (“correct_guesses = determine_correct_guesses(X, Y, theta, m).count(0)”)
      The idea was to create a method which can quickly count the number of correct guesses. But as you have noticed it is not as transparant as I would like it to be. Therefore I had already updated this function on Github, but I see it was not updated here. Thanks for pointing this out 🙂

      PS: The final evaluation of the Classifier model should of course be done by measuring the accuracy on the test-set with the the f1-score (see on github).


Geef een reactie

Het e-mailadres wordt niet gepubliceerd. Verplichte velden zijn gemarkeerd met *