From the introductionary blog we know that the Naive Bayes Classifier is based on the bag-of-words model.
With the bag-of-words model we check which word of the text-document appears in a positive-words-list or a negative-words-list. If the word appears in a positive-words-list the total score of the text is updated with +1 and vice versa. If at the end the total score is positive, the text is classified as positive and if it is negative, the text is classified as negative. Simple enough!
With the Naive Bayes model, we do not take only a small set of positive and negative words into account, but all words the NB Classifier was trained with, i.e. all words presents in the training set. If a word has not appeared in the training set, we have no data available and apply Laplacian smoothing (use 1 instead of the conditional probability of the word).
The probability a document belongs to a class is given by the class probability
multiplied by the products of the conditional probabilities of each word for that class.
Here is the number of occurences of word
in class
,
is the total number of words in class
and
is the number of words in the document we are currently classifying.
does not change (unless the training set is expanded), so it can be placed outside of the product:
Implementing Naive Bayes Text Classification
With this information it is easy to implement a Naive Bayes Text Classifier. (Naive Bayes can also be used to classify non-text / numerical datasets, for an explanation see this notebook).
We have a NaiveBayesText class, which accepts the input values for X and Y as parameters for the “train()” method. Here X is a list of lists, where each lower level list contains all the words in the document. Y is a list containing the label/class of each document.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
class NaiveBaseClass: def calculate_relative_occurences(self, list1): no_examples = len(list1) ro_dict = dict(Counter(list1)) for key in ro_dict.keys(): ro_dict[key] = ro_dict[key] / float(no_examples) return ro_dict def get_max_value_key(self, d1): values = d1.values() keys = d1.keys() max_value_index = values.index(max(values)) max_key = keys[max_value_index] return max_key def initialize_nb_dict(self): self.nb_dict = {} for label in self.labels: self.nb_dict[label] = defaultdict(list) class NaiveBayesText(NaiveBaseClass): """" When the goal is classifying text, it is better to give the input X in the form of a list of lists containing words. X = [ ['this', 'is', 'a',...], (...) ] Y still is a 1D array / list containing the labels of each entry def initialize_nb_dict(self): self.nb_dict = {} for label in self.labels: self.nb_dict[label] = [] def train(self, X, Y): self.class_probabilities = self.calculate_relative_occurences(Y) self.labels = np.unique(Y) self.no_examples = len(Y) self.initialize_nb_dict() for ii in range(0,len(Y)): label = Y[ii] self.nb_dict[label] += X[ii] #transform the list with all occurences to a dict with relative occurences for label in self.labels: self.nb_dict[label] = self.calculate_relative_occurences(self.nb_dict[label]) |
As we can see, the training of the Naive Bayes Classifier is done by iterating through all of the documents in the training set. From all of the documents, a Hash table (dictionary in python language) with the relative occurence of each word per class is constructed.
This is done in two steps:
1. construct a huge list of all occuring words per class:
1 2 3 |
for ii in range(0,len(Y)): label = Y[ii] self.nb_dict[label] += X[ii] |
2. calculate the relative occurence of each word in this huge list, with the “calculate_relative_occurences” method. This method simply uses Python’s Counter module to count how much each word occurs and then divides this number with the total number of words.
The result is saved in the dictionary nb_dict.
As we can see, it is easy to train the Naive Bayes Classifier. We simply calculate the relative occurence of each word per class, and save the result in the “nb_dict” dictionary.
This dictionary can be updated, saved to file, and loaded back from file. It contains the results of Naive Bayes Classifier training.
Classifying new documents is also done quite easily by calculating the class probability for each class and then selecting the class with the highest probability.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
def classify_single_elem(self, X_elem): Y_dict = {} for label in self.labels: class_probability = self.class_probabilities[label] nb_dict_features = self.nb_dict[label] for word in X_elem: if word in nb_dict_features.keys(): relative_word_occurence = nb_dict_features[word] class_probability *= relative_word_occurence else: class_probability *= 0 Y_dict[label] = class_probability return self.get_max_value_key(Y_dict) def classify(self, X): self.predicted_Y_values = [] n = len(X) for ii in range(0,n): X_elem = X[ii] prediction = self.classify_single_elem(X_elem) self.predicted_Y_values.append(prediction) return self.predicted_Y_values |
Next blog:
In the next blog we will look at the results of this naively implemented algorithm for the Naive Bayes Classifier and see how it performs under various conditions; we will see the influence of varying training set sizes and whether the use of n-gram features will improve the accuracy of the classifier.
13 gedachten over “Sentiment Analysis with the Naive Bayes Classifier”
Great looking site. Think you did a great deal of your ownyour very own coding
Could you please teach me how to do the prediciton for a new article instead of doing evalution by using the test data with corresponding categoires?
Really Cool. Bednarczyk
Thank you for the posting. But, in the part of your code, I don’t understand one part.
When we try to do the test, if the word is not in our bag of words (training set), you mentioned we will use the Laplacian smoothing (use 1 instead of the conditional probability of the word). But in the code,
else:
class_probability *= 0
it multiply by 0 not 1. Why you do this?
Hi Jason,
In earlier parts of the code I was using laplacian smoothing indeed, but changed it later on.
Let me change it back when I have some time.
Thanks for the heads up.