Classification with Scikit-Learn

Posted on Posted in Classification, scikit-learn

1. Introduction

For python programmers, scikit-learn is one of the best libraries to build Machine Learning applications with. It is ideal for beginners because it has a really simple interface, it is well documented with many examples and tutorials.
Besides supervised machine learning (classification and regression), it can also be used for clustering, dimensionality reduction, feature extraction and engineering,  and pre-processing the data. The interface is consistent over all of these methods, so it is not only easy to use, but it is also easy to construct a large ensemble of classifiers/regression models and train them with the same commands.

In this blog lets have a look at how to build, train, evaluate and validate a classifier with scikit-learn and in this way get familiar with the scikit-learn library.

Let’s look at the process of classification with scikit-learn with two example datasets. The glass dataset, and the Mushroom dataset.

The glass dataset contains data on six types of glass (from building windows, containers, tableware, headlamps, etc) and each type of glass can be identified by the content of several minerals (for example  Na, Fe, K, etc). This dataset only contains numerical data and therefore is a good dataset to get started with.
The second dataset contains non-numerical data and we will need an additional step where we encode the categorical data to numerical data.


2. Classification of the glass dataset:

Lets start with classifying the classes of glass!


First we need to import the necessary modules and libraries which we will use.

  • The pandas module is used to load, inspect, process the data and get in the shape necessary for classification.
  • Seaborn is a library based on matplotlib and has nice functionalities for drawing graphs.
  • StandardScaler is a library for standardizing and normalizing dataset and
  • the LaberEncoder library can be used to One Hot Encode the categorical features (in the mushroom dataset).
  • All of the other modules are classifiers which are used for classification of the dataset.


2.1 Loading, analyzing and processing the dataset:

When loading a dataset for the first time, there are several questions we need to ask ourself:

  • What kind of data does the dataset contain?  Numerical data, categorical data, geographic information, etc…
  • Does the dataset contain any missing data?
  • Does the dataset contain any redundant data (noise)?
  • Do the values of the features differ over many orders of magnitude? Do we need to standardize or normalize the dataset?



We can see that the dataset consists of 214 rows and 10 columns. All of the columns contain numerical data, and there are no rows with missing information (check this for yourself). Also most of the features have values in the same order of magnitude.
So for this dataset we do not need to remove any rows (with .dropna() ) or apply one hot encoding (to transform categorical data into numerical data) or standardize the data (with StandardScaler().fit_transform(X)).

The .describe() method of pandas is useful for giving a quick overview of the dataset;

  • How many rows of data are there?
  • What are some characteristic values like the mean, standard deviation, minimum and maximum value, the 25th percentile etc.


To get more insight in how (strongly) each feature is correlated with the Type of glass, we can calculate and plot the correlation matrix for this dataset.

The correlation matrix shows us for example that the oxides ‘Mg’ and ‘Al’ are most strongly correlated with the Type of glass. The content of ‘Ca’ is least strongly correlated with the type of glass. For some dataset there could be features with no correlation at all; then it might be a good idea to remove these since they will only function as noise.


2.3 Classification and validation

The next step is building and training the actual classifier, which hopefully can accurately classify the data. With this we will be able to tell which type of glass an entry in the dataset belongs to, based on the features.

For this we need to split the dataset into a training set and a test set. With the training set we will train the classifier, and with the test set we will validate the accuracy of the classifier. Usually a 70 % / 30 % ratio is used when splitting into a training and test set, but this ratio should be chosen based on the size of the dataset. For example, if the dataset does not have enough entries, 30% of it might not contain all of the classes or enough information to properly function as a validation set.


Another important note is that the distribution of the different classes in both the training and the test set should be equal to the distribution in the actual dataset. For example, if you have a dataset with review-texts which contains 20% negative and 80% positive reviews, both the training and the test set should have this 20% / 80% ratio. The best way to do this, is to split the dataset into a training and test set randomly.



With the dataset splitted into training and test sets, we can start building a classification model. I will do this in a slightly different way as usual. The idea behind this is that, when we start with a new dataset, we don’t know which (type of) classifier will perform best on this dataset. Will it be a classifier like Decision Tree or Random Forest, or a classifier which uses a functional approach like Logistic Regression, a classifier which uses a statistical approach like Naive Bayes etc.?

Because we dont know this, we will try all types of classifiers first and later we can continue to optimize the best performing classifier of this inital batch of classifiers. For this we have to make an dictionary, which contains as keys the name of the classifiers and as values an instance of the classifiers.


Then we can iterate over this dictionary, and for each classifier:

  • train the classifier with .fit(X_train, Y_train)
  • evaluate how the classifier performs on the training set with .score(X_train, Y_train)
  • evaluate how the classifier perform on the test set with .score(X_test, Y_test).
  • keep track of how much time it takes to train the classifier with the time module.
  • save the training score, the test score, and the training time into a dataframe called ‘df_results’.



The reason why we keep track of the time it takes to train a classifier, is because in practice this is also an important indicator of whether or not you would like to use a specific classifier. If there are two classifier with similar results, but one of them takes much less time to train you probably want to use that one.

The score() method simply return the result of the accuracy_score() method in the metrics module. This module, contains many methods for evualating classification or regression models and I can recommend you to spent some time to learn which metrics you can use to evaluate your model.
The classification_report method for example, calculates the precision, recall and f1-score for all of the classes in your dataset. If you are looking for ways to improve the accuracy of your classifier, or if you want to know why the accuracy is lower than expected, such detailed information about the performance of the classifier on the dataset can point you in the right direction.


The accuracy on the training set, accuracy on the test set, and the duration of the training is saved into the ‘df_results’ dataframe.

What we are doing feels like a brute force approach, where a large number of classifiers are build to see which one performs best. Although this is not particularly educational, it gives an idea which classifier will perform better for a particular dataset and which one will not. After that you can continue with the best (or top 3) classifier, and try to improve the results by tweaking the parameters of the classifier, or by adding more features to the dataset.

As we can see, the Gradient Boosting classifier performs the best for this dataset. Actually, classifiers like Random Forest and Gradient Boosting classification performs best for most datasets and challenges on Kaggle (That does not mean you should rule out all other classifiers).

For the ones who are interested in the theory behind these classifiers, scikit-learn has a pretty well written user guide. Some of these classifiers were also explained in previous posts, like the naive bayes classifier, logistic regression and support vector machines was partially explained in the perceptron blog.


3. Classification of the mushroom dataset:

The second dataset we will have a look at is the mushroom dataset, which contains data on edible vs poisonous mushrooms. In the dataset there are 8124 mushrooms in total (4208 edible and 3916 poisonous) described by 22 features each.
The big difference with the glass dataset is that these features don’t have a numerical, but a categorical value. Because this dataset contains categorical values, we need one extra step in the classification process, which is the encoding of these values.


3.1 Loading, analyzing and pre-processing the data.

A fast way to find out what type of categorical data a dataset contains, is to print out the unique values of each column in this dataframe. In this way we can also see whether the dataset contains any missing values or redundant columns.


As we can see, there are 22 categorical features. Of these, the feature ‘veil-type’ only contains one value ‘p’ and therefore does not provide any added value for any classifier. The best thing to do is to remove this feature.


3.2 Encoding categorical data

Most classifier can only work with numerical data, and will raise an error when categorical values in the form of strings are used as input. Luckily scikit-learn contains the module LabelEncoder, which can be used to transform non-numerical values to numerical values. This is done by first fitting the LabelEncoder with all possible (unique) values and then transforming all values to numerical values.

(Both steps can also be done in one go with the fit_transform() method.)


3.3 Classification and Validation

As we can see, the columns previously containing non-numerical values now contain numerical values and the dataset is ready for classification. Again, we will split the dataset into a 70% training set and a 30% test set and start training and validating a batch of the eight most used classifiers.


As we can see, the accuracy of the classifiers for this dataset is actually also quiet high. For datasets, where this is not the case we can play around with the features in the dataset, add extra features from additional datasets or change the parameters of the classifiers in order to improve the accuracy.

In my opinion, the best way to master the scikit-learn library is to simply start coding with it. I hope this blog-post gave some insight into the working of scikit-learn library, but for the ones who need some more information, here are some useful links:

dataschool – machine learning with scikit-learn video series

Classification example using the iris dataset

Official scikit-learn documentation

Scikit-Learn Cheat Sheet


Share This:

Geef een reactie

Het e-mailadres wordt niet gepubliceerd. Verplichte velden zijn gemarkeerd met *