Classification with Scikit-Learn

update: The code presented in this blog-post is also available in my GitHub repository.

update2: I have added sections 2.4 , 3.2 , 3.3.2 and 4 to this blog post, updated the code on GitHub and improved upon some methods.


1. Introduction

For python programmers, scikit-learn is one of the best libraries to build Machine Learning applications with. It is ideal for beginners because it has a really simple interface, it is well documented with many examples and tutorials.

Besides supervised machine learning (classification and regression), it can also be used for clustering, dimensionality reduction, feature extraction and engineering,  and pre-processing the data. The interface is consistent over all of these methods, so it is not only easy to use, but it is also easy to construct a large ensemble of classifiers/regression models and train them with the same commands.

In this blog lets have a look at how to build, train, evaluate and validate a classifier with scikit-learn, improve upon the initial classifier with hyper-parameter optimization and look at ways in which we can have a better understanding of complex datasets.

We will do this by going through the of classification of two example datasets. The glass dataset, and the Mushroom dataset.

The glass dataset contains data on six types of glass (from building windows, containers, tableware, headlamps, etc) and each type of glass can be identified by the content of several minerals (for example  Na, Fe, K, etc). This dataset only contains numerical data and therefore is a good dataset to get started with.

The second dataset contains non-numerical data and we will need an additional step where we encode the categorical data to numerical data.


2. Classification of the glass dataset:

Lets start with classifying the classes of glass!

First we need to import the necessary modules and libraries which we will use.


  • The pandas module is used to load, inspect, process the data and get in the shape necessary for classification.
  • Seaborn is a library based on matplotlib and has nice functionalities for drawing graphs.
  • StandardScaler is a library for standardizing and normalizing dataset and
  • the LaberEncoder library can be used to One Hot Encode the categorical features (in the mushroom dataset).
  • All of the other modules are classifiers which are used for classification of the dataset.


2.1 Loading, analyzing and processing the dataset:

When loading a dataset for the first time, there are several questions we need to ask ourself:

  • What kind of data does the dataset contain?  Numerical data, categorical data, geographic information, etc…
  • Does the dataset contain any missing data?
  • Does the dataset contain any redundant data (noise)?
  • Do the values of the features differ over many orders of magnitude? Do we need to standardize or normalize the dataset?



We can see that the dataset consists of 214 rows and 10 columns. All of the columns contain numerical data, and there are no rows with missing information. Also most of the features have values in the same order of magnitude.
So for this dataset we do not need to remove any rows, impute missing values or transform categorical data into numerical.

The .describe() method we used above is useful for giving a quick overview of the dataset;

  • How many rows of data are there?
  • What are some characteristic values like the mean, standard deviation, minimum and maximum value, the 25th percentile etc.



2.3 Classification and validation

The next step is building and training the actual classifier, which hopefully can accurately classify the data. With this we will be able to tell which type of glass an entry in the dataset belongs to, based on the features.

For this we need to split the dataset into a training set and a test set. With the training set we will train the classifier, and with the test set we will validate the accuracy of the classifier. Usually a 70 % / 30 % ratio is used when splitting into a training and test set, but this ratio should be chosen based on the size of the dataset. For example, if the dataset does not have enough entries, 30% of it might not contain all of the classes or enough information to properly function as a validation set.

Figure 1. A schematic overview of the classification process.


Another important note is that the distribution of the different classes in both the training and the test set should be equal to the distribution in the actual dataset. For example, if you have a dataset with review-texts which contains 20% negative and 80% positive reviews, both the training and the test set should have this 20% / 80% ratio. The best way to do this, is to split the dataset into a training and test set randomly.



With the dataset splitted into a training and test set, we can start building a classification model. We will do this in a slightly different way as usual. The idea behind this is that, when we start with a new dataset, we don’t know which (type of) classifier will perform best on this dataset. Will it be a ensemble classifier like Gradient Boosting or Random Forest, or a classifier which uses a functional approach like Logistic Regression, a classifier which uses a statistical approach like Naive Bayes etc.?

Because we dont know this, and nowadays computational power is cheap to get, we will try out all types of classifiers first and later we can continue to optimize the best performing classifier of this inital batch of classifiers. For this we have to make an dictionary, which contains as keys the name of the classifiers and as values an instance of the classifiers.


Then we can iterate over this dictionary, and for each classifier:

  • train the classifier with .fit(X_train, Y_train)
  • evaluate how the classifier performs on the training set with .score(X_train, Y_train)
  • evaluate how the classifier perform on the test set with .score(X_test, Y_test).
  • keep track of how much time it takes to train the classifier with the time module.
  • save the trained model, the training score, the test score, and the training time into a dictionary. If necessary this dictionary can be saved with Python’s pickle module.



The reason why we keep track of the time it takes to train a classifier, is because in practice this is also an important indicator of whether or not you would like to use a specific classifier. If there are two classifiers with similar results, but one of them takes much less time to train you probably want to use that one.

The score() method simply return the result of the accuracy_score() method in the metrics module. This module, contains many methods for evualating classification or regression models and I can recommend you to spent some time to learn which metrics you can use to evaluate your model.
The classification_report method for example, calculates the precision, recall and f1-score for all of the classes in your dataset. If you are looking for ways to improve the accuracy of your classifier, or if you want to know why the accuracy is lower than expected, such detailed information about the performance of the classifier on the dataset can point you in the right direction.


The accuracy on the training set, accuracy on the test set, and the duration of the training was saved into a dictionary, and we can use the display_dict_models() method to visualize the results ordered by the test score.


What we are doing feels like a brute force approach, where a large number of classifiers are build to see which one performs best. It gives us an idea which classifier will perform better for a particular dataset and which one will not. After that you can continue with the best (or top 3) classifier, and try to improve the results by tweaking the parameters of the classifier, or by adding more features to the dataset.

As we can see, the Gradient Boosting classifier performs the best for this dataset. Actually, classifiers like Random Forest and Gradient Boosting classification performs best for most datasets and challenges on Kaggle (That does not mean you should rule out all other classifiers).

For the ones who are interested in the theory behind these classifiers, scikit-learn has a pretty well written user guide. Some of these classifiers were also explained in previous posts, like the naive bayes classifier, logistic regression and support vector machines was partially explained in the perceptron blog.



2.4 Improving upon the Classifier: hyperparameter optimization

After we have determined with a quick and dirty method which classifier performs best for the dataset, we can improve upon the Classifier by optimizing its hyper-parameters.



3. Classification of the mushroom dataset:

The second dataset we will have a look at is the mushroom dataset, which contains data on edible vs poisonous mushrooms. In the dataset there are 8124 mushrooms in total (4208 edible and 3916 poisonous) described by 22 features each.

The big difference with the glass dataset is that these features don’t have a numerical, but a categorical value. Because this dataset contains categorical values, we need one extra step in the classification process, which is the encoding of these values (see section 3.3).


3.1 Loading, analyzing and pre-processing the data.

A fast way to find out what type of categorical data a dataset contains, is to print out the unique values of each column in this dataframe. In this way we can also see whether the dataset contains any missing values or redundant columns.



As we can see, there are 22 categorical features. Of these, the feature ‘veil-type’ only contains one value ‘p’ and therefore does not provide any added value for any classifier. The best thing to do is to remove columns like this which only contain one value.



3.2 Imputing Missing values

Some datasets contain missing values in the form of NaN, null, NULL, ‘?’, ‘??’ etc
It could be that all missing values are of type NaN, or that some columns contain NaN and other columns contain missing data in the form of ‘??’.

It is up to your best judgement to decide what to do with these missing values. What is most effective, really depends on the type of data, the type of missing data and the ratio between missing data and non-missing data.

  • If the number of rows containing missing data is only a few percent of the total dataset, the best option could be to drop those rows. If half of the rows contain missing values, we could lose valuable information by dropping all of them.
  • If there is a row or column which contains almost only missing data, it will not have much added value and it might be best to drop that column.
  • It could be that a value not being filled in also is information which helps with the classification and it is best to leave it like it is.
  • Maybe we really need to improve the accuracy and the only way to do this is by Imputing the missing values.
  • etc, etc

Below we will look at a few ways in which you can either remove the missing values, or impute them.


3.2.1 Drop rows with missing values


3.2.2 Drop column with more than X percent missing values


3.2.3 Fill missing values with zeros


3.2.4 Fill missing values with backward fill


3.2.5 Fill missing values with forward fill


3.3 Encoding categorical data

Most classifier can only work with numerical data, and will raise an error when categorical values in the form of strings is used as input. When it comes to columns with categorical data, you can do two things.

  • 1) One-hot encode the column such that its categorical values are converted to numerical values.
  • 2) Expand the column into N different columns containing binary values.


Example: Let assume that we have a column called ‘FRUIT’ which contains the unique values [‘ORANGE’, ‘APPLE’, PEAR’].

  • In the first case it would be converted to the unique values [0, 1, 2]
  • In the second case it would be converted into three different columns called [‘FRUIT_IS_ORANGE’, ‘FRUIT_IS_APPLE’, ‘FRUIT_IS_PEAR’] and after this the original column ‘FRUIT’ would be deleted. The three new columns would contain the values 1 or 0 depending on the value of the original column.


When using the first method, you should pay attention to the fact that some classifiers will try to make sense of the numerical value of the one-hot encoded column. For example the Nearest Neighbour algorithm assumes that the value 1 is closer to 0 than the value 2. But the numerical values have no meaning in the case of one-hot encoded columns (an APPLE is not closer to an ORANGE than a PEAR is.) and the results therefore can be misleading.


3.3.1 One-Hot encoding the columns with categorical data


3.3.2 Expanding the columns with categorical data



3.4 Classification and Validation

We have seen that there are two different ways to handle columns with categorical data, and many different ways to handle missing values.

Since computation power is cheap, it is easy to try out all of classifiers present on all of the different ways we have imputed missing values.

After we have seen which method and which classifier has the highest accuracy initially we can continue in that direction.

Again, we will split the dataset into a 70% training set and a 30% test set and start training and validating a batch of the eight most used classifiers.


As we can see here, the accuracy of the classifiers for this dataset is actually quiet high.



4. Understanding complex datasets

Some datasets contain a lot of features and it is not immediatly clear which of these features are helping with the Classification / Regression, and which of these features are only adding more noise.

To have a better understanding of how the dataset is made up of its features, we will discuss a few methods which can give more insight in the next few sections.


4.1 The correlation matrix

To get more insight in how (strongly) each feature is correlated with the Type of glass, we can calculate and plot the correlation matrix for this dataset.

Figure 2. A correlation matrix describing the correlation between the different features in the dataset.

The correlation matrix shows us for example that the oxides ‘Mg’ and ‘Al’ are most strongly correlated with the Type of glass. The content of ‘Ca’ is least strongly correlated with the type of glass. For some dataset there could be features with no correlation at all; then it might be a good idea to remove these since they will only function as noise.


4.2 Correlation of a single feature with other features

A correlation matrix is a good way to get a general picture of how all of features in the dataset are correlated with each other. For a dataset with a lot of features it might become very large and the correlation of a single feature with the other features becomes difficult to discern.

If you want to look at the correlations of a single feature, it usually is a better idea to visualize it in the form of a bar-graph:



Figure 3. The correlation of feature ‘Type’ with all other features.



4.3 Cumulative Explained Variance

The Cumulative explained variance shows how much of the variance is captures by the first x features.

Below we can see that the first 4 features (i.e. the four features with the largest correlation) already capture 90% of the variance.


Figure 4. The cumulative explained variance.



If you have low accuracy values for your Regression / Classification model, you could decide to stepwise remove the features with the lowest correlation, (or stepwise add features with the highest correlation).



4.4 Pairwise relationships between the features

In addition to the correlation matrix, you can plot the pairwise relationships between the features, to see how these features are correlated.


Figure 4. The pairwise relationship between the features.



5. Final Words

In my opinion, the best way to master the scikit-learn library is to simply start coding with it. I hope this blog-post gave some insight into the working of scikit-learn library, but for the ones who need some more information, here are some useful links:

dataschool – machine learning with scikit-learn video series

Classification example using the iris dataset

Official scikit-learn documentation

Scikit-Learn Cheat Sheet


Share This:

14 gedachten over “Classification with Scikit-Learn

  1. 2.3 Classification and validation
    def get_train_test(df, y_col, ratio):
    mask = np.random.rand(len(df)) < ratio
    What is < ratio
    When I run this , i get the following error
    NameError: name ‘lt’ is not defined

  2. Hi Ahmet,
    thanks for your examples. We used it for educational purposes, however please revisit the section where you transform non-numerical values to numerical values.
    If I understand you correctly, you run correlations on numerical values which do not have any ‘numerical meaning’ but are strictly categorical (‘glass type’); the results (other than ‘ 0’) therefore are arbitrary and strongly depend only on the ordering of the types.

    1. Hi Simon,
      The correlation matrix was actually made on the glass dataset, which does not contain any categorical data. But I did use this opportunity to update the code and explicitly mention the dangers of one-hot encoding categorical data, and also added a few additional sections.

  3. Great article! I like the useful plots. How could you perform feature subset selection based on groups of subsets that have to be considered together? (Where a single feature is FeatureHashed to a fixed size vector). Do you have any resources for this case?

  4. Merhaba hocam elinize sağlık. Benim sorunum şu. Predict yaparken 3 saat falan bekliyor acaba nedeni RAM in yetmediğinden mi ?

    1. Merhaba Baris, predict yaparkene mi fit yaparkene mi uzun suruyor? Uzun surmesinin bir koc sebebi olabilir ama daha fazla RAM herzaman iyidir

  5. Hi Ahmet,

    Thanks for another very informative post, I think I’ve learned more from your tutorials than my college lecturers! One question that came up here and in one of my projects was how to generate correlation matrices with both categorical & numerical data? How would you approach this in the binary and multi-class case?

    1. Hi Liam,
      Thank you for the compliment.
      What I usually do with categorical data is as follows; If the number of categorical values for a specific column is too large (lets say > 50) I LabelEncode it. Here the number of columns remains the same since each categorical value is encoded into a numerical value.
      If the number of categorical values in a specific column is not too large, I OneHote Encode that column. This is described in section 3.3

    1. That is a good question.
      This is an old post, either I was not aware of train_test_split of sklearn at that time or wanted to code it myself 🙂

Geef een antwoord

Het e-mailadres wordt niet gepubliceerd.