Stochastic Signal Analysis is a field of science concerned with the processing, modification and analysis of (stochastic) signals.

Anyone with a background in Physics or Engineering knows to some degree about signal analysis techniques, what these technique are and how they can be used to analyze, model and classify signals.

Data Scientists coming from a different fields, like Computer Science or Statistics, might not be aware of the analytical power these techniques bring with them.

In this blog post, we will have a look at how we can use Stochastic Signal Analysis techniques, in combination with traditional Machine Learning Classifiers for accurate classification and modelling of time-series and signals.

At the end of the blog-post you should be able understand the various signal-processing techniques which can be used to retrieve features from signals and be able to classify ECG signals (and even identify a person by their ECG signal), predict seizures from EEG signals, classify and identify targets in radar signals, identify patients with neuropathy or myopathyetc from EMG signals by using the FFT, etc etc.

In this blog-post we’ll discuss the following topics:

- Basics of Signals
- Transformations between time- and frequency-domain by means of FFT, PSD and autocorrelation.
- Statistical parameter estimation and feature extraction
- Example dataset: Classification of human activity
- Extracting features from all signals in the training and test set

- Classification with (traditional) Scikit-learn classifiers
- Finals words

You might often have come across the words time-series and signals describing datasets and it might not be clear what the exact difference between them is.

In a time-series dataset the to-be-predicted value () is a function of time (). Such a function can describe anything, from the value of bitcoin or a specific stock over time, to fish population over time. A signal is a more general version of this where the dependent variable does not have to a function of time; it can be a function of spatial coordinates (), distance from the source ( ), etc etc.

Signals can come in many different forms and shapes: you can think of audio signals, pictures, video signals, geophysical signals (seismic data), sonar and radar data and medical signals (EEG, ECG, EMG).

- A picture can be seen as a signal which contains information about the brightness of the three colors (RGB) across the two spatial dimensions.
- Sonar signals give information about an acoustic pressure field as a function of time and the three spatial dimensions.
- Radar signals do the same thing for electromagnetic waves.

In essence, almost anything can be interpreted as a signal as long as it carries information within itself.

If you open up a text book on Signal Processing it will usually be divided into two parts: the continuous time-domain and the discrete-time domain.

The difference is that Continuous signals have an independent variable which is (as the name suggests) continuous in nature, i.e. it is present at each time-step within its domain. No matter how far you ‘zoom in’, you will have a value at that time step; at , at , at , etc etc.

Discrete-time signals are discrete and are only defined at specific time-steps. For example, if the period of a discrete signal is , it will be defined at , , , etc … (but not at ).

Most of the signals you come across in nature are analog (continuous); think of the electrical signals in your body, human speech, any other sound you hear, the amount of light measured during the day, barometric pressure, etc etc.

Whenever you want to digitize one of these analog signals in order to analyze and visualize it on a computer, it becomes discrete. Since we will only be concerning ourselves with digital signals in this blog-post, we will only look at the discrete version of the various stochastic signal analysis techniques.

Digitizing an analog signal is usually done by sampling it with a specific sampling rate. In Figure 1 we can see a signal sampled at different frequencies.

As we can see, it is important to choose a good sampling rate; if the sampling rate is chosen too low, the discrete signal will no longer contain the characteristics of the original analog signal and important features defining the signal are lost.

To be more specific, a signal is said to be under-sampled if the sampling rate is smaller than the Nyquist rate. The Nyquist rate is twice the highest frequency present in the signal.

Under-sampling a signal can lead to effects like aliasing [2] and the wagon-wheel effect (see video). To prevent undersampling usually a frequency much higher than the Nyquist rate is chosen as the sampling frequency.

If the signal contains a pattern, which repeats itself after a specific period of time, we call it an *periodic* signal.

The time it takes for an periodic signal to repeat itself is called the period (and the distance it travels in this period is called the wavelength ).

The frequency is the inverse of the Period; if a signal has a Period of , its frequency is , and if the period is , the frequency is .

The period, wavelength and frequency are related to each other via formula (1):

(1)

where is the speed of sound.

Fourier analysis is a field of study used to analyze the periodicity in (periodic) signals. If a signal contains components which are periodic in nature, Fourier analysis can be used to decompose this signal in its periodic components. Fourier analysis tells us at what the frequency of these periodical component are.

For example, if we measure your heart beat and at the time of measurement you have a heart rate of 60 beats / minute, the signal will have a frequency of (Period of = frequency of ). If you are doing at the same time, some repetitive task where you move your fingers every two seconds, the signal going to you hand will have a frequency of (Period of = frequency of ). An electrode placed on your arm, will measure the combination of these two signals. And a Fourier analysis performed on the combined signals, will show us a peak in the frequency spectrum at 0.5 Hz and one at 1 Hz.

So, two (or more) different signals (with different frequencies, amplitudes, etc) can be mixed together to form a new composite signal. The new signal then consists of all of its component signals.

The reverse is also true, every signal – no matter how complex it looks – can be decomposed into a sum of its simpler signals. These simpler signals are trigonometric functions (sine and cosine waves). This was discovered (in 1822) by Joseph Fourier and it is what Fourier analysis is about. The mathematical function which transform a signal from the time-domain to the frequency-domain is called the Fourier Transform, and the function which does the opposite is called the Inverse Fourier Transform.

If you want to know how the Fourier transform works, 3blue1brown’s beautifully animated explanation will hopefully give you more insight.

Below, we can see this in action. We have five sine-waves (blue signals) with amplitudes 4, 6, 8, 10 and 14 and frequencies 6.5, 5, 3, 1.5 and 1 Hz. By combining these signals we form a new composite signal (black). The Fourier Transform transforms this signal to the frequency-domain (red signal) and shows us at which frequencies the component signals oscillate.

from mpl_toolkits.mplot3d import Axes3D t_n = 10 N = 1000 T = t_n / N f_s = 1/T x_value = np.linspace(0,t_n,N) amplitudes = [4, 6, 8, 10, 14] frequencies = [6.5, 5, 3, 1.5, 1] y_values = [amplitudes[ii]*np.sin(2*np.pi*frequencies[ii]*x_value) for ii in range(0,len(amplitudes))] composite_y_value = np.sum(y_values, axis=0) f_values, fft_values = get_fft_values(composite_y_value, T, N, f_s) colors = ['k', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b'] fig = plt.figure(figsize=(8,8)) ax = fig.add_subplot(111, projection='3d') ax.set_xlabel("\nTime [s]", fontsize=16) ax.set_ylabel("\nFrequency [Hz]", fontsize=16) ax.set_zlabel("\nAmplitude", fontsize=16) y_values_ = [composite_y_value] + list(reversed(y_values)) frequencies = [1, 1.5, 3, 5, 6.5] for ii in range(0,len(y_values_)): signal = y_values_[ii] color = colors[ii] length = signal.shape[0] x=np.linspace(0,10,1000) y=np.array([frequencies[ii]]*length) z=signal if ii == 0: linewidth = 4 else: linewidth = 2 ax.plot(list(x), list(y), zs=list(z), linewidth=linewidth, color=color) x=[10]*75 y=f_values[:75] z = fft_values[:75]*3 ax.plot(list(x), list(y), zs=list(z), linewidth=2, color='red') plt.tight_layout() plt.show()

The Fast Fourier Transform (FFT) is an efficient algorithm for calculating the Discrete Fourier Transform (DFT) and is the de facto standard to calculate a Fourier Transform. It is present in almost any scientific computing libraries and packages, in every programming language.

Nowadays the Fourier transform is an indispensable mathematical tool used in almost every aspect of our daily lives. In the next section we will have a look at how we can use the FFT and other Stochastic Signal analysis techniques to classify time-series and signals.

In Python, the FFT of a signal can be calculate with the SciPy library.

Below, we can see how we can use SciPy to calculate the FFT of the composite signal above, and retrieve the frequency values of its component signals.

from scipy.fftpack import fft def get_fft_values(y_values, T, N, f_s): f_values = np.linspace(0.0, 1.0/(2.0*T), N//2) fft_values_ = fft(y_values) fft_values = 2.0/N * np.abs(fft_values_[0:N//2]) return f_values, fft_values t_n = 10 N = 1000 T = t_n / N f_s = 1/T f_values, fft_values = get_fft_values(composite_y_value, T, N, f_s) plt.plot(f_values, fft_values, linestyle='-', color='blue') plt.xlabel('Frequency [Hz]', fontsize=16) plt.ylabel('Amplitude', fontsize=16) plt.title("Frequency domain of the signal", fontsize=16) plt.show()

Since our signal is sampled at a rate of , the FFT will return the frequency spectrum up to a frequency of . The higher your sampling rate is, the higher the maximum frequency is FFT can calculate.

In the `get_fft_values`

function above, the `scipy.fftpack.fft`

function returns a vector of complex valued frequencies. Since they are complex valued, they will contain a real and an imaginary part. The real part of the complex value corresponds with the magnitude, and the imaginary part with the phase of the signal. Since we are only interested in the magnitude of the amplitudes, we use `np.abs()`

to take the real part of the frequency spectrum.

The FFT of an input signal of N points, will return an vector of N points. The first half of this vector (N/2 points) contain the useful values of the frequency spectrum from 0 Hz up to the Nyquist frequency of . The second half contains the complex conjugate and can be disregarded since it does not provide any useful information.

Closely related to the Fourier Transform is the concept of Power Spectral Density.

Similar to the FFT, it describes the frequency spectrum of a signal. But in addition to the FFT it also takes the power distribution at each frequency (bin) into account. Generally speaking the locations of the peaks in the frequency spectrum will be the same as in the FFT-case, but the height and width of the peaks will differ. The surface below the peaks corresponds with the power distribution at that frequency.

Calculation of the Power Spectral density is a bit easier, since SciPy contain a function which not only return a vector of amplitudes, but also a vector containing the tick-values of the frequency-axis.

from scipy.signal import welch def get_psd_values(y_values, T, N, f_s): f_values, psd_values = welch(y_values, fs=f_s) return f_values, psd_values t_n = 10 N = 1000 T = t_n / N f_s = 1/T f_values, psd_values = get_psd_values(composite_y_value, T, N, f_s) plt.plot(f_values, psd_values, linestyle='-', color='blue') plt.xlabel('Frequency [Hz]') plt.ylabel('PSD [V**2 / Hz]') plt.show()

The auto-correlation function calculates the correlation of a signal with a time-delayed version of itself. The idea behind it is that if a signal contain a pattern which repeats itself after a time-period of seconds, there will be a high correlation between the signal and a sec delayed version of the signal.

Unfortunately there is no standard function to calculate the auto-correlation of a function in SciPy. But we can make one ourselves using the `correlate()`

function of numpy. Our function returns the correlation value, as a function of the time-delay . Naturally, this time-delay can not be more than the full length of the signal (which is in our case 2.56 sec).

def autocorr(x): result = np.correlate(x, x, mode='full') return result[len(result)//2:] def get_autocorr_values(y_values, T, N, f_s): autocorr_values = autocorr(y_values) x_values = np.array([T * jj for jj in range(0, N)]) return x_values, autocorr_values t_n = 10 N = 1000 T = t_n / N f_s = 1/T t_values, autocorr_values = get_autocorr_values(composite_y_value, T, N, f_s) plt.plot(t_values, autocorr_values, linestyle='-', color='blue') plt.xlabel('time delay [s]') plt.ylabel('Autocorrelation amplitude') plt.show()

Converting the values of the auto-correlation peaks from the time-domain to the frequency domain should result in the same peaks as the ones calculated by the FFT. The frequency of a signal thus can be found with the auto-correlation as well as with the FFT.

However, because it is more precise, the FFT is almost always used for frequency detection.

Fun fact: the auto-correlation and the PSD are Fourier Transform pairs, i.e. the PSD can be calculated by taking the FFT of the auto-correlation function, and the auto-correlation can be calculated by taking the Inverse Fourier Transform of the PSD function.

One transform which we have not mentioned here is the Wavelet transform. It transform a signal into its frequency domain, just like the Fourier Transform.

The difference is: the Fourier Transform has a very high resolution in the frequency domain, and zero resolution in the time domain; we know at which frequencies the signal oscillates, but not at which time these oscillations occur. The output of a Wavelet transform hash a high resolution in the frequency domain and also in the time domain; it maintains information about the time-domain.

The wavelet transform is better suited for analyzing signals with a dynamic frequency spectrum, i.e. the frequency spectrum changes over time, or has a component with a different frequency localized in time (the frequency changes abruptly for a short period of time).

Lets have a look at how to use the wavelet transform in Python in the next blog-post. For the ones who can not wait to get started with it, here are some examples of applications using the wavelet transform.

We have seen three different ways to calculate characteristics of signals using the FFT, PSD and the autocorrelation function. These functions transform a signal from the time-domain to the frequency-domain and give us its frequency spectrum.

After we have transformed a signal to the frequency-domain, we can extract features from each of these transformed signals and use these features as input in standard classifiers like Random Forest, Logistic Regression, Gradient Boosting or Support Vector Machines.

Which features can we extract from these transformations? A good first step is the value of the frequencies at which oscillations occur and the corresponding amplitudes. In other words; the x and y-position of the peaks in the frequency spectrum.

SciPy contains some methods to find the relative maxima (argrelmax) and minima (argrelmin) in data, but I found the peak detection method of Marcos Duarte much simpler and easier to use.

Armed with this peak-finding function, we can calculate the FFT, PSD and the auto-correlation of each signal and use the x and y coordinates of the peaks as input for our classifier. This is illustrated in Figure 6.

Lets have a look at how we can classify the signals in the Human Activity Recognition Using Smartphones Data Set. This dataset contains measurements done by 30 people between the ages of 19 to 48. The measurements are done with a smartphone placed on the waist while doing one of the following six activities:

- walking,
- walking upstairs,
- walking downstairs,
- sitting,
- standing or
- laying.

The measurements are done at a constant rate of . After filtering out the noise, the signals are cut in fixed-width windows of 2.56 sec with an overlap of 1.28 sec. Each signal will therefore have 50 x 2.56 = 128 samples in total. This is illustrated in Figure 7a.

The smartphone measures three-axial linear body acceleration, three-axial linear total acceleration and three-axial angular velocity. So per measurement, the total signal consists of nine components (see Figure 7b).

The dataset is already splitted into a training and a test part, so we can immediately load the signals into an numpy ndarray.

def read_signals(filename): with open(filename, 'r') as fp: data = fp.read().splitlines() data = map(lambda x: x.rstrip().lstrip().split(), data) data = [list(map(float, line)) for line in data] data = np.array(data, dtype=np.float32) return data def read_labels(filename): with open(filename, 'r') as fp: activities = fp.read().splitlines() activities = list(map(int, activities)) return np.array(activities) INPUT_FOLDER_TRAIN = './UCI_HAR/train/InertialSignals/' INPUT_FOLDER_TEST = './UCI_HAR/test/InertialSignals/' INPUT_FILES_TRAIN = ['body_acc_x_train.txt', 'body_acc_y_train.txt', 'body_acc_z_train.txt', 'body_gyro_x_train.txt', 'body_gyro_y_train.txt', 'body_gyro_z_train.txt', 'total_acc_x_train.txt', 'total_acc_y_train.txt', 'total_acc_z_train.txt'] INPUT_FILES_TEST = ['body_acc_x_test.txt', 'body_acc_y_test.txt', 'body_acc_z_test.txt', 'body_gyro_x_test.txt', 'body_gyro_y_test.txt', 'body_gyro_z_test.txt', 'total_acc_x_test.txt', 'total_acc_y_test.txt', 'total_acc_z_test.txt'] train_signals, test_signals = [], [] for input_file in INPUT_FILES_TRAIN: signal = read_signals(INPUT_FOLDER_TRAIN + input_file) train_signals.append(signal) train_signals = np.transpose(np.array(train_signals), (1, 2, 0)) for input_file in INPUT_FILES_TEST: signal = read_signals(INPUT_FOLDER_TEST + input_file) test_signals.append(signal) test_signals = np.transpose(np.array(test_signals), (1, 2, 0)) LABELFILE_TRAIN = './UCI_HAR/train/y_train.txt' LABELFILE_TEST = './UCI_HAR/test/y_test.txt' train_labels = read_labels(LABELFILE_TRAIN) test_labels = read_labels(LABELFILE_TEST)

We have loaded the training set into a ndarray of size (7352, 128, 9) and the test set into a ndarray of size (2947, 128, 9). As you can guess from the dimensions, the number of signals in the training set is 7352 and the number of signals in the test set is 2947. And each signal in the training and test set has a length of 128 samples and 9 different components.

Below, we will visualize the signal itself with its nine components, the FFT, the PSD and auto-correlation of the components, together with the peaks present in each of the three transformations.

import numpy as np import matplotlib.pyplot as plt def get_values(y_values, T, N, f_s): y_values = y_values x_values = [sample_rate * kk for kk in range(0,len(y_values))] return x_values, y_values #### labels = ['x-component', 'y-component', 'z-component'] colors = ['r', 'g', 'b'] suptitle = "Different signals for the activity: {}" xlabels = ['Time [sec]', 'Freq [Hz]', 'Freq [Hz]', 'Time lag [s]'] ylabel = 'Amplitude' axtitles = [['Acceleration', 'Gyro', 'Total acceleration'], ['FFT acc', 'FFT gyro', 'FFT total acc'], ['PSD acc', 'PSD gyro', 'PSD total acc'], ['Autocorr acc', 'Autocorr gyro', 'Autocorr total acc'] ] list_functions = [get_values, get_fft_values, get_psd_values, get_autocorr_values] N = 128 f_s = 50 t_n = 2.56 T = t_n / N signal_no = 0 signals = train_signals[signal_no, :, :] label = train_labels[signal_no] activity_name = activities_description[label] f, axarr = plt.subplots(nrows=4, ncols=3, figsize=(12,12)) f.suptitle(suptitle.format(activity_name), fontsize=16) for row_no in range(0,4): for comp_no in range(0,9): col_no = comp_no // 3 plot_no = comp_no % 3 color = colors[plot_no] label = labels[plot_no] axtitle = axtitles[row_no][col_no] xlabel = xlabels[row_no] value_retriever = list_functions[row_no] ax = axarr[row_no, col_no] ax.set_title(axtitle, fontsize=16) ax.set_xlabel(xlabel, fontsize=16) if col_no == 0: ax.set_ylabel(ylabel, fontsize=16) signal_component = signals[:, comp_no] x_values, y_values = value_retriever(signal_component, T, N, f_s) ax.plot(x_values, y_values, linestyle='-', color=color, label=label) if row_no > 0: max_peak_height = 0.1 * np.nanmax(y_values) indices_peaks = detect_peaks(y_values, mph=max_peak_height) ax.scatter(x_values[indices_peaks], y_values[indices_peaks], c=color, marker='*', s=60) if col_no == 2: ax.legend(loc='center left', bbox_to_anchor=(1, 0.5)) plt.tight_layout() plt.subplots_adjust(top=0.90, hspace=0.6) plt.show()

That is a lot of code! But if you strip away the parts of the code to give the plot a nice layout, you can see it basically consists of two for loops which apply the different transformations to the nine components of the signal and subsequently finds the peaks in the resulting frequency spectrum. The transformation is plotted with a line-plot and the found peaks are plot with a scatter-plot.

The result can be seen in Figure 8.

In Figure 8, we have already shown how to extract features from a signal: transform a signal by means of the FFT, PSD or autocorrelation function and locate the peaks in the transformation with the peak-finding function.

We have already seen how to do that for one signal, so now simply need to iterate through all signals in the dataset.

def get_first_n_peaks(x,y,no_peaks=5): x_, y_ = list(x), list(y) if len(x_) >= no_peaks: return x_[:no_peaks], y_[:no_peaks] else: missing_no_peaks = no_peaks-len(x_) return x_ + [0]*missing_no_peaks, y_ + [0]*missing_no_peaks def get_features(x_values, y_values, mph): indices_peaks = detect_peaks(y_values, mph=mph) peaks_x, peaks_y = get_first_n_peaks(x_values[indices_peaks], y_values[indices_peaks]) return peaks_x + peaks_y def extract_features_labels(dataset, labels, T, N, f_s, denominator): percentile = 5 list_of_features = [] list_of_labels = [] for signal_no in range(0, len(dataset)): features = [] list_of_labels.append(labels[signal_no]) for signal_comp in range(0,dataset.shape[2]): signal = dataset[signal_no, :, signal_comp] signal_min = np.nanpercentile(signal, percentile) signal_max = np.nanpercentile(signal, 100-percentile) #ijk = (100 - 2*percentile)/10 mph = signal_min + (signal_max - signal_min)/denominator features += get_features(*get_psd_values(signal, T, N, f_s), mph) features += get_features(*get_fft_values(signal, T, N, f_s), mph) features += get_features(*get_autocorr_values(signal, T, N, f_s), mph) list_of_features.append(features) return np.array(list_of_features), np.array(list_of_labels) denominator = 10 X_train, Y_train = extract_features_labels(train_signals, train_labels, T, N, f_s, denominator) X_test, Y_test = extract_features_labels(test_signals, test_labels, T, N, f_s, denominator)

This process results in a matrix containing the features of the training set, and a matrix containing the features of the test set. The number rows of these matrices should be equal to the number of signals in each set (7352 and 2947).

The number of columns in each matrix depend on depends on your choice of features. Each signal has nine components, and for each component you can calculate either just the FFT or all three of the transformations. For each transformation you can decide to look at the first n peaks in the signal. And for each peak you can decide to take only the x value, or both the x and y values. In the example above, we have taken the x and y values of the first 5 peaks of each transform, so we have 270 columns (9*3*5*2) in total.

After constructing the matrices for the training and the test set, together with a list of the correct labels, we can use the scikit-learn package to construct a classifier.

#### from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report clf = RandomForestClassifier(n_estimators=1000) clf.fit(X_train, Y_train) print("Accuracy on training set is : {}".format(clf.score(X_train, Y_train))) print("Accuracy on test set is : {}".format(clf.score(X_test, Y_test))) Y_test_pred = clf.predict(X_test) print(classification_report(Y_test, Y_test_pred)) ---Accuracy on training set is : 1.0 ---Accuracy on test set is : 0.9097387173396675 --- 1 0.96 0.98 0.97 496 --- 2 0.94 0.95 0.95 471 --- 3 0.94 0.90 0.92 420 --- 4 0.84 0.82 0.83 491 --- 5 0.86 0.96 0.90 532 --- 6 0.94 0.85 0.89 537 --- --- avg / total 0.91 0.91 0.91 2947

As you can see, we were able to classify these signals with quite a high accuracy. The accuracy of the training set is about 1 and the accuracy on the test set is about 0.91.

To achieve this accuracy we did not even have to break a sweat. The feature selection was done fully automatic; for each transformation we selected the x and y component of the first five peaks (or use the default value of zero).

It is understandable that some of the 270 features will be more informative than other ones. It could be that some transformations of some components do not have five peaks, or that the frequency value of the peaks is more informative than the amplitude value, or that the FFT is always more informative than the auto-correlation.

The accuracy will increase even more if we actively select the features, transformations and components which are important for classification. Maybe we can even choose a different classifier or play around with its parameter values (hyperparameter optimization) to achieve a higher accuracy.

The field of stochastic signal analysis provides us with a set of powerful tools which can be used to analyze, model and classify time-series and signals. I hope this blog-post has provided you with some information on how to use these techniques.

If you found this blog useful feel free to share it with other people and with your fellow Data Scientists. If you think something is missing feel free to leave a comment below!

**PS:** Did you notice we did not use a Recurrent Neural Net ? What kind of accuracy can you achieve on this dataset with a RNN?

**PS2: ** The code is also available as a Jupyter notebook on my GitHub account.

]]>

In the previous blog post we have seen how to build Convolutional Neural Networks (CNN) in Tensorflow, by building various CNN architectures (like LeNet5, AlexNet, VGGNet-16) from scratch and training them on the MNIST, CIFAR-10 and Oxflower17 datasets.

It starts to get interesting when you start thinking about the practical applications of CNN and other Deep Learning methods. If you have been following the latest technical developments you probably know that CNNs are used for face recognition, object detection, analysis of medical images, automatic inspection in manufacturing processes, natural language processing tasks, any many other applications. You could say that you’re only limited by your imagination and creativity (and of course motivation, energy and time) to find practical applications for CNNs.

Inspired by Kaggle’s Satellite Imagery Feature Detection challenge, I would like to find out how easy it is to detect features (roads in this particular case) in satellite and aerial images.

If this is possible, the practical applications of it will be enormous. In a time of global urbanisation, cities are expanding, growing and changing continuously. This of course comes along with new infrastructures, new buildings and neighbourhoods and changing landscapes. Monitoring and keeping track of all of these changes has always been a really labour intensive job. If we could get fresh satellite images every day and use Deep Learning to immediately update all of our maps, it would a big help for everyone working in this field!

Developments in the field of Deep Learning are happening so fast that ‘simple’ image classification, which was a big hype a few years ago, already seems outdated. Nowadays, also object detection has become mainstream and in the next (few) years we will probably see more and more applications using image segmentation (see figure 1).

In this blog we will use Image classification to detect roads in aerial images.

To do this, we first need to get these aerial images, and get the data containing information on the location of roads (see Section 2.1).

After that we need to map these two layers on top each other, we will do this in section 3.1.

After saving the prepared dataset (Section 4.1) in the right format, we can feed it into our build convolutional neural network (Section 4.3).

We will conclude this blog by looking at the accuracy of this method, and discuss which other methods can be used to improve it.

The contents of this blog-post is as follows:

- Introduction
- Getting the Data
- 2.1 Downloading image tiles with owslib
- 2.2 Reading the shapefile containing the roads of the Netherlands.

- Mapping the two layers of data
- 3.2 Visualizing the Mapping results

- Detecting roads with the convolutional neural network
- 4.1 Preparing a training, test and validation dataset
- 4.2 Saving the datasets as a pickle file
- 4.3 Training a convolutional neural network
- 4.5 Resulting Accuracy

- Final Words

**update**: The code is now also available in a notebook on my GitHub repository

The first (and often the most difficult) step in any Data Science project is always obtaining the data. Luckily there are many open datasets containing satellite images in various forms. There is the Landsat dataset, ESA’s Sentinel dataset, MODIS dataset, the NAIP dataset, etc.

Each dataset has different pro’s and con’s. Some like the NAIP dataset offer a high resolution (one meter resolution), but only cover the US. Others like Landsat cover the entire earth but have a lower (30 meter) resolution. Some of them show which type of land-cover (forest, water, grassland) there is, others contain atmospheric and climate data.

Since I am from the Netherlands, I would like to use aerial / satellite images covering the Netherlands, so I will use the aerial images provided by PDOK. These are not only are quite up to date, but they also have an incredible resolution of 25 cm.

Dutch governmental organizations have a lot open-data available which could form the second layer on top of these aerial images. With Pdokviewer you can view a lot of these open datasets online;

- think of layers containing the infrastructure (various types of road, railways, waterways in the Netherlands (NWB wegenbestand),
- layers containing the boundaries of municipalities and districts,
- physical geographical regions,
- locations of every governmental organizations,
- agricultural areas,
- electricity usage per block,
- type of soil, geomorphological map,
- number of residents per block,
- surface usage, etc etc etc (even the living habitat of the brown long eared bat).

So there are many possible datasets you could use as the second layer, and use it to automatically detect these types of features in satellite images.

PS: Another such site containing a lot of maps is the Atlas Natuurlijk Kapitaal.

The layer that I am interested in is the layer containing the road-types. The map with the road-types (NWB wegenbestand) can be downloaded from the open data portal of the Dutch government. The aerial images are available as an Web Map Service (WMS) and can be downloaded with the Python package owslib.

Below is the code for downloading and saving the image tiles containing aerial photographs with the owslib library.

from owslib.wms import WebMapService URL = "https://geodata.nationaalgeoregister.nl/luchtfoto/rgb/wms?request=GetCapabilities" wms = WebMapService(URL, version='1.1.1') OUTPUT_DIRECTORY = './data/image_tiles/' x_min = 90000 y_min = 427000 dx, dy = 200, 200 no_tiles_x = 100 no_tiles_y = 100 total_no_tiles = no_tiles_x * no_tiles_y x_max = x_min + no_tiles_x * dx y_max = y_min + no_tiles_y * dy BOUNDING_BOX = [x_min, y_min, x_max, y_max] for ii in range(0,no_tiles_x): print(ii) for jj in range(0,no_tiles_y): ll_x_ = x_min + ii*dx ll_y_ = y_min + jj*dy bbox = (ll_x_, ll_y_, ll_x_ + dx, ll_y_ + dy) img = wms.getmap(layers=['Actueel_ortho25'], srs='EPSG:28992', bbox=bbox, size=(256, 256), format='image/jpeg', transparent=True) filename = "{}_{}_{}_{}.jpg".format(bbox[0], bbox[1], bbox[2], bbox[3]) out = open(OUTPUT_DIRECTORY + filename, 'wb') out.write(img.read()) out.close()

With “dx” and “dy” we can adjust the zoom level of the tiles. A values of 200 corresponds roughly with a zoom level of 12, and a value of 100 corresponds roughly with a zoom level of 13.

As you can see, the lower-left x and y, and the upper-right x and y coordinates are used in the filename, so we will always know where each tile is located.

This process results in 10.000 tiles within the bounding box ((90000, 427000), (110000, 447000)). These coordinates are given in the rijksdriehoekscoordinaten reference system, and it corresponds with the coordinates ((51.82781, 4.44428), (52.00954, 4.73177)) in the WGS 84 reference system.

It covers a few square kilometers in the south of Rotterdam (see Figure 3) and contains both urban as well as non-urban area, i.e. there are enough roads for our convolutional neural network to be trained with.

Next we will determine the contents of each tile image, using data from the NWB Wegvakken (version September 2017). This is a file containing all of the roads of the Netherlands, which gets updated frequently. It is possible to download it in the form of a shapefile from this location.

Shapefiles contain shapes with geospatial data and are normally opened with GIS software like ArcGIS or QGIS. It is also possible to open it within Python, by using the pyshp library.

import shapefile import json input_filename = './nwb_wegvakken/Wegvakken.shp' output_filename = './nwb_wegvakken/Wegvakken.json' reader = shapefile.Reader(shp_filename) fields = reader.fields[1:] field_names = [field[0] for field in fields] buffer = [] for sr in reader.shapeRecords(): atr = dict(zip(field_names, sr.record)) geom = sr.shape.__geo_interface__ buffer.append(dict(type="Feature", geometry=geom, properties=atr)) output_filename = './data/nwb_wegvakken/2017_09_wegvakken.json' json_file = open(output_filename , "w") json_file.write(json.dumps({"type": "FeatureCollection", "features": buffer}, indent=2, default=JSONencoder) + "\n") json_file.close()

In this code, the list ‘buffer’ contains the contents of the shapefile. Since we don’t want to repeat the same shapefile-reading process every time, we will now save it in a json format with `json.dumps()`

.

If we try to save the contents of the shapefile as it currently is, it will result in the error `'(...) is not JSON Serializable'`

. This is because the shapefile contains datatypes (bytes and datetime objects) which are not natively supported by JSON. So we need to write an extension to the standard JSON serializer which can take the datatypes not supported by JSON and convert them into datatypes which can be serialized. This is what the method `JSONencoder`

does. (For more on this see here).

def JSONencoder(obj): """JSON serializer for objects not serializable by default json code""" if isinstance(obj, (datetime, date)): serial = obj.isoformat() return serial if isinstance(obj, bytes): return {'__class__': 'bytes', '__value__': list(obj)} raise TypeError ("Type %s not serializable" % type(obj))

If we look at the contents of this shapefile, we will see that it contains a list of objects of the following type:

{'properties': {'E_HNR_LNKS': 1, 'WEGBEHSRT': 'G', 'EINDKM': None, 'RIJRICHTNG': '', 'ROUTELTR4': '', 'ROUTENR3': None, 'WEGTYPE': '', 'ROUTENR': None, 'GME_ID': 717, 'ADMRICHTNG': '', 'WPSNAAMNEN': 'DOMBURG', 'DISTRCODE': 0, 'WVK_BEGDAT': '1998-01-21', 'WGTYPE_OMS': '', 'WEGDEELLTR': '#', 'HNRSTRRHTS': '', 'WEGBEHNAAM': 'Veere', 'JTE_ID_BEG': 47197012, 'DIENSTNAAM': '', 'WVK_ID': 47197071, 'HECTO_LTTR': '#', 'RPE_CODE': '#', 'L_HNR_RHTS': None, 'ROUTELTR3': '', 'WEGBEHCODE': '717', 'ROUTELTR': '', 'GME_NAAM': 'Veere', 'L_HNR_LNKS': 11, 'POS_TV_WOL': '', 'BST_CODE': '', 'BEGINKM': None, 'ROUTENR2': None, 'DISTRNAAM': '', 'ROUTELTR2': '', 'WEGNUMMER': '', 'ENDAFSTAND': None, 'E_HNR_RHTS': None, 'ROUTENR4': None, 'BEGAFSTAND': None, 'DIENSTCODE': '', 'STT_NAAM': 'Van Voorthuijsenstraat', 'WEGNR_AW': '', 'HNRSTRLNKS': 'O', 'JTE_ID_END': 47197131}, 'type': 'Feature', 'geometry': {'coordinates': [[23615.0, 398753.0], [23619.0, 398746.0], [23622.0, 398738.0], [23634.0, 398692.0]], 'type': 'LineString'}}

It contains a lot of information (and what each of them means is specified in the manual), but what is of immediate importance to us, are the values of

- ‘WEGBEHSRT’ –> This indicates the road-type
- ‘coordinates’ –> These are the coordinates of this specific object given in the ‘rijkscoordinaten’ system.

The different road-types present in NWB Wegvakken are:

Now it is time to determine which tiles contain roads, and which tiles do not. We will do this by mapping the contents of NWB-Wegvakken on top of the downloaded aerial photograph tiles.

We can use a Python dictionary to keep track of the mappings. We will also use an dictionary to keep track of the types of road present in each tile.

#First we define some variables, and dictionary keys which are going to be used throughout the rest. dict_roadtype = { "G": 'Gemeente', "R": 'Rijk', "P": 'Provincie', "W": 'Waterschap', 'T': 'Andere wegbeheerder', '' : 'leeg' } dict_roadtype_to_color = { "G": 'red', "R": 'blue', "P": 'green', "W": 'magenta', 'T': 'yellow', '' : 'leeg' } FEATURES_KEY = 'features' PROPERTIES_KEY = 'properties' GEOMETRY_KEY = 'geometry' COORDINATES_KEY = 'coordinates' WEGSOORT_KEY = 'WEGBEHSRT' MINIMUM_NO_POINTS_PER_TILE = 4 POINTS_PER_METER = 0.1 INPUT_FOLDER_TILES = './data/image_tiles/' filename_wegvakken = './data/nwb_wegvakken/2017_09_wegvakken.json' dict_nwb_wegvakken = json.load(open(filename_wegvakken))[FEATURES_KEY] d_tile_contents = defaultdict(list) d_roadtype_tiles = defaultdict(set)

In the code above, the contents of NWB wegvakken, which we have previously converted from a Shapefile to a .JSON format are loaded into `dict_nwb_wegvakken`

.

Furthermore, we initialize two defaultdicts. The first one will be filled with the tiles as the keys, and a list of its contents as the values. The second dictionary will be filled with the road-type as the keys, and all tiles containing these road-types as the values. How this is done, is shown below:

for elem in dict_nwb_wegvakken: coordinates = retrieve_coordinates(elem) rtype = retrieve_roadtype(elem) coordinates_in_bb = [coord for coord in coordinates if coord_is_in_bb(coord, BOUNDING_BOX)] if len(coordinates_in_bb)==1: coord = coordinates_in_bb[0] add_to_dict(d_tile_contents, d_roadtype_tiles, coord, rtype) if len(coordinates_in_bb)>1: add_to_dict(d_tile_contents, d_roadtype_tiles, coordinates_in_bb[0], rtype) for ii in range(1,len(coordinates_in_bb)): previous_coord = coordinates_in_bb[ii-1] coord = coordinates_in_bb[ii] add_to_dict(d_tile_contents, d_roadtype_tiles, coord, rtype) dist = eucledian_distance(previous_coord, coord) no_intermediate_points = int(dist/10) intermediate_coordinates = calculate_intermediate_points(previous_coord, coord, no_intermediate_points) for intermediate_coord in intermediate_coordinates: add_to_dict(d_tile_contents, d_roadtype_tiles, intermediate_coord, rtype)

As you can see, we iterate over the contents of dict_nwb_wegvakken, and for each element, we look up the coordinates and the road-types and check which of these coordinates are inside our bounding box. This is done with the following methods:

def coord_is_in_bb(coord, bb): x_min = bb[0] y_min = bb[1] x_max = bb[2] y_max = bb[3] return coord[0] > x_min and coord[0] < x_max and coord[1] > y_min and coord[1] < y_max def retrieve_roadtype(elem): return elem[PROPERTIES_KEY][WEGSOORT_KEY] def retrieve_coordinates(elem): return elem[GEOMETRY_KEY][COORDINATES_KEY] def add_to_dict(d1, d2, coordinates, rtype): coordinate_ll_x = int((coordinates[0] // dx)*dx) coordinate_ll_y = int((coordinates[1] // dy)*dy) coordinate_ur_x = int((coordinates[0] // dx)*dx + dx) coordinate_ur_y = int((coordinates[1] // dy)*dy + dy) tile = "{}_{}_{}_{}.jpg".format(coordinate_ll_x, coordinate_ll_y, coordinate_ur_x, coordinate_ur_y) rel_coord_x = (coordinates[0] - coordinate_ll_x) / dx rel_coord_y = (coordinates[1] - coordinate_ll_y) / dy value = (rtype, rel_coord_x, rel_coord_y) d1[tile].append(value) d2[rtype].add(tile)

As you can see, the add_to_dict method first determines to which tile a coordinate belongs to, by determining the four coordinates (lowerleft x, y and upperright x,y) each tile is named with.

We also determine the relative position of each coordinate in its tile. The coordinate (99880, 445120) in tile “99800_445000_100000_445200.jpg” will for example have relative coordinates (0.4, 0.6). This is handy later on when you want to plot the contents of a tile.

The road-type, together with the relative coordinate are appended to the list of contents of the tile.

At the same time, we add the tilename to the second dictionary which contains a list of tilenames per road-type.

- If there is only one set of coordinates, and these coordinates are inside our bounding box, we immediately add these coordinate to our dictionaries.
- If there are more than one coordinate in an element, we do not only add all coordinates to the dictionaries, but also calculate all intermediate points between two subsequent coordinates and add these intermediate points to the dictionaries.

This is necessary because two coordinates could form a shape/line which describes a road which lies inside a tile, but if the coordinates happen to lie outside of the tile, we will think this tile does not contain any road.

This is illustrated in Figure 4. On the left we can see two points describing a road, but they happen to lie outside of the tile, and on the right we also calculate every intermediate point (every 1/POINTS_PER_METER meter) between two points and add the intermediate points to the dictionary.

It is always good to make visualizations. It gives us an indication if the mapping of points on the tiles went correctly, whether we have missed any roads, and if we have chosen enough intermediate points between two coordinates to fully cover all parts of the roads.

fig, axarr = plt.subplots(nrows=11,ncols=11, figsize=(16,16)) for ii in range(0,11): for jj in range(0,11): ll_x = x0 + ii*dx ll_y = y0 + jj*dy ur_x = ll_x + dx ur_y = ll_y + dy tile = "{}_{}_{}_{}.jpg".format(ll_x, ll_y, ur_x, ur_y) filename = INPUT_FOLDER_TILES + tile tile_contents = d_tile_contents[tile] ax = axarr[10-jj, ii] image = ndimage.imread(filename) rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) ax.imshow(rgb_image) ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) for elem in tile_contents: color = dict_roadtype_to_color[elem[0]] x = elem[1]*256 y = (1-elem[2])*256 ax.scatter(x,y,c=color,s=10) plt.subplots_adjust(wspace=0, hspace=0) plt.show()

In Figure 5, we can see two figures, on the left for (x0 = 94400, y0 = 432000) and on the right for (x0 = 93000, y0 = 430000). You can click on them for a larger view.

Next we will load all of the tiles together with their correct labels (road-presence and/or road-type) into a dataset. The dataset is randomized and then split in a training, test and validation part.

image_width = 256 image_height = 256 image_depth = 3 total_no_images = 10000 image_files = os.listdir(INPUT_FOLDER_TILES) dataset = np.ndarray(shape=(total_no_images, image_width, image_height, image_depth), dtype=np.float32) labels_roadtype = [] labels_roadpresence = np.ndarray(total_no_images, dtype=np.float32) for counter, image in enumerate(image_files): filename = INPUT_FOLDER_TILES + image if image in list(d_tile_contents.keys()): tile_contents = d_tile_contents[image] roadtypes = sorted(list(set([elem[0] for elem in tile_contents]))) roadtype = "_".join(roadtypes) labels_roadpresence[counter] = 1 else: roadtype = '' labels_roadpresence[counter] = 0 labels_roadtype.append(roadtype) image_data = ndimage.imread(filename).astype(np.float32) dataset[counter, :, :] = image_data labels_roadtype_ohe = np.array(list(onehot_encode_labels(labels_roadtype))) dataset, labels_roadpresence, labels_roadtype_ohe = reformat_data(dataset, labels_roadpresence, labels_roadtype_ohe)

We can use the following functions to one-hot encode the labels, and randomize the dataset:

def onehot_encode_labels(labels): list_possible_labels = list(np.unique(labels)) encoded_labels = map(lambda x: list_possible_labels.index(x), labels) return encoded_labels def randomize(dataset, labels1, labels2): permutation = np.random.permutation(dataset.shape[0]) randomized_dataset = dataset[permutation, :, :, :] randomized_labels1 = labels1[permutation] randomized_labels2 = labels2[permutation] return randomized_dataset, randomized_labels1, randomized_labels2 def one_hot_encode(np_array, num_unique_labels): return (np.arange(num_unique_labels) == np_array[:,None]).astype(np.float32) def reformat_data(dataset, labels1, labels2): dataset, labels1, labels2 = randomize(dataset, labels1, labels2) num_unique_labels1 = len(np.unique(labels1)) num_unique_labels2 = len(np.unique(labels2)) labels1 = one_hot_encode(labels1, num_unique_labels1) labels2 = one_hot_encode(labels2, num_unique_labels2) return dataset, labels1, labels2

This whole process of loading the dataset into memory, and especially randomizing the order of the images usually takes a really long time. So after you have done it once, you want to save the result as a pickle file.

start_train_dataset = 0 start_valid_dataset = 1200 start_test_dataset = 1600 total_no_images = 2000 output_pickle_file = './data/sattelite_dataset.pickle' f = open(output_pickle_file, 'wb') save = { 'train_dataset': dataset[start_train_dataset:start_valid_dataset,:,:,:], 'train_labels_roadtype': labels_roadtype[start_train_dataset:start_valid_dataset], 'train_labels_roadpresence': labels_roadpresence[start_train_dataset:start_valid_dataset], 'valid_dataset': dataset[start_valid_dataset:start_test_dataset,:,:,:], 'valid_labels_roadtype': labels_roadtype[start_valid_dataset:start_test_dataset], 'valid_labels_roadpresence': labels_roadpresence[start_valid_dataset:start_test_dataset], 'test_dataset': dataset[start_test_dataset:total_no_images,:,:,:], 'test_labels_roadtype': labels_roadtype[start_test_dataset:total_no_images], 'test_labels_roadpresence': labels_roadpresence[start_test_dataset:total_no_images], } pickle.dump(save, f, pickle.HIGHEST_PROTOCOL) f.close() print("\nsaved dataset to {}".format(output_pickle_file))

Now that we have saved the training, validation and test set in a pickle file, we are finished with the preparation part.

We can load this pickle file into convolutional neural network and train it to recognize roads.

To train the convolutional neural network to recognize roads, we are going to reuse code from the previous blog post. So if you want to understand how a convolutional neural network actually works, I advise you to take a few minutes and read it.

**PS:** As you have seen, before we can even start with the CNN, we had to do a lot of work getting and preparing the dataset. This also reflects how data science is in real life; 70 to 80 percent of the time goes into getting, understanding and cleaning data. The actual modelling / training of the data is only a small part of the work.

First we load the dataset from the saved pickle file, import VGGNet from the cnn_models module (see GitHub) and some utility functions (to determine the accuracy) and set the values for the learning rate, batch size, etc.

import pickle import tensorflow as tf from cnn_models.vggnet import * from utils import * pickle_file = './data/sattelite_dataset.pickle' f = open(pickle_file, 'rb') save = pickle.load(f) train_dataset = save['train_dataset'].astype(dtype = np.float32) train_labels = save['train_labels_roadpresence'].astype(dtype = np.float32) test_dataset = save['test_dataset'].astype(dtype = np.float32) test_labels = save['test_labels_roadpresence'].astype(dtype = np.float32) valid_dataset = save['valid_dataset'].astype(dtype = np.float32) valid_labels = save['valid_labels_roadpresence'].astype(dtype = np.float32) f.close() num_labels = len(np.unique(train_labels)) num_steps = 501 display_step = 10 batch_size = 16 learning_rate = 0.0001 lambda_loss_amount = 0.0015

After that we can construct the Graph containing all of the computational steps of the Convolutional Neural Network and run it. As you can see, we are using the VGGNet-16 Convolutional Neural Network, l2-regularization to minimize the error, and learning rate of 0.0001.

At each step the training accuracy is appended to train_accuracies, and at every 10th step the test and validation accuracies are appended to similar lists. We will use these later to visualize our accuracies.

train_accuracies, test_accuracies, valid_accuracies = [], [], [] print("STARTING WITH SATTELITE") graph = tf.Graph() with graph.as_default(): #1) First we put the input data in a tensorflow friendly form. tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_width, image_height, image_depth)) tf_train_labels = tf.placeholder(tf.float32, shape = (batch_size, num_labels)) tf_test_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_width, image_height, image_depth)) tf_test_labels = tf.placeholder(tf.float32, shape = (batch_size, num_labels)) tf_valid_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_width, image_height, image_depth)) tf_valid_labels = tf.placeholder(tf.float32, shape = (batch_size, num_labels)) #2) Then, the weight matrices and bias vectors are initialized variables = variables_vggnet16() #3. The model used to calculate the logits (predicted labels) model = model_vggnet16 logits = model(tf_train_dataset, variables) #4. then we compute the softmax cross entropy between the logits and the (actual) labels l2 = lambda_loss_amount * sum(tf.nn.l2_loss(tf_var) for tf_var in tf.trainable_variables()) loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf_train_labels)) + l2 #learning_rate = tf.train.exponential_decay(0.05, global_step, 1000, 0.85, staircase=True) #5. The optimizer is used to calculate the gradients of the loss function optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss) # Predictions for the training, validation, and test data. train_prediction = tf.nn.softmax(logits) test_prediction = tf.nn.softmax(model(tf_test_dataset, variables)) valid_prediction = tf.nn.softmax(model(tf_valid_dataset, variables)) with tf.Session(graph=graph) as session: test_counter = 0 tf.global_variables_initializer().run() print('Initialized with learning_rate', learning_rate, " model ", ii) for step in range(num_steps): #Since we are using stochastic gradient descent, we are selecting small batches from the training dataset, #and training the convolutional neural network each time with a batch. offset = (step * batch_size) % (train_labels.shape[0] - batch_size) batch_data = train_dataset[offset:(offset + batch_size), :, :] batch_labels = train_labels[offset:(offset + batch_size), :] feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels} _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict) train_accuracy = accuracy(predictions, batch_labels) train_accuracies.append(train_accuracy) if step % display_step == 0: offset2 = (test_counter * batch_size) % (test_labels.shape[0] - batch_size) test_dataset_batch = test_dataset[offset2:(offset2 + batch_size), :, :] test_labels_batch = test_labels[offset2:(offset2 + batch_size), :] feed_dict2 = {tf_test_dataset : test_dataset_batch, tf_test_labels : test_labels_batch} test_prediction_ = session.run(test_prediction, feed_dict=feed_dict2) test_accuracy = accuracy(test_prediction_, test_labels_batch) test_accuracies.append(test_accuracy) valid_dataset_batch = valid_dataset[offset2:(offset2 + batch_size), :, :] valid_labels_batch = valid_labels[offset2:(offset2 + batch_size), :] feed_dict3 = {tf_valid_dataset : valid_dataset_batch, tf_valid_labels : valid_labels_batch} valid_prediction_ = session.run(valid_prediction, feed_dict=feed_dict3) valid_accuracy = accuracy(valid_prediction_, valid_labels_batch) valid_accuracies.append(valid_accuracy) message = "step {:04d} : loss is {:06.2f}, accuracy on training set {:02.2f} %, accuracy on test set {:02.2f} accuracy on valid set {:02.2f} %".format(step, l, train_accuracy, test_accuracy, valid_accuracy) print(message)

Again, if you do not fully understand what is happening here, you can have a look at the previous blog post in which we looked at building Convolutional Neural Networks in Tensorflow in more detail.

Below we can see the accuracy results of the convolutional neural network.

As you can see, the accuracy for the test as well as the validation set lies around 80 %.

If we look at the tiles which have been classified inaccurately, we can see that most of these tiles are classified wrongly because it really is difficult to detect the roads on these images.

We have seen how we can detect roads in satellite or aerial images using CNNs. Although it is quite amazing what you can do with Convolutional Neural Networks, the technical development in A.I. and Deep Learning world is so fast that using ‘only a CNN’ is already outdated.

- Since a few years there are also Neural Networks called R-CNN, Fast R-CNN and Faster R-CNN (like SSD, YOLO and YOLO9000). These Neural Networks can not only detect the presence of objects in images, but also return the bounding boxes of the object.
- Nowadays there are also Neural Networks which can perform segmentation tasks (like DeepMask, SharpMask, MultiPath), i.e. they can determine to which object each pixel in the image belongs.

I think these Neural Networks which can perform image segmentation would be ideal to determine the location of roads and other objects inside satellite images. In future blogs, I want to have a look at how we can detect roads (or other features) in satellite and aerial images using these types of Neural Networks.

**PS:** If you are interested in the latest developments in Computer Vision, I can recommend you to read a year in computer vision.

]]>

In the past I have mostly written about ‘classical’ Machine Learning, like Naive Bayes classification, Logistic Regression, and the Perceptron algorithm. In the past year I have also worked with Deep Learning techniques, and I would like to share with you how to make and train a Convolutional Neural Network from scratch, using tensorflow. Later on we can use this knowledge as a building block to make interesting Deep Learning applications.

For this you will need to have tensorflow installed (see installation instructions) and you should also have a basic understanding of Python programming and the theory behind Convolutional Neural Networks. After you have installed tensorflow, you can run the smaller Neural Networks without GPU, but for the deeper networks you will definitely need some GPU power.

The Internet is full with awesome websites and courses which explain how a convolutional neural network works. Some of them have good visualisations which make it easy to understand [click here for more info]. I don’t feel the need to explain the same things again, so before you continue, make sure you understand how a convolutional neural network works. For example,

- What is a convolutional layer, and what is the filter of this convolutional layer?
- What is an activation layer (ReLu layer (most widely used), sigmoid activation or tanh)?
- What is a pooling layer (max pooling / average pooling), dropout?
- How does Stochastic Gradient Descent work?

The contents of this blog-post is as follows:

- Tensorflow basics:
- 1.1 Constants and Variables
- 1.2 Tensorflow Graphs and Sessions
- 1.3 Placeholders and feed_dicts

- Neural Networks in Tensorflow
- 2.1 Introduction
- 2.2 Loading in the data
- 2.3 Creating a (simple) 1-layer Neural Network:
- 2.4 The many faces of Tensorflow
- 2.5 Creating the LeNet5 CNN
- 2.6 How the parameters affect the outputsize of an layer
- 2.7 Adjusting the LeNet5 architecture
- 2.8 Impact of Learning Rate and Optimizer

- Deep Neural Networks in Tensorflow
- 3.1 AlexNet
- 3.2 VGG Net-16
- 3.3 AlexNet Performance

- Final words

Here I will give a short introduction to Tensorflow for people who have never worked with it before. If you want to start building Neural Networks immediatly, or you are already familiar with Tensorflow you can go ahead and skip to section 2. If you would like to know more about Tensorflow, you can also have a look at this repository, or the notes of lecture 1 and lecture 2 of Stanford’s CS20SI course.

The most basic units within tensorflow are Constants, Variables and Placeholders.

The difference between a tf.constant() and a tf.Variable() should be clear; a constant has a constant value and once you set it, it cannot be changed. The value of a Variable can be changed after it has been set, but the type and shape of the Variable can not be changed.

#We can create constants and variables of different types. #However, the different types do not mix well together. a = tf.constant(2, tf.int16) b = tf.constant(4, tf.float32) c = tf.constant(8, tf.float32) d = tf.Variable(2, tf.int16) e = tf.Variable(4, tf.float32) f = tf.Variable(8, tf.float32) #we can perform computations on variable of the same type: e + f #but the following can not be done: d + e #everything in tensorflow is a tensor, these can have different dimensions: #0D, 1D, 2D, 3D, 4D, or nD-tensors g = tf.constant(np.zeros(shape=(2,2), dtype=np.float32)) #does work h = tf.zeros([11], tf.int16) i = tf.ones([2,2], tf.float32) j = tf.zeros([1000,4,3], tf.float64) k = tf.Variable(tf.zeros([2,2], tf.float32)) l = tf.Variable(tf.zeros([5,6,5], tf.float32))

Besides the tf.zeros() and tf.ones(), which create a Tensor initialized to zero or one (see here), there is also the tf.random_normal() function which create a tensor filled with values picked randomly from a normal distribution (the default distribution has a mean of 0.0 and stddev of 1.0).

There is also the tf.truncated_normal() function, which creates an Tensor with values randomly picked from a normal distribution, where two times the standard deviation forms the lower and upper limit.

With this knowledge, we can already create weight matrices and bias vectors which can be used in a neural network.

weights = tf.Variable(tf.truncated_normal([256 * 256, 10])) biases = tf.Variable(tf.zeros([10])) print(weights.get_shape().as_list()) print(biases.get_shape().as_list()) >>>[65536, 10] >>>[10]

In Tensorflow, all of the different Variables and the operations done on these Variables are saved in a Graph. After you have build a Graph which contains all of the computational steps necessary for your model, you can run this Graph within a Session. This Session then distributes all of the computations across the available CPU and GPU resources.

graph = tf.Graph() with graph.as_default(): a = tf.Variable(8, tf.float32) b = tf.Variable(tf.zeros([2,2], tf.float32)) with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print(f) print(session.run(f)) print(session.run(k)) >>> <tf.Variable 'Variable_2:0' shape=() dtype=int32_ref> >>> 8 >>> [[ 0. 0.] >>> [ 0. 0.]]

We have seen the various forms in which we can create constants and variables. Tensorflow also has placeholders; these do not require an initial value and only serve to allocate the necessary amount of memory. During a session, these placeholder can be filled in with (external) data with a *feed_dict*.

Below is an example of the usage of a placeholder.

list_of_points1_ = [[1,2], [3,4], [5,6], [7,8]] list_of_points2_ = [[15,16], [13,14], [11,12], [9,10]] list_of_points1 = np.array([np.array(elem).reshape(1,2) for elem in list_of_points1_]) list_of_points2 = np.array([np.array(elem).reshape(1,2) for elem in list_of_points2_]) graph = tf.Graph() with graph.as_default(): #we should use a tf.placeholder() to create a variable whose value you will fill in later (during session.run()). #this can be done by 'feeding' the data into the placeholder. #below we see an example of a method which uses two placeholder arrays of size [2,1] to calculate the eucledian distance point1 = tf.placeholder(tf.float32, shape=(1, 2)) point2 = tf.placeholder(tf.float32, shape=(1, 2)) def calculate_eucledian_distance(point1, point2): difference = tf.subtract(point1, point2) power2 = tf.pow(difference, tf.constant(2.0, shape=(1,2))) add = tf.reduce_sum(power2) eucledian_distance = tf.sqrt(add) return eucledian_distance dist = calculate_eucledian_distance(point1, point2) with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() for ii in range(len(list_of_points1)): point1_ = list_of_points1[ii] point2_ = list_of_points2[ii] feed_dict = {point1 : point1_, point2 : point2_} distance = session.run([dist], feed_dict=feed_dict) print("the distance between {} and {} -> {}".format(point1_, point2_, distance)) >>> the distance between [[1 2]] and [[15 16]] -> [19.79899] >>> the distance between [[3 4]] and [[13 14]] -> [14.142136] >>> the distance between [[5 6]] and [[11 12]] -> [8.485281] >>> the distance between [[7 8]] and [[ 9 10]] -> [2.8284271]

The graph containing the Neural Network (illustrated in the image above) should contain the following steps:

- The
**input datasets**; the training dataset and labels, the test dataset and labels (and the validation dataset and labels).

The test and validation datasets can be placed inside a tf.constant(). And the training dataset is placed in a tf.placeholder() so that it can be feeded in batches during the training (stochastic gradient descent). - The Neural Network
**model**with all of its layers. This can be a simple fully connected neural network consisting of only 1 layer, or a more complicated neural network consisting of 5, 9, 16 etc layers. - The
**weight**matrices and**bias**vectors defined in the proper shape and initialized to their initial values. (One weight matrix and bias vector per layer.) - The
**loss**value: the model has as output the logit vector (estimated training labels) and by comparing the logit with the actual labels, we can calculate the loss value (with the softmax with cross-entropy function). The loss value is an indication of how close the estimated training labels are to the actual training labels and will be used to update the weight values. - An
**optimizer**, which will use the calculated loss value to update the weights and biases with backpropagation.

Let’s load the dataset which are going to be used to train and test the Neural Networks. For this we will download the MNIST and the CIFAR-10 dataset. The MNIST dataset contains 60.000 images of handwritten digits, where each image size is 28 x 28 x 1 (grayscale). The CIFAR-10 dataset contains 60.000 colour images (3 channels) – size 32 x 32 x 3 – of 10 different objects (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). Since there are 10 different objects in each dataset, both datasets contain 10 labels.

First, lets define some methods which are convenient for loading and reshaping the data into the necessary format.

def randomize(dataset, labels): permutation = np.random.permutation(labels.shape[0]) shuffled_dataset = dataset[permutation, :, :] shuffled_labels = labels[permutation] return shuffled_dataset, shuffled_labels def one_hot_encode(np_array): return (np.arange(10) == np_array[:,None]).astype(np.float32) def reformat_data(dataset, labels, image_width, image_height, image_depth): np_dataset_ = np.array([np.array(image_data).reshape(image_width, image_height, image_depth) for image_data in dataset]) np_labels_ = one_hot_encode(np.array(labels, dtype=np.float32)) np_dataset, np_labels = randomize(np_dataset_, np_labels_) return np_dataset, np_labels def flatten_tf_array(array): shape = array.get_shape().as_list() return tf.reshape(array, [shape[0], shape[1] * shape[2] * shape[3]]) def accuracy(predictions, labels): return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1)) / predictions.shape[0])

These are methods for one-hot encoding the labels, loading the data in a randomized array and a method for flattening an array (since a fully connected network needs an flat array as its input):

After we have defined these necessary function, we can load the MNIST and CIFAR-10 datasets with:

mnist_folder = './data/mnist/' mnist_image_width = 28 mnist_image_height = 28 mnist_image_depth = 1 mnist_num_labels = 10 mndata = MNIST(mnist_folder) mnist_train_dataset_, mnist_train_labels_ = mndata.load_training() mnist_test_dataset_, mnist_test_labels_ = mndata.load_testing() mnist_train_dataset, mnist_train_labels = reformat_data(mnist_train_dataset_, mnist_train_labels_, mnist_image_size, mnist_image_size, mnist_image_depth) mnist_test_dataset, mnist_test_labels = reformat_data(mnist_test_dataset_, mnist_test_labels_, mnist_image_size, mnist_image_size, mnist_image_depth) print("There are {} images, each of size {}".format(len(mnist_train_dataset), len(mnist_train_dataset[0]))) print("Meaning each image has the size of 28*28*1 = {}".format(mnist_image_size*mnist_image_size*1)) print("The training set contains the following {} labels: {}".format(len(np.unique(mnist_train_labels_)), np.unique(mnist_train_labels_))) print('Training set shape', mnist_train_dataset.shape, mnist_train_labels.shape) print('Test set shape', mnist_test_dataset.shape, mnist_test_labels.shape) train_dataset_mnist, train_labels_mnist = mnist_train_dataset, mnist_train_labels test_dataset_mnist, test_labels_mnist = mnist_test_dataset, mnist_test_labels ###################################################################################### cifar10_folder = './data/cifar10/' train_datasets = ['data_batch_1', 'data_batch_2', 'data_batch_3', 'data_batch_4', 'data_batch_5', ] test_dataset = ['test_batch'] c10_image_height = 32 c10_image_width = 32 c10_image_depth = 3 c10_num_labels = 10 with open(cifar10_folder + test_dataset[0], 'rb') as f0: c10_test_dict = pickle.load(f0, encoding='bytes') c10_test_dataset, c10_test_labels = c10_test_dict[b'data'], c10_test_dict[b'labels'] test_dataset_cifar10, test_labels_cifar10 = reformat_data(c10_test_dataset, c10_test_labels, c10_image_size, c10_image_size, c10_image_depth) c10_train_dataset, c10_train_labels = [], [] for train_dataset in train_datasets: with open(cifar10_folder + train_dataset, 'rb') as f0: c10_train_dict = pickle.load(f0, encoding='bytes') c10_train_dataset_, c10_train_labels_ = c10_train_dict[b'data'], c10_train_dict[b'labels'] c10_train_dataset.append(c10_train_dataset_) c10_train_labels += c10_train_labels_ c10_train_dataset = np.concatenate(c10_train_dataset, axis=0) train_dataset_cifar10, train_labels_cifar10 = reformat_data(c10_train_dataset, c10_train_labels, c10_image_size, c10_image_size, c10_image_depth) del c10_train_dataset del c10_train_labels print("The training set contains the following labels: {}".format(np.unique(c10_train_dict[b'labels']))) print('Training set shape', train_dataset_cifar10.shape, train_labels_cifar10.shape) print('Test set shape', test_dataset_cifar10.shape, test_labels_cifar10.shape)

You can download the MNIST dataset from Yann LeCun’s website. After you have downloaded and unzipped the files, you can load the data with the python-mnist tool. CIFAR-10 can be downloaded from here.

The most simple form of a Neural Network is a 1-layer linear Fully Connected Neural Network (FCNN). Mathematically it consists of a matrix multiplication.

It is best to start with such a simple NN in tensorflow, and later on look at the more complicated Neural Networks. When we start looking at these more complicated Neural Networks, only the model (step 2) and weights (step 3) part of the Graph will change and the other steps will remain the same.

We can make such an 1-layer FCNN as follows:

image_width = mnist_image_width image_height = mnist_image_height image_depth = mnist_image_depth num_labels = mnist_num_labels #the dataset train_dataset = mnist_train_dataset train_labels = mnist_train_labels test_dataset = mnist_test_dataset test_labels = mnist_test_labels #number of iterations and learning rate num_steps = 10001 display_step = 1000 learning_rate = 0.5 graph = tf.Graph() with graph.as_default(): #1) First we put the input data in a tensorflow friendly form. tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_width, image_height, image_depth)) tf_train_labels = tf.placeholder(tf.float32, shape = (batch_size, num_labels)) tf_test_dataset = tf.constant(test_dataset, tf.float32) #2) Then, the weight matrices and bias vectors are initialized #as a default, tf.truncated_normal() is used for the weight matrix and tf.zeros() is used for the bias vector. weights = tf.Variable(tf.truncated_normal([image_width * image_height * image_depth, num_labels]), tf.float32) bias = tf.Variable(tf.zeros([num_labels]), tf.float32) #3) define the model: #A one layered fccd simply consists of a matrix multiplication def model(data, weights, bias): return tf.matmul(flatten_tf_array(data), weights) + bias logits = model(tf_train_dataset, weights, bias) #4) calculate the loss, which will be used in the optimization of the weights loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf_train_labels)) #5) Choose an optimizer. Many are available. optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss) #6) The predicted values for the images in the train dataset and test dataset are assigned to the variables train_prediction and test_prediction. #It is only necessary if you want to know the accuracy by comparing it with the actual values. train_prediction = tf.nn.softmax(logits) test_prediction = tf.nn.softmax(model(tf_test_dataset, weights, bias)) with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print('Initialized') for step in range(num_steps): _, l, predictions = session.run([optimizer, loss, train_prediction]) if (step % display_step == 0): train_accuracy = accuracy(predictions, train_labels[:, :]) test_accuracy = accuracy(test_prediction.eval(), test_labels) message = "step {:04d} : loss is {:06.2f}, accuracy on training set {:02.2f} %, accuracy on test set {:02.2f} %".format(step, l, train_accuracy, test_accuracy) print(message)

>>> Initialized >>> step 0000 : loss is 2349.55, accuracy on training set 10.43 %, accuracy on test set 34.12 % >>> step 0100 : loss is 3612.48, accuracy on training set 89.26 %, accuracy on test set 90.15 % >>> step 0200 : loss is 2634.40, accuracy on training set 91.10 %, accuracy on test set 91.26 % >>> step 0300 : loss is 2109.42, accuracy on training set 91.62 %, accuracy on test set 91.56 % >>> step 0400 : loss is 2093.56, accuracy on training set 91.85 %, accuracy on test set 91.67 % >>> step 0500 : loss is 2325.58, accuracy on training set 91.83 %, accuracy on test set 91.67 % >>> step 0600 : loss is 22140.44, accuracy on training set 68.39 %, accuracy on test set 75.06 % >>> step 0700 : loss is 5920.29, accuracy on training set 83.73 %, accuracy on test set 87.76 % >>> step 0800 : loss is 9137.66, accuracy on training set 79.72 %, accuracy on test set 83.33 % >>> step 0900 : loss is 15949.15, accuracy on training set 69.33 %, accuracy on test set 77.05 % >>> step 1000 : loss is 1758.80, accuracy on training set 92.45 %, accuracy on test set 91.79 %

This is all there is too it! Inside the Graph, we load the data, define the weight matrices and the model, calculate the loss value from the logit vector and pass this to the optimizer which will update the weights for ‘num_steps’ number of iterations.

In the above fully connected NN, we have used the Gradient Descent Optimizer for optimizing the weights. However, there are many different optimizers available in tensorflow. The most common used optimizers are the GradientDescentOptimizer, AdamOptimizer and AdaGradOptimizer, so I would suggest to start with these if youre building a CNN.

Sebastian Ruder has a nice blog post explaining the differences between the different optimizers which you can read if you want to know more about them.

Tensorflow contains many layers, meaning the same operations can be done with different levels of abstraction. To give a simple example, the operation

`logits = tf.matmul(tf_train_dataset, weights) + biases`

,

can also be achieved with

`logits = tf.nn.xw_plus_b(train_dataset, weights, biases)`

.

This is the best visible in the layers API, which is an layer with a high level of abstraction and makes it very easy to create Neural Network consisting of many different layers. For example, the conv_2d() or the fully_connected() functions create convolutional and fully connected layers. With these functions, the number of layers, filter sizes / depths, type of activation function, etc can be specified as a parameter. The weights and bias matrices are then automatically created, as well as the additional activation functions and dropout regularization layers.

For example, with the layers API, the following lines:

import tensorflow as tf w1 = tf.Variable(tf.truncated_normal([filter_size, filter_size, image_depth, filter_depth], stddev=0.1)) b1 = tf.Variable(tf.zeros([filter_depth])) layer1_conv = tf.nn.conv2d(data, w1, [1, 1, 1, 1], padding='SAME') layer1_relu = tf.nn.relu(layer1_conv + b1) layer1_pool = tf.nn.max_pool(layer1_pool, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')

can be replaced with

from tflearn.layers.conv import conv_2d, max_pool_2d layer1_conv = conv_2d(data, filter_depth, filter_size, activation='relu') layer1_pool = max_pool_2d(layer1_conv_relu, 2, strides=2)

As you can see, we don’t need to define the weights, biases or activation functions. Especially when youre building a neural network with many layers, this keeps the code succint and clean.

However, if youre just starting out with tensorflow and want to learn how to build different kinds of Neural Networks, it is not ideal, since were letting tflearn do all the work.

Therefore we will not use the layers API in this blog-post, but I do recommend you to use it once you have a full understanding of how a neural network should be build in tensorflow.

Let’s start with building more layered Neural Network. For example the LeNet5 Convolutional Neural Network.

The LeNet5 CNN architecture was thought of by Yann Lecun as early as in 1998 (see paper). It is one of the earliest CNN’s (maybe even the first?) and was specifically designed to classify handwritten digits. Although it performs well on the MNIST dataset which consist of grayscale images of size 28 x 28, the performance drops on other datasets with more images, with a larger resolution (larger image size) and more classes. For these larger datasets, deeper ConvNets (like AlexNet, VGGNet or ResNet), will perform better.

But since the LeNet5 architecture only consists of 5 layers, it is a good starting point for learning how to build CNN’s.

The Lenet5 architecture looks as follows:

As we can see, it consists of 5 layers:

**layer 1**: a convolutional layer, with a sigmoid activation function, followed by an average pooling layer.**layer 2**: a convolutional layer, with a sigmoid activation function, followed by an average pooling layer.**layer 3**: a fully connected network (sigmoid activation)**layer 4**: a fully connected network (sigmoid activation)**layer 5**: the output layer

This means that we need to create 5 weight and bias matrices, and our model will consists of 12 lines of code (5 layers + 2 pooling + 4 activation functions + 1 flatten layer).

Since this is quiet some code, it is best to define these in a seperate function outside of the graph.

LENET5_BATCH_SIZE = 32 LENET5_PATCH_SIZE = 5 LENET5_PATCH_DEPTH_1 = 6 LENET5_PATCH_DEPTH_2 = 16 LENET5_NUM_HIDDEN_1 = 120 LENET5_NUM_HIDDEN_2 = 84 def variables_lenet5(patch_size = LENET5_PATCH_SIZE, patch_depth1 = LENET5_PATCH_DEPTH_1, patch_depth2 = LENET5_PATCH_DEPTH_2, num_hidden1 = LENET5_NUM_HIDDEN_1, num_hidden2 = LENET5_NUM_HIDDEN_2, image_depth = 1, num_labels = 10): w1 = tf.Variable(tf.truncated_normal([patch_size, patch_size, image_depth, patch_depth1], stddev=0.1)) b1 = tf.Variable(tf.zeros([patch_depth1])) w2 = tf.Variable(tf.truncated_normal([patch_size, patch_size, patch_depth1, patch_depth2], stddev=0.1)) b2 = tf.Variable(tf.constant(1.0, shape=[patch_depth2])) w3 = tf.Variable(tf.truncated_normal([5*5*patch_depth2, num_hidden1], stddev=0.1)) b3 = tf.Variable(tf.constant(1.0, shape = [num_hidden1])) w4 = tf.Variable(tf.truncated_normal([num_hidden1, num_hidden2], stddev=0.1)) b4 = tf.Variable(tf.constant(1.0, shape = [num_hidden2])) w5 = tf.Variable(tf.truncated_normal([num_hidden2, num_labels], stddev=0.1)) b5 = tf.Variable(tf.constant(1.0, shape = [num_labels])) variables = { 'w1': w1, 'w2': w2, 'w3': w3, 'w4': w4, 'w5': w5, 'b1': b1, 'b2': b2, 'b3': b3, 'b4': b4, 'b5': b5 } return variables def model_lenet5(data, variables): layer1_conv = tf.nn.conv2d(data, variables['w1'], [1, 1, 1, 1], padding='SAME') layer1_actv = tf.sigmoid(layer1_conv + variables['b1']) layer1_pool = tf.nn.avg_pool(layer1_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer2_conv = tf.nn.conv2d(layer1_pool, variables['w2'], [1, 1, 1, 1], padding='VALID') layer2_actv = tf.sigmoid(layer2_conv + variables['b2']) layer2_pool = tf.nn.avg_pool(layer2_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') flat_layer = flatten_tf_array(layer2_pool) layer3_fccd = tf.matmul(flat_layer, variables['w3']) + variables['b3'] layer3_actv = tf.nn.sigmoid(layer3_fccd) layer4_fccd = tf.matmul(layer3_actv, variables['w4']) + variables['b4'] layer4_actv = tf.nn.sigmoid(layer4_fccd) logits = tf.matmul(layer4_actv, variables['w5']) + variables['b5'] return logits

With the variables, and model defined seperately, we can adjust the the graph a little bit so that it uses these weights and model instead of the previous Fully Connected NN:

#parameters determining the model size image_size = mnist_image_size num_labels = mnist_num_labels #the datasets train_dataset = mnist_train_dataset train_labels = mnist_train_labels test_dataset = mnist_test_dataset test_labels = mnist_test_labels #number of iterations and learning rate num_steps = 10001 display_step = 1000 learning_rate = 0.001 graph = tf.Graph() with graph.as_default(): #1) First we put the input data in a tensorflow friendly form. tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_width, image_height, image_depth)) tf_train_labels = tf.placeholder(tf.float32, shape = (batch_size, num_labels)) tf_test_dataset = tf.constant(test_dataset, tf.float32) #2) Then, the weight matrices and bias vectors are initialized <strong>variables = variables_lenet5(image_depth = image_depth, num_labels = num_labels)</strong> #3. The model used to calculate the logits (predicted labels) <strong>model = model_lenet5</strong> <strong>logits = model(tf_train_dataset, variables)</strong> #4. then we compute the softmax cross entropy between the logits and the (actual) labels loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf_train_labels)) #5. The optimizer is used to calculate the gradients of the loss function optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss) # Predictions for the training, validation, and test data. train_prediction = tf.nn.softmax(logits) test_prediction = tf.nn.softmax(model(tf_test_dataset, variables))

with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print('Initialized with learning_rate', learning_rate) for step in range(num_steps): #Since we are using stochastic gradient descent, we are selecting small batches from the training dataset, #and training the convolutional neural network each time with a batch. offset = (step * batch_size) % (train_labels.shape[0] - batch_size) batch_data = train_dataset[offset:(offset + batch_size), :, :, :] batch_labels = train_labels[offset:(offset + batch_size), :] feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels} _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict) if step % display_step == 0: train_accuracy = accuracy(predictions, batch_labels) test_accuracy = accuracy(test_prediction.eval(), test_labels) message = "step {:04d} : loss is {:06.2f}, accuracy on training set {:02.2f} %, accuracy on test set {:02.2f} %".format(step, l, train_accuracy, test_accuracy) print(message)

>>> Initialized with learning_rate 0.1 >>> step 0000 : loss is 002.49, accuracy on training set 3.12 %, accuracy on test set 10.09 % >>> step 1000 : loss is 002.29, accuracy on training set 21.88 %, accuracy on test set 9.58 % >>> step 2000 : loss is 000.73, accuracy on training set 75.00 %, accuracy on test set 78.20 % >>> step 3000 : loss is 000.41, accuracy on training set 81.25 %, accuracy on test set 86.87 % >>> step 4000 : loss is 000.26, accuracy on training set 93.75 %, accuracy on test set 90.49 % >>> step 5000 : loss is 000.28, accuracy on training set 87.50 %, accuracy on test set 92.79 % >>> step 6000 : loss is 000.23, accuracy on training set 96.88 %, accuracy on test set 93.64 % >>> step 7000 : loss is 000.18, accuracy on training set 90.62 %, accuracy on test set 95.14 % >>> step 8000 : loss is 000.14, accuracy on training set 96.88 %, accuracy on test set 95.80 % >>> step 9000 : loss is 000.35, accuracy on training set 90.62 %, accuracy on test set 96.33 % >>> step 10000 : loss is 000.12, accuracy on training set 93.75 %, accuracy on test set 96.76 %

As we can see the LeNet5 architecture performs better on the MNIST dataset than a simple fully connected NN.

Generally it is true that the more layers a Neural Network has, the better it performs. We can add more layers, change activation functions and pooling layers, change the learning rate and see how each step affects the performance. Since the input of layer is the output of layer , we need to know how the output size of layer is affected by its different parameters.

To understand this, lets have a look at the conv2d() function.

It has four parameters:

- The input image, a 4D Tensor with dimensions [batch size, image_width, image_height, image_depth]
- An weight matrix, a 4-D Tensor with dimensions [filter_size, filter_size, image_depth, filter_depth]
- The number of strides in each dimension.
- Padding (= ‘SAME’ / ‘VALID’)

These four parameters determine the size of the output image.

The first two parameters are the 4-D Tensor containing the batch of input images and the 4-D Tensor containing the weights of the convolutional filter.

The third parameter is the stride of the convolution, i.e. how much the convolutional filter should skip positions in each of the four dimension. The first of these 4 dimensions indicates the image-number in the batch of images and since we dont want to skip over any image, this will always be 1. The last dimension indicates the image depth (no of color-channels; 1 for grayscale and 3 for RGB) and since we dont want to skip over any color-channels, this is also always 1. The second and third dimension indicate the stride in the X and Y direction (image width and height). If we want to apply a stride, these are the dimensions in which the filter should skip positions. So for a stride of 1, we have to set the stride-parameter to [1, 1, 1, 1] and if we want a stride of 2, set it to [1, 2, 2, 1]. etc

The last parameter indicates whether or not tensorflow should zero-pad the image in order to make sure the output size does not change size for a stride of 1. With padding = ‘SAME’ the image does get zero-padded (and output size does not change), with padding = ‘VALID’ it does not.

Below we can see two examples of a convolutional filter (with filter size 5 x 5) scanning through an image (of size 28 x 28).

On the left the padding parameter is set to ‘SAME’, the image is zero-padded and the last 4 rows / columns are included in the output image.

On the right padding is set to ‘VALID’, the image does not get zero-padded and the last 4 rows/columns are not included.

As we can see, without zero-padding the last four cells are not included, because the convolutional filter has reached the end of the (non-zero padded) image. This means that, for an input size of 28 x 28, the output size becomes 24 x 24. If padding = ‘SAME’, the output size is 28 x 28.

This becomes more clear if we write down the positions of the filter on the image while it is scanning through the image (For simplicity, only the X-direction). With a stride of 1, the X-positions are 0-5, 1-6, 2-7, etc. If the stride is 2, the X-positions are 0-5, 2-7, 4-9, etc.

If we do this for an image size of 28 x 28, filter size of 5 x 5 and strides 1 to 4, we will get the following table:

As you can see, for a stride of 1, and zero-padding the output image size is 28 x 28. Without zero-padding the output image size becomes 24 x 24. For a filter with a stride of 2, these numbers are 14 x 14 and 12 x 12, and for a filter with stride 3 it is 10 x 10 and 8 x 8. etc

For any arbitrary chosen stride S, filter size K, image size W, and padding-size P, the output size will be

If padding = ‘SAME’ in tensorflow, the numerator always adds up to 1 and the output size is only determined by the stride S.

In the original paper, a sigmoid activation function and average pooling were used in the LeNet5 architecture. However, nowadays, it is much more common to use a relu activation function. So let’s change the LeNet5 CNN a little bit to see if we can improve its accuracy. We will call this the LeNet5-like Architecture:

LENET5_LIKE_BATCH_SIZE = 32 LENET5_LIKE_FILTER_SIZE = 5 LENET5_LIKE_FILTER_DEPTH = 16 LENET5_LIKE_NUM_HIDDEN = 120 def variables_lenet5_like(filter_size = LENET5_LIKE_FILTER_SIZE, filter_depth = LENET5_LIKE_FILTER_DEPTH, num_hidden = LENET5_LIKE_NUM_HIDDEN, image_width = 28, image_depth = 1, num_labels = 10): w1 = tf.Variable(tf.truncated_normal([filter_size, filter_size, image_depth, filter_depth], stddev=0.1)) b1 = tf.Variable(tf.zeros([filter_depth])) w2 = tf.Variable(tf.truncated_normal([filter_size, filter_size, filter_depth, filter_depth], stddev=0.1)) b2 = tf.Variable(tf.constant(1.0, shape=[filter_depth])) w3 = tf.Variable(tf.truncated_normal([(image_width // 4)*(image_width // 4)*filter_depth , num_hidden], stddev=0.1)) b3 = tf.Variable(tf.constant(1.0, shape = [num_hidden])) w4 = tf.Variable(tf.truncated_normal([num_hidden, num_hidden], stddev=0.1)) b4 = tf.Variable(tf.constant(1.0, shape = [num_hidden])) w5 = tf.Variable(tf.truncated_normal([num_hidden, num_labels], stddev=0.1)) b5 = tf.Variable(tf.constant(1.0, shape = [num_labels])) variables = { 'w1': w1, 'w2': w2, 'w3': w3, 'w4': w4, 'w5': w5, 'b1': b1, 'b2': b2, 'b3': b3, 'b4': b4, 'b5': b5 } return variables def model_lenet5_like(data, variables): layer1_conv = tf.nn.conv2d(data, variables['w1'], [1, 1, 1, 1], padding='SAME') layer1_actv = tf.nn.relu(layer1_conv + variables['b1']) layer1_pool = tf.nn.avg_pool(layer1_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer2_conv = tf.nn.conv2d(layer1_pool, variables['w2'], [1, 1, 1, 1], padding='SAME') layer2_actv = tf.nn.relu(layer2_conv + variables['b2']) layer2_pool = tf.nn.avg_pool(layer2_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') flat_layer = flatten_tf_array(layer2_pool) layer3_fccd = tf.matmul(flat_layer, variables['w3']) + variables['b3'] layer3_actv = tf.nn.relu(layer3_fccd) #layer3_drop = tf.nn.dropout(layer3_actv, 0.5) layer4_fccd = tf.matmul(layer3_actv, variables['w4']) + variables['b4'] layer4_actv = tf.nn.relu(layer4_fccd) #layer4_drop = tf.nn.dropout(layer4_actv, 0.5) logits = tf.matmul(layer4_actv, variables['w5']) + variables['b5'] return logits

The main differences are that we are using a relu activation function instead of a sigmoid activation.

Besides the activation function, we can also change the used optimizers to see what the effect is of the different optimizers on accuracy.

Lets see how these CNN’s perform on the MNIST and CIFAR-10 datasets.

In the figures above, the accuracy on the test set is given as a function of the number of iterations. On the left for the one layer fully connected NN, in the middle for the LeNet5 NN and on the right for the LeNet5-like NN.

As we can see, the LeNet5 CNN works pretty good for the MNIST dataset. Which should not be such a big surprise, since it was specially designed to classify handwritten digits. The MNIST dataset is quiet small and does not provide a big challenge, so even a one layer fully connected network performs quiet good.

On the CIFAR-10 Dataset however, the performance for the LeNet5 NN drops significantly to accuracy values around 40%.

To increase the accuracy, we can change the optimizer, or fine-tune the Neural Network by applying regularization or learning rate decay.

As we can see, the AdagradOptimizer, AdamOptimizer and the RMSPropOptimizer have a better performance than the GradientDescentOptimizer. These are adaptive optimizers which in general perform better than the (simple) GradientDescentOptimizer but need more computational power.

With L2-regularization or exponential rate decay we can probably gain a bit more accuracy, but for much better results we need to go deeper.

So far we have seen the LeNet5 CNN architecture. LeNet5 contains two convolutional layers followed by fully connected layers and therefore could be called a shallow Neural Network. At that time (in 1998) GPU’s were not used for computational calculations, and the CPU’s were not even that powerful so for that time the two convolutional layers were already quiet innovative.

Later on, many other types of Convolutional Neural Networks have been designed, most of them much deeper [click here for more info].

There is the famous AlexNet architecture (2012) by Alex Krizhevsky et. al., the 7-layered ZF Net (2013), and the 16-layered VGGNet (2014).

In 2015 Google came with 22-layered CNN with an inception module (GoogLeNet), and Microsoft Research Asia created the 152-layered CNN called ResNet.

Now, with the things we have learned so far, lets see how we can create the AlexNet and VGGNet16 architectures in Tensorflow.

Although LeNet5 was the first ConvNet, it is considered to be a shallow neural network. It performs well on the MNIST dataset which consist of grayscale images of size 28 x 28, but the performance drops when we’re trying to classify larger images, with more resolution and more classes.

The first Deep CNN came out in 2012 and is called AlexNet after its creators Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Compared to the most recent architectures AlexNet can be considered simple, but at that time it was really succesfull. It won the ImageNet competition with a incredible test error rate of 15.4% (while the runner-up had an error of 26.2%) and started a revolution (also see this video) in the world of Deep Learning and AI.

It consists of 5 convolutional layers (with relu activation), 3 max pooling layers, 3 fully connected layers and 2 dropout layers. The overall architecture looks as follows:

**layer 0**: input image of size 224 x 224 x 3**layer 1**: A convolutional layer with 96 filters (filter_depth_1 = 96) of size 11 x 11 (filter_size_1 = 11) and a stride of 4. It has a relu activation function.

This is followed by max pooling and local response normalization layers.**layer 2**: A convolutional layer with 256 filters (filter_depth_2 = 256) of size 5 x 5 (filter_size_2 = 5) and a stride of 1. It has a relu activation function.

This layer is also followed by max pooling and local response normalization layers.**layer 3**: A convolutional layer with 384 filters (filter_depth_3 = 384) of size 3 x 3 (filter_size_3 = 3) and a stride of 1. It has a relu activation function.**layer 4**: Same as layer 3.**layer 5**: A convolutional layer with 256 filters (filter_depth_4 = 256) of size 3 x 3 (filter_size_4 = 3) and a stride of 1. It has a relu activation function.**layer 6-8**: These convolutional layers are followed by fully connected layers with 4096 neurons each. In the original paper they are classifying a dataset with 1000 classes, but we will use the oxford17 dataset, which has 17 different classes (of flowers).

Note that this CNN (or other deep CNN’s) cannot be used on the MNIST or the CIFAR-10 dataset, because the images in these datasets are too small. As we have seen before, a pooling layer (or a convolutional layer with a stride of 2) reduces the image size by a factor of 2. AlexNet has 3 max pooling layers and one convolutional layer with a stride of 4. This means that the original image size gets reduced by a factor of . The images in the MNIST dataset would simply get reduced to a size smaller than 0.

Therefore we need to load a dataset with larger images, preferably 224 x 224 x 3 (as the original paper indicates). The 17 category flower dataset, aka oxflower17 dataset is ideal since it contains images of exactly this size:

ox17_image_width = 224 ox17_image_height = 224 ox17_image_depth = 3 ox17_num_labels = 17 import tflearn.datasets.oxflower17 as oxflower17 train_dataset_, train_labels_ = oxflower17.load_data(one_hot=True) train_dataset_ox17, train_labels_ox17 = train_dataset_[:1000,:,:,:], train_labels_[:1000,:] test_dataset_ox17, test_labels_ox17 = train_dataset_[1000:,:,:,:], train_labels_[1000:,:] print('Training set', train_dataset_ox17.shape, train_labels_ox17.shape) print('Test set', test_dataset_ox17.shape, test_labels_ox17.shape)

Lets try to create the weight matrices and the different layers present in AlexNet. As we have seen before, we need as much weight matrices and bias vectors as the amount of layers, and each weight matrix should have a size corresponding to the filter size of the layer it belongs to.

ALEX_PATCH_DEPTH_1, ALEX_PATCH_DEPTH_2, ALEX_PATCH_DEPTH_3, ALEX_PATCH_DEPTH_4 = 96, 256, 384, 256 ALEX_PATCH_SIZE_1, ALEX_PATCH_SIZE_2, ALEX_PATCH_SIZE_3, ALEX_PATCH_SIZE_4 = 11, 5, 3, 3 ALEX_NUM_HIDDEN_1, ALEX_NUM_HIDDEN_2 = 4096, 4096 def variables_alexnet(patch_size1 = ALEX_PATCH_SIZE_1, patch_size2 = ALEX_PATCH_SIZE_2, patch_size3 = ALEX_PATCH_SIZE_3, patch_size4 = ALEX_PATCH_SIZE_4, patch_depth1 = ALEX_PATCH_DEPTH_1, patch_depth2 = ALEX_PATCH_DEPTH_2, patch_depth3 = ALEX_PATCH_DEPTH_3, patch_depth4 = ALEX_PATCH_DEPTH_4, num_hidden1 = ALEX_NUM_HIDDEN_1, num_hidden2 = ALEX_NUM_HIDDEN_2, image_width = 224, image_height = 224, image_depth = 3, num_labels = 17): w1 = tf.Variable(tf.truncated_normal([patch_size1, patch_size1, image_depth, patch_depth1], stddev=0.1)) b1 = tf.Variable(tf.zeros([patch_depth1])) w2 = tf.Variable(tf.truncated_normal([patch_size2, patch_size2, patch_depth1, patch_depth2], stddev=0.1)) b2 = tf.Variable(tf.constant(1.0, shape=[patch_depth2])) w3 = tf.Variable(tf.truncated_normal([patch_size3, patch_size3, patch_depth2, patch_depth3], stddev=0.1)) b3 = tf.Variable(tf.zeros([patch_depth3])) w4 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth3, patch_depth3], stddev=0.1)) b4 = tf.Variable(tf.constant(1.0, shape=[patch_depth3])) w5 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth3, patch_depth3], stddev=0.1)) b5 = tf.Variable(tf.zeros([patch_depth3])) pool_reductions = 3 conv_reductions = 2 no_reductions = pool_reductions + conv_reductions w6 = tf.Variable(tf.truncated_normal([(image_width // 2**no_reductions)*(image_height // 2**no_reductions)*patch_depth3, num_hidden1], stddev=0.1)) b6 = tf.Variable(tf.constant(1.0, shape = [num_hidden1])) w7 = tf.Variable(tf.truncated_normal([num_hidden1, num_hidden2], stddev=0.1)) b7 = tf.Variable(tf.constant(1.0, shape = [num_hidden2])) w8 = tf.Variable(tf.truncated_normal([num_hidden2, num_labels], stddev=0.1)) b8 = tf.Variable(tf.constant(1.0, shape = [num_labels])) variables = { 'w1': w1, 'w2': w2, 'w3': w3, 'w4': w4, 'w5': w5, 'w6': w6, 'w7': w7, 'w8': w8, 'b1': b1, 'b2': b2, 'b3': b3, 'b4': b4, 'b5': b5, 'b6': b6, 'b7': b7, 'b8': b8 } return variables def model_alexnet(data, variables): layer1_conv = tf.nn.conv2d(data, variables['w1'], [1, 4, 4, 1], padding='SAME') layer1_relu = tf.nn.relu(layer1_conv + variables['b1']) layer1_pool = tf.nn.max_pool(layer1_relu, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME') layer1_norm = tf.nn.local_response_normalization(layer1_pool) layer2_conv = tf.nn.conv2d(layer1_norm, variables['w2'], [1, 1, 1, 1], padding='SAME') layer2_relu = tf.nn.relu(layer2_conv + variables['b2']) layer2_pool = tf.nn.max_pool(layer2_relu, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME') layer2_norm = tf.nn.local_response_normalization(layer2_pool) layer3_conv = tf.nn.conv2d(layer2_norm, variables['w3'], [1, 1, 1, 1], padding='SAME') layer3_relu = tf.nn.relu(layer3_conv + variables['b3']) layer4_conv = tf.nn.conv2d(layer3_relu, variables['w4'], [1, 1, 1, 1], padding='SAME') layer4_relu = tf.nn.relu(layer4_conv + variables['b4']) layer5_conv = tf.nn.conv2d(layer4_relu, variables['w5'], [1, 1, 1, 1], padding='SAME') layer5_relu = tf.nn.relu(layer5_conv + variables['b5']) layer5_pool = tf.nn.max_pool(layer4_relu, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME') layer5_norm = tf.nn.local_response_normalization(layer5_pool) flat_layer = flatten_tf_array(layer5_norm) layer6_fccd = tf.matmul(flat_layer, variables['w6']) + variables['b6'] layer6_tanh = tf.tanh(layer6_fccd) layer6_drop = tf.nn.dropout(layer6_tanh, 0.5) layer7_fccd = tf.matmul(layer6_drop, variables['w7']) + variables['b7'] layer7_tanh = tf.tanh(layer7_fccd) layer7_drop = tf.nn.dropout(layer7_tanh, 0.5) logits = tf.matmul(layer7_drop, variables['w8']) + variables['b8'] return logits

Now we can modify the CNN model to use the weights and layers of the AlexNet model in order to classify images.

VGG Net was created in 2014 by Karen Simonyan and Andrew Zisserman of the University of Oxford. It contains much more layers (16-19 layers), but each layer is simpler in its design; all of the convolutional layers have filters of size 3 x 3 and stride of 1 and all max pooling layers have a stride of 2.

So it is a deeper CNN but simpler.

It comes in different configurations, with either 16 or 19 layers. The difference between these two different configurations is the usage of either 3 or 4 convolutional layers after the second, third and fourth max pooling layer (see below).

The configuration with 16 layers (configuration D) seems to produce the best results, so lets try to create that in tensorflow.

#The VGGNET Neural Network VGG16_PATCH_SIZE_1, VGG16_PATCH_SIZE_2, VGG16_PATCH_SIZE_3, VGG16_PATCH_SIZE_4 = 3, 3, 3, 3 VGG16_PATCH_DEPTH_1, VGG16_PATCH_DEPTH_2, VGG16_PATCH_DEPTH_3, VGG16_PATCH_DEPTH_4 = 64, 128, 256, 512 VGG16_NUM_HIDDEN_1, VGG16_NUM_HIDDEN_2 = 4096, 1000 def variables_vggnet16(patch_size1 = VGG16_PATCH_SIZE_1, patch_size2 = VGG16_PATCH_SIZE_2, patch_size3 = VGG16_PATCH_SIZE_3, patch_size4 = VGG16_PATCH_SIZE_4, patch_depth1 = VGG16_PATCH_DEPTH_1, patch_depth2 = VGG16_PATCH_DEPTH_2, patch_depth3 = VGG16_PATCH_DEPTH_3, patch_depth4 = VGG16_PATCH_DEPTH_4, num_hidden1 = VGG16_NUM_HIDDEN_1, num_hidden2 = VGG16_NUM_HIDDEN_2, image_width = 224, image_height = 224, image_depth = 3, num_labels = 17): w1 = tf.Variable(tf.truncated_normal([patch_size1, patch_size1, image_depth, patch_depth1], stddev=0.1)) b1 = tf.Variable(tf.zeros([patch_depth1])) w2 = tf.Variable(tf.truncated_normal([patch_size1, patch_size1, patch_depth1, patch_depth1], stddev=0.1)) b2 = tf.Variable(tf.constant(1.0, shape=[patch_depth1])) w3 = tf.Variable(tf.truncated_normal([patch_size2, patch_size2, patch_depth1, patch_depth2], stddev=0.1)) b3 = tf.Variable(tf.constant(1.0, shape = [patch_depth2])) w4 = tf.Variable(tf.truncated_normal([patch_size2, patch_size2, patch_depth2, patch_depth2], stddev=0.1)) b4 = tf.Variable(tf.constant(1.0, shape = [patch_depth2])) w5 = tf.Variable(tf.truncated_normal([patch_size3, patch_size3, patch_depth2, patch_depth3], stddev=0.1)) b5 = tf.Variable(tf.constant(1.0, shape = [patch_depth3])) w6 = tf.Variable(tf.truncated_normal([patch_size3, patch_size3, patch_depth3, patch_depth3], stddev=0.1)) b6 = tf.Variable(tf.constant(1.0, shape = [patch_depth3])) w7 = tf.Variable(tf.truncated_normal([patch_size3, patch_size3, patch_depth3, patch_depth3], stddev=0.1)) b7 = tf.Variable(tf.constant(1.0, shape=[patch_depth3])) w8 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth3, patch_depth4], stddev=0.1)) b8 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) w9 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b9 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) w10 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b10 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) w11 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b11 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) w12 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b12 = tf.Variable(tf.constant(1.0, shape=[patch_depth4])) w13 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b13 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) no_pooling_layers = 5 w14 = tf.Variable(tf.truncated_normal([(image_width // (2**no_pooling_layers))*(image_height // (2**no_pooling_layers))*patch_depth4 , num_hidden1], stddev=0.1)) b14 = tf.Variable(tf.constant(1.0, shape = [num_hidden1])) w15 = tf.Variable(tf.truncated_normal([num_hidden1, num_hidden2], stddev=0.1)) b15 = tf.Variable(tf.constant(1.0, shape = [num_hidden2])) w16 = tf.Variable(tf.truncated_normal([num_hidden2, num_labels], stddev=0.1)) b16 = tf.Variable(tf.constant(1.0, shape = [num_labels])) variables = { 'w1': w1, 'w2': w2, 'w3': w3, 'w4': w4, 'w5': w5, 'w6': w6, 'w7': w7, 'w8': w8, 'w9': w9, 'w10': w10, 'w11': w11, 'w12': w12, 'w13': w13, 'w14': w14, 'w15': w15, 'w16': w16, 'b1': b1, 'b2': b2, 'b3': b3, 'b4': b4, 'b5': b5, 'b6': b6, 'b7': b7, 'b8': b8, 'b9': b9, 'b10': b10, 'b11': b11, 'b12': b12, 'b13': b13, 'b14': b14, 'b15': b15, 'b16': b16 } return variables def model_vggnet16(data, variables): layer1_conv = tf.nn.conv2d(data, variables['w1'], [1, 1, 1, 1], padding='SAME') layer1_actv = tf.nn.relu(layer1_conv + variables['b1']) layer2_conv = tf.nn.conv2d(layer1_actv, variables['w2'], [1, 1, 1, 1], padding='SAME') layer2_actv = tf.nn.relu(layer2_conv + variables['b2']) layer2_pool = tf.nn.max_pool(layer2_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer3_conv = tf.nn.conv2d(layer2_pool, variables['w3'], [1, 1, 1, 1], padding='SAME') layer3_actv = tf.nn.relu(layer3_conv + variables['b3']) layer4_conv = tf.nn.conv2d(layer3_actv, variables['w4'], [1, 1, 1, 1], padding='SAME') layer4_actv = tf.nn.relu(layer4_conv + variables['b4']) layer4_pool = tf.nn.max_pool(layer4_pool, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer5_conv = tf.nn.conv2d(layer4_pool, variables['w5'], [1, 1, 1, 1], padding='SAME') layer5_actv = tf.nn.relu(layer5_conv + variables['b5']) layer6_conv = tf.nn.conv2d(layer5_actv, variables['w6'], [1, 1, 1, 1], padding='SAME') layer6_actv = tf.nn.relu(layer6_conv + variables['b6']) layer7_conv = tf.nn.conv2d(layer6_actv, variables['w7'], [1, 1, 1, 1], padding='SAME') layer7_actv = tf.nn.relu(layer7_conv + variables['b7']) layer7_pool = tf.nn.max_pool(layer7_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer8_conv = tf.nn.conv2d(layer7_pool, variables['w8'], [1, 1, 1, 1], padding='SAME') layer8_actv = tf.nn.relu(layer8_conv + variables['b8']) layer9_conv = tf.nn.conv2d(layer8_actv, variables['w9'], [1, 1, 1, 1], padding='SAME') layer9_actv = tf.nn.relu(layer9_conv + variables['b9']) layer10_conv = tf.nn.conv2d(layer9_actv, variables['w10'], [1, 1, 1, 1], padding='SAME') layer10_actv = tf.nn.relu(layer10_conv + variables['b10']) layer10_pool = tf.nn.max_pool(layer10_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer11_conv = tf.nn.conv2d(layer10_pool, variables['w11'], [1, 1, 1, 1], padding='SAME') layer11_actv = tf.nn.relu(layer11_conv + variables['b11']) layer12_conv = tf.nn.conv2d(layer11_actv, variables['w12'], [1, 1, 1, 1], padding='SAME') layer12_actv = tf.nn.relu(layer12_conv + variables['b12']) layer13_conv = tf.nn.conv2d(layer12_actv, variables['w13'], [1, 1, 1, 1], padding='SAME') layer13_actv = tf.nn.relu(layer13_conv + variables['b13']) layer13_pool = tf.nn.max_pool(layer13_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') flat_layer = flatten_tf_array(layer13_pool) layer14_fccd = tf.matmul(flat_layer, variables['w14']) + variables['b14'] layer14_actv = tf.nn.relu(layer14_fccd) layer14_drop = tf.nn.dropout(layer14_actv, 0.5) layer15_fccd = tf.matmul(layer14_drop, variables['w15']) + variables['b15'] layer15_actv = tf.nn.relu(layer15_fccd) layer15_drop = tf.nn.dropout(layer15_actv, 0.5) logits = tf.matmul(layer15_drop, variables['w16']) + variables['b16'] return logits

As a comparison, have a look at the LeNet5 CNN performance on the larger oxflower17 dataset:

The code is also available in my GitHub repository, so feel free to use it on your own dataset(s).

There is much more to explore in the world of Deep Learning; Recurrent Neural Networks, Region-Based CNN’s, GAN’s, Reinforcement Learning, etc. In future blog-posts I’ll build these types of Neural Networks, and also build awesome applications with what we have already learned.

So subscribe and stay tuned!

[1] If you feel like you need to refresh your understanding of CNN’s, here are some good starting points to get you up to speed:

- Machine Learning is fun!
- An Intuitive Explanation of Convolutional Neural Networks :
- CS231n Convolutional Neural Networks for Visual Recognition :
- Udacity’s Deep Learning course:
- Neural Networks and Deep Learning Ch 6.

[2] If you want more information about the theory behind these different Neural Networks, Adit Deshpande’s blog post provides a good comparison of them with links to the original papers. Eugenio Culurciello has a nice blog and article worth a read. In addition to that, also have a look at this github repository containing awesome deep learning papers, and this github repository where deep learning papers are ordered by task and date.

]]>update2: I have added sections 2.4 , 3.2 , 3.3.2 and 4 to this blog post, updated the code on GitHub and improved upon some methods.

For python programmers, scikit-learn is one of the best libraries to build Machine Learning applications with. It is ideal for beginners because it has a really simple interface, it is well documented with many examples and tutorials.

Besides supervised machine learning (classification and regression), it can also be used for clustering, dimensionality reduction, feature extraction and engineering, and pre-processing the data. The interface is consistent over all of these methods, so it is not only easy to use, but it is also easy to construct a large ensemble of classifiers/regression models and train them with the same commands.

In this blog lets have a look at how to build, train, evaluate and validate a classifier with scikit-learn, improve upon the initial classifier with hyper-parameter optimization and look at ways in which we can have a better understanding of complex datasets.

We will do this by going through the of classification of two example datasets. The glass dataset, and the Mushroom dataset.

The glass dataset contains data on six types of glass (from building windows, containers, tableware, headlamps, etc) and each type of glass can be identified by the content of several minerals (for example Na, Fe, K, etc). This dataset only contains numerical data and therefore is a good dataset to get started with.

The second dataset contains non-numerical data and we will need an additional step where we encode the categorical data to numerical data.

Lets start with classifying the classes of glass!

First we need to import the necessary modules and libraries which we will use.

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import time from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn import tree from sklearn.neural_network import MLPClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.gaussian_process.kernels import RBF from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import GaussianNB

- The pandas module is used to load, inspect, process the data and get in the shape necessary for classification.
- Seaborn is a library based on matplotlib and has nice functionalities for drawing graphs.
- StandardScaler is a library for standardizing and normalizing dataset and
- the LaberEncoder library can be used to One Hot Encode the categorical features (in the mushroom dataset).
- All of the other modules are classifiers which are used for classification of the dataset.

When loading a dataset for the first time, there are several questions we need to ask ourself:

- What kind of data does the dataset contain? Numerical data, categorical data, geographic information, etc…
- Does the dataset contain any missing data?
- Does the dataset contain any redundant data (noise)?
- Do the values of the features differ over many orders of magnitude? Do we need to standardize or normalize the dataset?

filename_glass = './data/glass.csv' df_glass = pd.read_csv(filename_glass) print(df_glass.shape) display(df_glass.head()) display(df_glass.describe())

We can see that the dataset consists of 214 rows and 10 columns. All of the columns contain numerical data, and there are no rows with missing information. Also most of the features have values in the same order of magnitude.

So for this dataset we do not need to remove any rows, impute missing values or transform categorical data into numerical.

The .describe() method we used above is useful for giving a quick overview of the dataset;

- How many rows of data are there?
- What are some characteristic values like the mean, standard deviation, minimum and maximum value, the 25th percentile etc.

The next step is building and training the actual classifier, which hopefully can accurately classify the data. With this we will be able to tell which type of glass an entry in the dataset belongs to, based on the features.

For this we need to split the dataset into a training set and a test set. With the training set we will train the classifier, and with the test set we will validate the accuracy of the classifier. Usually a 70 % / 30 % ratio is used when splitting into a training and test set, but this ratio should be chosen based on the size of the dataset. For example, if the dataset does not have enough entries, 30% of it might not contain all of the classes or enough information to properly function as a validation set.

Another important note is that the distribution of the different classes in both the training and the test set should be equal to the distribution in the actual dataset. For example, if you have a dataset with review-texts which contains 20% negative and 80% positive reviews, both the training and the test set should have this 20% / 80% ratio. The best way to do this, is to split the dataset into a training and test set **randomly**.

def get_train_test(df, y_col, x_cols, ratio): """ This method transforms a dataframe into a train and test set, for this you need to specify: 1. the ratio train : test (usually 0.7) 2. the column with the Y_values """ mask = np.random.rand(len(df)) < ratio df_train = df[mask] df_test = df[~mask] Y_train = df_train[y_col].values Y_test = df_test[y_col].values X_train = df_train[x_cols].values X_test = df_test[x_cols].values return df_train, df_test, X_train, Y_train, X_test, Y_test y_col_glass = 'Type' x_cols_glass = list(df_glass.columns.values) x_cols_glass.remove(y_col_glass) train_test_ratio = 0.7 df_train, df_test, X_train, Y_train, X_test, Y_test = get_train_test(df_glass, y_col_glass, x_cols_glass, train_test_ratio)

With the dataset splitted into a training and test set, we can start building a classification model. We will do this in a slightly different way as usual. The idea behind this is that, when we start with a new dataset, we don’t know which (type of) classifier will perform best on this dataset. Will it be a ensemble classifier like Gradient Boosting or Random Forest, or a classifier which uses a functional approach like Logistic Regression, a classifier which uses a statistical approach like Naive Bayes etc.?

Because we dont know this, and nowadays computational power is cheap to get, we will try out all types of classifiers first and later we can continue to optimize the best performing classifier of this inital batch of classifiers. For this we have to make an dictionary, which contains as *keys* the name of the classifiers and as *values *an instance of the classifiers.

dict_classifiers = { "Logistic Regression": LogisticRegression(), "Nearest Neighbors": KNeighborsClassifier(), "Linear SVM": SVC(), "Gradient Boosting Classifier": GradientBoostingClassifier(n_estimators=1000), "Decision Tree": tree.DecisionTreeClassifier(), "Random Forest": RandomForestClassifier(n_estimators=1000), "Neural Net": MLPClassifier(alpha = 1), "Naive Bayes": GaussianNB(), #"AdaBoost": AdaBoostClassifier(), #"QDA": QuadraticDiscriminantAnalysis(), #"Gaussian Process": GaussianProcessClassifier() }

Then we can iterate over this dictionary, and for each classifier:

- train the classifier with
`.fit(X_train, Y_train)`

- evaluate how the classifier performs on the training set with
`.score(X_train, Y_train)`

- evaluate how the classifier perform on the test set with
`.score(X_test, Y_test)`

. - keep track of how much time it takes to train the classifier with the time module.
- save the trained model, the training score, the test score, and the training time into a dictionary. If necessary this dictionary can be saved with Python’s pickle module.

def batch_classify(X_train, Y_train, X_test, Y_test, no_classifiers = 5, verbose = True): """ This method, takes as input the X, Y matrices of the Train and Test set. And fits them on all of the Classifiers specified in the dict_classifier. The trained models, and accuracies are saved in a dictionary. The reason to use a dictionary is because it is very easy to save the whole dictionary with the pickle module. Usually, the SVM, Random Forest and Gradient Boosting Classifier take quiet some time to train. So it is best to train them on a smaller dataset first and decide whether you want to comment them out or not based on the test accuracy score. """ dict_models = {} for classifier_name, classifier in list(dict_classifiers.items())[:no_classifiers]: t_start = time.clock() classifier.fit(X_train, Y_train) t_end = time.clock() t_diff = t_end - t_start train_score = classifier.score(X_train, Y_train) test_score = classifier.score(X_test, Y_test) dict_models[classifier_name] = {'model': classifier, 'train_score': train_score, 'test_score': test_score, 'train_time': t_diff} if verbose: print("trained {c} in {f:.2f} s".format(c=classifier_name, f=t_diff)) return dict_models def display_dict_models(dict_models, sort_by='test_score'): cls = [key for key in dict_models.keys()] test_s = [dict_models[key]['test_score'] for key in cls] training_s = [dict_models[key]['train_score'] for key in cls] training_t = [dict_models[key]['train_time'] for key in cls] df_ = pd.DataFrame(data=np.zeros(shape=(len(cls),4)), columns = ['classifier', 'train_score', 'test_score', 'train_time']) for ii in range(0,len(cls)): df_.loc[ii, 'classifier'] = cls[ii] df_.loc[ii, 'train_score'] = training_s[ii] df_.loc[ii, 'test_score'] = test_s[ii] df_.loc[ii, 'train_time'] = training_t[ii] display(df_.sort_values(by=sort_by, ascending=False))

The reason why we keep track of the time it takes to train a classifier, is because in practice this is also an important indicator of whether or not you would like to use a specific classifier. If there are two classifiers with similar results, but one of them takes much less time to train you probably want to use that one.

The `score()`

method simply return the result of the accuracy_score() method in the metrics module. This module, contains many methods for evualating classification or regression models and I can recommend you to spent some time to learn which metrics you can use to evaluate your model.

The classification_report method for example, calculates the precision, recall and f1-score for all of the classes in your dataset. If you are looking for ways to improve the accuracy of your classifier, or if you want to know why the accuracy is lower than expected, such detailed information about the performance of the classifier on the dataset can point you in the right direction.

The accuracy on the training set, accuracy on the test set, and the duration of the training was saved into a dictionary, and we can use the `display_dict_models()`

method to visualize the results ordered by the test score.

dict_models = batch_classify(X_train, Y_train, X_test, Y_test, no_classifiers = 8) display_dict_models(dict_models)

What we are doing feels like a brute force approach, where a large number of classifiers are build to see which one performs best. It gives us an idea which classifier will perform better for a particular dataset and which one will not. After that you can continue with the best (or top 3) classifier, and try to improve the results by tweaking the parameters of the classifier, or by adding more features to the dataset.

As we can see, the Gradient Boosting classifier performs the best for this dataset. Actually, classifiers like Random Forest and Gradient Boosting classification performs best for most datasets and challenges on Kaggle (That does not mean you should rule out all other classifiers).

For the ones who are interested in the theory behind these classifiers, scikit-learn has a pretty well written user guide. Some of these classifiers were also explained in previous posts, like the naive bayes classifier, logistic regression and support vector machines was partially explained in the perceptron blog.

After we have determined with a quick and dirty method which classifier performs best for the dataset, we can improve upon the Classifier by optimizing its hyper-parameters.

GDB_params = { 'n_estimators': [100, 500, 1000], 'learning_rate': [0.5, 0.1, 0.01, 0.001], 'criterion': ['friedman_mse', 'mse', 'mae'] } df_train, df_test, X_train, Y_train, X_test, Y_test = get_train_test(df_glass, y_col_glass, x_cols_glass, 0.6) for n_est in GDB_params['n_estimators']: for lr in GDB_params['learning_rate']: for crit in GDB_params['criterion']: clf = GradientBoostingClassifier(n_estimators=n_est, learning_rate = lr, criterion = crit) clf.fit(X_train, Y_train) train_score = clf.score(X_train, Y_train) test_score = clf.score(X_test, Y_test) print("For ({}, {}, {}) - train, test score: \t {:.5f} \t-\t {:.5f}".format(n_est, lr, crit[:4], train_score, test_score))

The second dataset we will have a look at is the mushroom dataset, which contains data on edible vs poisonous mushrooms. In the dataset there are 8124 mushrooms in total (4208 edible and 3916 poisonous) described by 22 features each.

The big difference with the glass dataset is that these features don’t have a numerical, but a categorical value. Because this dataset contains categorical values, we need one extra step in the classification process, which is the encoding of these values (see section 3.3).

filename_mushrooms = './data/mushrooms.csv' df_mushrooms = pd.read_csv(filename_mushrooms) display(df_mushrooms.head())

A fast way to find out what type of categorical data a dataset contains, is to print out the unique values of each column in this dataframe. In this way we can also see whether the dataset contains any missing values or redundant columns.

for col in df_mushrooms.columns.values: print(col, df_mushrooms[col].unique())

As we can see, there are 22 categorical features. Of these, the feature ‘veil-type’ only contains one value ‘p’ and therefore does not provide any added value for any classifier. The best thing to do is to remove columns like this which only contain one value.

for col in df_mushrooms.columns.values: if len(df_mushrooms[col].unique()) <= 1: print("Removing column {}, which only contains the value: {}".format(col, df_mushrooms[col].unique()[0]))

Some datasets contain missing values in the form of NaN, null, NULL, ‘?’, ‘??’ etc

It could be that all missing values are of type NaN, or that some columns contain NaN and other columns contain missing data in the form of ‘??’.

It is up to your best judgement to decide what to do with these missing values. What is most effective, really depends on the type of data, the type of missing data and the ratio between missing data and non-missing data.

- If the number of rows containing missing data is only a few percent of the total dataset, the best option could be to drop those rows. If half of the rows contain missing values, we could lose valuable information by dropping all of them.
- If there is a row or column which contains almost only missing data, it will not have much added value and it might be best to drop that column.
- It could be that a value not being filled in also is information which helps with the classification and it is best to leave it like it is.
- Maybe we really need to improve the accuracy and the only way to do this is by Imputing the missing values.
- etc, etc

Below we will look at a few ways in which you can either remove the missing values, or impute them.

print("Number of rows in total: {}".format(df_mushrooms.shape[0])) print("Number of rows with missing values in column 'stalk-root': {}".format(df_mushrooms[df_mushrooms['stalk-root'] == '?'].shape[0])) df_mushrooms_dropped_rows = df_mushrooms[df_mushrooms['stalk-root'] != '?']

drop_percentage = 0.8 df_mushrooms_dropped_cols = df_mushrooms.copy(deep=True) df_mushrooms_dropped_cols.loc[df_mushrooms_dropped_cols['stalk-root'] == '?', 'stalk-root'] = np.nan for col in df_mushrooms_dropped_cols.columns.values: no_rows = df_mushrooms_dropped_cols[col].isnull().sum() percentage = no_rows / df_mushrooms_dropped_cols.shape[0] if percentage > drop_percentage: del df_mushrooms_dropped_cols[col] print("Column {} contains {} missing values. This is {} percent. Dropping this column.".format(col, no_rows, percentage))

df_mushrooms_zerofill = df_mushrooms.copy(deep = True) df_mushrooms_zerofill.loc[df_mushrooms_zerofill['stalk-root'] == '?', 'stalk-root'] = np.nan df_mushrooms_zerofill.fillna(0, inplace=True)

df_mushrooms_bfill = df_mushrooms.copy(deep = True) df_mushrooms_bfill.loc[df_mushrooms_bfill['stalk-root'] == '?', 'stalk-root'] = np.nan df_mushrooms_bfill.fillna(method='bfill', inplace=True)

df_mushrooms_ffill = df_mushrooms.copy(deep = True) df_mushrooms_ffill.loc[df_mushrooms_ffill['stalk-root'] == '?', 'stalk-root'] = np.nan df_mushrooms_ffill.fillna(method='ffill', inplace=True)

Most classifier can only work with numerical data, and will raise an error when categorical values in the form of strings is used as input. When it comes to columns with categorical data, you can do two things.

- 1) One-hot encode the column such that its categorical values are converted to numerical values.
- 2) Expand the column into N different columns containing binary values.

**Example: **Let assume that we have a column called ‘FRUIT’ which contains the unique values [‘ORANGE’, ‘APPLE’, PEAR’].

- In the first case it would be converted to the unique values [0, 1, 2]
- In the second case it would be converted into three different columns called [‘FRUIT_IS_ORANGE’, ‘FRUIT_IS_APPLE’, ‘FRUIT_IS_PEAR’] and after this the original column ‘FRUIT’ would be deleted. The three new columns would contain the values 1 or 0 depending on the value of the original column.

When using the first method, you should pay attention to the fact that some classifiers will try to make sense of the numerical value of the one-hot encoded column. For example the Nearest Neighbour algorithm assumes that the value 1 is closer to 0 than the value 2. But the numerical values have no meaning in the case of one-hot encoded columns (an APPLE is not closer to an ORANGE than a PEAR is.) and the results therefore can be misleading.

def label_encode(df, columns): for col in columns: le = LabelEncoder() col_values_unique = list(df[col].unique()) le_fitted = le.fit(col_values_unique) col_values = list(df[col].values) le.classes_ col_values_transformed = le.transform(col_values) df[col] = col_values_transformed df_mushrooms_ohe = df_mushrooms.copy(deep=True) to_be_encoded_cols = df_mushrooms_ohe.columns.values label_encode(df_mushrooms_ohe, to_be_encoded_cols) display(df_mushrooms_ohe.head()) ## Now lets do the same thing for the other dataframes df_mushrooms_dropped_rows_ohe = df_mushrooms_dropped_rows.copy(deep = True) df_mushrooms_zerofill_ohe = df_mushrooms_zerofill.copy(deep = True) df_mushrooms_bfill_ohe = df_mushrooms_bfill.copy(deep = True) df_mushrooms_ffill_ohe = df_mushrooms_ffill.copy(deep = True) label_encode(df_mushrooms_dropped_rows_ohe, to_be_encoded_cols) label_encode(df_mushrooms_zerofill_ohe, to_be_encoded_cols) label_encode(df_mushrooms_bfill_ohe, to_be_encoded_cols) label_encode(df_mushrooms_ffill_ohe, to_be_encoded_cols)

def expand_columns(df, list_columns): for col in list_columns: colvalues = df[col].unique() for colvalue in colvalues: newcol_name = "{}_is_{}".format(col, colvalue) df.loc[df[col] == colvalue, newcol_name] = 1 df.loc[df[col] != colvalue, newcol_name] = 0 df.drop(list_columns, inplace=True, axis=1) y_col = 'class' to_be_expanded_cols = list(df_mushrooms.columns.values) to_be_expanded_cols.remove(y_col) df_mushrooms_expanded = df_mushrooms.copy(deep=True) label_encode(df_mushrooms_expanded, [y_col]) expand_columns(df_mushrooms_expanded, to_be_expanded_cols) ## Now lets do the same thing for all other dataframes df_mushrooms_dropped_rows_expanded = df_mushrooms_dropped_rows.copy(deep = True) df_mushrooms_zerofill_expanded = df_mushrooms_zerofill.copy(deep = True) df_mushrooms_bfill_expanded = df_mushrooms_bfill.copy(deep = True) df_mushrooms_ffill_expanded = df_mushrooms_ffill.copy(deep = True) label_encode(df_mushrooms_dropped_rows_expanded, [y_col]) label_encode(df_mushrooms_zerofill_expanded, [y_col]) label_encode(df_mushrooms_bfill_expanded, [y_col]) label_encode(df_mushrooms_ffill_expanded, [y_col]) expand_columns(df_mushrooms_dropped_rows_expanded, to_be_expanded_cols) expand_columns(df_mushrooms_zerofill_expanded, to_be_expanded_cols) expand_columns(df_mushrooms_bfill_expanded, to_be_expanded_cols) expand_columns(df_mushrooms_ffill_expanded, to_be_expanded_cols)

We have seen that there are two different ways to handle columns with categorical data, and many different ways to handle missing values.

Since computation power is cheap, it is easy to try out all of classifiers present on all of the different ways we have imputed missing values.

After we have seen which method and which classifier has the highest accuracy initially we can continue in that direction.

Again, we will split the dataset into a 70% training set and a 30% test set and start training and validating a batch of the eight most used classifiers.

dict_dataframes = { "df_mushrooms_ohe": df_mushrooms_ohe, "df_mushrooms_dropped_rows_ohe": df_mushrooms_dropped_rows_ohe, "df_mushrooms_zerofill_ohe": df_mushrooms_zerofill_ohe, "df_mushrooms_bfill_ohe": df_mushrooms_bfill_ohe, "df_mushrooms_ffill_ohe": df_mushrooms_ffill_ohe, "df_mushrooms_expanded": df_mushrooms_expanded, "df_mushrooms_dropped_rows_expanded": df_mushrooms_dropped_rows_expanded, "df_mushrooms_zerofill_expanded": df_mushrooms_zerofill_expanded, "df_mushrooms_bfill_expanded": df_mushrooms_bfill_expanded, "df_mushrooms_ffill_expanded": df_mushrooms_ffill_expanded } y_col = 'class' train_test_ratio = 0.7 for df_key, df in dict_dataframes.items(): x_cols = list(df.columns.values) x_cols.remove(y_col) df_train, df_test, X_train, Y_train, X_test, Y_test = get_train_test(df, y_col, x_cols, train_test_ratio) dict_models = batch_classify(X_train, Y_train, X_test, Y_test, no_classifiers = 8, verbose=False) print() print(df_key) display_dict_models(dict_models) print("-------------------------------------------------------")

As we can see here, the accuracy of the classifiers for this dataset is actually quiet high.

Some datasets contain a lot of features and it is not immediatly clear which of these features are helping with the Classification / Regression, and which of these features are only adding more noise.

To have a better understanding of how the dataset is made up of its features, we will discuss a few methods which can give more insight in the next few sections.

To get more insight in how (strongly) each feature is correlated with the Type of glass, we can calculate and plot the correlation matrix for this dataset.

correlation_matrix = df_glass.corr() plt.figure(figsize=(10,8)) ax = sns.heatmap(correlation_matrix, vmax=1, square=True, annot=True,fmt='.2f', cmap ='GnBu', cbar_kws={"shrink": .5}, robust=True) plt.title('Correlation matrix between the features', fontsize=20) plt.show()

The correlation matrix shows us for example that the oxides ‘Mg’ and ‘Al’ are most strongly correlated with the Type of glass. The content of ‘Ca’ is least strongly correlated with the type of glass. For some dataset there could be features with no correlation at all; then it might be a good idea to remove these since they will only function as noise.

A correlation matrix is a good way to get a general picture of how all of features in the dataset are correlated with each other. For a dataset with a lot of features it might become very large and the correlation of a single feature with the other features becomes difficult to discern.

If you want to look at the correlations of a single feature, it usually is a better idea to visualize it in the form of a bar-graph:

def display_corr_with_col(df, col): correlation_matrix = df.corr() correlation_type = correlation_matrix[col].copy() abs_correlation_type = correlation_type.apply(lambda x: abs(x)) desc_corr_values = abs_correlation_type.sort_values(ascending=False) y_values = list(desc_corr_values.values)[1:] x_values = range(0,len(y_values)) xlabels = list(desc_corr_values.keys())[1:] fig, ax = plt.subplots(figsize=(8,8)) ax.bar(x_values, y_values) ax.set_title('The correlation of all features with {}'.format(col), fontsize=20) ax.set_ylabel('Pearson correlatie coefficient [abs waarde]', fontsize=16) plt.xticks(x_values, xlabels, rotation='vertical') plt.show() display_corr_with_col(df_glass, 'Type')

The Cumulative explained variance shows how much of the variance is captures by the first x features.

Below we can see that the first 4 features (i.e. the four features with the largest correlation) already capture 90% of the variance.

X = df_glass[x_cols_glass].values X_std = StandardScaler().fit_transform(X) pca = PCA().fit(X_std) var_ratio = pca.explained_variance_ratio_ components = pca.components_ #print(pca.explained_variance_) plt.plot(np.cumsum(var_ratio)) plt.xlim(0,9,1) plt.xlabel('Number of Features', fontsize=16) plt.ylabel('Cumulative explained variance', fontsize=16) plt.show()

If you have low accuracy values for your Regression / Classification model, you could decide to stepwise remove the features with the lowest correlation, (or stepwise add features with the highest correlation).

In addition to the correlation matrix, you can plot the pairwise relationships between the features, to see **how** these features are correlated.

ax = sns.pairplot(df_glass, hue='Type') plt.title('Pairwise relationships between the features') plt.show()

In my opinion, the best way to master the scikit-learn library is to simply start coding with it. I hope this blog-post gave some insight into the working of scikit-learn library, but for the ones who need some more information, here are some useful links:

dataschool – machine learning with scikit-learn video series

Classification example using the iris dataset

Official scikit-learn documentation

]]>

Most tasks in Machine Learning can be reduced to classification tasks. For example, we have a medical dataset and we want to classify who has diabetes (positive class) and who doesn’t (negative class). We have a dataset from the financial world and want to know which customers will default on their credit (positive class) and which customers will not (negative class).

To do this, we can train a Classifier with a ‘training dataset’ and after such a Classifier is trained (we have determined its model parameters) and can accurately classify the training set, we can use it to classify new data (test set). If the training is done properly, the Classifier should predict the class probabilities of the new data with a similar accuracy.

There are three popular Classifiers which use three different mathematical approaches to classify data. Previously we have looked at the first two of these; Logistic Regression and the Naive Bayes classifier. Logistic Regression uses a functional approach to classify data, and the Naive Bayes classifier uses a statistical (Bayesian) approach to classify data.

Logistic Regression assumes there is some function which forms a correct model of the dataset (i.e. it maps the input values correctly to the output values). This function is defined by its parameters . We can use the gradient descent method to find the optimum values of these parameters.

The Naive Bayes method is much simpler than that; we do not have to optimize a function, but can calculate the Bayesian (conditional) probabilities directly from the training dataset. This can be done quiet fast (by creating a hash table containing the probability distributions of the features) but is generally less accurate.

Classification of data can also be done via a third way, by using a geometrical approach. The main idea is to find a line, or a plane, which can separate the two classes in their feature space. Classifiers which are using a geometrical approach are the Perceptron and the SVM (Support Vector Machines) methods.

Below we will discuss the Perceptron classification algorithm. Although Support Vector Machines is used more often, I think a good understanding of the Perceptron algorithm is essential to understanding Support Vector Machines and Neural Networks.

The Perceptron is a lightweight algorithm, which can classify data quiet fast. But it only works in the limited case of a linearly separable, binary dataset. If you have a dataset consisting of only two classes, the Perceptron classifier can be trained to find a linear hyperplane which seperates the two. If the dataset is not linearly separable, the perceptron will fail to find a separating hyperplane.

If the dataset consists of more than two classes we can use the standard approaches in multiclass classification (one-vs-all and one-vs-one) to transform the multiclass dataset to a binary dataset. For example, if we have a dataset, which consists of three different classes:

- In
**one-vs-all**, class I is considered as the positive class and the rest of the classes are considered as the negative class. We can then look for a separating hyperplane between class I and the rest of the dataset (class II and III). This process is repeated for class II and then for class III. So we are trying to find three separating hyperplanes; between class I and the rest of the data, between class II and the rest of the data, etc.

If the dataset consists of K classes, we end up with K separating hyperplanes. - In
**one-vs-one**, class I is considered as the positive class and each of the other classes is considered as the negative class; so first class II is considered as the negative class and then class III is is considered as the negative class. Then this process is repeated with the other classes as the positive class.

So if the dataset consists of K classes, we are looking for separating hyperplanes.

Although the one-vs-one can be a bit slower (there is one more iteration layer), it is not difficult to imagine it will be more advantageous in situations where a (linear) separating hyperplane does not exist between one class and the rest of the data, while it does exists between one class and other classes when they are considered individually. In the image below there is no separating line between the pear-class and the other two classes.

The algorithm for the Perceptron is similar to the algorithm of Support Vector Machines (SVM). Both algorithms find a (linear) hyperplane separating the two classes. The biggest difference is that the Perceptron algorithm will find **any** hyperplane, while the SVM algorithm uses a Lagrangian constraint* *to find the hyperplane which is optimized to have the **maximum margin**. That is, the sum of the squared distances of each point to the hyperplane is maximized. This is illustrated in the figure below. While the Perceptron classifier is satisfied if any of these seperating hyperplanes are found, a SVM classifier will find the green one , which has the maximum margin.

Another difference is; If the dataset is not linearly seperable [2] the perceptron will fail to find a separating hyperplane. The algorithm simply does not converge during its iteration cycle. The SVM on the other hand, can still find a maximum margin minimum cost decision boundary (a separating hyperplane which does not separate 100% of the data, but does it with some small error).

It is often said that the perceptron is modeled after neurons in the brain. It has input values (which correspond with the features of the examples in the training set) and one output value. Each input value is multiplied by a weight-factor . If the sum of the products between the feature value and weight-factor is larger than zero, the perceptron is activated and ‘fires’ a signal (+1). Otherwise it is not activated.

The weighted sum between the input-values and the weight-values, can mathematically be determined with the scalar-product . To produce the behaviour of ‘firing’ a signal (+1) we can use the signum function ; it maps the output to +1 if the input is positive, and it maps the output to -1 if the input is negative.

Thus, this Perceptron can mathematically be modeled by the function . Here is the bias, i.e. the default value when all feature values are zero.

The perceptron algorithm looks as follows:

class Perceptron(): """ Class for performing Perceptron. X is the input array with n rows (no_examples) and m columns (no_features) Y is a vector containing elements which indicate the class (1 for positive class, -1 for negative class) w is the weight-vector (m number of elements) b is the bias-value """ def __init__(self, b = 0, max_iter = 1000): self.max_iter = max_iter self.w = [] self.b = 0 self.no_examples = 0 self.no_features = 0 def train(self, X, Y): self.no_examples, self.no_features = np.shape(X) self.w = np.zeros(self.no_features) for ii in range(0, self.max_iter): w_updated = False for jj in range(0, self.no_examples): a = self.b + np.dot(self.w, X[jj]) if np.sign(Y[jj]*a) != 1: w_updated = True self.w += Y[jj] * X[jj] self.b += Y[jj] if not w_updated: print("Convergence reached in %i iterations." % ii) break if w_updated: print( """ WARNING: convergence not reached in %i iterations. Either dataset is not linearly separable, or max_iter should be increased """ % self.max_iter ) def classify_element(self, x_elem): return int(np.sign(self.b + np.dot(self.w, x_elem))) def classify(self, X): predicted_Y = [] for ii in range(np.shape(X)[0]): y_elem = self.classify_element(X[ii]) predicted_Y.append(y_elem) return predicted_Y

As you can see, we set the bias-value and all the elements in the weight-vector to zero. Then we iterate ‘max_iter’ number of times over all the examples in the training set.

Here, is the actual output value of each training example. This is either +1 (if it belongs to the positive class) or -1 (if it does not belong to the positive class).,

The activation function value is the predicted output value. It will be if the prediction is correct and if the prediction is incorrect. Therefore, if the prediction made (with the weight vector from the previous training example) is incorrect, will be -1, and the weight vector is updated.

If the weight vector is not updated after some iteration, it means we have reached convergence and we can break out of the loop.

If the weight vector was updated in the last iteration, it means we still didnt reach convergence and either the dataset is not linearly separable, or we need to increase ‘max_iter’.

We can see that the Perceptron is an online algorithm; it iterates through the examples in the training set, and for each example in the training set it calculates the value of the activation function and updates the values of the weight-vector.

Now lets examine the Perceptron algorithm for a linearly separable dataset which exists in 2 dimensions. For this we first have to create this dataset:

def generate_data(no_points): X = np.zeros(shape=(no_points, 2)) Y = np.zeros(shape=no_points) for ii in range(no_points): X[ii][0] = random.randint(1,9)+0.5 X[ii][1] = random.randint(1,9)+0.5 Y[ii] = 1 if X[ii][0]+X[ii][1] >= 13 else -1 return X, Y

In the 2D case, the perceptron algorithm looks like:

X, Y = generate_data(100) p = Perceptron() p.train(X, Y) X_test, Y_test = generate_data(50) predicted_Y_test = p.classify(X_test)

As we can see, the weight vector and the bias ( which together determine the separating hyperplane ) are updated when is not positive.

The result is nicely illustrated in this gif:

We can extend this to a dataset in any number of dimensions, and as long as it is linearly separable, the Perceptron algorithm will converge.

One of the benefits of this Perceptron is that it is a very ‘lightweight’ algorithm; it is computationally very fast and easy to implement for datasets which are linearly separable. But if the dataset is not linearly separable, it will not converge.

For such datasets, the Perceptron can still be used if the correct kernel is applied. In practice this is never done, and Support Vector Machines are used whenever a Kernel needs to be applied. Some of these Kernels are:

Linear: | |

Polynomial: | with |

Laplacian RBF: | |

Gaussian RBF: |

At this point, it will be too much to also implement Kernel functions, but I hope to do it at a next post about SVM. For more information about Kernel functions, a comprehensive list of kernels, and their source code, please click here.

**PS: The Python code for Logistic Regression can be forked/cloned from GitHub. **

In the previous blog we have seen the theory and mathematics behind the Logistic Regression Classifier.

Logistic Regression is one of the most powerful classification methods within machine learning and can be used for a wide variety of tasks. Think of pre-policing or predictive analytics in health; it can be used to aid tuberculosis patients, aid breast cancer diagnosis, etc. Think of modeling urban growth, analysing mortgage pre-payments and defaults, forecasting the direction and strength of stock market movement, and even sports.

Reading all of this, the theory[1] of Logistic Regression Classification might look difficult. In my experience, the average Developer does not believe they can design a proper Logistic Regression Classifier from scratch. I strongly disagree: not only is the mathematics behind is relatively simple, it can also be implemented with a few lines of code.

I have done this in the past month, so I thought I’d show you how to do it. The code is in Python but it should be relatively easy to translate it to other languages. Some of the examples contain self-generated data, while other examples contain real-world (iris) data. As was also done in the blog-posts about the bag-of-words model and the Naive Bayes Classifier, we will also try to automatically classify the sentiments of Amazon.com book reviews.

We have seen that the technique to perform Logistic Regression is similar to regular Regression Analysis.

There is a function which maps the input values to the output and this function is completely determined by its parameters . So once we have determined the values with training examples, we can determine the class of any new example.

We are trying to estimate the feature values with the iterative Gradient Descent method. In the Gradient Descent method, the values of the parameters in the current iteration are calculated by updating the values of from the previous iteration with the gradient of the cost function .

In (regular) Regression this hypothesis function can be any function which you expect will provide a good model of the dataset. In Logistic Regression the hypothesis function is always given by the Logistic function:

.

Different cost functions exist, but most often the log-likelihood function known as binary cross-entropy (see equation 2 of previous post) is used.

One of its benefits is that the gradient of this cost function, turns out to be quiet simple, and since it is the gradient we use to update the values of this makes our work easier.

Taking all of this into account, this is how Gradient Descent works:

- Make an initial but intelligent guess for the values of the parameters .
- Keep iterating while the value of the cost function has not met your criteria*:
- With the current values of , calculate the gradient of the cost function ( ).
- Update the values for the parameters
- Fill in these new values in the hypothesis function and calculate again the value of the cost function;

*Usually the iteration stops when either the maximum number of iterations has been reached, or the error (the difference between the cost of this iteration and the cost of the previous iteration) is smaller than some minimum error value (0.001).

We have seen the self-generated example of students participating in a Machine Learning course, where their final grade depended on how many hours they had studied.

First, let’s generate the data:

import random import numpy as np num_of_datapoints = 100 x_max = 10 initial_theta = [1, 0.07] def func1(X, theta, add_noise = True): if add_noise: return theta[0]*X[0] + theta[1]*X[1]**2 + 0.25*X[1]*(random.random()-1) else: return theta[0]*X[0] + theta[1]*X[1]**2 def generate_data(num_of_datapoints, x_max, theta): X = np.zeros(shape=(num_of_datapoints, 2)) Y = np.zeros(shape=num_of_datapoints) for ii in range(num_of_datapoints): X[ii][0] = 1 X[ii][1] = (x_max*ii) / float(num_of_datapoints) Y[ii] = func1(X[ii], theta) return X, Y X, Y = generate_data(num_of_datapoints, x_max, initial_theta)

We can see that we have generated 100 points uniformly distributed over the -axis. For each of these – points the -value is determined by minus some random value.

On the left we can see a scatterplot of the datapoints and on the right we can see the same data with a curve fitted through the points. This is the curve we are trying to estimate with the Gradient Descent method. This is done as follows:

numIterations= 1000 alpha = 0.00000005 m, n = np.shape(X) theta = np.ones(n) theta = gradient_descent(X, Y, theta, alpha, m, numIterations) def gradient_descent(X, Y, theta, alpha, m, number_of_iterations): for ii in range(0,number_of_iterations): print "iteration %s : feature-value: %s" % (ii, theta) hypothesis = np.dot(X, theta) cost = sum([theta[0]*X[iter][0]+theta[1]*X[iter][1]-Y[iter] for iter in range(m)]) grad0 = (2.0/m)*sum([(func1(X[iter], theta, False) - Y[iter])*X[iter][0]**2 for iter in range(m)]) grad1 = (2.0/m)*sum([(func1(X[iter], theta, False) - Y[iter])*X[iter][1]**4 for iter in range(m)]) theta[0] = theta[0] - alpha * grad0 theta[1] = theta[1] - alpha * grad1 return theta

We can see that we have to calculate the gradient of the cost function times and update the feature values simultaneously! This indeed results in the curve we were looking for:

After this short example of Regression, lets have a look at a few examples of Logistic Regression. We will start out with a the self-generated example of students passing a course or not and then we will look at real world data.

Let’s generate some data points. There are students participating in the course Machine Learning and whether a student passes ( ) or not ( ) depends on two variables;

- : how many hours student has studied for the exam.
- : how many hours student has slept the day before the exam.

import random import numpy as np def func2(x_i): if x_i[1] <= 4: y = 0 else: if x_i[1]+x_i[2] <= 13: y = 0 else: y = 1 return y def generate_data2(no_points): X = np.zeros(shape=(no_points, 3)) Y = np.zeros(shape=no_points) for ii in range(no_points): X[ii][0] = 1 X[ii][1] = random.random()*9+0.5 X[ii][2] = random.random()*9+0.5 Y[ii] = func2(X[ii]) return X, Y X, Y = generate_data2(300)

In our example, the results are pretty binary; everyone who has studied less than 4 hours fails the course, as well as everyone whose studying time + sleeping time is less than or equal to 13 hours (). The results looks like this (the green dots indicate a pass and the red dots a fail):

We have a LogisticRegression class, which sets the values of the learning rate and the maximum number of iterations at its initialization. The values of X, Y are set when these matrices are passed to the “train()” function, and then the values of no_examples, no_features, and theta are determined.

import numpy as np class LogisticRegression(): """ Class for performing logistic regression. """ def __init__(self, learning_rate = 0.7, max_iter = 1000): self.learning_rate = learning_rate self.max_iter = max_iter self.theta = [] self.no_examples = 0 self.no_features = 0 self.X = None self.Y = None def add_bias_col(self, X): bias_col = np.ones((X.shape[0], 1)) return np.concatenate([bias_col, X], axis=1)

We also have the hypothesis, cost and gradient functions:

def hypothesis(self, X): return 1 / (1 + np.exp(-1.0 * np.dot(X, self.theta))) def cost_function(self): """ We will use the binary cross entropy as the cost function. https://en.wikipedia.org/wiki/Cross_entropy """ predicted_Y_values = self.hypothesis(self.X) cost = (-1.0/self.no_examples) * np.sum(self.Y * np.log(predicted_Y_values) + (1 - self.Y) * (np.log(1-predicted_Y_values))) return cost def gradient(self): predicted_Y_values = self.hypothesis(self.X) grad = (-1.0/self.no_examples) * np.dot((self.Y-predicted_Y_values), self.X) return grad

With these functions, the gradient descent method can be defined as:

def gradient_descent(self): for iter in range(1,self.max_iter): cost = self.cost_function() delta = self.gradient() self.theta = self.theta - self.learning_rate * delta print("iteration %s : cost %s " % (iter, cost))

These functions are used by the “train()” method, which first sets the values of the matrices X, Y and theta, and then calls the gradient_descent method:

def train(self, X, Y): self.X = self.add_bias_col(X) self.Y = Y self.no_examples, self.no_features = np.shape(X) self.theta = np.ones(self.no_features + 1) self.gradient_descent()

Once the values have been determined with the gradient descent method, we can use it to classify new examples:

def classify(self, X): X = self.add_bias_col(X) predicted_Y = self.hypothesis(X) predicted_Y_binary = np.round(predicted_Y) return predicted_Y_binary

Using this algorithm for gradient descent, we can correctly classify 297 out of 300 datapoints of our self-generated example (wrongly classified points are indicated with a cross).

Now that the concept of Logistic Regression is a bit more clear, let’s classify real-world data!

One of the most famous classification datasets is The Iris Flower Dataset. This dataset consists of three classes, where each example has four numerical features.

import pandas as pd to_bin_y = { 1: { 'Iris-setosa': 1, 'Iris-versicolor': 0, 'Iris-virginica': 0 }, 2: { 'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 0 }, 3: { 'Iris-setosa': 0, 'Iris-versicolor': 0, 'Iris-virginica': 1 } } #loading the dataset datafile = '../datasets/iris/iris.data' df = pd.read_csv(datafile, header=None) df_train = df.sample(frac=0.7) df_test = df.loc[~df.index.isin(df_train.index)] X_train = df_train.values[:,0:4].astype(float) y_train = df_train.values[:,4] X_test = df_test.values[:,0:4].astype(float) y_test = df_test.values[:,4] Y_train = np.array([to_bin_y[3][x] for x in y_train]) Y_test = np.array([to_bin_y[3][x] for x in y_test]) print("training Logistic Regression Classifier") lr = LogisticRegression() lr.train(X_train, Y_train) print("trained") predicted_Y_test = lr.classify(X_test) f1 = f1_score(predicted_Y_test, Y_test, 1) print("F1-score on the test-set for class %s is: %s" % (1, f1))

As you can see, our simple LogisticRegression class can classify the iris dataset with quiet a high accuracy:

training Logistic Regression Classifier iteration 1 : cost 8.4609605194 iteration 2 : cost 3.50586831057 iteration 3 : cost 3.78903735339 iteration 4 : cost 6.01488933456 iteration 5 : cost 0.458208317153 iteration 6 : cost 2.67703502395 iteration 7 : cost 3.66033580721 (...) iteration 998 : cost 0.0362384208231 iteration 999 : cost 0.0362289106001 trained F1-score on the test-set for class 1 is: 0.973225806452

For a full overview of the code, please have a look at GitHub.

Logistic Regression by using Gradient Descent can also be used for NLP / Text Analysis tasks. There are a wide variety of tasks which can are done in the field of NLP; autorship attribution, spam filtering, topic classification and sentiment analysis.

For a task like sentiment analysis we can follow the same procedure. We will have as the input a large collection of labelled text documents. These will be used to train the Logistic Regression classifier. The most important task then, is to select the proper features which will lead to the best sentiment classification. Almost everything in the text document can be used as a feature[2]; you are only limited by your creativity.

For sentiment analysis usually the occurence of (specific) words is used, or the relative occurence of words (the word occurences divided by the total number of words).

As we have done before, we have to fill in the and matrices, which will serve as an input for the gradient descent algorithm and this algorithm will give us the resulting feature vector . With this vector we can determine the class of other text documents.

As always is a vector with elements (where is the number of text-documents). The matrix is a by matrix; here is the total number of relevant words in all of the text-documents. I will illustrate how to build up this matrix with three book reviews:

**pos:**“This is such a beautiful edition of Harry Potter and the Sorcerer’s Stone. I’m so glad I bought it as a keep sake. The illustrations are just stunning.” (28 words in total)**pos:**“A brilliant book that helps you to open up your mind as wide as the sky” (16 words in total)**neg:**“This publication is virtually unreadable. It doesn’t do this classic justice. Multiple typos, no illustrations, and the most wonky footnotes conceivable. Spend a dollar more and get a decent edition.” (30 words in total)

These three reviews will result in the following -matrix.

As you can see, each row of the matrix contains all of the data per review and each column contains the data per word. If a review does not contain a specific word, the corresponding column will contain a zero. Such a -matrix containing all the data from the training set can be build up in the following manner:

Assuming that we have a list containing the data from the *training set*:

[ ([u'downloaded', u'the', u'book', u'to', u'my', ..., u'art'], 'neg'), ([u'this', u'novel', u'if', u'bunch', u'of', ..., u'ladies'], 'neg'), ([u'forget', u'reading', u'the', u'book', u'and', ..., u'hilarious!'], 'neg'), ... ]

From this *training_set*, we are going to generate a *words_vector*. This *words_vector* is used to keep track to which column a specific word belongs to. After this *words_vector* has been generated, the matrix and vector can filled in.

def generate_words_vector(training_set): words_vector = [] for review in training_set: for word in review[0]: if word not in words_vector: words_vector.append(word) return words_vector def generate_Y_vector(training_set, training_class): no_reviews = len(training_set) Y = np.zeros(shape=no_reviews) for ii in range(0,no_reviews): review_class = training_set[ii][1] Y[ii] = 1 if review_class == training_class else 0 return Y def generate_X_matrix(training_set, words_vector): no_reviews = len(training_set) no_words = len(words_vector) X = np.zeros(shape=(no_reviews, no_words+1)) for ii in range(0,no_reviews): X[ii][0] = 1 review_text = training_set[ii][0] total_words_in_review = len(review_text) for word in Set(review_text): word_occurences = review_text.count(word) word_index = words_vector.index(word)+1 X[ii][word_index] = word_occurences / float(total_words_in_review) return X words_vector = generate_words_vector(training_set) X = generate_X_matrix(training_set, words_vector) Y_neg = generate_Y_vector(training_set, 'neg')

As we have done before, the gradient descent method can be applied to derive the feature vector from the and matrices:

numIterations = 100 alpha = 0.55 m,n = np.shape(X) theta = np.ones(n) theta_neg = gradient_descent2(X, Y_neg, theta, alpha, m, numIterations)

What should we do if a specific review tests positive (Y=1) for more than one class? A review could result in Y=1 for both the *neu* class as well as the *neg* class. In that case we will pick the class with the highest score. This is called multinomial logistic regression.

So far, we have seen how to implement a Logistic Regression Classifier in its most basic form. It is true that building such a classifier from scratch, is great for learning purposes. It is also true that no one will get to the point of using deeper / more advanced Machine Learning skills without learning the basics first.

For real-world applications however, often the best solution is to not re-invent the wheel but to re-use tools which are already available. Tools which have been tested thorougly and have been used by plenty of smart programmers before you. One of such a tool is Python’s NLTK library.

NLTK is Python’s Natural Language Toolkit and it can be used for a wide variety of Text Processing and Analytics jobs like tokenization, part-of-speech tagging and classification. It is easy to use and even includes a lot of text corpora, which can be used to train your model if you have no training set available.

Let us also have a look at how to perform sentiment analysis and text classification with NLTK. As always, we will use a training set to train NLTK’s Maximum Entropy Classifier and a test set to verify the results. Our training set has the following format:

training_set = [ ([u'this', u'novel', u'if', u'bunch', u'of', u'childish', ..., u'ladies'], 'neg') ([u'where', u'to', u'begin', u'jeez', u'gasping', u'blushing', ..., u'fail????'], 'neg') ... ]

As you can see, the training set consists of a list of tuples of two elements. The first element is a list of the words in the text of the document and the second element is the class-label of this specific review (‘neg’, ‘neu’ or ‘pos’). Unfortunately NLTK’s Classifiers only accepts the text in a hashable format (dictionaries for example) and that is why we need to convert this list of words into a dictionary of words.

def list_to_dict(words_list): return dict([(word, True) for word in words_list]) training_set_formatted = [(list_to_dict(element[0]), element[1]) for element in training_set]

‘

Once the training set has been converted into the proper format, it can be feed into the train method of the MaxEnt Classifier:

import nltk numIterations = 100 algorithm = nltk.classify.MaxentClassifier.ALGORITHMS[0] classifier = nltk.MaxentClassifier.train(training_set_formatted, algorithm, max_iter=numIterations) classifier.show_most_informative_features(10)

Once the training of the MaxEntClassifier is done, it can be used to classify the review in the test set:

for review in test_set_formatted: label = review[1] text = review[0] determined_label = classifier.classify(text) print determined_label, label

So far we have seen the theory behind the Naive Bayes Classifier and how to implement it (in the context of Text Classification) and in the previous and this blog-post we have seen the theory and implementation of Logistic Regression Classifiers. Although this is done at a basic level, it should give some understanding of the Logistic Regression method (I hope at a level where you can apply it and classify data yourself). There are however still many (advanced) topics which have not been discussed here:

- Which hill-climbing / gradient descent algorithm to use; IIS (Improved Iterative Scaling), GIS (Generalized Iterative Scaling), BFGS, L-BFGS or Coordinate Descent
- Encoding of the feature vector and the use of dummy variables
- Logistic Regression is an inherently sequential algorithm; although it is quiet fast, you might need a parallelization strategy if you start using larger datasets.

If you see any errors please do not hesitate to contact me. If you have enjoyed reading, maybe even learned something, do not forget to subscribe to this blog and share it!

—

[1] See the paper of Nigam et. al. on Maximum Entropy and the paper of Bo Pang et. al. on Sentiment Analysis using Maximum Entropy. Also see Using Maximum Entropy for text classification (1999), A simple introduction to Maximum Entropy models(1997), A brief MaxEnt tutorial, and another good MIT article.

[2] See for example Chapter 7 of Speech and Language Processing by (Jurafsky & Martin): For the task of period disambiguation a feature could be whether or not a period is followed by a capital letter unless the previous word is *St.*

One of the most important tasks in Machine Learning are the Classification tasks (a.k.a. supervised machine learning). Classification is used to make an accurate prediction of the class of entries in a test set (a dataset of which the entries have not yet been labelled) with the model which was constructed from a training set. You could think of classifying crime in the field of pre-policing, classifying patients in the health sector, classifying houses in the real-estate sector. Another field in which classification is big, is Natural Lanuage Processing (NLP). This goal of this field of science is to makes machines (computers) understand written (human) language. You could think of text categorization, sentiment analysis, spam detection and topic categorization.

For classification tasks there are three widely used algorithms; the Naive Bayes, Logistic Regression / Maximum Entropy and Support Vector Machines. We have already seen how the Naive Bayes works in the context of Sentiment Analysis. Although it is more accurate than a bag-of-words model, it has the assumption of conditional independence of its features. This is a simplification which makes the NB classifier easy to implement, but it is also unrealistic in most cases and leads to a lower accuracy. A direct improvement on the N.B. classifier, is an algorithm which does not assume conditional independence but tries to estimate the weight vectors (feature values) directly. This algorithm is called Maximum Entropy in the field of NLP and Logistic Regression in the field of Statistics.

Maximum Entropy might sound like a difficult concept, but actually it is not. It is a simple idea, which can be implemented with a few lines of code. But to fully understand it, we must first go into the basics of Regression and Logistic Regression.

Regression Analysis is the field of mathematics where the goal is to find a function which best correlates with a dataset. Let’s say we have a dataset containing datapoints; . For each of these (input) datapoints there is a corresponding (output) -value. Here, the -datapoints are called the independent variables and the dependent variable; the value of depends on the value of , while the value of may be freely chosen without any restriction imposed on it by any other variable.

The goal of Regression analysis is to find a function which can best describe the correlation between and . In the field of Machine Learning, this function is called the hypothesis function and is denoted as .

If we can find such a function, we can say we have successfully built a Regression model. If the input-data lives in a 2D-space, this boils down to finding a curve which fits through the data points. In the 3D case we have to find a plane and in higher dimensions a hyperplane.

To give an example, let’s say that we are trying to find a predictive model for the success of students in a course called Machine Learning. We have a dataset which contains the final grade of students. Dataset contains the values of the independent variables. Our initial assumption is that the final grade only depends on the studying time. The variable therefore indicates how many hours student has studied. The first thing we would do is visualize this data:

If the results looks like the figure on the left, then we are out of luck. It looks like the points are distributed randomly and there is no correlation between and at all. However, if it looks like the figure on the right, there is probably a strong correlation and we can start looking for the function which describes this correlation.

This function could for example be:

or

where are the dependent parameters of our model.

In evaluating the results from the previous section, we may find the results unsatisfying; the function does not correlate with the datapoints strongly enough. Our initial assumption is probably not complete. Taking only the studying time into account is not enough. The final grade does not only depend on the studying time, but also on how much the students have slept the night before the exam. Now the dataset contains an additional variable which represents the sleeping time. Our dataset is then given by . In this dataset indicates how many hours student has studied and indicates how many hours he has slept.

This is an example of multivariate regression. The function has to include both variables. For example:

or

.

All of the above examples are examples of linear regression. We have seen that in some cases depends on a linear form of , but it can also depend on some power of , or on the log or any other form of . However, in all cases the parameters were linear.

So, what makes linear regression linear is not that depends in a linear way on , but that it depends in a linear way on . needs to be linear with respect to the model-parameters . Mathematically speaking it needs to satisfy the superposition principle. Examples of nonlinear regression would be:

or

The reason why the distinction is made between linear and nonlinear regression is that nonlinear regression problems are more difficult to solve and therefore more computational intensive algorithms are needed.

Linear regression models can be written as a linear system of equations, which can be solved by finding the closed-form solution with Linear Algebra. See these statistics notes for more on solving linear models with linear algebra.

As discussed before, such a closed-form solution can only be found for linear regression problems. However, even when the problem is linear in nature, we need to take into account that calculating the inverse of a by matrix has a time-complexity of . This means that for large datasets ( ) finding the closed-form solution will take more time than solving it iteratively (gradient descent method) as is done for nonlinear problems. So solving it iteratively is usually preferred for larger datasets, even if it is a linear problem.

The Gradient Descent method is a general optimization technique in which we try to find the value of the parameters with an iterative approach.

First, we construct a cost function (also known as loss function or error function) which gives the difference between the values of (the values you expect to have with the determined values of ) and the actual values of . The better your estimation of is, the better the values of will approach the values of .

Usually, the cost function is expressed as the squared error between this difference:

At each iteration we choose new values for the parameters , and move towards the ‘true’ values of these parameters, i.e. the values which make this cost function as small as possible. The direction in which we have to move is the negative gradient direction;

.

The reason for this is that a function’s value decreases the fastest if we move towards the direction of the negative gradient (the directional derivative is maximal in the direction of the gradient).

Taking all this into account, this is how gradient descent works:

- Make an initial but intelligent guess for the values of the parameters .
- Keep iterating while the value of the cost function has not met your criteria:
- With the current values of , calculate the gradient of the cost function J ( ).
- Update the values for the parameters
- Fill in these new values in the hypothesis function and calculate again the value of the cost function;

Just as important as the initial guess of the parameters is the value you choose for the learning rate . This learning rate determines how fast you move along the slope of the gradient. If the selected value of this learning rate is too small, it will take too many iterations before you reach your convergence criteria. If this value is too large, you might overshoot and not converge.

Logistic Regression is similar to (linear) regression, but adapted for the purpose of classification. The difference is small; for Logistic Regression we also have to apply gradient descent iteratively to estimate the values of the parameter . And again, during the iteration, the values are estimated by taking the gradient of the cost function. And again, the cost function is given by the squared error of the difference between the hypothesis function and . The major difference however, is the form of the hypothesis function.

When you want to classify something, there are a limited number of classes it can belong to. And for each of these possible classes there can only be two states for ;

either belongs to the specified class and , or it does not belong to the class and . Even though the output values are binary, the independent variables are still continuous. So, we need a function which has as input a large set of continuous variables and for each of these variables produces a binary output. This function, the hypothesis function, has the following form:

.

This function is also known as the logistic function, which is a part of the sigmoid function family. These functions are widely used in the natural sciences because they provide the simplest model for population growth. However, the reason why the logistic function is used for classification in Machine Learning is its ‘S-shape’.

As you can see this function is bounded in the y-direction by 0 and 1. If the variable is very negative, the output function will go to zero (it does not belong to the class). If the variable is very positive, the output will be one and it does belong to the class. (Such a function is called an indicator function.)

The question then is, what will happen to input values which are neither very positive nor very negative, but somewhere ‘in the middle’. We have to define a decision boundary, which separates the positive from the negative class. Usually this decision boundary is chosen at the middle of the logistic function, namely at where the output value is .

(1)

For those who are wondering where entered the picture that we were talking about before. As we can see in the formula of the logistic function, . Meaning, the dependent parameter (also known as the feature), maps the input variable to a position on the -axis. With its -value, we can use the logistic function to calculate the -value. If this -value we assume it does belong in this class and vice versa.

So the feature should be chosen such that it predicts the class membership correctly. It is therefore essential to know which features are useful for the classification task. Once the appropriate features are selected , gradient descent can be used to find the optimal value of these features.

How can we do gradient descent with this logistic function? Except for the hypothesis function having a different form, the gradient descent method is exactly the same. We again have a cost function, of which we have to iteratively take the gradient w.r.t. the feature and update the feature value at each iteration.

This cost function is given by

(2)

We know that:

and

(3)

Plugging these two equations back into the cost function gives us:

(4)

The gradient of the cost function with respect to is given by

(5)

So the gradient of the seemingly difficult cost function, turns out to be a much simpler equation. And with this simple equation, gradient descent for Logistic Regression is again performed in the same way:

- Make an initial but intelligent guess for the values of the parameters .
- Keep iterating while the value of the cost function has not met your criteria:
- With the current values of , calculate the gradient of the cost function J ( ).
- Update the values for the parameters
- Fill in these new values in the hypothesis function and calculate again the value of the cost function;

In the previous section we have seen how we can use Gradient Descent to estimate the feature values , which can then be used to determine the class with the Logistic function. As stated in the introduction, this can be used for a wide variety of classification tasks. The only thing that will be different for each of these classification tasks is the form the features take on.

Here we will continue to look at the example of Text Classification; Lets assume we are doing Sentiment Analysis and want to know whether a specific review should be classified as positive, neutral or negative.

The first thing we need to know is which and what types of features we need to include.

For NLP we will need a large number of features; often as large as the number of words present in the training set. We could reduce the number of features by excluding stopwords, or by only considering n-gram features.

For example, the 5-gram ‘kept me reading and reading’ is much less likely to occur in a review-document than the unigram ‘reading’, but if it occurs it is much more indicative of the class (positive) than ‘reading’. Since we only need to consider n-grams which actually are present in the training set, there will be much less features if we only consider n-grams instead of unigrams.

The second thing we need to know is the actual value of these features. The values are learned by initializing all features to zero, and applying the gradient descent method using the labeled examples in the training set. Once we know the values for the features, we can compute the probability for each class and choose the class with the maximum probability. This is done with the following Logistic function.

In this post we have discussed only the theory of Maximum Entropy and Logistic Regression. Usually such discussions are better understood with examples and the actual code. I will save that for the next blog.

If you have enjoyed reading this post or maybe even learned something from it, subscribe to this blog so you can receive a notification the next time something is posted.

Miles Osborne, Using Maximum Entropy for Sentence Extraction (2002)

Jurafsky and Martin, Speech and Language Processing; Chapter 7

Nigam et. al., Using Maximum Entropy for Text Classification

]]>

With the bag-of-words model we check which word of the text-document appears in a positive-words-list or a negative-words-list. If the word appears in a positive-words-list the total score of the text is updated with +1 and vice versa. If at the end the total score is positive, the text is classified as positive and if it is negative, the text is classified as negative. Simple enough!

With the Naive Bayes model, we do not take only a small set of positive and negative words into account, but all words the NB Classifier was trained with, i.e. all words presents in the training set. If a word has not appeared in the training set, we have no data available and apply Laplacian smoothing (use 1 instead of the conditional probability of the word).

The probability a document belongs to a class is given by the class probability multiplied by the products of the conditional probabilities of each word for that class.

Here is the number of occurences of word in class , is the total number of words in class and is the number of words in the document we are currently classifying.

does not change (unless the training set is expanded), so it can be placed outside of the product:

With this information it is easy to implement a Naive Bayes Text Classifier. (Naive Bayes can also be used to classify non-text / numerical datasets, for an explanation see this notebook).

We have a NaiveBayesText class, which accepts the input values for X and Y as parameters for the “train()” method. Here X is a list of lists, where each lower level list contains all the words in the document. Y is a list containing the label/class of each document.

class NaiveBaseClass: def calculate_relative_occurences(self, list1): no_examples = len(list1) ro_dict = dict(Counter(list1)) for key in ro_dict.keys(): ro_dict[key] = ro_dict[key] / float(no_examples) return ro_dict def get_max_value_key(self, d1): values = d1.values() keys = d1.keys() max_value_index = values.index(max(values)) max_key = keys[max_value_index] return max_key def initialize_nb_dict(self): self.nb_dict = {} for label in self.labels: self.nb_dict[label] = defaultdict(list) class NaiveBayesText(NaiveBaseClass): """" When the goal is classifying text, it is better to give the input X in the form of a list of lists containing words. X = [ ['this', 'is', 'a',...], (...) ] Y still is a 1D array / list containing the labels of each entry def initialize_nb_dict(self): self.nb_dict = {} for label in self.labels: self.nb_dict[label] = [] def train(self, X, Y): self.class_probabilities = self.calculate_relative_occurences(Y) self.labels = np.unique(Y) self.no_examples = len(Y) self.initialize_nb_dict() for ii in range(0,len(Y)): label = Y[ii] self.nb_dict[label] += X[ii] #transform the list with all occurences to a dict with relative occurences for label in self.labels: self.nb_dict[label] = self.calculate_relative_occurences(self.nb_dict[label])

As we can see, the training of the Naive Bayes Classifier is done by iterating through all of the documents in the training set. From all of the documents, a Hash table (dictionary in python language) with the relative occurence of each word per class is constructed.

This is done in two steps:

1. construct a huge list of all occuring words per class:

for ii in range(0,len(Y)): label = Y[ii] self.nb_dict[label] += X[ii]

2. calculate the relative occurence of each word in this huge list, with the “calculate_relative_occurences” method. This method simply uses Python’s Counter module to count how much each word occurs and then divides this number with the total number of words.

The result is saved in the dictionary *nb_dict*.

As we can see, it is easy to train the Naive Bayes Classifier. We simply calculate the relative occurence of each word per class, and save the result in the “nb_dict” dictionary.

This dictionary can be updated, saved to file, and loaded back from file. It contains the results of Naive Bayes Classifier training.

Classifying new documents is also done quite easily by calculating the class probability for each class and then selecting the class with the highest probability.

def classify_single_elem(self, X_elem): Y_dict = {} for label in self.labels: class_probability = self.class_probabilities[label] nb_dict_features = self.nb_dict[label] for word in X_elem: if word in nb_dict_features.keys(): relative_word_occurence = nb_dict_features[word] class_probability *= relative_word_occurence else: class_probability *= 0 Y_dict[label] = class_probability return self.get_max_value_key(Y_dict) def classify(self, X): self.predicted_Y_values = [] n = len(X) for ii in range(0,n): X_elem = X[ii] prediction = self.classify_single_elem(X_elem) self.predicted_Y_values.append(prediction) return self.predicted_Y_values

In the next blog we will look at the results of this naively implemented algorithm for the Naive Bayes Classifier and see how it performs under various conditions; we will see the influence of varying training set sizes and whether the use of n-gram features will improve the accuracy of the classifier.

]]>- To keep track of the number of occurences of each word, we tokenize the text and add each word to a single list. Then by using a Counter element we can keep track of the number of occurences.
- We can make a DataFrame containing the class probabilities of each word by adding each word to the DataFrame as we encounter it and dividing it by the total number of occurences afterwards.
- Sorting this DataFrame by the values in the columns of the Positive or Negative class, then taking the top 100 / 200 words we can construct a list containing negative or positive words.
- These words in this constructed Sentiment Lexicon can be used to give a value to the subjectivity of the reviews in the test set.

Using the steps described above, we were able to determine the subjectivity of reviews in the test set with an accuracy (F-score) of ~60%.

In this blog we will look into the effectiveness of cross-book sentiment lexicons; how well does a sentiment lexicon made from book A perform at sentiment analysis of book B?

We will also see how we can improve the bag-of-words technique by including n-gram features in the bag-of-words.

In the previous post, we have seen that the sentiment of reviews in the test-set of ‘Gone Girl’ could be predicted with a 60% accuracy. How well does the sentiment lexicon derived from the training set of book A perform at deducing the sentiment of reviews in the test set of book B?

In the table above, we can see that the most effective Sentiment Lexicons are created from books with a large amount of Positive ánd Negative reviews. In the previous post we saw that Fifty Shades of Grey has a large amount of negative reviews. This makes it a good book to construct an effective Sentiment Lexicon from.

Other books have a lot of positive reviews but only a few negative ones. The Sentiment Lexicon constructed from these books has a high accuracy in determining the sentiment of positive reviews, but a low accuracy for negative reviews… bringing the average down.

In the previous blog-post we had constructed a bag-of-words model with unigram features. Meaning that we split the entire text in single words and count the occurence of each word. Such a model does not take the position of each word in the sentence, its context and the grammar into account. That is why, the bag-of-words model has a low accuracy in detecting the sentiment of a text document.

For example, with the bag-of-words model the following two sentences will be given the same score:

1. “This is not a good book” –> 0 + 0 + 0 + 0 + 1 + 0 –> positive

2. “This is a very good book” –> 0 + 0 +0 +0 +1 + 0 –> positive

If we include features consisting of two or three words, this problem can be avoided; “not good” and “very good” will be two different features with different subjectivity scores. The biggest reason why bigram or trigram features are not used more often is that the number of possible combinations of words increases exponentially with the number of words. Theoretically, a document with 2.000 words can have 2.000 possible unigram features, 40.000 possible bigram features and 8.000.000.000 possible trigram features.

However, if we consider this problem from a pragmatic point of view we can say that most of the combinations of words which can be made, are grammatically not possible, or do not occur with a significant amount and hence don’t need to be taken into account.

Actually, we only need to define a small set of words (prepisitions, conjunctions, interjections etc) of which we know it changes the meaning of the words following it and/or the rest of the sentence. I we encounter such a ‘ngram word’, we do not split the sentence but split it after the next word. In this way we will construct ngram features consisting of the specified words and the words directly following them. Some examples of such words are:

In the previous post, we had seen that the code to construct a DataFrame containing the class probabilities of words in the training set is:

from sets import Set import pandas as pd BOW_df = pd.DataFrame(0, columns=scores, index='') words_set = Set() for review in training_set: score = review['score'] text = review['review_text'] splitted_text = split_text(text) for word in splitted_text: if word not in words_set: words_set.add(word) BOW_df.loc[word] = [0,0,0,0,0] BOW_df.ix[word][score] += 1 else: BOW_df.ix[word][score] += 1

If we also want to include ngrams in this class probability DataFrame, we need to include a function which generates n-grams from the splitted text and the list of specified ngram words:

(...) splitted_text = split_text('text') text_with_ngrams = generate_ngrams(splitted_text, ngram_words) for word in text_with_ngrams: (...)

There are a few conditions this “generate_ngrams” function needs to fulfill:

- When it iterates through the splitted text and encounters a ngram-word, it needs to concatenate this word with the next word. So [“I”,”do”,”not”,”recommend”,”this”,”book”] needs to become [“I”, “do”, “not recommend”, “this”, “book”]. At the same time it needs to skip the next iteration so the next word does not appear two times.
- It needs to be recursive: we might encounter multiple ngram words in a row. Then all of the words needs to be concatenated into a single ngram. So [“This”,”is”,”a”,”very”,”very”, “good”,”book”] needs to be concatenated in [“This”,”is”,”a”,”very very good”, “book”]. If n words are concatenated together into a single n-gram, the next n iterations need to be skipped.
- In addition to concatenating words with the words following it, it might also be interesting if we concatenating it with the word preceding it. For example, forming n-grams including the word “book” and its preceding words leads to features like “worst book”, “best book”, “fascinating book” etc…

Now that we know this information, lets have a look at the code:

def generate_ngrams(text, ngram_words): new_text = [] index = 0 while index < len(text): [new_word, new_index] = concatenate_words(index, text, ngram_words) new_text.append(new_word) index = new_index+1 if index!= new_index else index+1 return new_text def concatenate_words(index, text, ngram_words): word = text[index] if index == len(text)-1: return word, index if word in bigram_array: [word_new, new_index] = concatenate_words(index+1, text, ngram_words) word = word + ' ' + word_new index = new_index return word, index

Here concatenate_words is a recursive function which either returns the word at the index position in the array, or the word concatenated with the next word. It also return the index so we know how many iterations need to be skipped.

This function will also work if we want to append words to its previous words. Then we simply need to pass the reversed text to it `text = list(reversed(text))`

and concatenate it in reversed order: ` word = word_new + ' ' + word`

.

We can put this information together in a single function, which can either concatenate with the next word or with the previous word, depending on the value of the parameter ‘forward’:

def generate_ngrams(text, ngram_words, forward = True): new_text = [] index = 0 if not forward: text = list(reversed(text)) while index < len(text): [new_word, new_index] = concatenate_words(index, text, ngram_words, forward) new_text.append(new_word) index = new_index+1 if index!= new_index else index+1 if not forward: return list(reversed(new_text)) return new_text def concatenate_words(index, text, ngram_words, forward): words = text[index] if index == len(text)-1: return words, index if words.split(' ')[0] in bigram_array: [new_word, new_index] = concatenate_words(index+1, text, ngram_words, forward) if forward: words = words + ' ' + new_word else: words = new_word + ' ' + words index = index_new return words, index

Using this simple function to concatenate words in order to form n-grams, will lead to features which strongly correlate with a specific (Negative/Positive) class like ‘highly recommend’, ‘best book’ or even ‘couldn’t put it down’.

Now that we have a better understanding of Text Classification terms like bag-of-words, features and n-grams, we can start using Classifiers for Sentiment Analysis. Think of Naive Bayes, Maximum Entropy and SVM.

]]>In my previous post I have explained the Theory behind three of the most popular Text Classification methods (Naive Bayes, Maximum Entropy and Support Vector Machines) and told you that I will use these Classifiers for the automatic classification of the subjectivity of Amazon.com book reviews.

The purpose is to get a better understanding of how these Classifiers work and perform under various conditions, i.e. do a comparative study about Sentiment Analytics.

In this blog-post we will use the bag-of-words model to do Sentiment Analysis. The bag-of-words model can perform quiet well at Topic Classification, but is inaccurate when it comes to Sentiment Classification. Bo Pang and Lillian Lee report an accuracy of 69% in their 2002 research about Movie review sentiment analysis. With the three Classifiers this percentage goes up to about 80% (depending on the chosen feature).

The reason to still make a bag-of-words model is that it gives us a better understanding of the content of the text and we can use this to select the features for the three classifiers. The Naive Bayes model is also based on the bag-of-words model, so the bag-of-words model can be used as an intermediate step.

We can collect book reviews from Amazon.com by scraping them from the website with BeautifulSoup. The process for this was already explained in the context of Twitter.com and it should not be too difficult to do the same for Amazon.com.

In total 213.335 book reviews were collected for eight randomly chosen books:

After making a bar-plot of the distribution of the different stars for the chosen books, we can we that there is a strong variation. Books which are considered to be average have almost no 1-star ratings while books far below average have a more uniform distribution of the different ratings.

We can see that the book ‘Gone Girl’ has a pretty uniform distribution so it seems like a good choice for our training set. Books like ‘Unbroken’ or ‘The Martian’ might not have enough 1-star reviews to train for the Negative class.

As the next step, we are going to divide the corpus of reviews into a training set and a test set. The book ‘Gone Girl’ has about 40.000 reviews, so we can use *up to* half of it for training purposes and the other half for testing the accuracy of our model. In order to also take into account the effects of the training set size on the accuracy of our model, we will vary the training set size from 1.000 up to 20.000.

The bag-of-words model is one of the simplest language models used in NLP. It makes an unigram model of the text by keeping track of the number of occurences of each word. This can later be used as a features for Text Classifiers. In this bag-of-words model you only take individual words into account and give each word a specific subjectivity score. This subjectivity score can be looked up in a sentiment lexicon[1]. If the total score is negative the text will be classified as negative and if its positive the text will be classified as positive. It is simple to make, but is less accurate because it does not take the word order or grammar into account.

A simple improvement on using unigrams would be to use unigrams + bigrams. That is, not split a sentence after words like “not”,”no”,”very”, “just” etc. It is easy to implement but can give significant improvement to the accuracy. The sentence “This book is not good” will be interpreted as a positive sentence, unless such a construct is implemented. Another example is that the sentences “This book is very good” and “This book is good” will have the same score with a unigram model of the text, but not with an unigram + bigram model.

My pseudocode for creating a bag-of-words model is as follows:

*list_BOW*= []- For each review in the training set:
- Strip the newline charachter “n” at the end of each review.
- Place a space before and after each of the following characters: .,()[]:;” (This prevents sentences like “I like this book.It is engaging” being interpreted as [“I”, “like”, “this”, “book.It”, “is”, “engaging”].)
- Tokenize the text by splitting it on spaces.
- Remove tokens which consist of only a space, empty string or punctuation marks.
- Append the tokens to list_BOW.

*list_BOW*now contains all words occuring in the training set.- Place
*list_BOW*in a Python Counter element. This counter now contains all occuring words together with their frequencies. Its entries can be sorted with the most_common() method.

The real question is, how we should determine the sentiment/subjectivity score of each word in order to determine the total subjectivity score of the text. We can use one of the sentiment lexicons given in [1], but we dont really know in which circumstances and for which purposes these lexicons are created. Furthermore, in most of these lexicons the words are classified in a binary way (either positive or negative ). Bing Liu’s sentiment lexicon for example contains a list of a few thousands positive and a few thousand negative words.

Bo Pang and Lillian Lee used words which were chosen by two student as positive and negative words.

It would be better if we determine the subjectivity score of each word using some simple statistics of the training set. To do this we need to determine the class probability of each word present in the bag-of-words. This can be done by using pandas dataframe as a datacontainer (but can just as easily be done with dictionaries or other data structures). The code for this looks like:

from sets import Set import pandas as pd BOW_df = pd.DataFrame(0, columns=scores, index='') words_set = Set() for review in training_set: score = review['score'] text = review['review_text'] splitted_text = split_text(text) for word in splitted_text: if word not in words_set: words_set.add(word) BOW_df.loc[word] = [0,0,0,0,0] BOW_df.ix[word][score] += 1 else: BOW_df.ix[word][score] += 1

Here `split_text`

is the method for splitting a text into a list of individual words:

def expand_around_chars(text, characters): for char in characters: text = text.replace(char, &quot; &quot;+char+&quot; &quot;) return text def split_text(text): text = strip_quotations_newline(text) text = expand_around_chars(text, '&quot;.,()[]{}:;') splitted_text = text.split(&quot; &quot;) cleaned_text = [x for x in splitted_text if len(x)&gt;1] text_lowercase = [x.lower() for x in cleaned_text] return text_lowercase

This gives us a DataFrame containing of the number of occurances of each word in each class:

Unnamed: 0 1 2 3 4 5 0 i 4867 5092 9178 14180 17945 1 through 210 232 414 549 627 2 all 499 537 923 1355 1791 3 drawn-out 1 0 1 1 0 4 , 4227 4779 8750 15069 18334 5 detailed 3 7 15 30 36 ... ... ... ... ... ... ... 31800 a+++++++ 0 0 0 0 1 31801 nailbiter 0 0 0 0 1 31802 melinda 0 0 0 0 1 31803 reccomend! 0 0 0 0 1 31804 suspense!! 0 0 0 0 1 [31804 rows x 6 columns]

As we can see there are also quiet a few words which only occur one time. These words will have a class probability of 100% for the class they are occuring in.

This distribution however, does not approximate the real class distribution of that word at all. It is therefore good to define some ‘occurence cut off value’; words which occur less than this value are not taken into account.

By dividing each element of each row by the sum of the elements of that row we will get a DataFrame containing the relative occurences of each word in each class, i.e. a DataFrame with the class probabilities of each word. After this is done, the words with the highest probability in class 1 can be taken as negative words and words with the highest probability in class 5 can be taken as positive words.

We can construct such a sentiment lexicon from the training set and use it to measure the subjectivity of reviews in the test set. Depending on the size of the training set, the sentiment lexicon becomes more accurate for prediciton.

By labeling 4 and 5-star reviews as Positive, 1 and 2-star reviews as Negative and 3 star reviews as Neutral and using the following positive and negative word:

we can determine with the bag-of-words model whether a review is positive or negative with a 60% accuracy .

- How accurate is this list of positive and negative words constructed from the reviews of book A in determening the subjectivity of book B reviews.
- How much more accurate will the bag-of-words model become if we take bigrams or even trigrams into account? There were words with a high negative or positive subjectivity in the word-list which do not have a negative or positive meaning by themselves. This can only be understood if you take the preceding or following words into account.
- Make an overall sentiment lexicon from all the reviews of all the books.
- Use the bag-of-words as features for the Classifiers; Naive Bayes, Maximum Entropy and Support Vector Machines.

]]>