In a previous blog-post we have seen how we can use Signal Processing techniques for the classification of time-series and signals.

A very short summary of that post is: We can use the Fourier Transform to transform a signal from its time-domain to its frequency domain. The peaks in the frequency spectrum indicate the most occurring frequencies in the signal. The larger and sharper a peak is, the more prevalent a frequency is in a signal. The location (frequency-value) and height (amplitude) of the peaks in the frequency spectrum then can be used as input for Classifiers like Random Forest or Gradient Boosting.

This simple approach works surprisingly well for many classification problems. In that blog post we were able to classify the Human Activity Recognition dataset with a ~91 % accuracy.

The general rule is that this approach of using the Fourier Transform will work very well when the frequency spectrum is stationary. That is, the frequencies present in the signal are not time-dependent; if a signal contains a frequency of x this frequency should be present equally anywhere in the signal.

The more non-stationary/dynamic a signal is, the worse the results will be. That’s too bad, since most of the signals we see in real life are non-stationary in nature. Whether we are talking about ECG signals, the stock market, equipment or sensor data, etc, etc, in real life problems start to get interesting when we are dealing with dynamic systems. A much better approach for analyzing dynamic signals is to use the Wavelet Transform instead of the Fourier Transform.

Even though the Wavelet Transform is a very powerful tool for the analysis and classification of time-series and signals, it is unfortunately not known or popular within the field of Data Science. This is partly because you should have some prior knowledge (about signal processing, Fourier Transform and Mathematics) before you can understand the mathematics behind the Wavelet Transform. However, I believe it is also due to the fact that most books, articles and papers are way too theoretical and don’t provide enough practical information on how it should and can be used.

In this blog-post we will see the theory behind the Wavelet Transform (without going too much into the mathematics) and also see how it can be used in practical applications. **By providing Python code at every step of the way you should be able to use the Wavelet Transform in your own applications by the end of this post.**

The contents of this blogpost are as follows:

- Introduction
- Theory
- 2.1 From Fourier Transform to Wavelet Transform
- 2.2 How does the Wavelet Transform work?
- 2.3 The different types of Wavelet families
- 2.4 Continuous Wavelet Transform vs Discrete Wavelet Transform
- 2.5 More on the Discrete Wavelet Transform: The DWT as a filter-bank.

- Practical Applications
- 3.1 Visualizing the State-Space using the Continuous Wavelet Transform
- 3.2 Using the Continuous Wavelet Transform and a Convolutional Neural Network to classify signals
- 3.2.1 Loading the UCI-HAR time-series dataset
- 3.2.2 Applying the CWT on the dataset and transforming the data to the right format
- 3.2.3 Training the Convolutional Neural Network with the CWT

- 3.3 Deconstructing a signal using the DWT
- 3.4 Removing (high-frequency) noise using the DWT
- 3.5 Using the Discrete Wavelet Transform to classify signals
- 3.5.1 The idea behind Discrete Wavelet classification
- 3.5.2 Generating features per sub-band
- 3.5.3 Using the features and scikit-learn classifiers to classify two datasets.

- 3.6 Comparison of the classification accuracies between DWT, Fourier Transform and Recurrent Neural Networks

- Finals Words

**PS:** In this blog-post we will mostly use the Python package PyWavelets, so go ahead and install it with `pip install pywavelets`

.

In the previous blog-post we have seen how the Fourier Transform works. That is, by multiplying a signal with a series of sine-waves with different frequencies we are able to determine which frequencies are present in a signal. If the dot-product between our signal and a sine wave of a certain frequency results in a large amplitude this means that there is a lot of overlap between the two signals, and our signal contains this specific frequency. This is of course because the dot product is a measure of how much two vectors / signals overlap.

The thing about the Fourier Transform is that it has a high resolution in the frequency-domain but zero resolution in the time-domain. This means that it can tell us exactly which frequencies are present in a signal, but not at which location in time these frequencies have occurred. This can easily be demonstrated as follows:

t_n = 1 N = 100000 T = t_n / N f_s = 1/T xa = np.linspace(0, t_n, num=N) xb = np.linspace(0, t_n/4, num=N/4) frequencies = [4, 30, 60, 90] y1a, y1b = np.sin(2*np.pi*frequencies[0]*xa), np.sin(2*np.pi*frequencies[0]*xb) y2a, y2b = np.sin(2*np.pi*frequencies[1]*xa), np.sin(2*np.pi*frequencies[1]*xb) y3a, y3b = np.sin(2*np.pi*frequencies[2]*xa), np.sin(2*np.pi*frequencies[2]*xb) y4a, y4b = np.sin(2*np.pi*frequencies[3]*xa), np.sin(2*np.pi*frequencies[3]*xb) composite_signal1 = y1a + y2a + y3a + y4a composite_signal2 = np.concatenate([y1b, y2b, y3b, y4b]) f_values1, fft_values1 = get_fft_values(composite_signal1, T, N, f_s) f_values2, fft_values2 = get_fft_values(composite_signal2, T, N, f_s) fig, axarr = plt.subplots(nrows=2, ncols=2, figsize=(12,8)) axarr[0,0].plot(xa, composite_signal1) axarr[1,0].plot(xa, composite_signal2) axarr[0,1].plot(f_values1, fft_values1) axarr[1,1].plot(f_values2, fft_values2) (...) plt.tight_layout() plt.show()

In Figure 1 we can see at the top left a signal containing four different frequencies (, , and ) which are present at all times and on the right its frequency spectrum. In the bottom figure, we can see the same four frequencies, only the first one is present in the first quarter of the signal, the second one in the second quarter, etc. In addition, on the right side we again see its frequency spectrum.

What is important to note here is that the two frequency spectra contain exactly the same four peaks, so it can not tell us *where* in the signal these frequencies are present. The Fourier Transform can not distinguish between the first two signals.

**PS:** The side lobes we see in the bottom frequency spectrum, is due to the discontinuity between the four different frequencies.

In trying to overcome this problem, scientists have come up with the Short-Time Fourier Transform. In this approach the original signal is splitted into several parts of equal length (which may or may not have an overlap) by using a sliding window before applying the Fourier Transform. The idea is quite simple: if we split our signal into 10 parts, and the Fourier Transform detects a specific frequency in the second part, then we know for sure that this frequency has occurred between th and th of our original signal.

The main problem with this approach is that you run into the theoretical limits of the Fourier Transform known as the uncertainty principle. The smaller we make the size of the window the more we will know about where a frequency has occurred in the signal, but less about the frequency value itself. The larger we make the size of the window the more we will know about the frequency value and less about the time.

A better approach for analyzing signals with a dynamical frequency spectrum is the Wavelet Transform. The Wavelet Transform has a high resolution in both the frequency- and the time-domain. It does not only tell us which frequencies are present in a signal, but also at which time these frequencies have occurred. This is accomplished by working with different scales. First we look at the signal with a large scale/window and analyze ‘large’ features and then we look at the signal with smaller scales in order to analyze smaller features.

The time- and frequency resolutions of the different methods are illustrated in Figure 2.

In Figure 2 we can see the time and frequency resolutions of the different transformations. The size and orientation of the blocks indicate how small the features are that we can distinguish in the time and frequency domain. The original time-series has a high resolution in the time-domain and zero resolution in the frequency domain. This means that we can distinguish very small features in the time-domain and no features in the frequency domain.

Opposite to that is the Fourier Transform, which has a high resolution in the frequency domain and zero resolution in the time-domain.

The Short Time Fourier Transform has medium sized resolution in both the frequency and time domain.

The Wavelet Transform has:

- for small frequency values a high resolution in the frequency domain, low resolution in the time- domain,
- for large frequency values a low resolution in the frequency domain, high resolution in the time domain.

In other words, the Wavelet Transforms makes a trade-off; at scales in which time-dependent features are interesting it has a high resolution in the time-domain and at scales in which frequency-dependent features are interesting it has a high resolution in the frequency domain.

And as you can imagine, this is exactly the kind of trade-off we are looking for!

The Fourier Transform uses a series of sine-waves with different frequencies to analyze a signal. That is, a signal is represented through a linear combination of sine-waves.

The Wavelet Transform uses a series of functions called wavelets, each with a different scale. The word wavelet means a small wave, and this is exactly what a wavelet is.

In Figure 3 we can see the difference between a sine-wave and a wavelet. The main difference is that the sine-wave is not localized in time (it stretches out from -infinity to +infinity) while a wavelet **is** localized in time. This allows the wavelet transform to obtain time-information in addition to frequency information.

Since the Wavelet is localized in time, we can multiply our signal with the wavelet at different locations in time. We start with the beginning of our signal and slowly move the wavelet towards the end of the signal. This procedure is also known as a convolution. After we have done this for the original (mother) wavelet, we can scale it such that it becomes larger and repeat the process. This process is illustrated in the figure below.

As we can see in the figure above, the Wavelet transform of an 1-dimensional signal will have two dimensions. This 2-dimensional output of the Wavelet transform is the time-scale representation of the signal in the form of a scaleogram.

Above the scaleogram is plotted in a 3D plot in the bottom left figure and in a 2D color plot in the bottom right figure.

**PS:** You can also have a look at this youtube video to see how a Wavelet Transform works.

So what is this dimension called scale? Since the term frequency is reserved for the Fourier Transform, the wavelet transform is usually expressed in scales instead. That is why the two dimensions of a scaleogram are time and scale. For the ones who find frequencies more intuitive than scales, it is possible to convert scales to pseudo-frequencies with the equation

where is the pseudo-frequency, is the central frequency of the Mother wavelet and is the scaling factor.

We can see that a higher scale-factor (longer wavelet) corresponds with a smaller frequency, so by scaling the wavelet in the time-domain we will analyze smaller frequencies (achieve a higher resolution) in the frequency domain. And vice versa, by using a smaller scale we have more detail in the time-domain. So scales are basically the inverse of the frequency.

**PS**: PyWavelets contains the function scale2frequency to convert from a scale-domain to a frequency-domain.

Another difference between the Fourier Transform and the Wavelet Transform is that there are many different families (types) of wavelets. The wavelet families differ from each other since for each family a different trade-off has been made in how compact and smooth the wavelet looks like. This means that we can choose a specific wavelet family which fits best with the features we are looking for in our signal.

The PyWavelets library for example contains 14 mother Wavelets (families of Wavelets):

import pywt print(pywt.families(short=False)) ['Haar', 'Daubechies', 'Symlets', 'Coiflets', 'Biorthogonal', 'Reverse biorthogonal', 'Discrete Meyer (FIR Approximation)', 'Gaussian', 'Mexican hat wavelet', 'Morlet wavelet', 'Complex Gaussian wavelets', 'Shannon wavelets', 'Frequency B-Spline wavelets', 'Complex Morlet wavelets']

Each type of wavelets has a different shape, smoothness and compactness and is useful for a different purpose. Since there are only two mathematical conditions a wavelet has to satisfy it is easy to generate a new type of wavelet.

The two mathematical conditions are the so-called normalization and orthogonalization constraints:

A wavelet must have 1) finite energy and 2) zero mean.

Finite energy means that it is localized in time and frequency; it is integrable and the inner product between the wavelet and the signal always exists.

The admissibility condition implies a wavelet has zero mean in the time-domain, a zero at zero frequency in the time-domain. This is necessary to ensure that it is integrable and the inverse of the wavelet transform can also be calculated.

Furthermore:

- A wavelet can be orthogonal or non-orthogonal.
- A wavelet can be bi-orthogonal or not.
- A wavelet can be symmetric or not.
- A wavelet can be complex or real. If it is complex, it is usually divided into a real part representing the amplitude and an imaginary part representing the phase.
- A wavelets is normalized to have unit energy.

Below we can see a plot with several different families of wavelets.

The first row contains four Discrete Wavelets and the second row four Continuous Wavelets.

discrete_wavelets = ['db5', 'sym5', 'coif5', 'bior2.4'] continuous_wavelets = ['mexh', 'morl', 'cgau5', 'gaus5'] list_list_wavelets = [discrete_wavelets, continuous_wavelets] list_funcs = [pywt.Wavelet, pywt.ContinuousWavelet] fig, axarr = plt.subplots(nrows=2, ncols=4, figsize=(16,8)) for ii, list_wavelets in enumerate(list_list_wavelets): func = list_funcs[ii] row_no = ii for col_no, waveletname in enumerate(list_wavelets): wavelet = func(waveletname) family_name = wavelet.family_name biorthogonal = wavelet.biorthogonal orthogonal = wavelet.orthogonal symmetry = wavelet.symmetry if ii == 0: _ = wavelet.wavefun() wavelet_function = _[0] x_values = _[-1] else: wavelet_function, x_values = wavelet.wavefun() if col_no == 0 and ii == 0: axarr[row_no, col_no].set_ylabel("Discrete Wavelets", fontsize=16) if col_no == 0 and ii == 1: axarr[row_no, col_no].set_ylabel("Continuous Wavelets", fontsize=16) axarr[row_no, col_no].set_title("{}".format(family_name), fontsize=16) axarr[row_no, col_no].plot(x_values, wavelet_function) axarr[row_no, col_no].set_yticks([]) axarr[row_no, col_no].set_yticklabels([]) plt.tight_layout() plt.show()

**PS:** To see how all wavelets looks like, you can have a look at the wavelet browser.

Within each wavelet family there can be a lot of different wavelet subcategories belonging to that family. You can distinguish the different subcategories of wavelets by the number of coefficients (the number of vanishing moments) and the level of decomposition.

This is illustrated below in for the one family of wavelets called ‘Daubechies’.

import pywt import matplotlib.pyplot as plt db_wavelets = pywt.wavelist('db')[:5] print(db_wavelets) *** ['db1', 'db2', 'db3', 'db4', 'db5'] fig, axarr = plt.subplots(ncols=5, nrows=5, figsize=(20,16)) fig.suptitle('Daubechies family of wavelets', fontsize=16) for col_no, waveletname in enumerate(db_wavelets): wavelet = pywt.Wavelet(waveletname) no_moments = wavelet.vanishing_moments_psi family_name = wavelet.family_name for row_no, level in enumerate(range(1,6)): wavelet_function, scaling_function, x_values = wavelet.wavefun(level = level) axarr[row_no, col_no].set_title("{} - level {}\n{} vanishing moments\n{} samples".format( waveletname, level, no_moments, len(x_values)), loc='left') axarr[row_no, col_no].plot(x_values, wavelet_function, 'bD--') axarr[row_no, col_no].set_yticks([]) axarr[row_no, col_no].set_yticklabels([]) plt.tight_layout() plt.subplots_adjust(top=0.9) plt.show()

In Figure 6 we can see wavelets of the ‘Daubechies’ family (db) of wavelets. In the first column we can see the Daubechies wavelets of the first order ( db1), in the second column of the second order (db2), up to the fifth order in the fifth column. PyWavelets contains Daubechies wavelets up to order 20 (db20).

The number of the order indicates the number of vanishing moments. So db3 has three vanishing moments and db5 has 5 vanishing moment. The number of vanishing moments is related to the approximation order and smoothness of the wavelet. If a wavelet has p vanishing moments, it can approximate polynomials of degree p – 1.

When selecting a wavelet, we can also indicate what the level of decomposition has to be. By default, PyWavelets chooses the maximum level of decomposition possible for the input signal. The maximum level of decomposition (see pywt.dwt_max_level()) depends on the length of the input signal length and the wavelet (more on this later).

As we can see, as the number of vanishing moments increases, the polynomial degree of the wavelet increases and it becomes smoother. And as the level of decomposition increases, the number of samples this wavelet is expressed in increases.

As we have seen before (Figure 5), the Wavelet Transform comes in two different and distinct flavors; the Continuous and the Discrete Wavelet Transform.

Mathematically, a Continuous Wavelet Transform is described by the following equation:

where is the continuous mother wavelet which gets scaled by a factor of and translated by a factor of . The values of the scaling and translation factors are continuous, which means that there can be an infinite amount of wavelets. You can scale the mother wavelet with a factor of 1.3, or 1.31, and 1.311, and 1.3111 etc.

When we are talking about the Discrete Wavelet Transform, the main difference is that the DWT uses discrete values for the scale and translation factor. The scale factor increases in powers of two, so and the translation factor increases integer values ( ).

**PS:** The DWT is only discrete in the scale and translation domain, not in the time-domain. To be able to work with digital and discrete signals we also need to discretize our wavelet transforms in the time-domain. These forms of the wavelet transform are called the Discrete-Time Wavelet Transform and the Discrete-Time Continuous Wavelet Transform.

In practice, the DWT is always implemented as a filter-bank. This means that it is implemented as a cascade of high-pass and low-pass filters. This is because filter banks are a very efficient way of splitting a signal of into several frequency sub-bands.

Below I will try to explain the concept behind the filter-bank in a simple (and probably oversimplified) way. It is necessary in order to understand how the wavelet transform actually works and can be used in practical applications.

To apply the DWT on a signal, we start with the smallest scale. As we have seen before, small scales correspond with high frequencies. This means that we first analyze high frequency behavior. At the second stage, the scale increases with a factor of two (the frequency decreases with a factor of two), and we are analyzing behavior around half of the maximum frequency. At the third stage, the scale factor is four and we are analyzing frequency behavior around a quarter of the maximum frequency. And this goes on and on, until we have reached the maximum decomposition level.

What do we mean with maximum decomposition level? To understand this we should also know that at each subsequent stage the number of samples in the signal is reduced with a factor of two. At lower frequency values, you will need less samples to satisfy the Nyquist rate so there is no need to keep the higher number of samples in the signal; it will only cause the transform to be computationally expensive. Due to this downsampling, at some stage in the process the number of samples in our signal will become smaller than the length of the wavelet filter and we will have reached the maximum decomposition level.

To give an example, suppose we have a signal with frequencies up to 1000 Hz. In the first stage we split our signal into a low-frequency part and a high-frequency part, i.e. 0-500 Hz and 500-1000 Hz.

At the second stage we take the low-frequency part and again split it into two parts: 0-250 Hz and 250-500 Hz.

At the third stage we split the 0-250 Hz part into a 0-125 Hz part and a 125-250 Hz part.

This goes on until we have reached the level of refinement we need or until we run out of samples.

We can easily visualize this idea, by plotting what happens when we apply the DWT on a chirp signal. A chirp signal is a signal with a dynamic frequency spectrum; the frequency spectrum increases with time. The start of the signal contains low frequency values and the end of the signal contains the high frequencies. This makes it easy for us to visualize which part of the frequency spectrum is filtered out by simply looking at the time-axis.

import pywt x = np.linspace(0, 1, num=2048) chirp_signal = np.sin(250 * np.pi * x**2) fig, ax = plt.subplots(figsize=(6,1)) ax.set_title("Original Chirp Signal: ") ax.plot(chirp_signal) plt.show() data = chirp_signal waveletname = 'sym5' fig, axarr = plt.subplots(nrows=5, ncols=2, figsize=(6,6)) for ii in range(5): (data, coeff_d) = pywt.dwt(data, waveletname) axarr[ii, 0].plot(data, 'r') axarr[ii, 1].plot(coeff_d, 'g') axarr[ii, 0].set_ylabel("Level {}".format(ii + 1), fontsize=14, rotation=90) axarr[ii, 0].set_yticklabels([]) if ii == 0: axarr[ii, 0].set_title("Approximation coefficients", fontsize=14) axarr[ii, 1].set_title("Detail coefficients", fontsize=14) axarr[ii, 1].set_yticklabels([]) plt.tight_layout() plt.show()

In Figure 7 we can see our chirp signal, and the DWT applied to it subsequently. There are a few things to notice here:

- In PyWavelets the DWT is applied with
`pywt.dwt()`

- The DWT return two sets of coefficients; the
**approximation**coefficients and**detail**coefficients. - The
**approximation coefficients**represent the output of the low pass filter (averaging filter) of the DWT. - The
**detail coefficients**represent the output of the high pass filter (difference filter) of the DWT. - By applying the DWT again on the approximation coefficients of the previous DWT, we get the wavelet transform of the next level.
- At each next level, the original signal is also sampled down by a factor of 2.

So now we have seen what it means that the DWT is implemented as a filter bank; At each subsequent level, the approximation coefficients are divided into a coarser low pass and high pass part and the DWT is applied again on the low-pass part.

As we can see, our original signal is now converted to several signals each corresponding to different frequency bands. Later on we will see how the approximation and detail coefficients at the different frequency sub-bands can be used in applications like removing high frequency noise from signals, compressing signals, or classifying the different types signals.

**PS:** We can also use `pywt.wavedec()`

to immediately calculate the coefficients of a higher level. This functions takes as input the original signal and the level and returns the one set of approximation coefficients (of the n-th level) and n sets of detail coefficients (1 to n-th level).

**PS2:** This idea of analyzing the signal on different scales is also known as multiresolution / multiscale analysis, and decomposing your signal in such a way is also known as multiresolution decomposition, or sub-band coding.

So far we have seen what the wavelet transform is, how it is different from the Fourier Transform, what the difference is between the CWT and the DWT, what types of wavelet families there are, what the impact of the order and level of decomposition is on the mother wavelet, and how and why the DWT is implemented as a filter-bank.

We have also seen that the output of a wavelet transform on a 1D signal results in a 2D scaleogram. Such a scaleogram gives us detailed information about the state-space of the system, i.e. it gives us information about the dynamic behavior of the system.

The el-Nino dataset is a time-series dataset used for tracking the El Nino and contains quarterly measurements of the sea surface temperature from 1871 up to 1997. In order to understand the power of a scaleogram, let us visualize it for el-Nino dataset together with the original time-series data and its Fourier Transform.

def plot_wavelet(time, signal, scales, waveletname = 'cmor', cmap = plt.cm.seismic, title = 'Wavelet Transform (Power Spectrum) of signal', ylabel = 'Period (years)', xlabel = 'Time'): dt = time[1] - time[0] [coefficients, frequencies] = pywt.cwt(signal, scales, waveletname, dt) power = (abs(coefficients)) ** 2 period = 1. / frequencies levels = [0.0625, 0.125, 0.25, 0.5, 1, 2, 4, 8] contourlevels = np.log2(levels) fig, ax = plt.subplots(figsize=(15, 10)) im = ax.contourf(time, np.log2(period), np.log2(power), contourlevels, extend='both',cmap=cmap) ax.set_title(title, fontsize=20) ax.set_ylabel(ylabel, fontsize=18) ax.set_xlabel(xlabel, fontsize=18) yticks = 2**np.arange(np.ceil(np.log2(period.min())), np.ceil(np.log2(period.max()))) ax.set_yticks(np.log2(yticks)) ax.set_yticklabels(yticks) ax.invert_yaxis() ylim = ax.get_ylim() ax.set_ylim(ylim[0], -1) cbar_ax = fig.add_axes([0.95, 0.5, 0.03, 0.25]) fig.colorbar(im, cax=cbar_ax, orientation="vertical") plt.show() def plot_signal_plus_average(time, signal, average_over = 5): fig, ax = plt.subplots(figsize=(15, 3)) time_ave, signal_ave = get_ave_values(time, signal, average_over) ax.plot(time, signal, label='signal') ax.plot(time_ave, signal_ave, label = 'time average (n={})'.format(5)) ax.set_xlim([time[0], time[-1]]) ax.set_ylabel('Signal Amplitude', fontsize=18) ax.set_title('Signal + Time Average', fontsize=18) ax.set_xlabel('Time', fontsize=18) ax.legend() plt.show() def get_fft_values(y_values, T, N, f_s): f_values = np.linspace(0.0, 1.0/(2.0*T), N//2) fft_values_ = fft(y_values) fft_values = 2.0/N * np.abs(fft_values_[0:N//2]) return f_values, fft_values def plot_fft_plus_power(time, signal): dt = time[1] - time[0] N = len(signal) fs = 1/dt fig, ax = plt.subplots(figsize=(15, 3)) variance = np.std(signal)**2 f_values, fft_values = get_fft_values(signal, dt, N, fs) fft_power = variance * abs(fft_values) ** 2 # FFT power spectrum ax.plot(f_values, fft_values, 'r-', label='Fourier Transform') ax.plot(f_values, fft_power, 'k--', linewidth=1, label='FFT Power Spectrum') ax.set_xlabel('Frequency [Hz / year]', fontsize=18) ax.set_ylabel('Amplitude', fontsize=18) ax.legend() plt.show() dataset = "http://paos.colorado.edu/research/wavelets/wave_idl/sst_nino3.dat" df_nino = pd.read_table(dataset) N = df_nino.shape[0] t0=1871 dt=0.25 time = np.arange(0, N) * dt + t0 signal = df_nino.values.squeeze() scales = np.arange(1, 128) plot_signal_plus_average(time, signal) plot_fft_plus_power(time, signal) plot_wavelet(time, signal, scales)

In Figure 8 we can see in the top figure the el-Nino dataset together with its time average, in the middle figure the Fourier Transform and at the bottom figure the scaleogram produced by the Continuous Wavelet Transform.

In the scaleogram we can see that most of the power is concentrated in a 2-8 year period. If we convert this to frequency (T = 1 / f) this corresponds with a frequency of 0.125 – 0.5 Hz. The increase in power can also be seen in the Fourier transform around these frequency values. The main difference is that the wavelet transform also gives us temporal information and the Fourier Transform does not. For example, in the scaleogram we can see that up to 1920 there were many fluctuations, while there were not so much between 1960 – 1990. We can also see that there is a shift from shorter to longer periods as time progresses. This is the kind of dynamic behavior in the signal which can be visualized with the Wavelet Transform but not with the Fourier Transform.

This should already make clear how powerful the wavelet transform can be for machine learning purposes. But to make the story complete, let us also look at how this can be used in combination with a Convolutional Neural Network to classify signals.

In section 3.1 we have seen that the wavelet transform of a 1D signal results in a 2D scaleogram which contains a lot more information than just the time-series or just the Fourier Transform. We have seen that applied on the el-Nino dataset, it can not only tell us what the period is of the largest oscillations, but also when these oscillations were present and when not.

Such a scaleogram can not only be used to better understand the dynamical behavior of a system, but it can also be used to distinguish different types of signals produced by a system from each other.

If you record a signal while you are walking up the stairs or down the stairs, the scaleograms will look different. ECG measurements of people with a healthy heart will have different scaleograms than ECG measurements of people with arrhythmia. Or measurements on a bearing, motor, rotor, ventilator, etc when it is faulty vs when it not faulty. The possibilities are limitless!

So by looking at the scaleograms we can distinguish a broken motor from a working one, a healthy person from a sick one, a person walking up the stairs from a person walking down the stairs, etc etc. But if you are as lazy as me, you probably don’t want to sift through thousands of scaleograms manually. One way to automate this process is to build a Convolutional Neural Network which can automatically detect the class each scaleogram belongs to and classify them accordingly.

What was the deal again with CNN? In previous blog posts we have seen how we can use Tensorflow to build a convolutional neural network from scratch. And how we can use such a CNN to detect roads in satellite images. If you are not familiar with CNN’s it is a good idea to have a look at these previous blog posts, since the rest of this section assumes you have some knowledge of CNN’s.

In the next few sections we will load a dataset (containing measurement of people doing six different activities), visualize the scaleograms using the CWT and then use a Convolutional Neural Network to classify these scaleograms.

Let us try to classify an open dataset containing time-series using the scaleograms and a CNN. The Human Activity Recognition Dataset (UCI-HAR) contains sensor measurements of people while they were doing different types of activities, like walking up or down the stairs, laying, standing, walking, etc. There are in total more than 10.000 signals where each signal consists of nine components (x acceleration, y acceleration, z acceleration, x,y,z gyroscope, etc). This is the perfect dataset for us to try our use case of CWT + CNN!

**PS: **For more on how the UCI HAR dataset looks like, you can also have a look at the previous blog post in which it was described in more detail.

After we have downloaded the data, we can load it into a numpy nd-array in the following way:

def read_signals_ucihar(filename): with open(filename, 'r') as fp: data = fp.read().splitlines() data = map(lambda x: x.rstrip().lstrip().split(), data) data = [list(map(float, line)) for line in data] return data def read_labels_ucihar(filename): with open(filename, 'r') as fp: activities = fp.read().splitlines() activities = list(map(int, activities)) return activities def load_ucihar_data(folder): train_folder = folder + 'train/InertialSignals/' test_folder = folder + 'test/InertialSignals/' labelfile_train = folder + 'train/y_train.txt' labelfile_test = folder + 'test/y_test.txt' train_signals, test_signals = [], [] for input_file in os.listdir(train_folder): signal = read_signals_ucihar(train_folder + input_file) train_signals.append(signal) train_signals = np.transpose(np.array(train_signals), (1, 2, 0)) for input_file in os.listdir(test_folder): signal = read_signals_ucihar(test_folder + input_file) test_signals.append(signal) test_signals = np.transpose(np.array(test_signals), (1, 2, 0)) train_labels = read_labels_ucihar(labelfile_train) test_labels = read_labels_ucihar(labelfile_test) return train_signals, train_labels, test_signals, test_labels folder_ucihar = './data/UCI_HAR/' train_signals_ucihar, train_labels_ucihar, test_signals_ucihar, test_labels_ucihar = load_ucihar_data(folder_ucihar)

The training set contains 7352 signals where each signal has 128 measurement samples and 9 components. The signals from the training set are loaded into a numpy ndarray of size (7352 , 128, 9) and the signals from the test set into one of size (2947 , 128, 9).

Since the signal consists of nine components we have to apply the CWT nine times for each signal. Below we can see the result of the CWT applied on two different signals from the dataset. The left one consist of a signal measured while walking up the stairs and the right one is a signal measured while laying down.

Figure 9. The CWT applied on two signals belonging to the UCI HAR dataset. Each signal has nine different components. On the left we can see the signals measured during walking upstairs and on the right we can see a signal measured during laying.

Since each signal has nine components, each signal will also have nine scaleograms. So the next question to ask is, how do we feed this set of nine scaleograms into a Convolutional Neural Network? There are several options we could follow:

- Train a CNN for each component separately and combine the results of the nine CNN’s in some sort of an ensembling method. I suspect that this will generally result in a poorer performance since the inter dependencies between the different components are not taken into account.
- Concatenate the nine different signals into one long signal and apply the CWT on the concatenated signal. This
*could*work but there will be discontinuities at location where two signals are concatenated and this will introduced noise in the scaleogram at the boundary locations of the component signals. - Calculate the CWT first and thén concatenate the nine different CWT images into one and feed that into the CNN. This could also work, but here there will also be discontinuities at the boundaries of the CWT images which will feed noise into the CNN. If the CNN is deep enough, it will be able to distinguish between these noisy parts and actually useful parts of the image and choose to ignore the noise. But I still prefer option number four:
- Place the nine scaleograms on top of each other and create one single image with nine channels. What does this mean? Well, normally an image has either one channel (grayscale image) or three channels (color image), but our CNN can just as easily handle images with nine channels. The way the CNN works remains exactly the same, the only difference is that there will be three times more filters compared to an RGB image.

This process is illustrated in the Figure 11.

Below we can see the Python code on how to apply the CWT on the signals in the dataset, and reformat it in such a way that it can be used as input for our Convolutional Neural Network. The total dataset contains over 10.000 signals, but we will only use 5.000 signals in our training set and 500 signals in our test set.

scales = range(1,128) waveletname = 'morl' x_train = np.ndarray(shape=(5000, 127, 127, 9)) x_test = np.ndarray(shape=(500, 127, 127, 9)) for ii in range(0,5000): for jj in range(0,9): signal = uci_har_signals_train[ii, :, jj] coefficients, frequencies = pywt.cwt(signal, scales, waveletname, 1) image = coefficients[:127,:127] x_train[ii, :, :, jj] = image for ii in range(5000,5500): for jj in range(0,9): signal = uci_har_signals_train[ii, :, jj] coefficients, frequencies = pywt.cwt(signal, scales, waveletname, 1) image = coefficients[:127,:127] x_test[ii-5000, :, :, jj] = image y_train = list(uci_har_labeNNls_train[:5000]) y_test = list(uci_har_labels_train[5000:5500])

As you can see above, the CWT of a single signal-component (128 samples) results in an image of 127 by 127 pixels. So the scaleograms coming from the 5000 signals of the training dataset are stored in an numpy ndarray of size (5000, 127, 127, 9) and the scaleograms coming from the 500 test signals are stored in one of size (500, 127, 127, 9).

Now that we have the data in the right format, we can start with the most interesting part of this section: training the CNN! For this part you will need the keras library, so please install it first.

import tensorflow as tf import keras from keras.layers import Dense, Flatten from keras.layers import Conv2D, MaxPooling2D from keras.models import Sequential img_x = 127 img_y = 127 img_z = 9 input_shape = (img_x, img_y, img_z) num_classes = 6 batch_size = 16 num_classes = 7 epochs = 10 x_train = x_train.reshape(x_train.shape[0], img_x, img_y, img_z) x_test = x_test.reshape(x_test.shape[0], img_x, img_y, img_z) x_train = x_train.astype('float32') x_test = x_test.astype('float32') x_train /= 255 x_test /= 255 y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes) model = Sequential() model.add(Conv2D(32, kernel_size=(5, 5), strides=(1, 1), activation='relu', input_shape=input_shape)) model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2))) model.add(Conv2D(64, (5, 5), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Flatten()) model.add(Dense(1000, activation='relu')) model.add(Dense(num_classes, activation='softmax')) model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adam(), metrics=['accuracy']) class AccuracyHistory(keras.callbacks.Callback): def on_train_begin(self, logs={}): self.acc = [] def on_epoch_end(self, batch, logs={}): self.acc.append(logs.get('acc')) history = AccuracyHistory() model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(x_test, y_test), callbacks=[history]) score = model.evaluate(x_test, y_test, verbose=0) print('Test loss:', score[0]) print('Test accuracy:', score[1]) *** Test loss: 0.11555804646015168 *** Test accuracy: 0.9620000009536743

As you can see, combining the Wavelet Transform and a Convolutional Neural Network leads to an awesome and amazing result!

We have an accuracy of 96% on the Activity Recognition dataset. Much higher than you can achieve with any other method.

In section 3.5 we will use the Discrete Wavelet Transform instead of Continuous Wavelet Transform to classify the same dataset and achieve similarly amazing results!

In Section 2.5 we have seen how the DWT is implemented as a filter-bank which can deconstruct a signal into its frequency sub-bands. In this section, let us see how we can use PyWavelets to deconstruct a signal into its frequency sub-bands and reconstruct the original signal again.

PyWavelets offers two different ways to deconstruct a signal.

- We can either apply
`pywt.dwt()`

on a signal to retrieve the approximation coefficients. Then apply the DWT on the retrieved coefficients to get the second level coefficients and continue this process until you have reached the desired decomposition level. - Or we can apply
`pywt.wavedec()`

directly and retrieve all of the the detail coefficients up to some level . This functions takes as input the original signal and the level and returns the one set of approximation coefficients (of the n-th level) and n sets of detail coefficients (1 to n-th level).

(cA1, cD1) = pywt.dwt(signal, 'db2', 'smooth') reconstructed_signal = pywt.idwt(cA1, cD1, 'db2', 'smooth') fig, ax = plt.subplots(figsize=(8,4)) ax.plot(signal, label='signal') ax.plot(reconstructed_signal, label='reconstructed signal', linestyle='--') ax.legend(loc='upper left') plt.show()

Above we have deconstructed a signal into its coefficients and reconstructed it again using the inverse DWT.

The second way is to use `pywt.wavedec()`

to deconstruct and reconstruct a signal and it is probably the most simple way if you want to get higher-level coefficients.

coeffs = pywt.wavedec(signal, 'db2', level=8) reconstructed_signal = pywt.waverec(coeffs, 'db2') fig, ax = plt.subplots(figsize=(8,4)) ax.plot(signal[:1000], label='signal') ax.plot(reconstructed_signal[:1000], label='reconstructed signal', linestyle='--') ax.legend(loc='upper left') ax.set_title('de- and reconstruction using wavedec()') plt.show()

In the previous section we have seen how we can deconstruct a signal into the approximation (low pass) and detail (high pass) coefficients. If we reconstruct the signal using these coefficients we will get the original signal back.

But what happens if we reconstruct while we leave out some detail coefficients? Since the detail coefficients represent the high frequency part of the signal, we will simply have filtered out that part of the frequency spectrum. If you have a lot of high-frequency noise in your signal, this one way to filter it out.

Leaving out of the detail coefficients can be done using pywt.threshold(), which removes coefficient values higher than the given threshold.

Lets demonstrate this using NASA’s Femto Bearing Dataset. This is a dataset containing high frequency sensor data regarding accelerated degradation of bearings.

DATA_FOLDER = './FEMTO_bearing/Training_set/Bearing1_1/' filename = 'acc_01210.csv' df = pd.read_csv(DATA_FOLDER + filename, header=None) signal = df[4].values def lowpassfilter(signal, thresh = 0.63, wavelet="db4"): thresh = thresh*np.nanmax(signal) coeff = pywt.wavedec(signal, wavelet, mode="per" ) coeff[1:] = (pywt.threshold(i, value=thresh, mode="soft" ) for i in coeff[1:]) reconstructed_signal = pywt.waverec(coeff, wavelet, mode="per" ) return reconstructed_signal fig, ax = plt.subplots(figsize=(12,8)) ax.plot(signal, color="b", alpha=0.5, label='original signal') rec = lowpassfilter(signal, 0.4) ax.plot(rec, 'k', label='DWT smoothing}', linewidth=2) ax.legend() ax.set_title('Removing High Frequency Noise with DWT', fontsize=18) ax.set_ylabel('Signal Amplitude', fontsize=16) ax.set_xlabel('Sample No', fontsize=16) plt.show()

As we can see, by deconstructing the signal, setting some of the coefficients to zero, and reconstructing it again, we can remove high frequency noise from the signal.

People who are familiar with signal processing techniques might know there are a lot of different ways to remove noise from a signal. For example, the Scipy library contains a lot of smoothing filters (one of them is the famous Savitzky-Golay filter) and they are much simpler to use. Another method for smoothing a signal is to average it over its time-axis.

So why should you use the DWT instead? The advantage of the DWT again comes from the many wavelet shapes there are available. You can choose a wavelet which will have a shape characteristic to the phenomena you expect to see. In this way, less of the phenomena you expect to see will be smoothed out.

In Section 3.2 we have seen how we can use a the CWT and a CNN to classify signals. Of course it is also possible to use the DWT to classify signals. Let us have a look at how this could be done.

The idea behind DWT signal classification is as follows: The DWT is used to split a signal into different frequency sub-bands, as many as needed or as many as possible. If the different types of signals exhibit different frequency characteristics, this difference in behavior has to be exhibited in one of the frequency sub-bands. So if we generate features from each of the sub-band and use the collection of features as an input for a classifier (Random Forest, Gradient Boosting, Logistic Regression, etc) and train it by using these features, the classifier should be able to distinguish between the different types of signals.

This is illustrated in the figure below:

So what kind of features can be generated from the set of values for each of the sub-bands? Of course this will highly depend on the type of signal and the application. But in general, below are some features which are most frequently used for signals.

- Auto-regressive model coefficient values
- (Shannon) Entropy values; entropy values can be taken as a measure of complexity of the signal.
- Statistical features like:
- variance
- standard deviation
- Mean
- Median
- 25th percentile value
- 75th percentile value
- Root Mean Square value; square of the average of the squared amplitude values
- The mean of the derivative
- Zero crossing rate, i.e. the number of times a signal crosses y = 0
- Mean crossing rate, i.e. the number of times a signal crosses y = mean(y)

These are just some ideas you could use to generate the features out of each sub-band. You could use some of the features described here, or you could use all of them. Most classifiers in the scikit-learn package are powerful enough to handle a large number of input features and distinguish between useful ones and non-useful ones. However, I still recommend you think carefully about which feature would be useful for the type of signal you are trying to classify.

**PS:** There are a hell of a lot more statistical functions in scipy.stats. By using these you can create more features if necessary.

Let’s see how this could be done in Python for a few of the above mentioned features:

def calculate_entropy(list_values): counter_values = Counter(list_values).most_common() probabilities = [elem[1]/len(list_values) for elem in counter_values] entropy=scipy.stats.entropy(probabilities) return entropy def calculate_statistics(list_values): n5 = np.nanpercentile(list_values, 5) n25 = np.nanpercentile(list_values, 25) n75 = np.nanpercentile(list_values, 75) n95 = np.nanpercentile(list_values, 95) median = np.nanpercentile(list_values, 50) mean = np.nanmean(list_values) std = np.nanstd(list_values) var = np.nanvar(list_values) rms = np.nanmean(np.sqrt(list_values**2)) return [n5, n25, n75, n95, median, mean, std, var, rms] def calculate_crossings(list_values): zero_crossing_indices = np.nonzero(np.diff(np.array(list_values) > 0))[0] no_zero_crossings = len(zero_crossing_indices) mean_crossing_indices = np.nonzero(np.diff(np.array(list_values) > np.nanmean(list_values)))[0] no_mean_crossings = len(mean_crossing_indices) return [no_zero_crossings, no_mean_crossings] def get_features(list_values): entropy = calculate_entropy(list_values) crossings = calculate_crossings(list_values) statistics = calculate_statistics(list_values) return [entropy] + crossings + statistics

Above we can see

- a function to calculate the entropy value of an input signal,
- a function to calculate some statistics like several percentiles, mean, standard deviation, variance, etc,
- a function to calculate the zero crossings rate and the mean crossings rate,
- and a function to combine the results of these three functions.

The final function returns a set of 12 features for any list of values. So if one signal is decomposed into 10 different sub-bands, and we generate features for each sub-band, we will end up with 10*12 = 120 features per signal.

So far so good. The next step is to actually use the DWT to decompose the signals in the training set into its sub-bands, calculate the features for each sub-band, use the features to train a classifier and use the classifier to predict the signals in the test-set.

We will do this for two time-series datasets:

- The UCI-HAR dataset which we have already seen in section 3.2. This dataset contains smartphone sensor data of humans while doing different types of activities, like sitting, standing, walking, stair up and stair down.
- PhysioNet ECG Dataset (download from here) which contains a set of ECG measurements of healthy persons (indicated as Normal sinus rhythm, NSR) and persons with either an arrhythmia (ARR) or a congestive heart failure (CHF). This dataset contains 96 ARR measurements, 36 NSR measurements and 30 CHF measurements.

After we have downloaded both datasets, and placed them in the right folders, the next step is to load them into memory. We have already seen how we can load the UCI-HAR dataset in section 3.2, and below we can see how to load the ECG dataset.

def load_ecg_data(filename): raw_data = sio.loadmat(filename) list_signals = raw_data['ECGData'][0][0][0] list_labels = list(map(lambda x: x[0][0], raw_data['ECGData'][0][0][1])) return list_signals, list_labels ########## filename = './data/ECG_data/ECGData.mat' data_ecg, labels_ecg = load_ecg_data(filename) training_size = int(0.6*len(labels_ecg)) train_data_ecg = data_ecg[:training_size] test_data_ecg = data_ecg[training_size:] train_labels_ecg = labels_ecg[:training_size] test_labels_ecg = labels_ecg[training_size:]

The ECG dataset is saved as a MATLAB file, so we have to use scipy.io.loadmat() to open this file in Python and retrieve its contents (the ECG measurements and the labels) as two separate lists.

The UCI HAR dataset is saved in a lot of .txt files, and after reading the data we save it into an numpy ndarray of size (no of signals, length of signal, no of components) = (10299 , 128, 9)

Now let us have a look at how we can get features out of these two datasets.

def get_uci_har_features(dataset, labels, waveletname): uci_har_features = [] for signal_no in range(0, len(dataset)): features = [] for signal_comp in range(0,dataset.shape[2]): signal = dataset[signal_no, :, signal_comp] list_coeff = pywt.wavedec(signal, waveletname) for coeff in list_coeff: features += get_features(coeff) uci_har_features.append(features) X = np.array(uci_har_features) Y = np.array(labels) return X, Y def get_ecg_features(ecg_data, ecg_labels, waveletname): list_features = [] list_unique_labels = list(set(ecg_labels)) list_labels = [list_unique_labels.index(elem) for elem in ecg_labels] for signal in ecg_data: list_coeff = pywt.wavedec(signal, waveletname) features = [] for coeff in list_coeff: features += get_features(coeff) list_features.append(features) return list_features, list_labels X_train_ecg, Y_train_ecg = get_ecg_features(train_data_ecg, train_labels_ecg, 'db4') X_test_ecg, Y_test_ecg = get_ecg_features(test_data_ecg, test_labels_ecg, 'db4') X_train_ucihar, Y_train_ucihar = get_uci_har_features(train_signals_ucihar, train_labels_ucihar, 'rbio3.1') X_test_ucihar, Y_test_ucihar = get_uci_har_features(test_signals_ucihar, test_labels_ucihar, 'rbio3.1')

What we have done above is to write functions to generate features from the ECG signals and the UCI HAR signals. There is nothing special about these functions and the only reason we are using two seperate functions is because the two datasets are saved in different formats. The ECG dataset is saved in a list, and the UCI HAR dataset is saved in a 3D numpy ndarray.

For the ECG dataset we iterate over the list of signals, and for each signal apply the DWT which returns a list of coefficients. For each of these coefficients, i.e. for each of the frequency sub-bands, we calculate the features with the function we have defined previously. The features calculated from all of the different coefficients belonging to one signal are concatenated together, since they belong to the same signal.

The same is done for the UCI HAR dataset. The only difference is that we now have two for-loops since each signal consists of nine components. The features generated from each of the sub-band from each of the signal component are concatenated together.

Now that we have calculated the features for the two datasets, we can use a GradientBoostingClassifier from the scikit-learn library and train it.

**PS:** If you want to know more about classification with the scikit-learn library, you can have a look at this blog post.

cls = GradientBoostingClassifier(n_estimators=2000) cls.fit(X_train_ecg, Y_train_ecg) train_score = cls.score(X_train_ecg, Y_train_ecg) test_score = cls.score(X_test_ecg, Y_test_ecg) print("Train Score for the ECG dataset is about: {}".format(train_score)) print("Test Score for the ECG dataset is about: {.2f}".format(test_score)) ### cls = GradientBoostingClassifier(n_estimators=2000) cls.fit(X_train_ucihar, Y_train_ucihar) train_score = cls.score(X_train_ucihar, Y_train_ucihar) test_score = cls.score(X_test_ucihar, Y_test_ucihar) print("Train Score for the UCI-HAR dataset is about: {}".format(train_score)) print("Test Score for the UCI-HAR dataset is about: {.2f}".format(test_score)) *** Train Score for the ECG dataset is about: 1.0 *** Test Score for the ECG dataset is about: 0.93 *** Train Score for the UCI_HAR dataset is about: 1.0 *** Test Score for the UCI-HAR dataset is about: 0.95

As we can see, the results when we use the DWT + Gradient Boosting Classifier are equally amazing!

This approach has an accuracy on the UCI-HAR test set of 95% and an accuracy on the ECG test set of 93%.

So far, we have seen throughout the various blog-posts, how we can classify time-series and signals in different ways. In a previous post we have classified the UCI-HAR dataset using Signal Processing techniques like The Fourier Transform. The accuracy on the test-set was ~91%.

In another blog-post, we have classified the same UCI-HAR dataset using Recurrent Neural Networks. The highest achieved accuracy on the test-set was ~86%.

In this blog-post we have achieved an accuracy of ~96% on the test-set with the approach of CWT + CNN and an accuracy of ~95% with the DWT + GradientBoosting approach!

It is clear that the Wavelet Transform has far better results. We could repeat this for other datasets and I suspect there the Wavelet Transform will also outperform.

What we can also notice is that strangely enough Recurrent Neural Networks perform the worst. Even the approach of simply using the Fourier Transform and then the peaks in the frequency spectrum has an better accuracy than RNN’s.

This really makes me wonder, what the hell Recurrent Neural Networks are actually good for. It is said that a RNN can learn ‘temporal dependencies in sequential data’. But, if a Fourier Transform can fully describe any signal (no matter its complexity) in terms of the Fourier components, then what more can be learned with respect to ‘temporal dependencies’.

It is no wonder that people already start talking about the fall of RNN / LSTM.

**PS: **The achieved accuracy using the DWT will depend on the features you decide to calculate, the wavelet and the classifier you decide to use. To give an impression, below are the accuracy values for the test set of the UCI-HAR and ECG datasets, for all of the wavelets present in PyWavelets, and for the most used 5 classifiers in scikit-learn.

We can see that there will be differences in accuracy depending on the chosen classifier. Generally speaking, the Gradient Boosting classifier performs best. This should come as no surprise since almost all kaggle competitions are won with the gradient boosting model.

What is more important is that the chosen wavelet can also have a lot of influence on the achieved accuracy values. Unfortunately I do not have an guideline on choosing the right wavelet. The best way to choose the right wavelet is to do a lot of trial-and-error and a little bit of literature research.

In this blog post we have seen how we can use the Wavelet Transform for the analysis and classification of time-series and signals (and some other stuff). Not many people know how to use the Wavelet Transform, but this is mostly because the theory is not beginner-friendly and the wavelet transform is not readily available in open source programming languages.

MATLAB is one the few programming languages with a very complete Wavelet Toolbox. But since MATLAB has a expensive licensing fee, it is mostly used in the academic world and large corporations.

Among Data Scientists, the Wavelet Transform remains an undiscovered jewel and I recommend fellow Data Scientists to use it more often. I am very thankful to the contributors of the PyWavelets package who have implemented a large set of Wavelet families and higher level functions for using these wavelets. Thanks to them, Wavelets can now more easily be used by people using the Python programming language.

I have tried to give a concise but complete description of how wavelets can be used for the analysis of signals. I hope it has motivated you to use it more often and that the Python code provided in this blog-post will point you in the right direction in your quest to use wavelets for data analysis.

A lot will depend on the choices you make; which wavelet transform will you use, CWT or DWT? which wavelet family will you use? Up to which level of decomposition will you go? What is the right range of scales to use?

Like most things in life, the only way to master a new method is by a lot of practice. You can look up some literature to see what the commonly used wavelet type is for your specific problem, but do not make the mistake of thinking whatever you read in research papers is the holy and you don’t need to look any further.

**PS: You can also find the code in this blog-post in five different Jupyter notebooks in my Github repository.**

]]>

In the previous blog posts we have seen how we can build Convolutional Neural Networks in Tensorflow and also how we can use Stochastic Signal Analysis techniques to classify signals and time-series. In this blog post, lets have a look and see how we can build Recurrent Neural Networks in Tensorflow and use them to classify Signals.

Recurrent Neural Nets (RNN) detect features in sequential data (e.g. time-series data). Examples of applications which can be made using RNN’s are anomaly detection in time-series data, classification of ECG and EEG data, stock market prediction, speech recogniton, sentiment analysis, etc.

This is done by unrolling the data into N different copies of itself (if the data consists of N time-steps) .

In this way, the input data at the previous time steps can be used when the data at timestep is evaluated. If the data at the previous time steps is somehow correlated to the data at the current time step, these correlations are remembered and otherwise they are forgotten.

By unrolling the data, the weights of the Neural Network are shared across all of the time steps, and the RNN can generalize beyond the example seen at the current timestep, and beyond sequences seen in the training set.

This is a very short description of how an RNN works. For people who want to know more, here is some more reading material to get you up to speed. For now, what I would like you to remember is that Recurrent Neural Networks can learn whether there are temporal dependencies in the sequential data, and if there are, which dependencies / features can be used to classify the data. A RNN therefore is ideal for the classification of time-series, signals and text documents.

So, Lets start with implementing RNN’s in Tensorflow and using them to classify signals.

This blog we will work with the CPU-friendly Human Activity Recognition Using Smartphones dataset. This dataset contains measurements done by 30 people between the ages of 19 to 48. These people have a smartphone placed on the waist while doing one of the following six activities:

- walking,
- walking upstairs,
- walking downstairs,
- sitting,
- standing or
- laying.

During these activities, sensor data is recorded at a constant rate of 50Hz. The signals are cut in fixed-width windows of 2.56 sec with 50% overlap. Since, these signals of 2.56 sec long have a sampling rate of 50 Hz, they will have 128 samples in total. For an illustration of this, see Figure 1a.

The smartphone measures three-axial linear body acceleration, three-axial linear total acceleration and three-axial angular velocity. So per measurement, the signal has nine components in total (see Figure 1b).

The dataset is already splitted into a training and a test part, so we can immediately load the signal into two different numpy ndarrays containing the training part and test part.

def read_signals(filename): with open(filename, 'r') as fp: data = fp.read().splitlines() data = map(lambda x: x.rstrip().lstrip().split(), data) data = [list(map(float, line)) for line in data] data = np.array(data, dtype=np.float32) return data def read_labels(filename): with open(filename, 'r') as fp: activities = fp.read().splitlines() activities = list(map(int, activities)) return np.array(activities) def randomize(dataset, labels): permutation = np.random.permutation(labels.shape[0]) shuffled_dataset = dataset[permutation, :, :] shuffled_labels = labels[permutation] return shuffled_dataset, shuffled_labels def one_hot_encode(np_array, num_labels): return (np.arange(num_labels) == np_array[:,None]).astype(np.float32) def reformat_data(dataset, labels): no_labels = len(np.unique(labels)) labels = one_hot_encode(labels, no_labels) dataset, labels = randomize(dataset, labels) return dataset, labels #### INPUT_FOLDER_TRAIN = './UCI_HAR/train/InertialSignals/' INPUT_FOLDER_TEST = './UCI_HAR/test/InertialSignals/' INPUT_FILES_TRAIN = ['body_acc_x_train.txt', 'body_acc_y_train.txt', 'body_acc_z_train.txt', 'body_gyro_x_train.txt', 'body_gyro_y_train.txt', 'body_gyro_z_train.txt', 'total_acc_x_train.txt', 'total_acc_y_train.txt', 'total_acc_z_train.txt'] INPUT_FILES_TEST = ['body_acc_x_test.txt', 'body_acc_y_test.txt', 'body_acc_z_test.txt', 'body_gyro_x_test.txt', 'body_gyro_y_test.txt', 'body_gyro_z_test.txt', 'total_acc_x_test.txt', 'total_acc_y_test.txt', 'total_acc_z_test.txt'] ##### train_signals, test_signals = [], [] for input_file in INPUT_FILES_TRAIN: signal = read_signals(INPUT_FOLDER_TRAIN + input_file) train_signals.append(signal) train_signals = np.transpose(np.array(train_signals), (1, 2, 0)) for input_file in INPUT_FILES_TEST: signal = read_signals(INPUT_FOLDER_TEST + input_file) test_signals.append(signal) test_signals = np.transpose(np.array(test_signals), (1, 2, 0)) ##### LABELFILE_TRAIN = './UCI_HAR/train/y_train.txt' LABELFILE_TEST = './UCI_HAR/test/y_test.txt' train_labels = read_labels(LABELFILE_TRAIN) test_labels = read_labels(LABELFILE_TEST) ##### train_dataset, train_labels = reformat_data(train_signals, train_labels) test_dataset, test_labels = reformat_data(test_signals, test_labels)

The number of signals in the training set is 7352, and the number of signals in the test set is 2947. As we can see in Figure 2, each signal has a length of of 128 samples and 9 different components, so numerically it can be considered as an array of size 128 x 9.

As we have also seen in the previous blog posts, our Neural Network consists of a `tf.Graph()`

and a `tf.Session()`

. The `tf.Graph()`

contains all of the computational steps required for the Neural Network, and the `tf.Session`

is used to execute these steps.

The computational steps defined in the `tf.Graph`

can be divided into four main parts;

- We initialize placeholders which are filled with batches of training data during the run.
- We define the RNN model and to calculate the output values (logits)
- The logits are used to calculate a loss value, which then
- is used in an Optimizer to optimize the weights of the RNN.

num_units = 50 signal_length = 128 num_components = 9 num_labels = 6 num_hidden = 32 learning_rate = 0.001 lambda_loss = 0.001 total_steps = 5000 display_step = 500 batch_size = 100 def accuracy(y_predicted, y): return (100.0 * np.sum(np.argmax(y_predicted, 1) == np.argmax(y, 1)) / y_predicted.shape[0]) #### graph = tf.Graph() with graph.as_default(): #1) First we put the input data in a tensorflow friendly form. tf_dataset = tf.placeholder(tf.float32, shape=(None, signal_length, num_components)) tf_labels = tf.placeholder(tf.float32, shape = (None, num_labels)) #2) Then we choose the model to calculate the logits (predicted labels) # We can choose from several models: logits = rnn_model(tf_dataset, num_hidden, num_labels) #logits = lstm_rnn_model(tf_dataset, num_hidden, num_labels) #logits = bidirectional_lstm_rnn_model(tf_dataset, num_hidden, num_labels) #logits = twolayer_lstm_rnn_model(tf_dataset, num_hidden, num_labels) #logits = gru_rnn_model(tf_dataset, num_hidden, num_labels) #3) Then we compute the softmax cross entropy between the logits and the (actual) labels l2 = lambda_loss * sum(tf.nn.l2_loss(tf_var) for tf_var in tf.trainable_variables()) loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf_labels)) + l2 #4. # The optimizer is used to calculate the gradients of the loss function optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss) #optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss) #optimizer = tf.train.AdagradOptimizer(learning_rate).minimize(loss) # Predictions for the training, validation, and test data. prediction = tf.nn.softmax(logits) with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print("\nInitialized") for step in range(total_steps): #Since we are using stochastic gradient descent, we are selecting small batches from the training dataset, #and training the convolutional neural network each time with a batch. offset = (step * batch_size) % (train_labels.shape[0] - batch_size) batch_data = train_dataset[offset:(offset + batch_size), :, :] batch_labels = train_labels[offset:(offset + batch_size), :] feed_dict = {tf_dataset : batch_data, tf_labels : batch_labels} _, l, train_predictions = session.run([optimizer, loss, prediction], feed_dict=feed_dict) train_accuracy = accuracy(train_predictions, batch_labels) if step % display_step == 0: feed_dict = {tf_dataset : test_dataset, tf_labels : test_labels} _, test_predictions = session.run([loss, prediction], feed_dict=feed_dict) test_accuracy = accuracy(test_predictions, test_labels) message = "step {:04d} : loss is {:06.2f}, accuracy on training set {} %, accuracy on test set {:02.2f} %".format(step, l, train_accuracy, test_accuracy) print(message)

Initialized step 0000 : loss is 001.96, accuracy on training set 16.0 %, accuracy on test set 27.99 % step 0500 : loss is 000.75, accuracy on training set 60.0 %, accuracy on test set 60.67 % step 1000 : loss is 000.52, accuracy on training set 74.0 %, accuracy on test set 68.68 % step 1500 : loss is 000.66, accuracy on training set 70.0 %, accuracy on test set 69.22 % step 2000 : loss is 001.14, accuracy on training set 60.0 %, accuracy on test set 50.93 % step 2500 : loss is 001.00, accuracy on training set 74.0 %, accuracy on test set 65.97 % step 3000 : loss is 001.25, accuracy on training set 69.0 %, accuracy on test set 65.93 % step 3500 : loss is 001.40, accuracy on training set 79.0 %, accuracy on test set 69.66 % step 4000 : loss is 001.69, accuracy on training set 74.0 %, accuracy on test set 70.44 % step 4500 : loss is 002.00, accuracy on training set 77.0 %, accuracy on test set 70.68 %

As you can see, there are different RNN Models and optimizers to choose from.

GradientDescentOptimizer is a vanilla (simple) implementation of Stochastic Gradient Descent while other implementations like the AdaOptimizer, MomentumOptimizer and AdamOptimizer dynamically adapt the learning rate to the parameters resulting in a more computational intensive process with better results. For a good explanation of the differences between all the different optimizers, have a look at Sebastian Ruders’ blog.

Besides the different types of optimizers, Tensorflow also contains different flavours of RNN’s.

We can choose from different types of cells and wrappers use them to reconstruct different types of Recurrent Neural Networks.

The basic types of cells are a BasicRNNCell, GruCell, LSTMCell, MultiRNNCell, These can be placed inside a static_rnn, dynamic_rnn or a static_bidirectional_rnn container.

In Figure 3 we can see (on the left side) a schematic overview of the process-steps of constructing a RNN Model together with (on the right side) the lines of code accompanying these steps.

As you can see, we first split the data into a list of N different arrays with `tf.unstack()`

. Then the type of cell is chosen and passed into the recurrent neural network together with the splitted data.

Now that we have schematically seen how we can create a RNN model, lets have a look at how we can create the different types of models in more detail.

Above, we have seen what the computational steps of the Neural Network consists of. But we have not yet seen the contents of our rnn_model, lstm_rnn_model, bidirectional_lstm_rnn_model, twolayer_lstm_rnn_model or gru_rnn_model. Lets have a look at how these models are constructed in more detail in the few sections below.

def rnn_model(data, num_hidden, num_labels): splitted_data = tf.unstack(data, axis=1) cell = tf.nn.rnn_cell.BasicRNNCell(num_hidden) outputs, current_state = tf.nn.static_rnn(cell, splitted_data, dtype=tf.float32) output = outputs[-1] w_softmax = tf.Variable(tf.truncated_normal([num_hidden, num_labels])) b_softmax = tf.Variable(tf.random_normal([num_labels])) logit = tf.matmul(output, w_softmax) + b_softmax return logit

As you can see, we first split the Tensor containing the data (size batch_size, 128, 9) into a list of 128 Tensors of size (batch_size, 9) each. This is used, together with BasicRNNCell as an input for the static_rnn, which gives us a list of outputs (also of length 128).

The last output in this list (the last time step) contains information from all previous timesteps, so this is the output we will use to classify this signal.

**BasicRNNCell **is the most basic and vanille cell present in Tensorflow. It is an basic implementation of a RNN cell and does not have an LSTM implementation like **BasicLSTMCell** has. The accuracy you can achieve with BasicLSTMCell therefore is higher than BasicRNNCelll.

Since it does not have LSTM implemented, BasicRNNCell has its limitations. Instead of a BasicRNNCell we can use a BasicLSTMCell or an LSTMCell. Both are comparable, but a **LSTMCell **has some additional options like peephole structures, clipping of values, etc.

def rnn_lstm_model(data, num_hidden, num_labels): splitted_data = tf.unstack(data, axis=1) cell = tf.nn.rnn_cell.BasicLSTMCell(num_hidden) outputs, current_state = tf.nn.static_rnn(cell, splitted_data, dtype=tf.float32) output = outputs[-1] w_softmax = tf.Variable(tf.truncated_normal([num_hidden, num_labels])) b_softmax = tf.Variable(tf.random_normal([num_labels])) logit = tf.matmul(output, w_softmax) + b_softmax return logit

Besides BasicRNNCell and BasicLSTMCell, Tensorflow also contains **GruCell**, which is an abstract implementation of the Gated Recurrent Unit, proposed in 2014 by Kyunghyun Cho et al.

def gru_rnn_model(data, num_hidden, num_labels): splitted_data = tf.unstack(data, axis=1) cell = tf.contrib.rnn.GRUCell(num_hidden) outputs, current_state = tf.nn.static_rnn(cell, splitted_data, dtype=tf.float32) output = outputs[-1] w_softmax = tf.Variable(tf.truncated_normal([num_hidden, num_labels])) b_softmax = tf.Variable(tf.random_normal([num_labels])) logit = tf.matmul(output, w_softmax) + b_softmax return logit

The vanille RNN and LSTM RNN models we have seen so far, assume that the data at a step only depend on ‘past’ events. A bidirectional LSTM RNN, assumes that the output at step can also depend on the data at future steps. This is not so strange if you think about applications in text analytics or speech recognition: subjects often precede verbs, adjectives precede nouns and in speech recognition the meaning of current sound may depend on the meaning of the next few sounds.

To implement a bidirectional RNN, two BasicLSTMCell’s are used; the first one looks for temporal dependencies in the backward direction and the second one for dependencies in the forward direction.

def bidirectional_rnn_model(data, num_hidden, num_labels): splitted_data = tf.unstack(data, axis=1) lstm_cell1 = tf.nn.rnn_cell.BasicLSTMCell(num_hidden, forget_bias=1.0, state_is_tuple=True) lstm_cell2 = tf.nn.rnn_cell.BasicLSTMCell(num_hidden, forget_bias=1.0, state_is_tuple=True) outputs, _, _ = tf.nn.static_bidirectional_rnn(lstm_cell1, lstm_cell2, splitted_data, dtype=tf.float32) output = outputs[-1] w_softmax = tf.Variable(tf.truncated_normal([num_hidden*2, num_labels])) b_softmax = tf.Variable(tf.random_normal([num_labels])) logit = tf.matmul(output, w_softmax) + b_softmax return logit

We have seen how we can implement a bi-directional LSTM by stacking two LSTM Cells on top of each other, where the first on looks for sequential dependencies in the forward direction, and the second one in the backward direction. You could also place two LSTM cells on top of each other, simply to increase the neural network strength.

def twolayer_rnn_model(data, num_hidden, num_labels): splitted_data = tf.unstack(data, axis=1) cell1 = tf.nn.rnn_cell.BasicLSTMCell(num_hidden, forget_bias=1.0, state_is_tuple=True) cell2 = tf.nn.rnn_cell.BasicLSTMCell(num_hidden, forget_bias=1.0, state_is_tuple=True) cell = tf.nn.rnn_cell.MultiRNNCell([cell1, cell2], state_is_tuple=True) outputs, state = tf.nn.static_rnn(cell, splitted_data, dtype=tf.float32) output = outputs[-1] w_softmax = tf.Variable(tf.truncated_normal([num_hidden, num_labels])) b_softmax = tf.Variable(tf.random_normal([num_labels])) logit = tf.matmul(output, w_softmax) + b_softmax return logit

In this RNN network, n layers of RNN are stacked on top of each other. The output of each layer is mapped into the input of the next layer, and this allows the RNN to hierarchically looks for temporal dependencies. With each layer the representational power of the Neural Network increases (in theory).

def multi_rnn_model(data, num_hidden, num_labels, num_cells = 4): splitted_data = tf.unstack(data, axis=1) lstm_cells = [] for ii in range(0,num_cells): lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(num_hidden, state_is_tuple=True) lstm_cells.append(lstm_cell) cell = tf.nn.rnn_cell.MultiRNNCell(lstm_cells, state_is_tuple=True) outputs, state = tf.nn.static_rnn(cell, splitted_data, dtype=tf.float32) output = outputs[-1] w_softmax = tf.Variable(tf.truncated_normal([num_hidden, num_labels])) b_softmax = tf.Variable(tf.random_normal([num_labels])) logit = tf.matmul(output, w_softmax) + b_softmax return logit

The num_layer parameter determines how many layers are used to determine the temporal dependencies in the data. The more layers you have, the higher the representation power of the RNN is.

We have seen how we can build several different types of Recurrent Neural Networks. The question then is, how do these RNN’s perform in practice?

Does the accuracy really increase a lot with the number of layers or the number of hidden units?

What is the effect of the chosen optimizer and the learning rate?

In each image you can see the final accuracy in the test set for different learning rates, models, optimizers and hidden units. You can **click on each image for a more detailed graph** of the training and test accuracies.

In this blog-post we have seen how we can build an Recurrent Neural Network in Tensorflow, from a vanille RNN model, to an LSTM RNN, GRU RNN, bi-directional or multi-layered RNN’s. Such Recurrent Neural Networks are (powerful) tools which can be used for the analysis of time-series data or other data which is sequential in nature (like text or speech).

What I have noticed so far is:

- The most important factor to achieve high accuracy values is the chosen learning rate. It should be carefully tuned, first with large steps than with finer steps.
- AdamOptimizer usually performs best.
- More hidden units is not necessarily better. In any case, if you change the number of hidden units, you probably need to find the optimum value for learning rate again.
- For the type of RNN also; more layers is not necessarily better. BasicRNNCell has the worst performance, but except for BasicRNNCell there is no single implementation which outperforms all others in all regards. If you implement a RNN containing a
**BasicLSTMCell**and carefully tune the learning rate and implement some l2-regularization it should be good enough for most applications. - I am not that impressed with RNN’s in general. Same accuracy values can be / are achieved with simple stochastic analysis techniques with much less effort. With Stochastic analysis techniques you also have the benefit of knowing what the characteristic feature of each type of signal is.

** **

[1] If you feel like you need to refresh your understanding of CNN’s, here are some good starting points to get you up to speed:

- Machine Learning is fun!
- Colah’s blog
- WildML on RNN
- DeepLearning4J on RNN
- The deeplearning book
- Some more resources

]]>

Stochastic Signal Analysis is a field of science concerned with the processing, modification and analysis of (stochastic) signals.

Anyone with a background in Physics or Engineering knows to some degree about signal analysis techniques, what these technique are and how they can be used to analyze, model and classify signals.

Data Scientists coming from a different fields, like Computer Science or Statistics, might not be aware of the analytical power these techniques bring with them.

In this blog post, we will have a look at how we can use Stochastic Signal Analysis techniques, in combination with traditional Machine Learning Classifiers for accurate classification and modelling of time-series and signals.

At the end of the blog-post you should be able understand the various signal-processing techniques which can be used to retrieve features from signals and be able to classify ECG signals (and even identify a person by their ECG signal), predict seizures from EEG signals, classify and identify targets in radar signals, identify patients with neuropathy or myopathyetc from EMG signals by using the FFT, etc etc.

In this blog-post we’ll discuss the following topics:

- Basics of Signals
- Transformations between time- and frequency-domain by means of FFT, PSD and autocorrelation.
- Statistical parameter estimation and feature extraction
- Example dataset: Classification of human activity
- Extracting features from all signals in the training and test set

- Classification with (traditional) Scikit-learn classifiers
- Finals words

You might often have come across the words time-series and signals describing datasets and it might not be clear what the exact difference between them is.

In a time-series dataset the to-be-predicted value () is a function of time (). Such a function can describe anything, from the value of bitcoin or a specific stock over time, to fish population over time. A signal is a more general version of this where the dependent variable does not have to a function of time; it can be a function of spatial coordinates (), distance from the source ( ), etc etc.

Signals can come in many different forms and shapes: you can think of audio signals, pictures, video signals, geophysical signals (seismic data), sonar and radar data and medical signals (EEG, ECG, EMG).

- A picture can be seen as a signal which contains information about the brightness of the three colors (RGB) across the two spatial dimensions.
- Sonar signals give information about an acoustic pressure field as a function of time and the three spatial dimensions.
- Radar signals do the same thing for electromagnetic waves.

In essence, almost anything can be interpreted as a signal as long as it carries information within itself.

If you open up a text book on Signal Processing it will usually be divided into two parts: the continuous time-domain and the discrete-time domain.

The difference is that Continuous signals have an independent variable which is (as the name suggests) continuous in nature, i.e. it is present at each time-step within its domain. No matter how far you ‘zoom in’, you will have a value at that time step; at , at , at , etc etc.

Discrete-time signals are discrete and are only defined at specific time-steps. For example, if the period of a discrete signal is , it will be defined at , , , etc … (but not at ).

Most of the signals you come across in nature are analog (continuous); think of the electrical signals in your body, human speech, any other sound you hear, the amount of light measured during the day, barometric pressure, etc etc.

Whenever you want to digitize one of these analog signals in order to analyze and visualize it on a computer, it becomes discrete. Since we will only be concerning ourselves with digital signals in this blog-post, we will only look at the discrete version of the various stochastic signal analysis techniques.

Digitizing an analog signal is usually done by sampling it with a specific sampling rate. In Figure 1 we can see a signal sampled at different frequencies.

As we can see, it is important to choose a good sampling rate; if the sampling rate is chosen too low, the discrete signal will no longer contain the characteristics of the original analog signal and important features defining the signal are lost.

To be more specific, a signal is said to be under-sampled if the sampling rate is smaller than the Nyquist rate. The Nyquist rate is twice the highest frequency present in the signal.

Under-sampling a signal can lead to effects like aliasing [2] and the wagon-wheel effect (see video). To prevent undersampling usually a frequency much higher than the Nyquist rate is chosen as the sampling frequency.

If the signal contains a pattern, which repeats itself after a specific period of time, we call it an *periodic* signal.

The time it takes for an periodic signal to repeat itself is called the period (and the distance it travels in this period is called the wavelength ).

The frequency is the inverse of the Period; if a signal has a Period of , its frequency is , and if the period is , the frequency is .

The period, wavelength and frequency are related to each other via formula (1):

(1)

where is the speed of sound.

Fourier analysis is a field of study used to analyze the periodicity in (periodic) signals. If a signal contains components which are periodic in nature, Fourier analysis can be used to decompose this signal in its periodic components. Fourier analysis tells us at what the frequency of these periodical component are.

For example, if we measure your heart beat and at the time of measurement you have a heart rate of 60 beats / minute, the signal will have a frequency of (Period of = frequency of ). If you are doing at the same time, some repetitive task where you move your fingers every two seconds, the signal going to you hand will have a frequency of (Period of = frequency of ). An electrode placed on your arm, will measure the combination of these two signals. And a Fourier analysis performed on the combined signals, will show us a peak in the frequency spectrum at 0.5 Hz and one at 1 Hz.

So, two (or more) different signals (with different frequencies, amplitudes, etc) can be mixed together to form a new composite signal. The new signal then consists of all of its component signals.

The reverse is also true, every signal – no matter how complex it looks – can be decomposed into a sum of its simpler signals. These simpler signals are trigonometric functions (sine and cosine waves). This was discovered (in 1822) by Joseph Fourier and it is what Fourier analysis is about. The mathematical function which transform a signal from the time-domain to the frequency-domain is called the Fourier Transform, and the function which does the opposite is called the Inverse Fourier Transform.

If you want to know how the Fourier transform works, 3blue1brown’s beautifully animated explanation will hopefully give you more insight.

Below, we can see this in action. We have five sine-waves (blue signals) with amplitudes 4, 6, 8, 10 and 14 and frequencies 6.5, 5, 3, 1.5 and 1 Hz. By combining these signals we form a new composite signal (black). The Fourier Transform transforms this signal to the frequency-domain (red signal) and shows us at which frequencies the component signals oscillate.

from mpl_toolkits.mplot3d import Axes3D t_n = 10 N = 1000 T = t_n / N f_s = 1/T x_value = np.linspace(0,t_n,N) amplitudes = [4, 6, 8, 10, 14] frequencies = [6.5, 5, 3, 1.5, 1] y_values = [amplitudes[ii]*np.sin(2*np.pi*frequencies[ii]*x_value) for ii in range(0,len(amplitudes))] composite_y_value = np.sum(y_values, axis=0) f_values, fft_values = get_fft_values(composite_y_value, T, N, f_s) colors = ['k', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b'] fig = plt.figure(figsize=(8,8)) ax = fig.add_subplot(111, projection='3d') ax.set_xlabel("\nTime [s]", fontsize=16) ax.set_ylabel("\nFrequency [Hz]", fontsize=16) ax.set_zlabel("\nAmplitude", fontsize=16) y_values_ = [composite_y_value] + list(reversed(y_values)) frequencies = [1, 1.5, 3, 5, 6.5] for ii in range(0,len(y_values_)): signal = y_values_[ii] color = colors[ii] length = signal.shape[0] x=np.linspace(0,10,1000) y=np.array([frequencies[ii]]*length) z=signal if ii == 0: linewidth = 4 else: linewidth = 2 ax.plot(list(x), list(y), zs=list(z), linewidth=linewidth, color=color) x=[10]*75 y=f_values[:75] z = fft_values[:75]*3 ax.plot(list(x), list(y), zs=list(z), linewidth=2, color='red') plt.tight_layout() plt.show()

The Fast Fourier Transform (FFT) is an efficient algorithm for calculating the Discrete Fourier Transform (DFT) and is the de facto standard to calculate a Fourier Transform. It is present in almost any scientific computing libraries and packages, in every programming language.

Nowadays the Fourier transform is an indispensable mathematical tool used in almost every aspect of our daily lives. In the next section we will have a look at how we can use the FFT and other Stochastic Signal analysis techniques to classify time-series and signals.

In Python, the FFT of a signal can be calculate with the SciPy library.

Below, we can see how we can use SciPy to calculate the FFT of the composite signal above, and retrieve the frequency values of its component signals.

from scipy.fftpack import fft def get_fft_values(y_values, T, N, f_s): f_values = np.linspace(0.0, 1.0/(2.0*T), N//2) fft_values_ = fft(y_values) fft_values = 2.0/N * np.abs(fft_values_[0:N//2]) return f_values, fft_values t_n = 10 N = 1000 T = t_n / N f_s = 1/T f_values, fft_values = get_fft_values(composite_y_value, T, N, f_s) plt.plot(f_values, fft_values, linestyle='-', color='blue') plt.xlabel('Frequency [Hz]', fontsize=16) plt.ylabel('Amplitude', fontsize=16) plt.title("Frequency domain of the signal", fontsize=16) plt.show()

Since our signal is sampled at a rate of , the FFT will return the frequency spectrum up to a frequency of . The higher your sampling rate is, the higher the maximum frequency is FFT can calculate.

In the `get_fft_values`

function above, the `scipy.fftpack.fft`

function returns a vector of complex valued frequencies. Since they are complex valued, they will contain a real and an imaginary part. The real part of the complex value corresponds with the magnitude, and the imaginary part with the phase of the signal. Since we are only interested in the magnitude of the amplitudes, we use `np.abs()`

to take the real part of the frequency spectrum.

The FFT of an input signal of N points, will return an vector of N points. The first half of this vector (N/2 points) contain the useful values of the frequency spectrum from 0 Hz up to the Nyquist frequency of . The second half contains the complex conjugate and can be disregarded since it does not provide any useful information.

Closely related to the Fourier Transform is the concept of Power Spectral Density.

Similar to the FFT, it describes the frequency spectrum of a signal. But in addition to the FFT it also takes the power distribution at each frequency (bin) into account. Generally speaking the locations of the peaks in the frequency spectrum will be the same as in the FFT-case, but the height and width of the peaks will differ. The surface below the peaks corresponds with the power distribution at that frequency.

Calculation of the Power Spectral density is a bit easier, since SciPy contain a function which not only return a vector of amplitudes, but also a vector containing the tick-values of the frequency-axis.

from scipy.signal import welch def get_psd_values(y_values, T, N, f_s): f_values, psd_values = welch(y_values, fs=f_s) return f_values, psd_values t_n = 10 N = 1000 T = t_n / N f_s = 1/T f_values, psd_values = get_psd_values(composite_y_value, T, N, f_s) plt.plot(f_values, psd_values, linestyle='-', color='blue') plt.xlabel('Frequency [Hz]') plt.ylabel('PSD [V**2 / Hz]') plt.show()

The auto-correlation function calculates the correlation of a signal with a time-delayed version of itself. The idea behind it is that if a signal contain a pattern which repeats itself after a time-period of seconds, there will be a high correlation between the signal and a sec delayed version of the signal.

Unfortunately there is no standard function to calculate the auto-correlation of a function in SciPy. But we can make one ourselves using the `correlate()`

function of numpy. Our function returns the correlation value, as a function of the time-delay . Naturally, this time-delay can not be more than the full length of the signal (which is in our case 2.56 sec).

def autocorr(x): result = np.correlate(x, x, mode='full') return result[len(result)//2:] def get_autocorr_values(y_values, T, N, f_s): autocorr_values = autocorr(y_values) x_values = np.array([T * jj for jj in range(0, N)]) return x_values, autocorr_values t_n = 10 N = 1000 T = t_n / N f_s = 1/T t_values, autocorr_values = get_autocorr_values(composite_y_value, T, N, f_s) plt.plot(t_values, autocorr_values, linestyle='-', color='blue') plt.xlabel('time delay [s]') plt.ylabel('Autocorrelation amplitude') plt.show()

Converting the values of the auto-correlation peaks from the time-domain to the frequency domain should result in the same peaks as the ones calculated by the FFT. The frequency of a signal thus can be found with the auto-correlation as well as with the FFT.

However, because it is more precise, the FFT is almost always used for frequency detection.

Fun fact: the auto-correlation and the PSD are Fourier Transform pairs, i.e. the PSD can be calculated by taking the FFT of the auto-correlation function, and the auto-correlation can be calculated by taking the Inverse Fourier Transform of the PSD function.

One transform which we have not mentioned here is the Wavelet transform. It transform a signal into its frequency domain, just like the Fourier Transform.

The difference is: the Fourier Transform has a very high resolution in the frequency domain, and zero resolution in the time domain; we know at which frequencies the signal oscillates, but not at which time these oscillations occur. The output of a Wavelet transform hash a high resolution in the frequency domain and also in the time domain; it maintains information about the time-domain.

The wavelet transform is better suited for analyzing signals with a dynamic frequency spectrum, i.e. the frequency spectrum changes over time, or has a component with a different frequency localized in time (the frequency changes abruptly for a short period of time).

Lets have a look at how to use the wavelet transform in Python in the next blog-post. For the ones who can not wait to get started with it, here are some examples of applications using the wavelet transform.

We have seen three different ways to calculate characteristics of signals using the FFT, PSD and the autocorrelation function. These functions transform a signal from the time-domain to the frequency-domain and give us its frequency spectrum.

After we have transformed a signal to the frequency-domain, we can extract features from each of these transformed signals and use these features as input in standard classifiers like Random Forest, Logistic Regression, Gradient Boosting or Support Vector Machines.

Which features can we extract from these transformations? A good first step is the value of the frequencies at which oscillations occur and the corresponding amplitudes. In other words; the x and y-position of the peaks in the frequency spectrum.

SciPy contains some methods to find the relative maxima (argrelmax) and minima (argrelmin) in data, but I found the peak detection method of Marcos Duarte much simpler and easier to use.

Armed with this peak-finding function, we can calculate the FFT, PSD and the auto-correlation of each signal and use the x and y coordinates of the peaks as input for our classifier. This is illustrated in Figure 6.

Lets have a look at how we can classify the signals in the Human Activity Recognition Using Smartphones Data Set. This dataset contains measurements done by 30 people between the ages of 19 to 48. The measurements are done with a smartphone placed on the waist while doing one of the following six activities:

- walking,
- walking upstairs,
- walking downstairs,
- sitting,
- standing or
- laying.

The measurements are done at a constant rate of . After filtering out the noise, the signals are cut in fixed-width windows of 2.56 sec with an overlap of 1.28 sec. Each signal will therefore have 50 x 2.56 = 128 samples in total. This is illustrated in Figure 7a.

The smartphone measures three-axial linear body acceleration, three-axial linear total acceleration and three-axial angular velocity. So per measurement, the total signal consists of nine components (see Figure 7b).

The dataset is already splitted into a training and a test part, so we can immediately load the signals into an numpy ndarray.

def read_signals(filename): with open(filename, 'r') as fp: data = fp.read().splitlines() data = map(lambda x: x.rstrip().lstrip().split(), data) data = [list(map(float, line)) for line in data] data = np.array(data, dtype=np.float32) return data def read_labels(filename): with open(filename, 'r') as fp: activities = fp.read().splitlines() activities = list(map(int, activities)) return np.array(activities) INPUT_FOLDER_TRAIN = './UCI_HAR/train/InertialSignals/' INPUT_FOLDER_TEST = './UCI_HAR/test/InertialSignals/' INPUT_FILES_TRAIN = ['body_acc_x_train.txt', 'body_acc_y_train.txt', 'body_acc_z_train.txt', 'body_gyro_x_train.txt', 'body_gyro_y_train.txt', 'body_gyro_z_train.txt', 'total_acc_x_train.txt', 'total_acc_y_train.txt', 'total_acc_z_train.txt'] INPUT_FILES_TEST = ['body_acc_x_test.txt', 'body_acc_y_test.txt', 'body_acc_z_test.txt', 'body_gyro_x_test.txt', 'body_gyro_y_test.txt', 'body_gyro_z_test.txt', 'total_acc_x_test.txt', 'total_acc_y_test.txt', 'total_acc_z_test.txt'] train_signals, test_signals = [], [] for input_file in INPUT_FILES_TRAIN: signal = read_signals(INPUT_FOLDER_TRAIN + input_file) train_signals.append(signal) train_signals = np.transpose(np.array(train_signals), (1, 2, 0)) for input_file in INPUT_FILES_TEST: signal = read_signals(INPUT_FOLDER_TEST + input_file) test_signals.append(signal) test_signals = np.transpose(np.array(test_signals), (1, 2, 0)) LABELFILE_TRAIN = './UCI_HAR/train/y_train.txt' LABELFILE_TEST = './UCI_HAR/test/y_test.txt' train_labels = read_labels(LABELFILE_TRAIN) test_labels = read_labels(LABELFILE_TEST)

We have loaded the training set into a ndarray of size (7352, 128, 9) and the test set into a ndarray of size (2947, 128, 9). As you can guess from the dimensions, the number of signals in the training set is 7352 and the number of signals in the test set is 2947. And each signal in the training and test set has a length of 128 samples and 9 different components.

Below, we will visualize the signal itself with its nine components, the FFT, the PSD and auto-correlation of the components, together with the peaks present in each of the three transformations.

import numpy as np import matplotlib.pyplot as plt def get_values(y_values, T, N, f_s): y_values = y_values x_values = [sample_rate * kk for kk in range(0,len(y_values))] return x_values, y_values #### labels = ['x-component', 'y-component', 'z-component'] colors = ['r', 'g', 'b'] suptitle = "Different signals for the activity: {}" xlabels = ['Time [sec]', 'Freq [Hz]', 'Freq [Hz]', 'Time lag [s]'] ylabel = 'Amplitude' axtitles = [['Acceleration', 'Gyro', 'Total acceleration'], ['FFT acc', 'FFT gyro', 'FFT total acc'], ['PSD acc', 'PSD gyro', 'PSD total acc'], ['Autocorr acc', 'Autocorr gyro', 'Autocorr total acc'] ] list_functions = [get_values, get_fft_values, get_psd_values, get_autocorr_values] N = 128 f_s = 50 t_n = 2.56 T = t_n / N signal_no = 0 signals = train_signals[signal_no, :, :] label = train_labels[signal_no] activity_name = activities_description[label] f, axarr = plt.subplots(nrows=4, ncols=3, figsize=(12,12)) f.suptitle(suptitle.format(activity_name), fontsize=16) for row_no in range(0,4): for comp_no in range(0,9): col_no = comp_no // 3 plot_no = comp_no % 3 color = colors[plot_no] label = labels[plot_no] axtitle = axtitles[row_no][col_no] xlabel = xlabels[row_no] value_retriever = list_functions[row_no] ax = axarr[row_no, col_no] ax.set_title(axtitle, fontsize=16) ax.set_xlabel(xlabel, fontsize=16) if col_no == 0: ax.set_ylabel(ylabel, fontsize=16) signal_component = signals[:, comp_no] x_values, y_values = value_retriever(signal_component, T, N, f_s) ax.plot(x_values, y_values, linestyle='-', color=color, label=label) if row_no > 0: max_peak_height = 0.1 * np.nanmax(y_values) indices_peaks = detect_peaks(y_values, mph=max_peak_height) ax.scatter(x_values[indices_peaks], y_values[indices_peaks], c=color, marker='*', s=60) if col_no == 2: ax.legend(loc='center left', bbox_to_anchor=(1, 0.5)) plt.tight_layout() plt.subplots_adjust(top=0.90, hspace=0.6) plt.show()

That is a lot of code! But if you strip away the parts of the code to give the plot a nice layout, you can see it basically consists of two for loops which apply the different transformations to the nine components of the signal and subsequently finds the peaks in the resulting frequency spectrum. The transformation is plotted with a line-plot and the found peaks are plot with a scatter-plot.

The result can be seen in Figure 8.

In Figure 8, we have already shown how to extract features from a signal: transform a signal by means of the FFT, PSD or autocorrelation function and locate the peaks in the transformation with the peak-finding function.

We have already seen how to do that for one signal, so now simply need to iterate through all signals in the dataset.

def get_first_n_peaks(x,y,no_peaks=5): x_, y_ = list(x), list(y) if len(x_) >= no_peaks: return x_[:no_peaks], y_[:no_peaks] else: missing_no_peaks = no_peaks-len(x_) return x_ + [0]*missing_no_peaks, y_ + [0]*missing_no_peaks def get_features(x_values, y_values, mph): indices_peaks = detect_peaks(y_values, mph=mph) peaks_x, peaks_y = get_first_n_peaks(x_values[indices_peaks], y_values[indices_peaks]) return peaks_x + peaks_y def extract_features_labels(dataset, labels, T, N, f_s, denominator): percentile = 5 list_of_features = [] list_of_labels = [] for signal_no in range(0, len(dataset)): features = [] list_of_labels.append(labels[signal_no]) for signal_comp in range(0,dataset.shape[2]): signal = dataset[signal_no, :, signal_comp] signal_min = np.nanpercentile(signal, percentile) signal_max = np.nanpercentile(signal, 100-percentile) #ijk = (100 - 2*percentile)/10 mph = signal_min + (signal_max - signal_min)/denominator features += get_features(*get_psd_values(signal, T, N, f_s), mph) features += get_features(*get_fft_values(signal, T, N, f_s), mph) features += get_features(*get_autocorr_values(signal, T, N, f_s), mph) list_of_features.append(features) return np.array(list_of_features), np.array(list_of_labels) denominator = 10 X_train, Y_train = extract_features_labels(train_signals, train_labels, T, N, f_s, denominator) X_test, Y_test = extract_features_labels(test_signals, test_labels, T, N, f_s, denominator)

This process results in a matrix containing the features of the training set, and a matrix containing the features of the test set. The number rows of these matrices should be equal to the number of signals in each set (7352 and 2947).

The number of columns in each matrix depend on depends on your choice of features. Each signal has nine components, and for each component you can calculate either just the FFT or all three of the transformations. For each transformation you can decide to look at the first n peaks in the signal. And for each peak you can decide to take only the x value, or both the x and y values. In the example above, we have taken the x and y values of the first 5 peaks of each transform, so we have 270 columns (9*3*5*2) in total.

After constructing the matrices for the training and the test set, together with a list of the correct labels, we can use the scikit-learn package to construct a classifier.

#### from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report clf = RandomForestClassifier(n_estimators=1000) clf.fit(X_train, Y_train) print("Accuracy on training set is : {}".format(clf.score(X_train, Y_train))) print("Accuracy on test set is : {}".format(clf.score(X_test, Y_test))) Y_test_pred = clf.predict(X_test) print(classification_report(Y_test, Y_test_pred)) ---Accuracy on training set is : 1.0 ---Accuracy on test set is : 0.9097387173396675 --- 1 0.96 0.98 0.97 496 --- 2 0.94 0.95 0.95 471 --- 3 0.94 0.90 0.92 420 --- 4 0.84 0.82 0.83 491 --- 5 0.86 0.96 0.90 532 --- 6 0.94 0.85 0.89 537 --- --- avg / total 0.91 0.91 0.91 2947

As you can see, we were able to classify these signals with quite a high accuracy. The accuracy of the training set is about 1 and the accuracy on the test set is about 0.91.

To achieve this accuracy we did not even have to break a sweat. The feature selection was done fully automatic; for each transformation we selected the x and y component of the first five peaks (or use the default value of zero).

It is understandable that some of the 270 features will be more informative than other ones. It could be that some transformations of some components do not have five peaks, or that the frequency value of the peaks is more informative than the amplitude value, or that the FFT is always more informative than the auto-correlation.

The accuracy will increase even more if we actively select the features, transformations and components which are important for classification. Maybe we can even choose a different classifier or play around with its parameter values (hyperparameter optimization) to achieve a higher accuracy.

The field of stochastic signal analysis provides us with a set of powerful tools which can be used to analyze, model and classify time-series and signals. I hope this blog-post has provided you with some information on how to use these techniques.

If you found this blog useful feel free to share it with other people and with your fellow Data Scientists. If you think something is missing feel free to leave a comment below!

**PS:** Did you notice we did not use a Recurrent Neural Net ? What kind of accuracy can you achieve on this dataset with a RNN?

**PS2: ** The code is also available as a Jupyter notebook on my GitHub account.

]]>

In the previous blog post we have seen how to build Convolutional Neural Networks (CNN) in Tensorflow, by building various CNN architectures (like LeNet5, AlexNet, VGGNet-16) from scratch and training them on the MNIST, CIFAR-10 and Oxflower17 datasets.

It starts to get interesting when you start thinking about the practical applications of CNN and other Deep Learning methods. If you have been following the latest technical developments you probably know that CNNs are used for face recognition, object detection, analysis of medical images, automatic inspection in manufacturing processes, natural language processing tasks, any many other applications. You could say that you’re only limited by your imagination and creativity (and of course motivation, energy and time) to find practical applications for CNNs.

Inspired by Kaggle’s Satellite Imagery Feature Detection challenge, I would like to find out how easy it is to detect features (roads in this particular case) in satellite and aerial images.

If this is possible, the practical applications of it will be enormous. In a time of global urbanisation, cities are expanding, growing and changing continuously. This of course comes along with new infrastructures, new buildings and neighbourhoods and changing landscapes. Monitoring and keeping track of all of these changes has always been a really labour intensive job. If we could get fresh satellite images every day and use Deep Learning to immediately update all of our maps, it would a big help for everyone working in this field!

Developments in the field of Deep Learning are happening so fast that ‘simple’ image classification, which was a big hype a few years ago, already seems outdated. Nowadays, also object detection has become mainstream and in the next (few) years we will probably see more and more applications using image segmentation (see figure 1).

In this blog we will use Image classification to detect roads in aerial images.

To do this, we first need to get these aerial images, and get the data containing information on the location of roads (see Section 2.1).

After that we need to map these two layers on top each other, we will do this in section 3.1.

After saving the prepared dataset (Section 4.1) in the right format, we can feed it into our build convolutional neural network (Section 4.3).

We will conclude this blog by looking at the accuracy of this method, and discuss which other methods can be used to improve it.

The contents of this blog-post is as follows:

- Introduction
- Getting the Data
- 2.1 Downloading image tiles with owslib
- 2.2 Reading the shapefile containing the roads of the Netherlands.

- Mapping the two layers of data
- 3.2 Visualizing the Mapping results

- Detecting roads with the convolutional neural network
- 4.1 Preparing a training, test and validation dataset
- 4.2 Saving the datasets as a pickle file
- 4.3 Training a convolutional neural network
- 4.5 Resulting Accuracy

- Final Words

**update**: The code is now also available in a notebook on my GitHub repository

The first (and often the most difficult) step in any Data Science project is always obtaining the data. Luckily there are many open datasets containing satellite images in various forms. There is the Landsat dataset, ESA’s Sentinel dataset, MODIS dataset, the NAIP dataset, etc.

Each dataset has different pro’s and con’s. Some like the NAIP dataset offer a high resolution (one meter resolution), but only cover the US. Others like Landsat cover the entire earth but have a lower (30 meter) resolution. Some of them show which type of land-cover (forest, water, grassland) there is, others contain atmospheric and climate data.

Since I am from the Netherlands, I would like to use aerial / satellite images covering the Netherlands, so I will use the aerial images provided by PDOK. These are not only are quite up to date, but they also have an incredible resolution of 25 cm.

Dutch governmental organizations have a lot open-data available which could form the second layer on top of these aerial images. With Pdokviewer you can view a lot of these open datasets online;

- think of layers containing the infrastructure (various types of road, railways, waterways in the Netherlands (NWB wegenbestand),
- layers containing the boundaries of municipalities and districts,
- physical geographical regions,
- locations of every governmental organizations,
- agricultural areas,
- electricity usage per block,
- type of soil, geomorphological map,
- number of residents per block,
- surface usage, etc etc etc (even the living habitat of the brown long eared bat).

So there are many possible datasets you could use as the second layer, and use it to automatically detect these types of features in satellite images.

PS: Another such site containing a lot of maps is the Atlas Natuurlijk Kapitaal.

The layer that I am interested in is the layer containing the road-types. The map with the road-types (NWB wegenbestand) can be downloaded from the open data portal of the Dutch government. The aerial images are available as an Web Map Service (WMS) and can be downloaded with the Python package owslib. (The NWB wegenbestand data has been moved from its original location, so I have uploaded it to my dropbox. Same goes for the tiles.)

Below is the code for downloading and saving the image tiles containing aerial photographs with the owslib library.

from owslib.wms import WebMapService URL = "https://geodata.nationaalgeoregister.nl/luchtfoto/rgb/wms?request=GetCapabilities" wms = WebMapService(URL, version='1.1.1') OUTPUT_DIRECTORY = './data/image_tiles/' x_min = 90000 y_min = 427000 dx, dy = 200, 200 no_tiles_x = 100 no_tiles_y = 100 total_no_tiles = no_tiles_x * no_tiles_y x_max = x_min + no_tiles_x * dx y_max = y_min + no_tiles_y * dy BOUNDING_BOX = [x_min, y_min, x_max, y_max] for ii in range(0,no_tiles_x): print(ii) for jj in range(0,no_tiles_y): ll_x_ = x_min + ii*dx ll_y_ = y_min + jj*dy bbox = (ll_x_, ll_y_, ll_x_ + dx, ll_y_ + dy) img = wms.getmap(layers=['Actueel_ortho25'], srs='EPSG:28992', bbox=bbox, size=(256, 256), format='image/jpeg', transparent=True) filename = "{}_{}_{}_{}.jpg".format(bbox[0], bbox[1], bbox[2], bbox[3]) out = open(OUTPUT_DIRECTORY + filename, 'wb') out.write(img.read()) out.close()

With “dx” and “dy” we can adjust the zoom level of the tiles. A values of 200 corresponds roughly with a zoom level of 12, and a value of 100 corresponds roughly with a zoom level of 13.

As you can see, the lower-left x and y, and the upper-right x and y coordinates are used in the filename, so we will always know where each tile is located.

This process results in 10.000 tiles within the bounding box ((90000, 427000), (110000, 447000)). These coordinates are given in the rijksdriehoekscoordinaten reference system, and it corresponds with the coordinates ((51.82781, 4.44428), (52.00954, 4.73177)) in the WGS 84 reference system.

It covers a few square kilometers in the south of Rotterdam (see Figure 3) and contains both urban as well as non-urban area, i.e. there are enough roads for our convolutional neural network to be trained with.

Next we will determine the contents of each tile image, using data from the NWB Wegvakken (version September 2017). This is a file containing all of the roads of the Netherlands, which gets updated frequently. It is possible to download it in the form of a shapefile from this location.

Shapefiles contain shapes with geospatial data and are normally opened with GIS software like ArcGIS or QGIS. It is also possible to open it within Python, by using the pyshp library.

import shapefile import json input_filename = './nwb_wegvakken/Wegvakken.shp' output_filename = './nwb_wegvakken/Wegvakken.json' reader = shapefile.Reader(shp_filename) fields = reader.fields[1:] field_names = [field[0] for field in fields] buffer = [] for sr in reader.shapeRecords(): atr = dict(zip(field_names, sr.record)) geom = sr.shape.__geo_interface__ buffer.append(dict(type="Feature", geometry=geom, properties=atr)) output_filename = './data/nwb_wegvakken/2017_09_wegvakken.json' json_file = open(output_filename , "w") json_file.write(json.dumps({"type": "FeatureCollection", "features": buffer}, indent=2, default=JSONencoder) + "\n") json_file.close()

In this code, the list ‘buffer’ contains the contents of the shapefile. Since we don’t want to repeat the same shapefile-reading process every time, we will now save it in a json format with `json.dumps()`

.

If we try to save the contents of the shapefile as it currently is, it will result in the error `'(...) is not JSON Serializable'`

. This is because the shapefile contains datatypes (bytes and datetime objects) which are not natively supported by JSON. So we need to write an extension to the standard JSON serializer which can take the datatypes not supported by JSON and convert them into datatypes which can be serialized. This is what the method `JSONencoder`

does. (For more on this see here).

def JSONencoder(obj): """JSON serializer for objects not serializable by default json code""" if isinstance(obj, (datetime, date)): serial = obj.isoformat() return serial if isinstance(obj, bytes): return {'__class__': 'bytes', '__value__': list(obj)} raise TypeError ("Type %s not serializable" % type(obj))

If we look at the contents of this shapefile, we will see that it contains a list of objects of the following type:

{'properties': {'E_HNR_LNKS': 1, 'WEGBEHSRT': 'G', 'EINDKM': None, 'RIJRICHTNG': '', 'ROUTELTR4': '', 'ROUTENR3': None, 'WEGTYPE': '', 'ROUTENR': None, 'GME_ID': 717, 'ADMRICHTNG': '', 'WPSNAAMNEN': 'DOMBURG', 'DISTRCODE': 0, 'WVK_BEGDAT': '1998-01-21', 'WGTYPE_OMS': '', 'WEGDEELLTR': '#', 'HNRSTRRHTS': '', 'WEGBEHNAAM': 'Veere', 'JTE_ID_BEG': 47197012, 'DIENSTNAAM': '', 'WVK_ID': 47197071, 'HECTO_LTTR': '#', 'RPE_CODE': '#', 'L_HNR_RHTS': None, 'ROUTELTR3': '', 'WEGBEHCODE': '717', 'ROUTELTR': '', 'GME_NAAM': 'Veere', 'L_HNR_LNKS': 11, 'POS_TV_WOL': '', 'BST_CODE': '', 'BEGINKM': None, 'ROUTENR2': None, 'DISTRNAAM': '', 'ROUTELTR2': '', 'WEGNUMMER': '', 'ENDAFSTAND': None, 'E_HNR_RHTS': None, 'ROUTENR4': None, 'BEGAFSTAND': None, 'DIENSTCODE': '', 'STT_NAAM': 'Van Voorthuijsenstraat', 'WEGNR_AW': '', 'HNRSTRLNKS': 'O', 'JTE_ID_END': 47197131}, 'type': 'Feature', 'geometry': {'coordinates': [[23615.0, 398753.0], [23619.0, 398746.0], [23622.0, 398738.0], [23634.0, 398692.0]], 'type': 'LineString'}}

It contains a lot of information (and what each of them means is specified in the manual), but what is of immediate importance to us, are the values of

- ‘WEGBEHSRT’ –> This indicates the road-type
- ‘coordinates’ –> These are the coordinates of this specific object given in the ‘rijkscoordinaten’ system.

The different road-types present in NWB Wegvakken are:

Now it is time to determine which tiles contain roads, and which tiles do not. We will do this by mapping the contents of NWB-Wegvakken on top of the downloaded aerial photograph tiles.

We can use a Python dictionary to keep track of the mappings. We will also use an dictionary to keep track of the types of road present in each tile.

#First we define some variables, and dictionary keys which are going to be used throughout the rest. dict_roadtype = { "G": 'Gemeente', "R": 'Rijk', "P": 'Provincie', "W": 'Waterschap', 'T': 'Andere wegbeheerder', '' : 'leeg' } dict_roadtype_to_color = { "G": 'red', "R": 'blue', "P": 'green', "W": 'magenta', 'T': 'yellow', '' : 'leeg' } FEATURES_KEY = 'features' PROPERTIES_KEY = 'properties' GEOMETRY_KEY = 'geometry' COORDINATES_KEY = 'coordinates' WEGSOORT_KEY = 'WEGBEHSRT' MINIMUM_NO_POINTS_PER_TILE = 4 POINTS_PER_METER = 0.1 INPUT_FOLDER_TILES = './data/image_tiles/' filename_wegvakken = './data/nwb_wegvakken/2017_09_wegvakken.json' dict_nwb_wegvakken = json.load(open(filename_wegvakken))[FEATURES_KEY] d_tile_contents = defaultdict(list) d_roadtype_tiles = defaultdict(set)

In the code above, the contents of NWB wegvakken, which we have previously converted from a Shapefile to a .JSON format are loaded into `dict_nwb_wegvakken`

.

Furthermore, we initialize two defaultdicts. The first one will be filled with the tiles as the keys, and a list of its contents as the values. The second dictionary will be filled with the road-type as the keys, and all tiles containing these road-types as the values. How this is done, is shown below:

for elem in dict_nwb_wegvakken: coordinates = retrieve_coordinates(elem) rtype = retrieve_roadtype(elem) coordinates_in_bb = [coord for coord in coordinates if coord_is_in_bb(coord, BOUNDING_BOX)] if len(coordinates_in_bb)==1: coord = coordinates_in_bb[0] add_to_dict(d_tile_contents, d_roadtype_tiles, coord, rtype) if len(coordinates_in_bb)>1: add_to_dict(d_tile_contents, d_roadtype_tiles, coordinates_in_bb[0], rtype) for ii in range(1,len(coordinates_in_bb)): previous_coord = coordinates_in_bb[ii-1] coord = coordinates_in_bb[ii] add_to_dict(d_tile_contents, d_roadtype_tiles, coord, rtype) dist = eucledian_distance(previous_coord, coord) no_intermediate_points = int(dist/10) intermediate_coordinates = calculate_intermediate_points(previous_coord, coord, no_intermediate_points) for intermediate_coord in intermediate_coordinates: add_to_dict(d_tile_contents, d_roadtype_tiles, intermediate_coord, rtype)

As you can see, we iterate over the contents of dict_nwb_wegvakken, and for each element, we look up the coordinates and the road-types and check which of these coordinates are inside our bounding box. This is done with the following methods:

def coord_is_in_bb(coord, bb): x_min = bb[0] y_min = bb[1] x_max = bb[2] y_max = bb[3] return coord[0] > x_min and coord[0] < x_max and coord[1] > y_min and coord[1] < y_max def retrieve_roadtype(elem): return elem[PROPERTIES_KEY][WEGSOORT_KEY] def retrieve_coordinates(elem): return elem[GEOMETRY_KEY][COORDINATES_KEY] def add_to_dict(d1, d2, coordinates, rtype): coordinate_ll_x = int((coordinates[0] // dx)*dx) coordinate_ll_y = int((coordinates[1] // dy)*dy) coordinate_ur_x = int((coordinates[0] // dx)*dx + dx) coordinate_ur_y = int((coordinates[1] // dy)*dy + dy) tile = "{}_{}_{}_{}.jpg".format(coordinate_ll_x, coordinate_ll_y, coordinate_ur_x, coordinate_ur_y) rel_coord_x = (coordinates[0] - coordinate_ll_x) / dx rel_coord_y = (coordinates[1] - coordinate_ll_y) / dy value = (rtype, rel_coord_x, rel_coord_y) d1[tile].append(value) d2[rtype].add(tile)

As you can see, the add_to_dict method first determines to which tile a coordinate belongs to, by determining the four coordinates (lowerleft x, y and upperright x,y) each tile is named with.

We also determine the relative position of each coordinate in its tile. The coordinate (99880, 445120) in tile “99800_445000_100000_445200.jpg” will for example have relative coordinates (0.4, 0.6). This is handy later on when you want to plot the contents of a tile.

The road-type, together with the relative coordinate are appended to the list of contents of the tile.

At the same time, we add the tilename to the second dictionary which contains a list of tilenames per road-type.

- If there is only one set of coordinates, and these coordinates are inside our bounding box, we immediately add these coordinate to our dictionaries.
- If there are more than one coordinate in an element, we do not only add all coordinates to the dictionaries, but also calculate all intermediate points between two subsequent coordinates and add these intermediate points to the dictionaries.

This is necessary because two coordinates could form a shape/line which describes a road which lies inside a tile, but if the coordinates happen to lie outside of the tile, we will think this tile does not contain any road.

This is illustrated in Figure 4. On the left we can see two points describing a road, but they happen to lie outside of the tile, and on the right we also calculate every intermediate point (every 1/POINTS_PER_METER meter) between two points and add the intermediate points to the dictionary.

It is always good to make visualizations. It gives us an indication if the mapping of points on the tiles went correctly, whether we have missed any roads, and if we have chosen enough intermediate points between two coordinates to fully cover all parts of the roads.

fig, axarr = plt.subplots(nrows=11,ncols=11, figsize=(16,16)) for ii in range(0,11): for jj in range(0,11): ll_x = x0 + ii*dx ll_y = y0 + jj*dy ur_x = ll_x + dx ur_y = ll_y + dy tile = "{}_{}_{}_{}.jpg".format(ll_x, ll_y, ur_x, ur_y) filename = INPUT_FOLDER_TILES + tile tile_contents = d_tile_contents[tile] ax = axarr[10-jj, ii] image = ndimage.imread(filename) rgb_image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) ax.imshow(rgb_image) ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) for elem in tile_contents: color = dict_roadtype_to_color[elem[0]] x = elem[1]*256 y = (1-elem[2])*256 ax.scatter(x,y,c=color,s=10) plt.subplots_adjust(wspace=0, hspace=0) plt.show()

In Figure 5, we can see two figures, on the left for (x0 = 94400, y0 = 432000) and on the right for (x0 = 93000, y0 = 430000). You can click on them for a larger view.

Next we will load all of the tiles together with their correct labels (road-presence and/or road-type) into a dataset. The dataset is randomized and then split in a training, test and validation part.

image_width = 256 image_height = 256 image_depth = 3 total_no_images = 10000 image_files = os.listdir(INPUT_FOLDER_TILES) dataset = np.ndarray(shape=(total_no_images, image_width, image_height, image_depth), dtype=np.float32) labels_roadtype = [] labels_roadpresence = np.ndarray(total_no_images, dtype=np.float32) for counter, image in enumerate(image_files): filename = INPUT_FOLDER_TILES + image if image in list(d_tile_contents.keys()): tile_contents = d_tile_contents[image] roadtypes = sorted(list(set([elem[0] for elem in tile_contents]))) roadtype = "_".join(roadtypes) labels_roadpresence[counter] = 1 else: roadtype = '' labels_roadpresence[counter] = 0 labels_roadtype.append(roadtype) image_data = ndimage.imread(filename).astype(np.float32) dataset[counter, :, :] = image_data labels_roadtype_ohe = np.array(list(onehot_encode_labels(labels_roadtype))) dataset, labels_roadpresence, labels_roadtype_ohe = reformat_data(dataset, labels_roadpresence, labels_roadtype_ohe)

We can use the following functions to one-hot encode the labels, and randomize the dataset:

def onehot_encode_labels(labels): list_possible_labels = list(np.unique(labels)) encoded_labels = map(lambda x: list_possible_labels.index(x), labels) return encoded_labels def randomize(dataset, labels1, labels2): permutation = np.random.permutation(dataset.shape[0]) randomized_dataset = dataset[permutation, :, :, :] randomized_labels1 = labels1[permutation] randomized_labels2 = labels2[permutation] return randomized_dataset, randomized_labels1, randomized_labels2 def one_hot_encode(np_array, num_unique_labels): return (np.arange(num_unique_labels) == np_array[:,None]).astype(np.float32) def reformat_data(dataset, labels1, labels2): dataset, labels1, labels2 = randomize(dataset, labels1, labels2) num_unique_labels1 = len(np.unique(labels1)) num_unique_labels2 = len(np.unique(labels2)) labels1 = one_hot_encode(labels1, num_unique_labels1) labels2 = one_hot_encode(labels2, num_unique_labels2) return dataset, labels1, labels2

This whole process of loading the dataset into memory, and especially randomizing the order of the images usually takes a really long time. So after you have done it once, you want to save the result as a pickle file.

start_train_dataset = 0 start_valid_dataset = 1200 start_test_dataset = 1600 total_no_images = 2000 output_pickle_file = './data/sattelite_dataset.pickle' f = open(output_pickle_file, 'wb') save = { 'train_dataset': dataset[start_train_dataset:start_valid_dataset,:,:,:], 'train_labels_roadtype': labels_roadtype[start_train_dataset:start_valid_dataset], 'train_labels_roadpresence': labels_roadpresence[start_train_dataset:start_valid_dataset], 'valid_dataset': dataset[start_valid_dataset:start_test_dataset,:,:,:], 'valid_labels_roadtype': labels_roadtype[start_valid_dataset:start_test_dataset], 'valid_labels_roadpresence': labels_roadpresence[start_valid_dataset:start_test_dataset], 'test_dataset': dataset[start_test_dataset:total_no_images,:,:,:], 'test_labels_roadtype': labels_roadtype[start_test_dataset:total_no_images], 'test_labels_roadpresence': labels_roadpresence[start_test_dataset:total_no_images], } pickle.dump(save, f, pickle.HIGHEST_PROTOCOL) f.close() print("\nsaved dataset to {}".format(output_pickle_file))

Now that we have saved the training, validation and test set in a pickle file, we are finished with the preparation part.

We can load this pickle file into convolutional neural network and train it to recognize roads.

To train the convolutional neural network to recognize roads, we are going to reuse code from the previous blog post. So if you want to understand how a convolutional neural network actually works, I advise you to take a few minutes and read it.

**PS:** As you have seen, before we can even start with the CNN, we had to do a lot of work getting and preparing the dataset. This also reflects how data science is in real life; 70 to 80 percent of the time goes into getting, understanding and cleaning data. The actual modelling / training of the data is only a small part of the work.

First we load the dataset from the saved pickle file, import VGGNet from the cnn_models module (see GitHub) and some utility functions (to determine the accuracy) and set the values for the learning rate, batch size, etc.

import pickle import tensorflow as tf from cnn_models.vggnet import * from utils import * pickle_file = './data/sattelite_dataset.pickle' f = open(pickle_file, 'rb') save = pickle.load(f) train_dataset = save['train_dataset'].astype(dtype = np.float32) train_labels = save['train_labels_roadpresence'].astype(dtype = np.float32) test_dataset = save['test_dataset'].astype(dtype = np.float32) test_labels = save['test_labels_roadpresence'].astype(dtype = np.float32) valid_dataset = save['valid_dataset'].astype(dtype = np.float32) valid_labels = save['valid_labels_roadpresence'].astype(dtype = np.float32) f.close() num_labels = len(np.unique(train_labels)) num_steps = 501 display_step = 10 batch_size = 16 learning_rate = 0.0001 lambda_loss_amount = 0.0015

After that we can construct the Graph containing all of the computational steps of the Convolutional Neural Network and run it. As you can see, we are using the VGGNet-16 Convolutional Neural Network, l2-regularization to minimize the error, and learning rate of 0.0001.

At each step the training accuracy is appended to train_accuracies, and at every 10th step the test and validation accuracies are appended to similar lists. We will use these later to visualize our accuracies.

train_accuracies, test_accuracies, valid_accuracies = [], [], [] print("STARTING WITH SATTELITE") graph = tf.Graph() with graph.as_default(): #1) First we put the input data in a tensorflow friendly form. tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_width, image_height, image_depth)) tf_train_labels = tf.placeholder(tf.float32, shape = (batch_size, num_labels)) tf_test_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_width, image_height, image_depth)) tf_test_labels = tf.placeholder(tf.float32, shape = (batch_size, num_labels)) tf_valid_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_width, image_height, image_depth)) tf_valid_labels = tf.placeholder(tf.float32, shape = (batch_size, num_labels)) #2) Then, the weight matrices and bias vectors are initialized variables = variables_vggnet16() #3. The model used to calculate the logits (predicted labels) model = model_vggnet16 logits = model(tf_train_dataset, variables) #4. then we compute the softmax cross entropy between the logits and the (actual) labels l2 = lambda_loss_amount * sum(tf.nn.l2_loss(tf_var) for tf_var in tf.trainable_variables()) loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf_train_labels)) + l2 #learning_rate = tf.train.exponential_decay(0.05, global_step, 1000, 0.85, staircase=True) #5. The optimizer is used to calculate the gradients of the loss function optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss) # Predictions for the training, validation, and test data. train_prediction = tf.nn.softmax(logits) test_prediction = tf.nn.softmax(model(tf_test_dataset, variables)) valid_prediction = tf.nn.softmax(model(tf_valid_dataset, variables)) with tf.Session(graph=graph) as session: test_counter = 0 tf.global_variables_initializer().run() print('Initialized with learning_rate', learning_rate, " model ", ii) for step in range(num_steps): #Since we are using stochastic gradient descent, we are selecting small batches from the training dataset, #and training the convolutional neural network each time with a batch. offset = (step * batch_size) % (train_labels.shape[0] - batch_size) batch_data = train_dataset[offset:(offset + batch_size), :, :] batch_labels = train_labels[offset:(offset + batch_size), :] feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels} _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict) train_accuracy = accuracy(predictions, batch_labels) train_accuracies.append(train_accuracy) if step % display_step == 0: offset2 = (test_counter * batch_size) % (test_labels.shape[0] - batch_size) test_dataset_batch = test_dataset[offset2:(offset2 + batch_size), :, :] test_labels_batch = test_labels[offset2:(offset2 + batch_size), :] feed_dict2 = {tf_test_dataset : test_dataset_batch, tf_test_labels : test_labels_batch} test_prediction_ = session.run(test_prediction, feed_dict=feed_dict2) test_accuracy = accuracy(test_prediction_, test_labels_batch) test_accuracies.append(test_accuracy) valid_dataset_batch = valid_dataset[offset2:(offset2 + batch_size), :, :] valid_labels_batch = valid_labels[offset2:(offset2 + batch_size), :] feed_dict3 = {tf_valid_dataset : valid_dataset_batch, tf_valid_labels : valid_labels_batch} valid_prediction_ = session.run(valid_prediction, feed_dict=feed_dict3) valid_accuracy = accuracy(valid_prediction_, valid_labels_batch) valid_accuracies.append(valid_accuracy) message = "step {:04d} : loss is {:06.2f}, accuracy on training set {:02.2f} %, accuracy on test set {:02.2f} accuracy on valid set {:02.2f} %".format(step, l, train_accuracy, test_accuracy, valid_accuracy) print(message)

Again, if you do not fully understand what is happening here, you can have a look at the previous blog post in which we looked at building Convolutional Neural Networks in Tensorflow in more detail.

Below we can see the accuracy results of the convolutional neural network.

As you can see, the accuracy for the test as well as the validation set lies around 80 %.

If we look at the tiles which have been classified inaccurately, we can see that most of these tiles are classified wrongly because it really is difficult to detect the roads on these images.

We have seen how we can detect roads in satellite or aerial images using CNNs. Although it is quite amazing what you can do with Convolutional Neural Networks, the technical development in A.I. and Deep Learning world is so fast that using ‘only a CNN’ is already outdated.

- Since a few years there are also Neural Networks called R-CNN, Fast R-CNN and Faster R-CNN (like SSD, YOLO and YOLO9000). These Neural Networks can not only detect the presence of objects in images, but also return the bounding boxes of the object.
- Nowadays there are also Neural Networks which can perform segmentation tasks (like DeepMask, SharpMask, MultiPath), i.e. they can determine to which object each pixel in the image belongs.

I think these Neural Networks which can perform image segmentation would be ideal to determine the location of roads and other objects inside satellite images. In future blogs, I want to have a look at how we can detect roads (or other features) in satellite and aerial images using these types of Neural Networks.

**PS:** If you are interested in the latest developments in Computer Vision, I can recommend you to read a year in computer vision.

]]>

In the past I have mostly written about ‘classical’ Machine Learning, like Naive Bayes classification, Logistic Regression, and the Perceptron algorithm. In the past year I have also worked with Deep Learning techniques, and I would like to share with you how to make and train a Convolutional Neural Network from scratch, using tensorflow. Later on we can use this knowledge as a building block to make interesting Deep Learning applications.

For this you will need to have tensorflow installed (see installation instructions) and you should also have a basic understanding of Python programming and the theory behind Convolutional Neural Networks. After you have installed tensorflow, you can run the smaller Neural Networks without GPU, but for the deeper networks you will definitely need some GPU power.

The Internet is full with awesome websites and courses which explain how a convolutional neural network works. Some of them have good visualisations which make it easy to understand [click here for more info]. I don’t feel the need to explain the same things again, so before you continue, make sure you understand how a convolutional neural network works. For example,

- What is a convolutional layer, and what is the filter of this convolutional layer?
- What is an activation layer (ReLu layer (most widely used), sigmoid activation or tanh)?
- What is a pooling layer (max pooling / average pooling), dropout?
- How does Stochastic Gradient Descent work?

The contents of this blog-post is as follows:

- Tensorflow basics:
- 1.1 Constants and Variables
- 1.2 Tensorflow Graphs and Sessions
- 1.3 Placeholders and feed_dicts

- Neural Networks in Tensorflow
- 2.1 Introduction
- 2.2 Loading in the data
- 2.3 Creating a (simple) 1-layer Neural Network:
- 2.4 The many faces of Tensorflow
- 2.5 Creating the LeNet5 CNN
- 2.6 How the parameters affect the outputsize of an layer
- 2.7 Adjusting the LeNet5 architecture
- 2.8 Impact of Learning Rate and Optimizer

- Deep Neural Networks in Tensorflow
- 3.1 AlexNet
- 3.2 VGG Net-16
- 3.3 AlexNet Performance

- Final words

Here I will give a short introduction to Tensorflow for people who have never worked with it before. If you want to start building Neural Networks immediatly, or you are already familiar with Tensorflow you can go ahead and skip to section 2. If you would like to know more about Tensorflow, you can also have a look at this repository, or the notes of lecture 1 and lecture 2 of Stanford’s CS20SI course.

The most basic units within tensorflow are Constants, Variables and Placeholders.

The difference between a tf.constant() and a tf.Variable() should be clear; a constant has a constant value and once you set it, it cannot be changed. The value of a Variable can be changed after it has been set, but the type and shape of the Variable can not be changed.

#We can create constants and variables of different types. #However, the different types do not mix well together. a = tf.constant(2, tf.int16) b = tf.constant(4, tf.float32) c = tf.constant(8, tf.float32) d = tf.Variable(2, tf.int16) e = tf.Variable(4, tf.float32) f = tf.Variable(8, tf.float32) #we can perform computations on variable of the same type: e + f #but the following can not be done: d + e #everything in tensorflow is a tensor, these can have different dimensions: #0D, 1D, 2D, 3D, 4D, or nD-tensors g = tf.constant(np.zeros(shape=(2,2), dtype=np.float32)) #does work h = tf.zeros([11], tf.int16) i = tf.ones([2,2], tf.float32) j = tf.zeros([1000,4,3], tf.float64) k = tf.Variable(tf.zeros([2,2], tf.float32)) l = tf.Variable(tf.zeros([5,6,5], tf.float32))

Besides the tf.zeros() and tf.ones(), which create a Tensor initialized to zero or one (see here), there is also the tf.random_normal() function which create a tensor filled with values picked randomly from a normal distribution (the default distribution has a mean of 0.0 and stddev of 1.0).

There is also the tf.truncated_normal() function, which creates an Tensor with values randomly picked from a normal distribution, where two times the standard deviation forms the lower and upper limit.

With this knowledge, we can already create weight matrices and bias vectors which can be used in a neural network.

weights = tf.Variable(tf.truncated_normal([256 * 256, 10])) biases = tf.Variable(tf.zeros([10])) print(weights.get_shape().as_list()) print(biases.get_shape().as_list()) >>>[65536, 10] >>>[10]

In Tensorflow, all of the different Variables and the operations done on these Variables are saved in a Graph. After you have build a Graph which contains all of the computational steps necessary for your model, you can run this Graph within a Session. This Session then distributes all of the computations across the available CPU and GPU resources.

graph = tf.Graph() with graph.as_default(): a = tf.Variable(8, tf.float32) b = tf.Variable(tf.zeros([2,2], tf.float32)) with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print(f) print(session.run(f)) print(session.run(k)) >>> <tf.Variable 'Variable_2:0' shape=() dtype=int32_ref> >>> 8 >>> [[ 0. 0.] >>> [ 0. 0.]]

We have seen the various forms in which we can create constants and variables. Tensorflow also has placeholders; these do not require an initial value and only serve to allocate the necessary amount of memory. During a session, these placeholder can be filled in with (external) data with a *feed_dict*.

Below is an example of the usage of a placeholder.

list_of_points1_ = [[1,2], [3,4], [5,6], [7,8]] list_of_points2_ = [[15,16], [13,14], [11,12], [9,10]] list_of_points1 = np.array([np.array(elem).reshape(1,2) for elem in list_of_points1_]) list_of_points2 = np.array([np.array(elem).reshape(1,2) for elem in list_of_points2_]) graph = tf.Graph() with graph.as_default(): #we should use a tf.placeholder() to create a variable whose value you will fill in later (during session.run()). #this can be done by 'feeding' the data into the placeholder. #below we see an example of a method which uses two placeholder arrays of size [2,1] to calculate the eucledian distance point1 = tf.placeholder(tf.float32, shape=(1, 2)) point2 = tf.placeholder(tf.float32, shape=(1, 2)) def calculate_eucledian_distance(point1, point2): difference = tf.subtract(point1, point2) power2 = tf.pow(difference, tf.constant(2.0, shape=(1,2))) add = tf.reduce_sum(power2) eucledian_distance = tf.sqrt(add) return eucledian_distance dist = calculate_eucledian_distance(point1, point2) with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() for ii in range(len(list_of_points1)): point1_ = list_of_points1[ii] point2_ = list_of_points2[ii] feed_dict = {point1 : point1_, point2 : point2_} distance = session.run([dist], feed_dict=feed_dict) print("the distance between {} and {} -> {}".format(point1_, point2_, distance)) >>> the distance between [[1 2]] and [[15 16]] -> [19.79899] >>> the distance between [[3 4]] and [[13 14]] -> [14.142136] >>> the distance between [[5 6]] and [[11 12]] -> [8.485281] >>> the distance between [[7 8]] and [[ 9 10]] -> [2.8284271]

The graph containing the Neural Network (illustrated in the image above) should contain the following steps:

- The
**input datasets**; the training dataset and labels, the test dataset and labels (and the validation dataset and labels).

The test and validation datasets can be placed inside a tf.constant(). And the training dataset is placed in a tf.placeholder() so that it can be feeded in batches during the training (stochastic gradient descent). - The Neural Network
**model**with all of its layers. This can be a simple fully connected neural network consisting of only 1 layer, or a more complicated neural network consisting of 5, 9, 16 etc layers. - The
**weight**matrices and**bias**vectors defined in the proper shape and initialized to their initial values. (One weight matrix and bias vector per layer.) - The
**loss**value: the model has as output the logit vector (estimated training labels) and by comparing the logit with the actual labels, we can calculate the loss value (with the softmax with cross-entropy function). The loss value is an indication of how close the estimated training labels are to the actual training labels and will be used to update the weight values. - An
**optimizer**, which will use the calculated loss value to update the weights and biases with backpropagation.

Let’s load the dataset which are going to be used to train and test the Neural Networks. For this we will download the MNIST and the CIFAR-10 dataset. The MNIST dataset contains 60.000 images of handwritten digits, where each image size is 28 x 28 x 1 (grayscale). The CIFAR-10 dataset contains 60.000 colour images (3 channels) – size 32 x 32 x 3 – of 10 different objects (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). Since there are 10 different objects in each dataset, both datasets contain 10 labels.

First, lets define some methods which are convenient for loading and reshaping the data into the necessary format.

def randomize(dataset, labels): permutation = np.random.permutation(labels.shape[0]) shuffled_dataset = dataset[permutation, :, :] shuffled_labels = labels[permutation] return shuffled_dataset, shuffled_labels def one_hot_encode(np_array): return (np.arange(10) == np_array[:,None]).astype(np.float32) def reformat_data(dataset, labels, image_width, image_height, image_depth): np_dataset_ = np.array([np.array(image_data).reshape(image_width, image_height, image_depth) for image_data in dataset]) np_labels_ = one_hot_encode(np.array(labels, dtype=np.float32)) np_dataset, np_labels = randomize(np_dataset_, np_labels_) return np_dataset, np_labels def flatten_tf_array(array): shape = array.get_shape().as_list() return tf.reshape(array, [shape[0], shape[1] * shape[2] * shape[3]]) def accuracy(predictions, labels): return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1)) / predictions.shape[0])

These are methods for one-hot encoding the labels, loading the data in a randomized array and a method for flattening an array (since a fully connected network needs an flat array as its input):

After we have defined these necessary function, we can load the MNIST and CIFAR-10 datasets with:

mnist_folder = './data/mnist/' mnist_image_width = 28 mnist_image_height = 28 mnist_image_depth = 1 mnist_num_labels = 10 mndata = MNIST(mnist_folder) mnist_train_dataset_, mnist_train_labels_ = mndata.load_training() mnist_test_dataset_, mnist_test_labels_ = mndata.load_testing() mnist_train_dataset, mnist_train_labels = reformat_data(mnist_train_dataset_, mnist_train_labels_, mnist_image_size, mnist_image_size, mnist_image_depth) mnist_test_dataset, mnist_test_labels = reformat_data(mnist_test_dataset_, mnist_test_labels_, mnist_image_size, mnist_image_size, mnist_image_depth) print("There are {} images, each of size {}".format(len(mnist_train_dataset), len(mnist_train_dataset[0]))) print("Meaning each image has the size of 28*28*1 = {}".format(mnist_image_size*mnist_image_size*1)) print("The training set contains the following {} labels: {}".format(len(np.unique(mnist_train_labels_)), np.unique(mnist_train_labels_))) print('Training set shape', mnist_train_dataset.shape, mnist_train_labels.shape) print('Test set shape', mnist_test_dataset.shape, mnist_test_labels.shape) train_dataset_mnist, train_labels_mnist = mnist_train_dataset, mnist_train_labels test_dataset_mnist, test_labels_mnist = mnist_test_dataset, mnist_test_labels ###################################################################################### cifar10_folder = './data/cifar10/' train_datasets = ['data_batch_1', 'data_batch_2', 'data_batch_3', 'data_batch_4', 'data_batch_5', ] test_dataset = ['test_batch'] c10_image_height = 32 c10_image_width = 32 c10_image_depth = 3 c10_num_labels = 10 with open(cifar10_folder + test_dataset[0], 'rb') as f0: c10_test_dict = pickle.load(f0, encoding='bytes') c10_test_dataset, c10_test_labels = c10_test_dict[b'data'], c10_test_dict[b'labels'] test_dataset_cifar10, test_labels_cifar10 = reformat_data(c10_test_dataset, c10_test_labels, c10_image_size, c10_image_size, c10_image_depth) c10_train_dataset, c10_train_labels = [], [] for train_dataset in train_datasets: with open(cifar10_folder + train_dataset, 'rb') as f0: c10_train_dict = pickle.load(f0, encoding='bytes') c10_train_dataset_, c10_train_labels_ = c10_train_dict[b'data'], c10_train_dict[b'labels'] c10_train_dataset.append(c10_train_dataset_) c10_train_labels += c10_train_labels_ c10_train_dataset = np.concatenate(c10_train_dataset, axis=0) train_dataset_cifar10, train_labels_cifar10 = reformat_data(c10_train_dataset, c10_train_labels, c10_image_size, c10_image_size, c10_image_depth) del c10_train_dataset del c10_train_labels print("The training set contains the following labels: {}".format(np.unique(c10_train_dict[b'labels']))) print('Training set shape', train_dataset_cifar10.shape, train_labels_cifar10.shape) print('Test set shape', test_dataset_cifar10.shape, test_labels_cifar10.shape)

You can download the MNIST dataset from Yann LeCun’s website. After you have downloaded and unzipped the files, you can load the data with the python-mnist tool. CIFAR-10 can be downloaded from here.

The most simple form of a Neural Network is a 1-layer linear Fully Connected Neural Network (FCNN). Mathematically it consists of a matrix multiplication.

It is best to start with such a simple NN in tensorflow, and later on look at the more complicated Neural Networks. When we start looking at these more complicated Neural Networks, only the model (step 2) and weights (step 3) part of the Graph will change and the other steps will remain the same.

We can make such an 1-layer FCNN as follows:

image_width = mnist_image_width image_height = mnist_image_height image_depth = mnist_image_depth num_labels = mnist_num_labels #the dataset train_dataset = mnist_train_dataset train_labels = mnist_train_labels test_dataset = mnist_test_dataset test_labels = mnist_test_labels #number of iterations and learning rate num_steps = 10001 display_step = 1000 learning_rate = 0.5 graph = tf.Graph() with graph.as_default(): #1) First we put the input data in a tensorflow friendly form. tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_width, image_height, image_depth)) tf_train_labels = tf.placeholder(tf.float32, shape = (batch_size, num_labels)) tf_test_dataset = tf.constant(test_dataset, tf.float32) #2) Then, the weight matrices and bias vectors are initialized #as a default, tf.truncated_normal() is used for the weight matrix and tf.zeros() is used for the bias vector. weights = tf.Variable(tf.truncated_normal([image_width * image_height * image_depth, num_labels]), tf.float32) bias = tf.Variable(tf.zeros([num_labels]), tf.float32) #3) define the model: #A one layered fccd simply consists of a matrix multiplication def model(data, weights, bias): return tf.matmul(flatten_tf_array(data), weights) + bias logits = model(tf_train_dataset, weights, bias) #4) calculate the loss, which will be used in the optimization of the weights loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf_train_labels)) #5) Choose an optimizer. Many are available. optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss) #6) The predicted values for the images in the train dataset and test dataset are assigned to the variables train_prediction and test_prediction. #It is only necessary if you want to know the accuracy by comparing it with the actual values. train_prediction = tf.nn.softmax(logits) test_prediction = tf.nn.softmax(model(tf_test_dataset, weights, bias)) with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print('Initialized') for step in range(num_steps): _, l, predictions = session.run([optimizer, loss, train_prediction]) if (step % display_step == 0): train_accuracy = accuracy(predictions, train_labels[:, :]) test_accuracy = accuracy(test_prediction.eval(), test_labels) message = "step {:04d} : loss is {:06.2f}, accuracy on training set {:02.2f} %, accuracy on test set {:02.2f} %".format(step, l, train_accuracy, test_accuracy) print(message)

>>> Initialized >>> step 0000 : loss is 2349.55, accuracy on training set 10.43 %, accuracy on test set 34.12 % >>> step 0100 : loss is 3612.48, accuracy on training set 89.26 %, accuracy on test set 90.15 % >>> step 0200 : loss is 2634.40, accuracy on training set 91.10 %, accuracy on test set 91.26 % >>> step 0300 : loss is 2109.42, accuracy on training set 91.62 %, accuracy on test set 91.56 % >>> step 0400 : loss is 2093.56, accuracy on training set 91.85 %, accuracy on test set 91.67 % >>> step 0500 : loss is 2325.58, accuracy on training set 91.83 %, accuracy on test set 91.67 % >>> step 0600 : loss is 22140.44, accuracy on training set 68.39 %, accuracy on test set 75.06 % >>> step 0700 : loss is 5920.29, accuracy on training set 83.73 %, accuracy on test set 87.76 % >>> step 0800 : loss is 9137.66, accuracy on training set 79.72 %, accuracy on test set 83.33 % >>> step 0900 : loss is 15949.15, accuracy on training set 69.33 %, accuracy on test set 77.05 % >>> step 1000 : loss is 1758.80, accuracy on training set 92.45 %, accuracy on test set 91.79 %

This is all there is too it! Inside the Graph, we load the data, define the weight matrices and the model, calculate the loss value from the logit vector and pass this to the optimizer which will update the weights for ‘num_steps’ number of iterations.

In the above fully connected NN, we have used the Gradient Descent Optimizer for optimizing the weights. However, there are many different optimizers available in tensorflow. The most common used optimizers are the GradientDescentOptimizer, AdamOptimizer and AdaGradOptimizer, so I would suggest to start with these if youre building a CNN.

Sebastian Ruder has a nice blog post explaining the differences between the different optimizers which you can read if you want to know more about them.

Tensorflow contains many layers, meaning the same operations can be done with different levels of abstraction. To give a simple example, the operation

`logits = tf.matmul(tf_train_dataset, weights) + biases`

,

can also be achieved with

`logits = tf.nn.xw_plus_b(train_dataset, weights, biases)`

.

This is the best visible in the layers API, which is an layer with a high level of abstraction and makes it very easy to create Neural Network consisting of many different layers. For example, the conv_2d() or the fully_connected() functions create convolutional and fully connected layers. With these functions, the number of layers, filter sizes / depths, type of activation function, etc can be specified as a parameter. The weights and bias matrices are then automatically created, as well as the additional activation functions and dropout regularization layers.

For example, with the layers API, the following lines:

import tensorflow as tf w1 = tf.Variable(tf.truncated_normal([filter_size, filter_size, image_depth, filter_depth], stddev=0.1)) b1 = tf.Variable(tf.zeros([filter_depth])) layer1_conv = tf.nn.conv2d(data, w1, [1, 1, 1, 1], padding='SAME') layer1_relu = tf.nn.relu(layer1_conv + b1) layer1_pool = tf.nn.max_pool(layer1_pool, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME')

can be replaced with

from tflearn.layers.conv import conv_2d, max_pool_2d layer1_conv = conv_2d(data, filter_depth, filter_size, activation='relu') layer1_pool = max_pool_2d(layer1_conv_relu, 2, strides=2)

As you can see, we don’t need to define the weights, biases or activation functions. Especially when youre building a neural network with many layers, this keeps the code succint and clean.

However, if youre just starting out with tensorflow and want to learn how to build different kinds of Neural Networks, it is not ideal, since were letting tflearn do all the work.

Therefore we will not use the layers API in this blog-post, but I do recommend you to use it once you have a full understanding of how a neural network should be build in tensorflow.

Let’s start with building more layered Neural Network. For example the LeNet5 Convolutional Neural Network.

The LeNet5 CNN architecture was thought of by Yann Lecun as early as in 1998 (see paper). It is one of the earliest CNN’s (maybe even the first?) and was specifically designed to classify handwritten digits. Although it performs well on the MNIST dataset which consist of grayscale images of size 28 x 28, the performance drops on other datasets with more images, with a larger resolution (larger image size) and more classes. For these larger datasets, deeper ConvNets (like AlexNet, VGGNet or ResNet), will perform better.

But since the LeNet5 architecture only consists of 5 layers, it is a good starting point for learning how to build CNN’s.

The Lenet5 architecture looks as follows:

As we can see, it consists of 5 layers:

**layer 1**: a convolutional layer, with a sigmoid activation function, followed by an average pooling layer.**layer 2**: a convolutional layer, with a sigmoid activation function, followed by an average pooling layer.**layer 3**: a fully connected network (sigmoid activation)**layer 4**: a fully connected network (sigmoid activation)**layer 5**: the output layer

This means that we need to create 5 weight and bias matrices, and our model will consists of 12 lines of code (5 layers + 2 pooling + 4 activation functions + 1 flatten layer).

Since this is quiet some code, it is best to define these in a seperate function outside of the graph.

LENET5_BATCH_SIZE = 32 LENET5_PATCH_SIZE = 5 LENET5_PATCH_DEPTH_1 = 6 LENET5_PATCH_DEPTH_2 = 16 LENET5_NUM_HIDDEN_1 = 120 LENET5_NUM_HIDDEN_2 = 84 def variables_lenet5(patch_size = LENET5_PATCH_SIZE, patch_depth1 = LENET5_PATCH_DEPTH_1, patch_depth2 = LENET5_PATCH_DEPTH_2, num_hidden1 = LENET5_NUM_HIDDEN_1, num_hidden2 = LENET5_NUM_HIDDEN_2, image_depth = 1, num_labels = 10): w1 = tf.Variable(tf.truncated_normal([patch_size, patch_size, image_depth, patch_depth1], stddev=0.1)) b1 = tf.Variable(tf.zeros([patch_depth1])) w2 = tf.Variable(tf.truncated_normal([patch_size, patch_size, patch_depth1, patch_depth2], stddev=0.1)) b2 = tf.Variable(tf.constant(1.0, shape=[patch_depth2])) w3 = tf.Variable(tf.truncated_normal([5*5*patch_depth2, num_hidden1], stddev=0.1)) b3 = tf.Variable(tf.constant(1.0, shape = [num_hidden1])) w4 = tf.Variable(tf.truncated_normal([num_hidden1, num_hidden2], stddev=0.1)) b4 = tf.Variable(tf.constant(1.0, shape = [num_hidden2])) w5 = tf.Variable(tf.truncated_normal([num_hidden2, num_labels], stddev=0.1)) b5 = tf.Variable(tf.constant(1.0, shape = [num_labels])) variables = { 'w1': w1, 'w2': w2, 'w3': w3, 'w4': w4, 'w5': w5, 'b1': b1, 'b2': b2, 'b3': b3, 'b4': b4, 'b5': b5 } return variables def model_lenet5(data, variables): layer1_conv = tf.nn.conv2d(data, variables['w1'], [1, 1, 1, 1], padding='SAME') layer1_actv = tf.sigmoid(layer1_conv + variables['b1']) layer1_pool = tf.nn.avg_pool(layer1_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer2_conv = tf.nn.conv2d(layer1_pool, variables['w2'], [1, 1, 1, 1], padding='VALID') layer2_actv = tf.sigmoid(layer2_conv + variables['b2']) layer2_pool = tf.nn.avg_pool(layer2_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') flat_layer = flatten_tf_array(layer2_pool) layer3_fccd = tf.matmul(flat_layer, variables['w3']) + variables['b3'] layer3_actv = tf.nn.sigmoid(layer3_fccd) layer4_fccd = tf.matmul(layer3_actv, variables['w4']) + variables['b4'] layer4_actv = tf.nn.sigmoid(layer4_fccd) logits = tf.matmul(layer4_actv, variables['w5']) + variables['b5'] return logits

With the variables, and model defined seperately, we can adjust the the graph a little bit so that it uses these weights and model instead of the previous Fully Connected NN:

#parameters determining the model size image_size = mnist_image_size num_labels = mnist_num_labels #the datasets train_dataset = mnist_train_dataset train_labels = mnist_train_labels test_dataset = mnist_test_dataset test_labels = mnist_test_labels #number of iterations and learning rate num_steps = 10001 display_step = 1000 learning_rate = 0.001 graph = tf.Graph() with graph.as_default(): #1) First we put the input data in a tensorflow friendly form. tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_width, image_height, image_depth)) tf_train_labels = tf.placeholder(tf.float32, shape = (batch_size, num_labels)) tf_test_dataset = tf.constant(test_dataset, tf.float32) #2) Then, the weight matrices and bias vectors are initialized <strong>variables = variables_lenet5(image_depth = image_depth, num_labels = num_labels)</strong> #3. The model used to calculate the logits (predicted labels) <strong>model = model_lenet5</strong> <strong>logits = model(tf_train_dataset, variables)</strong> #4. then we compute the softmax cross entropy between the logits and the (actual) labels loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=tf_train_labels)) #5. The optimizer is used to calculate the gradients of the loss function optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss) # Predictions for the training, validation, and test data. train_prediction = tf.nn.softmax(logits) test_prediction = tf.nn.softmax(model(tf_test_dataset, variables))

with tf.Session(graph=graph) as session: tf.global_variables_initializer().run() print('Initialized with learning_rate', learning_rate) for step in range(num_steps): #Since we are using stochastic gradient descent, we are selecting small batches from the training dataset, #and training the convolutional neural network each time with a batch. offset = (step * batch_size) % (train_labels.shape[0] - batch_size) batch_data = train_dataset[offset:(offset + batch_size), :, :, :] batch_labels = train_labels[offset:(offset + batch_size), :] feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels} _, l, predictions = session.run([optimizer, loss, train_prediction], feed_dict=feed_dict) if step % display_step == 0: train_accuracy = accuracy(predictions, batch_labels) test_accuracy = accuracy(test_prediction.eval(), test_labels) message = "step {:04d} : loss is {:06.2f}, accuracy on training set {:02.2f} %, accuracy on test set {:02.2f} %".format(step, l, train_accuracy, test_accuracy) print(message)

>>> Initialized with learning_rate 0.1 >>> step 0000 : loss is 002.49, accuracy on training set 3.12 %, accuracy on test set 10.09 % >>> step 1000 : loss is 002.29, accuracy on training set 21.88 %, accuracy on test set 9.58 % >>> step 2000 : loss is 000.73, accuracy on training set 75.00 %, accuracy on test set 78.20 % >>> step 3000 : loss is 000.41, accuracy on training set 81.25 %, accuracy on test set 86.87 % >>> step 4000 : loss is 000.26, accuracy on training set 93.75 %, accuracy on test set 90.49 % >>> step 5000 : loss is 000.28, accuracy on training set 87.50 %, accuracy on test set 92.79 % >>> step 6000 : loss is 000.23, accuracy on training set 96.88 %, accuracy on test set 93.64 % >>> step 7000 : loss is 000.18, accuracy on training set 90.62 %, accuracy on test set 95.14 % >>> step 8000 : loss is 000.14, accuracy on training set 96.88 %, accuracy on test set 95.80 % >>> step 9000 : loss is 000.35, accuracy on training set 90.62 %, accuracy on test set 96.33 % >>> step 10000 : loss is 000.12, accuracy on training set 93.75 %, accuracy on test set 96.76 %

As we can see the LeNet5 architecture performs better on the MNIST dataset than a simple fully connected NN.

Generally it is true that the more layers a Neural Network has, the better it performs. We can add more layers, change activation functions and pooling layers, change the learning rate and see how each step affects the performance. Since the input of layer is the output of layer , we need to know how the output size of layer is affected by its different parameters.

To understand this, lets have a look at the conv2d() function.

It has four parameters:

- The input image, a 4D Tensor with dimensions [batch size, image_width, image_height, image_depth]
- An weight matrix, a 4-D Tensor with dimensions [filter_size, filter_size, image_depth, filter_depth]
- The number of strides in each dimension.
- Padding (= ‘SAME’ / ‘VALID’)

These four parameters determine the size of the output image.

The first two parameters are the 4-D Tensor containing the batch of input images and the 4-D Tensor containing the weights of the convolutional filter.

The third parameter is the stride of the convolution, i.e. how much the convolutional filter should skip positions in each of the four dimension. The first of these 4 dimensions indicates the image-number in the batch of images and since we dont want to skip over any image, this will always be 1. The last dimension indicates the image depth (no of color-channels; 1 for grayscale and 3 for RGB) and since we dont want to skip over any color-channels, this is also always 1. The second and third dimension indicate the stride in the X and Y direction (image width and height). If we want to apply a stride, these are the dimensions in which the filter should skip positions. So for a stride of 1, we have to set the stride-parameter to [1, 1, 1, 1] and if we want a stride of 2, set it to [1, 2, 2, 1]. etc

The last parameter indicates whether or not tensorflow should zero-pad the image in order to make sure the output size does not change size for a stride of 1. With padding = ‘SAME’ the image does get zero-padded (and output size does not change), with padding = ‘VALID’ it does not.

Below we can see two examples of a convolutional filter (with filter size 5 x 5) scanning through an image (of size 28 x 28).

On the left the padding parameter is set to ‘SAME’, the image is zero-padded and the last 4 rows / columns are included in the output image.

On the right padding is set to ‘VALID’, the image does not get zero-padded and the last 4 rows/columns are not included.

As we can see, without zero-padding the last four cells are not included, because the convolutional filter has reached the end of the (non-zero padded) image. This means that, for an input size of 28 x 28, the output size becomes 24 x 24. If padding = ‘SAME’, the output size is 28 x 28.

This becomes more clear if we write down the positions of the filter on the image while it is scanning through the image (For simplicity, only the X-direction). With a stride of 1, the X-positions are 0-5, 1-6, 2-7, etc. If the stride is 2, the X-positions are 0-5, 2-7, 4-9, etc.

If we do this for an image size of 28 x 28, filter size of 5 x 5 and strides 1 to 4, we will get the following table:

As you can see, for a stride of 1, and zero-padding the output image size is 28 x 28. Without zero-padding the output image size becomes 24 x 24. For a filter with a stride of 2, these numbers are 14 x 14 and 12 x 12, and for a filter with stride 3 it is 10 x 10 and 8 x 8. etc

For any arbitrary chosen stride S, filter size K, image size W, and padding-size P, the output size will be

If padding = ‘SAME’ in tensorflow, the numerator always adds up to 1 and the output size is only determined by the stride S.

In the original paper, a sigmoid activation function and average pooling were used in the LeNet5 architecture. However, nowadays, it is much more common to use a relu activation function. So let’s change the LeNet5 CNN a little bit to see if we can improve its accuracy. We will call this the LeNet5-like Architecture:

LENET5_LIKE_BATCH_SIZE = 32 LENET5_LIKE_FILTER_SIZE = 5 LENET5_LIKE_FILTER_DEPTH = 16 LENET5_LIKE_NUM_HIDDEN = 120 def variables_lenet5_like(filter_size = LENET5_LIKE_FILTER_SIZE, filter_depth = LENET5_LIKE_FILTER_DEPTH, num_hidden = LENET5_LIKE_NUM_HIDDEN, image_width = 28, image_depth = 1, num_labels = 10): w1 = tf.Variable(tf.truncated_normal([filter_size, filter_size, image_depth, filter_depth], stddev=0.1)) b1 = tf.Variable(tf.zeros([filter_depth])) w2 = tf.Variable(tf.truncated_normal([filter_size, filter_size, filter_depth, filter_depth], stddev=0.1)) b2 = tf.Variable(tf.constant(1.0, shape=[filter_depth])) w3 = tf.Variable(tf.truncated_normal([(image_width // 4)*(image_width // 4)*filter_depth , num_hidden], stddev=0.1)) b3 = tf.Variable(tf.constant(1.0, shape = [num_hidden])) w4 = tf.Variable(tf.truncated_normal([num_hidden, num_hidden], stddev=0.1)) b4 = tf.Variable(tf.constant(1.0, shape = [num_hidden])) w5 = tf.Variable(tf.truncated_normal([num_hidden, num_labels], stddev=0.1)) b5 = tf.Variable(tf.constant(1.0, shape = [num_labels])) variables = { 'w1': w1, 'w2': w2, 'w3': w3, 'w4': w4, 'w5': w5, 'b1': b1, 'b2': b2, 'b3': b3, 'b4': b4, 'b5': b5 } return variables def model_lenet5_like(data, variables): layer1_conv = tf.nn.conv2d(data, variables['w1'], [1, 1, 1, 1], padding='SAME') layer1_actv = tf.nn.relu(layer1_conv + variables['b1']) layer1_pool = tf.nn.avg_pool(layer1_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer2_conv = tf.nn.conv2d(layer1_pool, variables['w2'], [1, 1, 1, 1], padding='SAME') layer2_actv = tf.nn.relu(layer2_conv + variables['b2']) layer2_pool = tf.nn.avg_pool(layer2_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') flat_layer = flatten_tf_array(layer2_pool) layer3_fccd = tf.matmul(flat_layer, variables['w3']) + variables['b3'] layer3_actv = tf.nn.relu(layer3_fccd) #layer3_drop = tf.nn.dropout(layer3_actv, 0.5) layer4_fccd = tf.matmul(layer3_actv, variables['w4']) + variables['b4'] layer4_actv = tf.nn.relu(layer4_fccd) #layer4_drop = tf.nn.dropout(layer4_actv, 0.5) logits = tf.matmul(layer4_actv, variables['w5']) + variables['b5'] return logits

The main differences are that we are using a relu activation function instead of a sigmoid activation.

Besides the activation function, we can also change the used optimizers to see what the effect is of the different optimizers on accuracy.

Lets see how these CNN’s perform on the MNIST and CIFAR-10 datasets.

In the figures above, the accuracy on the test set is given as a function of the number of iterations. On the left for the one layer fully connected NN, in the middle for the LeNet5 NN and on the right for the LeNet5-like NN.

As we can see, the LeNet5 CNN works pretty good for the MNIST dataset. Which should not be such a big surprise, since it was specially designed to classify handwritten digits. The MNIST dataset is quiet small and does not provide a big challenge, so even a one layer fully connected network performs quiet good.

On the CIFAR-10 Dataset however, the performance for the LeNet5 NN drops significantly to accuracy values around 40%.

To increase the accuracy, we can change the optimizer, or fine-tune the Neural Network by applying regularization or learning rate decay.

As we can see, the AdagradOptimizer, AdamOptimizer and the RMSPropOptimizer have a better performance than the GradientDescentOptimizer. These are adaptive optimizers which in general perform better than the (simple) GradientDescentOptimizer but need more computational power.

With L2-regularization or exponential rate decay we can probably gain a bit more accuracy, but for much better results we need to go deeper.

So far we have seen the LeNet5 CNN architecture. LeNet5 contains two convolutional layers followed by fully connected layers and therefore could be called a shallow Neural Network. At that time (in 1998) GPU’s were not used for computational calculations, and the CPU’s were not even that powerful so for that time the two convolutional layers were already quiet innovative.

Later on, many other types of Convolutional Neural Networks have been designed, most of them much deeper [click here for more info].

There is the famous AlexNet architecture (2012) by Alex Krizhevsky et. al., the 7-layered ZF Net (2013), and the 16-layered VGGNet (2014).

In 2015 Google came with 22-layered CNN with an inception module (GoogLeNet), and Microsoft Research Asia created the 152-layered CNN called ResNet.

Now, with the things we have learned so far, lets see how we can create the AlexNet and VGGNet16 architectures in Tensorflow.

Although LeNet5 was the first ConvNet, it is considered to be a shallow neural network. It performs well on the MNIST dataset which consist of grayscale images of size 28 x 28, but the performance drops when we’re trying to classify larger images, with more resolution and more classes.

The first Deep CNN came out in 2012 and is called AlexNet after its creators Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Compared to the most recent architectures AlexNet can be considered simple, but at that time it was really succesfull. It won the ImageNet competition with a incredible test error rate of 15.4% (while the runner-up had an error of 26.2%) and started a revolution (also see this video) in the world of Deep Learning and AI.

It consists of 5 convolutional layers (with relu activation), 3 max pooling layers, 3 fully connected layers and 2 dropout layers. The overall architecture looks as follows:

**layer 0**: input image of size 224 x 224 x 3**layer 1**: A convolutional layer with 96 filters (filter_depth_1 = 96) of size 11 x 11 (filter_size_1 = 11) and a stride of 4. It has a relu activation function.

This is followed by max pooling and local response normalization layers.**layer 2**: A convolutional layer with 256 filters (filter_depth_2 = 256) of size 5 x 5 (filter_size_2 = 5) and a stride of 1. It has a relu activation function.

This layer is also followed by max pooling and local response normalization layers.**layer 3**: A convolutional layer with 384 filters (filter_depth_3 = 384) of size 3 x 3 (filter_size_3 = 3) and a stride of 1. It has a relu activation function.**layer 4**: Same as layer 3.**layer 5**: A convolutional layer with 256 filters (filter_depth_4 = 256) of size 3 x 3 (filter_size_4 = 3) and a stride of 1. It has a relu activation function.**layer 6-8**: These convolutional layers are followed by fully connected layers with 4096 neurons each. In the original paper they are classifying a dataset with 1000 classes, but we will use the oxford17 dataset, which has 17 different classes (of flowers).

Note that this CNN (or other deep CNN’s) cannot be used on the MNIST or the CIFAR-10 dataset, because the images in these datasets are too small. As we have seen before, a pooling layer (or a convolutional layer with a stride of 2) reduces the image size by a factor of 2. AlexNet has 3 max pooling layers and one convolutional layer with a stride of 4. This means that the original image size gets reduced by a factor of . The images in the MNIST dataset would simply get reduced to a size smaller than 0.

Therefore we need to load a dataset with larger images, preferably 224 x 224 x 3 (as the original paper indicates). The 17 category flower dataset, aka oxflower17 dataset is ideal since it contains images of exactly this size:

ox17_image_width = 224 ox17_image_height = 224 ox17_image_depth = 3 ox17_num_labels = 17 import tflearn.datasets.oxflower17 as oxflower17 train_dataset_, train_labels_ = oxflower17.load_data(one_hot=True) train_dataset_ox17, train_labels_ox17 = train_dataset_[:1000,:,:,:], train_labels_[:1000,:] test_dataset_ox17, test_labels_ox17 = train_dataset_[1000:,:,:,:], train_labels_[1000:,:] print('Training set', train_dataset_ox17.shape, train_labels_ox17.shape) print('Test set', test_dataset_ox17.shape, test_labels_ox17.shape)

Lets try to create the weight matrices and the different layers present in AlexNet. As we have seen before, we need as much weight matrices and bias vectors as the amount of layers, and each weight matrix should have a size corresponding to the filter size of the layer it belongs to.

ALEX_PATCH_DEPTH_1, ALEX_PATCH_DEPTH_2, ALEX_PATCH_DEPTH_3, ALEX_PATCH_DEPTH_4 = 96, 256, 384, 256 ALEX_PATCH_SIZE_1, ALEX_PATCH_SIZE_2, ALEX_PATCH_SIZE_3, ALEX_PATCH_SIZE_4 = 11, 5, 3, 3 ALEX_NUM_HIDDEN_1, ALEX_NUM_HIDDEN_2 = 4096, 4096 def variables_alexnet(patch_size1 = ALEX_PATCH_SIZE_1, patch_size2 = ALEX_PATCH_SIZE_2, patch_size3 = ALEX_PATCH_SIZE_3, patch_size4 = ALEX_PATCH_SIZE_4, patch_depth1 = ALEX_PATCH_DEPTH_1, patch_depth2 = ALEX_PATCH_DEPTH_2, patch_depth3 = ALEX_PATCH_DEPTH_3, patch_depth4 = ALEX_PATCH_DEPTH_4, num_hidden1 = ALEX_NUM_HIDDEN_1, num_hidden2 = ALEX_NUM_HIDDEN_2, image_width = 224, image_height = 224, image_depth = 3, num_labels = 17): w1 = tf.Variable(tf.truncated_normal([patch_size1, patch_size1, image_depth, patch_depth1], stddev=0.1)) b1 = tf.Variable(tf.zeros([patch_depth1])) w2 = tf.Variable(tf.truncated_normal([patch_size2, patch_size2, patch_depth1, patch_depth2], stddev=0.1)) b2 = tf.Variable(tf.constant(1.0, shape=[patch_depth2])) w3 = tf.Variable(tf.truncated_normal([patch_size3, patch_size3, patch_depth2, patch_depth3], stddev=0.1)) b3 = tf.Variable(tf.zeros([patch_depth3])) w4 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth3, patch_depth3], stddev=0.1)) b4 = tf.Variable(tf.constant(1.0, shape=[patch_depth3])) w5 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth3, patch_depth3], stddev=0.1)) b5 = tf.Variable(tf.zeros([patch_depth3])) pool_reductions = 3 conv_reductions = 2 no_reductions = pool_reductions + conv_reductions w6 = tf.Variable(tf.truncated_normal([(image_width // 2**no_reductions)*(image_height // 2**no_reductions)*patch_depth3, num_hidden1], stddev=0.1)) b6 = tf.Variable(tf.constant(1.0, shape = [num_hidden1])) w7 = tf.Variable(tf.truncated_normal([num_hidden1, num_hidden2], stddev=0.1)) b7 = tf.Variable(tf.constant(1.0, shape = [num_hidden2])) w8 = tf.Variable(tf.truncated_normal([num_hidden2, num_labels], stddev=0.1)) b8 = tf.Variable(tf.constant(1.0, shape = [num_labels])) variables = { 'w1': w1, 'w2': w2, 'w3': w3, 'w4': w4, 'w5': w5, 'w6': w6, 'w7': w7, 'w8': w8, 'b1': b1, 'b2': b2, 'b3': b3, 'b4': b4, 'b5': b5, 'b6': b6, 'b7': b7, 'b8': b8 } return variables def model_alexnet(data, variables): layer1_conv = tf.nn.conv2d(data, variables['w1'], [1, 4, 4, 1], padding='SAME') layer1_relu = tf.nn.relu(layer1_conv + variables['b1']) layer1_pool = tf.nn.max_pool(layer1_relu, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME') layer1_norm = tf.nn.local_response_normalization(layer1_pool) layer2_conv = tf.nn.conv2d(layer1_norm, variables['w2'], [1, 1, 1, 1], padding='SAME') layer2_relu = tf.nn.relu(layer2_conv + variables['b2']) layer2_pool = tf.nn.max_pool(layer2_relu, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME') layer2_norm = tf.nn.local_response_normalization(layer2_pool) layer3_conv = tf.nn.conv2d(layer2_norm, variables['w3'], [1, 1, 1, 1], padding='SAME') layer3_relu = tf.nn.relu(layer3_conv + variables['b3']) layer4_conv = tf.nn.conv2d(layer3_relu, variables['w4'], [1, 1, 1, 1], padding='SAME') layer4_relu = tf.nn.relu(layer4_conv + variables['b4']) layer5_conv = tf.nn.conv2d(layer4_relu, variables['w5'], [1, 1, 1, 1], padding='SAME') layer5_relu = tf.nn.relu(layer5_conv + variables['b5']) layer5_pool = tf.nn.max_pool(layer4_relu, [1, 3, 3, 1], [1, 2, 2, 1], padding='SAME') layer5_norm = tf.nn.local_response_normalization(layer5_pool) flat_layer = flatten_tf_array(layer5_norm) layer6_fccd = tf.matmul(flat_layer, variables['w6']) + variables['b6'] layer6_tanh = tf.tanh(layer6_fccd) layer6_drop = tf.nn.dropout(layer6_tanh, 0.5) layer7_fccd = tf.matmul(layer6_drop, variables['w7']) + variables['b7'] layer7_tanh = tf.tanh(layer7_fccd) layer7_drop = tf.nn.dropout(layer7_tanh, 0.5) logits = tf.matmul(layer7_drop, variables['w8']) + variables['b8'] return logits

Now we can modify the CNN model to use the weights and layers of the AlexNet model in order to classify images.

VGG Net was created in 2014 by Karen Simonyan and Andrew Zisserman of the University of Oxford. It contains much more layers (16-19 layers), but each layer is simpler in its design; all of the convolutional layers have filters of size 3 x 3 and stride of 1 and all max pooling layers have a stride of 2.

So it is a deeper CNN but simpler.

It comes in different configurations, with either 16 or 19 layers. The difference between these two different configurations is the usage of either 3 or 4 convolutional layers after the second, third and fourth max pooling layer (see below).

The configuration with 16 layers (configuration D) seems to produce the best results, so lets try to create that in tensorflow.

#The VGGNET Neural Network VGG16_PATCH_SIZE_1, VGG16_PATCH_SIZE_2, VGG16_PATCH_SIZE_3, VGG16_PATCH_SIZE_4 = 3, 3, 3, 3 VGG16_PATCH_DEPTH_1, VGG16_PATCH_DEPTH_2, VGG16_PATCH_DEPTH_3, VGG16_PATCH_DEPTH_4 = 64, 128, 256, 512 VGG16_NUM_HIDDEN_1, VGG16_NUM_HIDDEN_2 = 4096, 1000 def variables_vggnet16(patch_size1 = VGG16_PATCH_SIZE_1, patch_size2 = VGG16_PATCH_SIZE_2, patch_size3 = VGG16_PATCH_SIZE_3, patch_size4 = VGG16_PATCH_SIZE_4, patch_depth1 = VGG16_PATCH_DEPTH_1, patch_depth2 = VGG16_PATCH_DEPTH_2, patch_depth3 = VGG16_PATCH_DEPTH_3, patch_depth4 = VGG16_PATCH_DEPTH_4, num_hidden1 = VGG16_NUM_HIDDEN_1, num_hidden2 = VGG16_NUM_HIDDEN_2, image_width = 224, image_height = 224, image_depth = 3, num_labels = 17): w1 = tf.Variable(tf.truncated_normal([patch_size1, patch_size1, image_depth, patch_depth1], stddev=0.1)) b1 = tf.Variable(tf.zeros([patch_depth1])) w2 = tf.Variable(tf.truncated_normal([patch_size1, patch_size1, patch_depth1, patch_depth1], stddev=0.1)) b2 = tf.Variable(tf.constant(1.0, shape=[patch_depth1])) w3 = tf.Variable(tf.truncated_normal([patch_size2, patch_size2, patch_depth1, patch_depth2], stddev=0.1)) b3 = tf.Variable(tf.constant(1.0, shape = [patch_depth2])) w4 = tf.Variable(tf.truncated_normal([patch_size2, patch_size2, patch_depth2, patch_depth2], stddev=0.1)) b4 = tf.Variable(tf.constant(1.0, shape = [patch_depth2])) w5 = tf.Variable(tf.truncated_normal([patch_size3, patch_size3, patch_depth2, patch_depth3], stddev=0.1)) b5 = tf.Variable(tf.constant(1.0, shape = [patch_depth3])) w6 = tf.Variable(tf.truncated_normal([patch_size3, patch_size3, patch_depth3, patch_depth3], stddev=0.1)) b6 = tf.Variable(tf.constant(1.0, shape = [patch_depth3])) w7 = tf.Variable(tf.truncated_normal([patch_size3, patch_size3, patch_depth3, patch_depth3], stddev=0.1)) b7 = tf.Variable(tf.constant(1.0, shape=[patch_depth3])) w8 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth3, patch_depth4], stddev=0.1)) b8 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) w9 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b9 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) w10 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b10 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) w11 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b11 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) w12 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b12 = tf.Variable(tf.constant(1.0, shape=[patch_depth4])) w13 = tf.Variable(tf.truncated_normal([patch_size4, patch_size4, patch_depth4, patch_depth4], stddev=0.1)) b13 = tf.Variable(tf.constant(1.0, shape = [patch_depth4])) no_pooling_layers = 5 w14 = tf.Variable(tf.truncated_normal([(image_width // (2**no_pooling_layers))*(image_height // (2**no_pooling_layers))*patch_depth4 , num_hidden1], stddev=0.1)) b14 = tf.Variable(tf.constant(1.0, shape = [num_hidden1])) w15 = tf.Variable(tf.truncated_normal([num_hidden1, num_hidden2], stddev=0.1)) b15 = tf.Variable(tf.constant(1.0, shape = [num_hidden2])) w16 = tf.Variable(tf.truncated_normal([num_hidden2, num_labels], stddev=0.1)) b16 = tf.Variable(tf.constant(1.0, shape = [num_labels])) variables = { 'w1': w1, 'w2': w2, 'w3': w3, 'w4': w4, 'w5': w5, 'w6': w6, 'w7': w7, 'w8': w8, 'w9': w9, 'w10': w10, 'w11': w11, 'w12': w12, 'w13': w13, 'w14': w14, 'w15': w15, 'w16': w16, 'b1': b1, 'b2': b2, 'b3': b3, 'b4': b4, 'b5': b5, 'b6': b6, 'b7': b7, 'b8': b8, 'b9': b9, 'b10': b10, 'b11': b11, 'b12': b12, 'b13': b13, 'b14': b14, 'b15': b15, 'b16': b16 } return variables def model_vggnet16(data, variables): layer1_conv = tf.nn.conv2d(data, variables['w1'], [1, 1, 1, 1], padding='SAME') layer1_actv = tf.nn.relu(layer1_conv + variables['b1']) layer2_conv = tf.nn.conv2d(layer1_actv, variables['w2'], [1, 1, 1, 1], padding='SAME') layer2_actv = tf.nn.relu(layer2_conv + variables['b2']) layer2_pool = tf.nn.max_pool(layer2_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer3_conv = tf.nn.conv2d(layer2_pool, variables['w3'], [1, 1, 1, 1], padding='SAME') layer3_actv = tf.nn.relu(layer3_conv + variables['b3']) layer4_conv = tf.nn.conv2d(layer3_actv, variables['w4'], [1, 1, 1, 1], padding='SAME') layer4_actv = tf.nn.relu(layer4_conv + variables['b4']) layer4_pool = tf.nn.max_pool(layer4_pool, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer5_conv = tf.nn.conv2d(layer4_pool, variables['w5'], [1, 1, 1, 1], padding='SAME') layer5_actv = tf.nn.relu(layer5_conv + variables['b5']) layer6_conv = tf.nn.conv2d(layer5_actv, variables['w6'], [1, 1, 1, 1], padding='SAME') layer6_actv = tf.nn.relu(layer6_conv + variables['b6']) layer7_conv = tf.nn.conv2d(layer6_actv, variables['w7'], [1, 1, 1, 1], padding='SAME') layer7_actv = tf.nn.relu(layer7_conv + variables['b7']) layer7_pool = tf.nn.max_pool(layer7_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer8_conv = tf.nn.conv2d(layer7_pool, variables['w8'], [1, 1, 1, 1], padding='SAME') layer8_actv = tf.nn.relu(layer8_conv + variables['b8']) layer9_conv = tf.nn.conv2d(layer8_actv, variables['w9'], [1, 1, 1, 1], padding='SAME') layer9_actv = tf.nn.relu(layer9_conv + variables['b9']) layer10_conv = tf.nn.conv2d(layer9_actv, variables['w10'], [1, 1, 1, 1], padding='SAME') layer10_actv = tf.nn.relu(layer10_conv + variables['b10']) layer10_pool = tf.nn.max_pool(layer10_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') layer11_conv = tf.nn.conv2d(layer10_pool, variables['w11'], [1, 1, 1, 1], padding='SAME') layer11_actv = tf.nn.relu(layer11_conv + variables['b11']) layer12_conv = tf.nn.conv2d(layer11_actv, variables['w12'], [1, 1, 1, 1], padding='SAME') layer12_actv = tf.nn.relu(layer12_conv + variables['b12']) layer13_conv = tf.nn.conv2d(layer12_actv, variables['w13'], [1, 1, 1, 1], padding='SAME') layer13_actv = tf.nn.relu(layer13_conv + variables['b13']) layer13_pool = tf.nn.max_pool(layer13_actv, [1, 2, 2, 1], [1, 2, 2, 1], padding='SAME') flat_layer = flatten_tf_array(layer13_pool) layer14_fccd = tf.matmul(flat_layer, variables['w14']) + variables['b14'] layer14_actv = tf.nn.relu(layer14_fccd) layer14_drop = tf.nn.dropout(layer14_actv, 0.5) layer15_fccd = tf.matmul(layer14_drop, variables['w15']) + variables['b15'] layer15_actv = tf.nn.relu(layer15_fccd) layer15_drop = tf.nn.dropout(layer15_actv, 0.5) logits = tf.matmul(layer15_drop, variables['w16']) + variables['b16'] return logits

As a comparison, have a look at the LeNet5 CNN performance on the larger oxflower17 dataset:

The code is also available in my GitHub repository, so feel free to use it on your own dataset(s).

There is much more to explore in the world of Deep Learning; Recurrent Neural Networks, Region-Based CNN’s, GAN’s, Reinforcement Learning, etc. In future blog-posts I’ll build these types of Neural Networks, and also build awesome applications with what we have already learned.

So subscribe and stay tuned!

[1] If you feel like you need to refresh your understanding of CNN’s, here are some good starting points to get you up to speed:

- Machine Learning is fun!
- An Intuitive Explanation of Convolutional Neural Networks :
- CS231n Convolutional Neural Networks for Visual Recognition :
- Udacity’s Deep Learning course:
- Neural Networks and Deep Learning Ch 6.

[2] If you want more information about the theory behind these different Neural Networks, Adit Deshpande’s blog post provides a good comparison of them with links to the original papers. Eugenio Culurciello has a nice blog and article worth a read. In addition to that, also have a look at this github repository containing awesome deep learning papers, and this github repository where deep learning papers are ordered by task and date.

]]>update2: I have added sections 2.4 , 3.2 , 3.3.2 and 4 to this blog post, updated the code on GitHub and improved upon some methods.

For python programmers, scikit-learn is one of the best libraries to build Machine Learning applications with. It is ideal for beginners because it has a really simple interface, it is well documented with many examples and tutorials.

Besides supervised machine learning (classification and regression), it can also be used for clustering, dimensionality reduction, feature extraction and engineering, and pre-processing the data. The interface is consistent over all of these methods, so it is not only easy to use, but it is also easy to construct a large ensemble of classifiers/regression models and train them with the same commands.

In this blog lets have a look at how to build, train, evaluate and validate a classifier with scikit-learn, improve upon the initial classifier with hyper-parameter optimization and look at ways in which we can have a better understanding of complex datasets.

We will do this by going through the of classification of two example datasets. The glass dataset, and the Mushroom dataset.

The glass dataset contains data on six types of glass (from building windows, containers, tableware, headlamps, etc) and each type of glass can be identified by the content of several minerals (for example Na, Fe, K, etc). This dataset only contains numerical data and therefore is a good dataset to get started with.

The second dataset contains non-numerical data and we will need an additional step where we encode the categorical data to numerical data.

Lets start with classifying the classes of glass!

First we need to import the necessary modules and libraries which we will use.

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import time from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn import tree from sklearn.neural_network import MLPClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.gaussian_process.kernels import RBF from sklearn.ensemble import RandomForestClassifier from sklearn.naive_bayes import GaussianNB

- The pandas module is used to load, inspect, process the data and get in the shape necessary for classification.
- Seaborn is a library based on matplotlib and has nice functionalities for drawing graphs.
- StandardScaler is a library for standardizing and normalizing dataset and
- the LaberEncoder library can be used to One Hot Encode the categorical features (in the mushroom dataset).
- All of the other modules are classifiers which are used for classification of the dataset.

When loading a dataset for the first time, there are several questions we need to ask ourself:

- What kind of data does the dataset contain? Numerical data, categorical data, geographic information, etc…
- Does the dataset contain any missing data?
- Does the dataset contain any redundant data (noise)?
- Do the values of the features differ over many orders of magnitude? Do we need to standardize or normalize the dataset?

filename_glass = './data/glass.csv' df_glass = pd.read_csv(filename_glass) print(df_glass.shape) display(df_glass.head()) display(df_glass.describe())

We can see that the dataset consists of 214 rows and 10 columns. All of the columns contain numerical data, and there are no rows with missing information. Also most of the features have values in the same order of magnitude.

So for this dataset we do not need to remove any rows, impute missing values or transform categorical data into numerical.

The .describe() method we used above is useful for giving a quick overview of the dataset;

- How many rows of data are there?
- What are some characteristic values like the mean, standard deviation, minimum and maximum value, the 25th percentile etc.

The next step is building and training the actual classifier, which hopefully can accurately classify the data. With this we will be able to tell which type of glass an entry in the dataset belongs to, based on the features.

For this we need to split the dataset into a training set and a test set. With the training set we will train the classifier, and with the test set we will validate the accuracy of the classifier. Usually a 70 % / 30 % ratio is used when splitting into a training and test set, but this ratio should be chosen based on the size of the dataset. For example, if the dataset does not have enough entries, 30% of it might not contain all of the classes or enough information to properly function as a validation set.

Another important note is that the distribution of the different classes in both the training and the test set should be equal to the distribution in the actual dataset. For example, if you have a dataset with review-texts which contains 20% negative and 80% positive reviews, both the training and the test set should have this 20% / 80% ratio. The best way to do this, is to split the dataset into a training and test set **randomly**.

def get_train_test(df, y_col, x_cols, ratio): """ This method transforms a dataframe into a train and test set, for this you need to specify: 1. the ratio train : test (usually 0.7) 2. the column with the Y_values """ mask = np.random.rand(len(df)) < ratio df_train = df[mask] df_test = df[~mask] Y_train = df_train[y_col].values Y_test = df_test[y_col].values X_train = df_train[x_cols].values X_test = df_test[x_cols].values return df_train, df_test, X_train, Y_train, X_test, Y_test y_col_glass = 'Type' x_cols_glass = list(df_glass.columns.values) x_cols_glass.remove(y_col_glass) train_test_ratio = 0.7 df_train, df_test, X_train, Y_train, X_test, Y_test = get_train_test(df_glass, y_col_glass, x_cols_glass, train_test_ratio)

With the dataset splitted into a training and test set, we can start building a classification model. We will do this in a slightly different way as usual. The idea behind this is that, when we start with a new dataset, we don’t know which (type of) classifier will perform best on this dataset. Will it be a ensemble classifier like Gradient Boosting or Random Forest, or a classifier which uses a functional approach like Logistic Regression, a classifier which uses a statistical approach like Naive Bayes etc.?

Because we dont know this, and nowadays computational power is cheap to get, we will try out all types of classifiers first and later we can continue to optimize the best performing classifier of this inital batch of classifiers. For this we have to make an dictionary, which contains as *keys* the name of the classifiers and as *values *an instance of the classifiers.

dict_classifiers = { "Logistic Regression": LogisticRegression(), "Nearest Neighbors": KNeighborsClassifier(), "Linear SVM": SVC(), "Gradient Boosting Classifier": GradientBoostingClassifier(n_estimators=1000), "Decision Tree": tree.DecisionTreeClassifier(), "Random Forest": RandomForestClassifier(n_estimators=1000), "Neural Net": MLPClassifier(alpha = 1), "Naive Bayes": GaussianNB(), #"AdaBoost": AdaBoostClassifier(), #"QDA": QuadraticDiscriminantAnalysis(), #"Gaussian Process": GaussianProcessClassifier() }

Then we can iterate over this dictionary, and for each classifier:

- train the classifier with
`.fit(X_train, Y_train)`

- evaluate how the classifier performs on the training set with
`.score(X_train, Y_train)`

- evaluate how the classifier perform on the test set with
`.score(X_test, Y_test)`

. - keep track of how much time it takes to train the classifier with the time module.
- save the trained model, the training score, the test score, and the training time into a dictionary. If necessary this dictionary can be saved with Python’s pickle module.

def batch_classify(X_train, Y_train, X_test, Y_test, no_classifiers = 5, verbose = True): """ This method, takes as input the X, Y matrices of the Train and Test set. And fits them on all of the Classifiers specified in the dict_classifier. The trained models, and accuracies are saved in a dictionary. The reason to use a dictionary is because it is very easy to save the whole dictionary with the pickle module. Usually, the SVM, Random Forest and Gradient Boosting Classifier take quiet some time to train. So it is best to train them on a smaller dataset first and decide whether you want to comment them out or not based on the test accuracy score. """ dict_models = {} for classifier_name, classifier in list(dict_classifiers.items())[:no_classifiers]: t_start = time.clock() classifier.fit(X_train, Y_train) t_end = time.clock() t_diff = t_end - t_start train_score = classifier.score(X_train, Y_train) test_score = classifier.score(X_test, Y_test) dict_models[classifier_name] = {'model': classifier, 'train_score': train_score, 'test_score': test_score, 'train_time': t_diff} if verbose: print("trained {c} in {f:.2f} s".format(c=classifier_name, f=t_diff)) return dict_models def display_dict_models(dict_models, sort_by='test_score'): cls = [key for key in dict_models.keys()] test_s = [dict_models[key]['test_score'] for key in cls] training_s = [dict_models[key]['train_score'] for key in cls] training_t = [dict_models[key]['train_time'] for key in cls] df_ = pd.DataFrame(data=np.zeros(shape=(len(cls),4)), columns = ['classifier', 'train_score', 'test_score', 'train_time']) for ii in range(0,len(cls)): df_.loc[ii, 'classifier'] = cls[ii] df_.loc[ii, 'train_score'] = training_s[ii] df_.loc[ii, 'test_score'] = test_s[ii] df_.loc[ii, 'train_time'] = training_t[ii] display(df_.sort_values(by=sort_by, ascending=False))

The reason why we keep track of the time it takes to train a classifier, is because in practice this is also an important indicator of whether or not you would like to use a specific classifier. If there are two classifiers with similar results, but one of them takes much less time to train you probably want to use that one.

The `score()`

method simply return the result of the accuracy_score() method in the metrics module. This module, contains many methods for evualating classification or regression models and I can recommend you to spent some time to learn which metrics you can use to evaluate your model.

The classification_report method for example, calculates the precision, recall and f1-score for all of the classes in your dataset. If you are looking for ways to improve the accuracy of your classifier, or if you want to know why the accuracy is lower than expected, such detailed information about the performance of the classifier on the dataset can point you in the right direction.

The accuracy on the training set, accuracy on the test set, and the duration of the training was saved into a dictionary, and we can use the `display_dict_models()`

method to visualize the results ordered by the test score.

dict_models = batch_classify(X_train, Y_train, X_test, Y_test, no_classifiers = 8) display_dict_models(dict_models)

What we are doing feels like a brute force approach, where a large number of classifiers are build to see which one performs best. It gives us an idea which classifier will perform better for a particular dataset and which one will not. After that you can continue with the best (or top 3) classifier, and try to improve the results by tweaking the parameters of the classifier, or by adding more features to the dataset.

As we can see, the Gradient Boosting classifier performs the best for this dataset. Actually, classifiers like Random Forest and Gradient Boosting classification performs best for most datasets and challenges on Kaggle (That does not mean you should rule out all other classifiers).

For the ones who are interested in the theory behind these classifiers, scikit-learn has a pretty well written user guide. Some of these classifiers were also explained in previous posts, like the naive bayes classifier, logistic regression and support vector machines was partially explained in the perceptron blog.

After we have determined with a quick and dirty method which classifier performs best for the dataset, we can improve upon the Classifier by optimizing its hyper-parameters.

GDB_params = { 'n_estimators': [100, 500, 1000], 'learning_rate': [0.5, 0.1, 0.01, 0.001], 'criterion': ['friedman_mse', 'mse', 'mae'] } df_train, df_test, X_train, Y_train, X_test, Y_test = get_train_test(df_glass, y_col_glass, x_cols_glass, 0.6) for n_est in GDB_params['n_estimators']: for lr in GDB_params['learning_rate']: for crit in GDB_params['criterion']: clf = GradientBoostingClassifier(n_estimators=n_est, learning_rate = lr, criterion = crit) clf.fit(X_train, Y_train) train_score = clf.score(X_train, Y_train) test_score = clf.score(X_test, Y_test) print("For ({}, {}, {}) - train, test score: \t {:.5f} \t-\t {:.5f}".format(n_est, lr, crit[:4], train_score, test_score))

The second dataset we will have a look at is the mushroom dataset, which contains data on edible vs poisonous mushrooms. In the dataset there are 8124 mushrooms in total (4208 edible and 3916 poisonous) described by 22 features each.

The big difference with the glass dataset is that these features don’t have a numerical, but a categorical value. Because this dataset contains categorical values, we need one extra step in the classification process, which is the encoding of these values (see section 3.3).

filename_mushrooms = './data/mushrooms.csv' df_mushrooms = pd.read_csv(filename_mushrooms) display(df_mushrooms.head())

A fast way to find out what type of categorical data a dataset contains, is to print out the unique values of each column in this dataframe. In this way we can also see whether the dataset contains any missing values or redundant columns.

for col in df_mushrooms.columns.values: print(col, df_mushrooms[col].unique())

As we can see, there are 22 categorical features. Of these, the feature ‘veil-type’ only contains one value ‘p’ and therefore does not provide any added value for any classifier. The best thing to do is to remove columns like this which only contain one value.

for col in df_mushrooms.columns.values: if len(df_mushrooms[col].unique()) <= 1: print("Removing column {}, which only contains the value: {}".format(col, df_mushrooms[col].unique()[0]))

Some datasets contain missing values in the form of NaN, null, NULL, ‘?’, ‘??’ etc

It could be that all missing values are of type NaN, or that some columns contain NaN and other columns contain missing data in the form of ‘??’.

It is up to your best judgement to decide what to do with these missing values. What is most effective, really depends on the type of data, the type of missing data and the ratio between missing data and non-missing data.

- If the number of rows containing missing data is only a few percent of the total dataset, the best option could be to drop those rows. If half of the rows contain missing values, we could lose valuable information by dropping all of them.
- If there is a row or column which contains almost only missing data, it will not have much added value and it might be best to drop that column.
- It could be that a value not being filled in also is information which helps with the classification and it is best to leave it like it is.
- Maybe we really need to improve the accuracy and the only way to do this is by Imputing the missing values.
- etc, etc

Below we will look at a few ways in which you can either remove the missing values, or impute them.

print("Number of rows in total: {}".format(df_mushrooms.shape[0])) print("Number of rows with missing values in column 'stalk-root': {}".format(df_mushrooms[df_mushrooms['stalk-root'] == '?'].shape[0])) df_mushrooms_dropped_rows = df_mushrooms[df_mushrooms['stalk-root'] != '?']

drop_percentage = 0.8 df_mushrooms_dropped_cols = df_mushrooms.copy(deep=True) df_mushrooms_dropped_cols.loc[df_mushrooms_dropped_cols['stalk-root'] == '?', 'stalk-root'] = np.nan for col in df_mushrooms_dropped_cols.columns.values: no_rows = df_mushrooms_dropped_cols[col].isnull().sum() percentage = no_rows / df_mushrooms_dropped_cols.shape[0] if percentage > drop_percentage: del df_mushrooms_dropped_cols[col] print("Column {} contains {} missing values. This is {} percent. Dropping this column.".format(col, no_rows, percentage))

df_mushrooms_zerofill = df_mushrooms.copy(deep = True) df_mushrooms_zerofill.loc[df_mushrooms_zerofill['stalk-root'] == '?', 'stalk-root'] = np.nan df_mushrooms_zerofill.fillna(0, inplace=True)

df_mushrooms_bfill = df_mushrooms.copy(deep = True) df_mushrooms_bfill.loc[df_mushrooms_bfill['stalk-root'] == '?', 'stalk-root'] = np.nan df_mushrooms_bfill.fillna(method='bfill', inplace=True)

df_mushrooms_ffill = df_mushrooms.copy(deep = True) df_mushrooms_ffill.loc[df_mushrooms_ffill['stalk-root'] == '?', 'stalk-root'] = np.nan df_mushrooms_ffill.fillna(method='ffill', inplace=True)

Most classifier can only work with numerical data, and will raise an error when categorical values in the form of strings is used as input. When it comes to columns with categorical data, you can do two things.

- 1) One-hot encode the column such that its categorical values are converted to numerical values.
- 2) Expand the column into N different columns containing binary values.

**Example: **Let assume that we have a column called ‘FRUIT’ which contains the unique values [‘ORANGE’, ‘APPLE’, PEAR’].

- In the first case it would be converted to the unique values [0, 1, 2]
- In the second case it would be converted into three different columns called [‘FRUIT_IS_ORANGE’, ‘FRUIT_IS_APPLE’, ‘FRUIT_IS_PEAR’] and after this the original column ‘FRUIT’ would be deleted. The three new columns would contain the values 1 or 0 depending on the value of the original column.

When using the first method, you should pay attention to the fact that some classifiers will try to make sense of the numerical value of the one-hot encoded column. For example the Nearest Neighbour algorithm assumes that the value 1 is closer to 0 than the value 2. But the numerical values have no meaning in the case of one-hot encoded columns (an APPLE is not closer to an ORANGE than a PEAR is.) and the results therefore can be misleading.

def label_encode(df, columns): for col in columns: le = LabelEncoder() col_values_unique = list(df[col].unique()) le_fitted = le.fit(col_values_unique) col_values = list(df[col].values) le.classes_ col_values_transformed = le.transform(col_values) df[col] = col_values_transformed df_mushrooms_ohe = df_mushrooms.copy(deep=True) to_be_encoded_cols = df_mushrooms_ohe.columns.values label_encode(df_mushrooms_ohe, to_be_encoded_cols) display(df_mushrooms_ohe.head()) ## Now lets do the same thing for the other dataframes df_mushrooms_dropped_rows_ohe = df_mushrooms_dropped_rows.copy(deep = True) df_mushrooms_zerofill_ohe = df_mushrooms_zerofill.copy(deep = True) df_mushrooms_bfill_ohe = df_mushrooms_bfill.copy(deep = True) df_mushrooms_ffill_ohe = df_mushrooms_ffill.copy(deep = True) label_encode(df_mushrooms_dropped_rows_ohe, to_be_encoded_cols) label_encode(df_mushrooms_zerofill_ohe, to_be_encoded_cols) label_encode(df_mushrooms_bfill_ohe, to_be_encoded_cols) label_encode(df_mushrooms_ffill_ohe, to_be_encoded_cols)

def expand_columns(df, list_columns): for col in list_columns: colvalues = df[col].unique() for colvalue in colvalues: newcol_name = "{}_is_{}".format(col, colvalue) df.loc[df[col] == colvalue, newcol_name] = 1 df.loc[df[col] != colvalue, newcol_name] = 0 df.drop(list_columns, inplace=True, axis=1) y_col = 'class' to_be_expanded_cols = list(df_mushrooms.columns.values) to_be_expanded_cols.remove(y_col) df_mushrooms_expanded = df_mushrooms.copy(deep=True) label_encode(df_mushrooms_expanded, [y_col]) expand_columns(df_mushrooms_expanded, to_be_expanded_cols) ## Now lets do the same thing for all other dataframes df_mushrooms_dropped_rows_expanded = df_mushrooms_dropped_rows.copy(deep = True) df_mushrooms_zerofill_expanded = df_mushrooms_zerofill.copy(deep = True) df_mushrooms_bfill_expanded = df_mushrooms_bfill.copy(deep = True) df_mushrooms_ffill_expanded = df_mushrooms_ffill.copy(deep = True) label_encode(df_mushrooms_dropped_rows_expanded, [y_col]) label_encode(df_mushrooms_zerofill_expanded, [y_col]) label_encode(df_mushrooms_bfill_expanded, [y_col]) label_encode(df_mushrooms_ffill_expanded, [y_col]) expand_columns(df_mushrooms_dropped_rows_expanded, to_be_expanded_cols) expand_columns(df_mushrooms_zerofill_expanded, to_be_expanded_cols) expand_columns(df_mushrooms_bfill_expanded, to_be_expanded_cols) expand_columns(df_mushrooms_ffill_expanded, to_be_expanded_cols)

We have seen that there are two different ways to handle columns with categorical data, and many different ways to handle missing values.

Since computation power is cheap, it is easy to try out all of classifiers present on all of the different ways we have imputed missing values.

After we have seen which method and which classifier has the highest accuracy initially we can continue in that direction.

Again, we will split the dataset into a 70% training set and a 30% test set and start training and validating a batch of the eight most used classifiers.

dict_dataframes = { "df_mushrooms_ohe": df_mushrooms_ohe, "df_mushrooms_dropped_rows_ohe": df_mushrooms_dropped_rows_ohe, "df_mushrooms_zerofill_ohe": df_mushrooms_zerofill_ohe, "df_mushrooms_bfill_ohe": df_mushrooms_bfill_ohe, "df_mushrooms_ffill_ohe": df_mushrooms_ffill_ohe, "df_mushrooms_expanded": df_mushrooms_expanded, "df_mushrooms_dropped_rows_expanded": df_mushrooms_dropped_rows_expanded, "df_mushrooms_zerofill_expanded": df_mushrooms_zerofill_expanded, "df_mushrooms_bfill_expanded": df_mushrooms_bfill_expanded, "df_mushrooms_ffill_expanded": df_mushrooms_ffill_expanded } y_col = 'class' train_test_ratio = 0.7 for df_key, df in dict_dataframes.items(): x_cols = list(df.columns.values) x_cols.remove(y_col) df_train, df_test, X_train, Y_train, X_test, Y_test = get_train_test(df, y_col, x_cols, train_test_ratio) dict_models = batch_classify(X_train, Y_train, X_test, Y_test, no_classifiers = 8, verbose=False) print() print(df_key) display_dict_models(dict_models) print("-------------------------------------------------------")

As we can see here, the accuracy of the classifiers for this dataset is actually quiet high.

Some datasets contain a lot of features and it is not immediatly clear which of these features are helping with the Classification / Regression, and which of these features are only adding more noise.

To have a better understanding of how the dataset is made up of its features, we will discuss a few methods which can give more insight in the next few sections.

To get more insight in how (strongly) each feature is correlated with the Type of glass, we can calculate and plot the correlation matrix for this dataset.

correlation_matrix = df_glass.corr() plt.figure(figsize=(10,8)) ax = sns.heatmap(correlation_matrix, vmax=1, square=True, annot=True,fmt='.2f', cmap ='GnBu', cbar_kws={"shrink": .5}, robust=True) plt.title('Correlation matrix between the features', fontsize=20) plt.show()

The correlation matrix shows us for example that the oxides ‘Mg’ and ‘Al’ are most strongly correlated with the Type of glass. The content of ‘Ca’ is least strongly correlated with the type of glass. For some dataset there could be features with no correlation at all; then it might be a good idea to remove these since they will only function as noise.

A correlation matrix is a good way to get a general picture of how all of features in the dataset are correlated with each other. For a dataset with a lot of features it might become very large and the correlation of a single feature with the other features becomes difficult to discern.

If you want to look at the correlations of a single feature, it usually is a better idea to visualize it in the form of a bar-graph:

def display_corr_with_col(df, col): correlation_matrix = df.corr() correlation_type = correlation_matrix[col].copy() abs_correlation_type = correlation_type.apply(lambda x: abs(x)) desc_corr_values = abs_correlation_type.sort_values(ascending=False) y_values = list(desc_corr_values.values)[1:] x_values = range(0,len(y_values)) xlabels = list(desc_corr_values.keys())[1:] fig, ax = plt.subplots(figsize=(8,8)) ax.bar(x_values, y_values) ax.set_title('The correlation of all features with {}'.format(col), fontsize=20) ax.set_ylabel('Pearson correlatie coefficient [abs waarde]', fontsize=16) plt.xticks(x_values, xlabels, rotation='vertical') plt.show() display_corr_with_col(df_glass, 'Type')

The Cumulative explained variance shows how much of the variance is captures by the first x features.

Below we can see that the first 4 features (i.e. the four features with the largest correlation) already capture 90% of the variance.

X = df_glass[x_cols_glass].values X_std = StandardScaler().fit_transform(X) pca = PCA().fit(X_std) var_ratio = pca.explained_variance_ratio_ components = pca.components_ #print(pca.explained_variance_) plt.plot(np.cumsum(var_ratio)) plt.xlim(0,9,1) plt.xlabel('Number of Features', fontsize=16) plt.ylabel('Cumulative explained variance', fontsize=16) plt.show()

If you have low accuracy values for your Regression / Classification model, you could decide to stepwise remove the features with the lowest correlation, (or stepwise add features with the highest correlation).

In addition to the correlation matrix, you can plot the pairwise relationships between the features, to see **how** these features are correlated.

ax = sns.pairplot(df_glass, hue='Type') plt.title('Pairwise relationships between the features') plt.show()

In my opinion, the best way to master the scikit-learn library is to simply start coding with it. I hope this blog-post gave some insight into the working of scikit-learn library, but for the ones who need some more information, here are some useful links:

dataschool – machine learning with scikit-learn video series

Classification example using the iris dataset

Official scikit-learn documentation

]]>

Most tasks in Machine Learning can be reduced to classification tasks. For example, we have a medical dataset and we want to classify who has diabetes (positive class) and who doesn’t (negative class). We have a dataset from the financial world and want to know which customers will default on their credit (positive class) and which customers will not (negative class).

To do this, we can train a Classifier with a ‘training dataset’ and after such a Classifier is trained (we have determined its model parameters) and can accurately classify the training set, we can use it to classify new data (test set). If the training is done properly, the Classifier should predict the class probabilities of the new data with a similar accuracy.

There are three popular Classifiers which use three different mathematical approaches to classify data. Previously we have looked at the first two of these; Logistic Regression and the Naive Bayes classifier. Logistic Regression uses a functional approach to classify data, and the Naive Bayes classifier uses a statistical (Bayesian) approach to classify data.

Logistic Regression assumes there is some function which forms a correct model of the dataset (i.e. it maps the input values correctly to the output values). This function is defined by its parameters . We can use the gradient descent method to find the optimum values of these parameters.

The Naive Bayes method is much simpler than that; we do not have to optimize a function, but can calculate the Bayesian (conditional) probabilities directly from the training dataset. This can be done quiet fast (by creating a hash table containing the probability distributions of the features) but is generally less accurate.

Classification of data can also be done via a third way, by using a geometrical approach. The main idea is to find a line, or a plane, which can separate the two classes in their feature space. Classifiers which are using a geometrical approach are the Perceptron and the SVM (Support Vector Machines) methods.

Below we will discuss the Perceptron classification algorithm. Although Support Vector Machines is used more often, I think a good understanding of the Perceptron algorithm is essential to understanding Support Vector Machines and Neural Networks.

The Perceptron is a lightweight algorithm, which can classify data quiet fast. But it only works in the limited case of a linearly separable, binary dataset. If you have a dataset consisting of only two classes, the Perceptron classifier can be trained to find a linear hyperplane which seperates the two. If the dataset is not linearly separable, the perceptron will fail to find a separating hyperplane.

If the dataset consists of more than two classes we can use the standard approaches in multiclass classification (one-vs-all and one-vs-one) to transform the multiclass dataset to a binary dataset. For example, if we have a dataset, which consists of three different classes:

- In
**one-vs-all**, class I is considered as the positive class and the rest of the classes are considered as the negative class. We can then look for a separating hyperplane between class I and the rest of the dataset (class II and III). This process is repeated for class II and then for class III. So we are trying to find three separating hyperplanes; between class I and the rest of the data, between class II and the rest of the data, etc.

If the dataset consists of K classes, we end up with K separating hyperplanes. - In
**one-vs-one**, class I is considered as the positive class and each of the other classes is considered as the negative class; so first class II is considered as the negative class and then class III is is considered as the negative class. Then this process is repeated with the other classes as the positive class.

So if the dataset consists of K classes, we are looking for separating hyperplanes.

Although the one-vs-one can be a bit slower (there is one more iteration layer), it is not difficult to imagine it will be more advantageous in situations where a (linear) separating hyperplane does not exist between one class and the rest of the data, while it does exists between one class and other classes when they are considered individually. In the image below there is no separating line between the pear-class and the other two classes.

The algorithm for the Perceptron is similar to the algorithm of Support Vector Machines (SVM). Both algorithms find a (linear) hyperplane separating the two classes. The biggest difference is that the Perceptron algorithm will find **any** hyperplane, while the SVM algorithm uses a Lagrangian constraint* *to find the hyperplane which is optimized to have the **maximum margin**. That is, the sum of the squared distances of each point to the hyperplane is maximized. This is illustrated in the figure below. While the Perceptron classifier is satisfied if any of these seperating hyperplanes are found, a SVM classifier will find the green one , which has the maximum margin.

Another difference is; If the dataset is not linearly seperable [2] the perceptron will fail to find a separating hyperplane. The algorithm simply does not converge during its iteration cycle. The SVM on the other hand, can still find a maximum margin minimum cost decision boundary (a separating hyperplane which does not separate 100% of the data, but does it with some small error).

It is often said that the perceptron is modeled after neurons in the brain. It has input values (which correspond with the features of the examples in the training set) and one output value. Each input value is multiplied by a weight-factor . If the sum of the products between the feature value and weight-factor is larger than zero, the perceptron is activated and ‘fires’ a signal (+1). Otherwise it is not activated.

The weighted sum between the input-values and the weight-values, can mathematically be determined with the scalar-product . To produce the behaviour of ‘firing’ a signal (+1) we can use the signum function it maps the output to +1 if the input is positive, and it maps the output to -1 if the input is negative.

Thus, this Perceptron can mathematically be modeled by the function . Here is the bias, i.e. the default value when all feature values are zero.

The perceptron algorithm looks as follows:

class Perceptron(): """ Class for performing Perceptron. X is the input array with n rows (no_examples) and m columns (no_features) Y is a vector containing elements which indicate the class (1 for positive class, -1 for negative class) w is the weight-vector (m number of elements) b is the bias-value """ def __init__(self, b = 0, max_iter = 1000): self.max_iter = max_iter self.w = [] self.b = 0 self.no_examples = 0 self.no_features = 0 def train(self, X, Y): self.no_examples, self.no_features = np.shape(X) self.w = np.zeros(self.no_features) for ii in range(0, self.max_iter): w_updated = False for jj in range(0, self.no_examples): a = self.b + np.dot(self.w, X[jj]) if np.sign(Y[jj]*a) != 1: w_updated = True self.w += Y[jj] * X[jj] self.b += Y[jj] if not w_updated: print("Convergence reached in %i iterations." % ii) break if w_updated: print( """ WARNING: convergence not reached in %i iterations. Either dataset is not linearly separable, or max_iter should be increased """ % self.max_iter ) def classify_element(self, x_elem): return int(np.sign(self.b + np.dot(self.w, x_elem))) def classify(self, X): predicted_Y = [] for ii in range(np.shape(X)[0]): y_elem = self.classify_element(X[ii]) predicted_Y.append(y_elem) return predicted_Y

As you can see, we set the bias-value and all the elements in the weight-vector to zero. Then we iterate ‘max_iter’ number of times over all the examples in the training set.

Here, is the actual output value of each training example. This is either +1 (if it belongs to the positive class) or -1 (if it does not belong to the positive class).,

The activation function value is the predicted output value. It will be if the prediction is correct and if the prediction is incorrect. Therefore, if the prediction made (with the weight vector from the previous training example) is incorrect, will be -1, and the weight vector is updated.

If the weight vector is not updated after some iteration, it means we have reached convergence and we can break out of the loop.

If the weight vector was updated in the last iteration, it means we still didnt reach convergence and either the dataset is not linearly separable, or we need to increase ‘max_iter’.

We can see that the Perceptron is an online algorithm; it iterates through the examples in the training set, and for each example in the training set it calculates the value of the activation function and updates the values of the weight-vector.

Now lets examine the Perceptron algorithm for a linearly separable dataset which exists in 2 dimensions. For this we first have to create this dataset:

def generate_data(no_points): X = np.zeros(shape=(no_points, 2)) Y = np.zeros(shape=no_points) for ii in range(no_points): X[ii][0] = random.randint(1,9)+0.5 X[ii][1] = random.randint(1,9)+0.5 Y[ii] = 1 if X[ii][0]+X[ii][1] >= 13 else -1 return X, Y

In the 2D case, the perceptron algorithm looks like:

X, Y = generate_data(100) p = Perceptron() p.train(X, Y) X_test, Y_test = generate_data(50) predicted_Y_test = p.classify(X_test)

As we can see, the weight vector and the bias ( which together determine the separating hyperplane ) are updated when is not positive.

The result is nicely illustrated in this gif:

We can extend this to a dataset in any number of dimensions, and as long as it is linearly separable, the Perceptron algorithm will converge.

One of the benefits of this Perceptron is that it is a very ‘lightweight’ algorithm; it is computationally very fast and easy to implement for datasets which are linearly separable. But if the dataset is not linearly separable, it will not converge.

For such datasets, the Perceptron can still be used if the correct kernel is applied. In practice this is never done, and Support Vector Machines are used whenever a Kernel needs to be applied. Some of these Kernels are:

Linear: | |

Polynomial: | with |

Laplacian RBF: | |

Gaussian RBF: |

At this point, it will be too much to also implement Kernel functions, but I hope to do it at a next post about SVM. For more information about Kernel functions, a comprehensive list of kernels, and their source code, please click here.

**PS: The Python code for Logistic Regression can be forked/cloned from GitHub. **

In the previous blog we have seen the theory and mathematics behind the Logistic Regression Classifier.

Logistic Regression is one of the most powerful classification methods within machine learning and can be used for a wide variety of tasks. Think of pre-policing or predictive analytics in health; it can be used to aid tuberculosis patients, aid breast cancer diagnosis, etc. Think of modeling urban growth, analysing mortgage pre-payments and defaults, forecasting the direction and strength of stock market movement, and even sports.

Reading all of this, the theory[1] of Logistic Regression Classification might look difficult. In my experience, the average Developer does not believe they can design a proper Logistic Regression Classifier from scratch. I strongly disagree: not only is the mathematics behind is relatively simple, it can also be implemented with a few lines of code.

I have done this in the past month, so I thought I’d show you how to do it. The code is in Python but it should be relatively easy to translate it to other languages. Some of the examples contain self-generated data, while other examples contain real-world (iris) data. As was also done in the blog-posts about the bag-of-words model and the Naive Bayes Classifier, we will also try to automatically classify the sentiments of Amazon.com book reviews.

We have seen that the technique to perform Logistic Regression is similar to regular Regression Analysis.

There is a function which maps the input values to the output and this function is completely determined by its parameters . So once we have determined the values with training examples, we can determine the class of any new example.

We are trying to estimate the feature values with the iterative Gradient Descent method. In the Gradient Descent method, the values of the parameters in the current iteration are calculated by updating the values of from the previous iteration with the gradient of the cost function .

In (regular) Regression this hypothesis function can be any function which you expect will provide a good model of the dataset. In Logistic Regression the hypothesis function is always given by the Logistic function:

.

Different cost functions exist, but most often the log-likelihood function known as binary cross-entropy (see equation 2 of previous post) is used.

One of its benefits is that the gradient of this cost function, turns out to be quiet simple, and since it is the gradient we use to update the values of this makes our work easier.

Taking all of this into account, this is how Gradient Descent works:

- Make an initial but intelligent guess for the values of the parameters .
- Keep iterating while the value of the cost function has not met your criteria*:
- With the current values of , calculate the gradient of the cost function ( ).
- Update the values for the parameters
- Fill in these new values in the hypothesis function and calculate again the value of the cost function;

*Usually the iteration stops when either the maximum number of iterations has been reached, or the error (the difference between the cost of this iteration and the cost of the previous iteration) is smaller than some minimum error value (0.001).

We have seen the self-generated example of students participating in a Machine Learning course, where their final grade depended on how many hours they had studied.

First, let’s generate the data:

import random import numpy as np num_of_datapoints = 100 x_max = 10 initial_theta = [1, 0.07] def func1(X, theta, add_noise = True): if add_noise: return theta[0]*X[0] + theta[1]*X[1]**2 + 0.25*X[1]*(random.random()-1) else: return theta[0]*X[0] + theta[1]*X[1]**2 def generate_data(num_of_datapoints, x_max, theta): X = np.zeros(shape=(num_of_datapoints, 2)) Y = np.zeros(shape=num_of_datapoints) for ii in range(num_of_datapoints): X[ii][0] = 1 X[ii][1] = (x_max*ii) / float(num_of_datapoints) Y[ii] = func1(X[ii], theta) return X, Y X, Y = generate_data(num_of_datapoints, x_max, initial_theta)

We can see that we have generated 100 points uniformly distributed over the -axis. For each of these – points the -value is determined by minus some random value.

On the left we can see a scatterplot of the datapoints and on the right we can see the same data with a curve fitted through the points. This is the curve we are trying to estimate with the Gradient Descent method. This is done as follows:

numIterations= 1000 alpha = 0.00000005 m, n = np.shape(X) theta = np.ones(n) theta = gradient_descent(X, Y, theta, alpha, m, numIterations) def gradient_descent(X, Y, theta, alpha, m, number_of_iterations): for ii in range(0,number_of_iterations): print "iteration %s : feature-value: %s" % (ii, theta) hypothesis = np.dot(X, theta) cost = sum([theta[0]*X[iter][0]+theta[1]*X[iter][1]-Y[iter] for iter in range(m)]) grad0 = (2.0/m)*sum([(func1(X[iter], theta, False) - Y[iter])*X[iter][0]**2 for iter in range(m)]) grad1 = (2.0/m)*sum([(func1(X[iter], theta, False) - Y[iter])*X[iter][1]**4 for iter in range(m)]) theta[0] = theta[0] - alpha * grad0 theta[1] = theta[1] - alpha * grad1 return theta

We can see that we have to calculate the gradient of the cost function times and update the feature values simultaneously! This indeed results in the curve we were looking for:

After this short example of Regression, lets have a look at a few examples of Logistic Regression. We will start out with a the self-generated example of students passing a course or not and then we will look at real world data.

Let’s generate some data points. There are students participating in the course Machine Learning and whether a student passes ( ) or not ( ) depends on two variables;

- : how many hours student has studied for the exam.
- : how many hours student has slept the day before the exam.

import random import numpy as np def func2(x_i): if x_i[1] <= 4: y = 0 else: if x_i[1]+x_i[2] <= 13: y = 0 else: y = 1 return y def generate_data2(no_points): X = np.zeros(shape=(no_points, 3)) Y = np.zeros(shape=no_points) for ii in range(no_points): X[ii][0] = 1 X[ii][1] = random.random()*9+0.5 X[ii][2] = random.random()*9+0.5 Y[ii] = func2(X[ii]) return X, Y X, Y = generate_data2(300)

In our example, the results are pretty binary; everyone who has studied less than 4 hours fails the course, as well as everyone whose studying time + sleeping time is less than or equal to 13 hours (). The results looks like this (the green dots indicate a pass and the red dots a fail):

We have a LogisticRegression class, which sets the values of the learning rate and the maximum number of iterations at its initialization. The values of X, Y are set when these matrices are passed to the “train()” function, and then the values of no_examples, no_features, and theta are determined.

import numpy as np class LogisticRegression(): """ Class for performing logistic regression. """ def __init__(self, learning_rate = 0.7, max_iter = 1000): self.learning_rate = learning_rate self.max_iter = max_iter self.theta = [] self.no_examples = 0 self.no_features = 0 self.X = None self.Y = None def add_bias_col(self, X): bias_col = np.ones((X.shape[0], 1)) return np.concatenate([bias_col, X], axis=1)

We also have the hypothesis, cost and gradient functions:

def hypothesis(self, X): return 1 / (1 + np.exp(-1.0 * np.dot(X, self.theta))) def cost_function(self): """ We will use the binary cross entropy as the cost function. https://en.wikipedia.org/wiki/Cross_entropy """ predicted_Y_values = self.hypothesis(self.X) cost = (-1.0/self.no_examples) * np.sum(self.Y * np.log(predicted_Y_values) + (1 - self.Y) * (np.log(1-predicted_Y_values))) return cost def gradient(self): predicted_Y_values = self.hypothesis(self.X) grad = (-1.0/self.no_examples) * np.dot((self.Y-predicted_Y_values), self.X) return grad

With these functions, the gradient descent method can be defined as:

def gradient_descent(self): for iter in range(1,self.max_iter): cost = self.cost_function() delta = self.gradient() self.theta = self.theta - self.learning_rate * delta print("iteration %s : cost %s " % (iter, cost))

These functions are used by the “train()” method, which first sets the values of the matrices X, Y and theta, and then calls the gradient_descent method:

def train(self, X, Y): self.X = self.add_bias_col(X) self.Y = Y self.no_examples, self.no_features = np.shape(X) self.theta = np.ones(self.no_features + 1) self.gradient_descent()

Once the values have been determined with the gradient descent method, we can use it to classify new examples:

def classify(self, X): X = self.add_bias_col(X) predicted_Y = self.hypothesis(X) predicted_Y_binary = np.round(predicted_Y) return predicted_Y_binary

Using this algorithm for gradient descent, we can correctly classify 297 out of 300 datapoints of our self-generated example (wrongly classified points are indicated with a cross).

Now that the concept of Logistic Regression is a bit more clear, let’s classify real-world data!

One of the most famous classification datasets is The Iris Flower Dataset. This dataset consists of three classes, where each example has four numerical features.

import pandas as pd to_bin_y = { 1: { 'Iris-setosa': 1, 'Iris-versicolor': 0, 'Iris-virginica': 0 }, 2: { 'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 0 }, 3: { 'Iris-setosa': 0, 'Iris-versicolor': 0, 'Iris-virginica': 1 } } #loading the dataset datafile = '../datasets/iris/iris.data' df = pd.read_csv(datafile, header=None) df_train = df.sample(frac=0.7) df_test = df.loc[~df.index.isin(df_train.index)] X_train = df_train.values[:,0:4].astype(float) y_train = df_train.values[:,4] X_test = df_test.values[:,0:4].astype(float) y_test = df_test.values[:,4] Y_train = np.array([to_bin_y[3][x] for x in y_train]) Y_test = np.array([to_bin_y[3][x] for x in y_test]) print("training Logistic Regression Classifier") lr = LogisticRegression() lr.train(X_train, Y_train) print("trained") predicted_Y_test = lr.classify(X_test) f1 = f1_score(predicted_Y_test, Y_test, 1) print("F1-score on the test-set for class %s is: %s" % (1, f1))

As you can see, our simple LogisticRegression class can classify the iris dataset with quiet a high accuracy:

training Logistic Regression Classifier iteration 1 : cost 8.4609605194 iteration 2 : cost 3.50586831057 iteration 3 : cost 3.78903735339 iteration 4 : cost 6.01488933456 iteration 5 : cost 0.458208317153 iteration 6 : cost 2.67703502395 iteration 7 : cost 3.66033580721 (...) iteration 998 : cost 0.0362384208231 iteration 999 : cost 0.0362289106001 trained F1-score on the test-set for class 1 is: 0.973225806452

For a full overview of the code, please have a look at GitHub.

Logistic Regression by using Gradient Descent can also be used for NLP / Text Analysis tasks. There are a wide variety of tasks which can are done in the field of NLP; autorship attribution, spam filtering, topic classification and sentiment analysis.

For a task like sentiment analysis we can follow the same procedure. We will have as the input a large collection of labelled text documents. These will be used to train the Logistic Regression classifier. The most important task then, is to select the proper features which will lead to the best sentiment classification. Almost everything in the text document can be used as a feature[2]; you are only limited by your creativity.

For sentiment analysis usually the occurence of (specific) words is used, or the relative occurence of words (the word occurences divided by the total number of words).

As we have done before, we have to fill in the and matrices, which will serve as an input for the gradient descent algorithm and this algorithm will give us the resulting feature vector . With this vector we can determine the class of other text documents.

As always is a vector with elements (where is the number of text-documents). The matrix is a by matrix; here is the total number of relevant words in all of the text-documents. I will illustrate how to build up this matrix with three book reviews:

**pos:**“This is such a beautiful edition of Harry Potter and the Sorcerer’s Stone. I’m so glad I bought it as a keep sake. The illustrations are just stunning.” (28 words in total)**pos:**“A brilliant book that helps you to open up your mind as wide as the sky” (16 words in total)**neg:**“This publication is virtually unreadable. It doesn’t do this classic justice. Multiple typos, no illustrations, and the most wonky footnotes conceivable. Spend a dollar more and get a decent edition.” (30 words in total)

These three reviews will result in the following -matrix.

As you can see, each row of the matrix contains all of the data per review and each column contains the data per word. If a review does not contain a specific word, the corresponding column will contain a zero. Such a -matrix containing all the data from the training set can be build up in the following manner:

Assuming that we have a list containing the data from the *training set*:

[ ([u'downloaded', u'the', u'book', u'to', u'my', ..., u'art'], 'neg'), ([u'this', u'novel', u'if', u'bunch', u'of', ..., u'ladies'], 'neg'), ([u'forget', u'reading', u'the', u'book', u'and', ..., u'hilarious!'], 'neg'), ... ]

From this *training_set*, we are going to generate a *words_vector*. This *words_vector* is used to keep track to which column a specific word belongs to. After this *words_vector* has been generated, the matrix and vector can filled in.

def generate_words_vector(training_set): words_vector = [] for review in training_set: for word in review[0]: if word not in words_vector: words_vector.append(word) return words_vector def generate_Y_vector(training_set, training_class): no_reviews = len(training_set) Y = np.zeros(shape=no_reviews) for ii in range(0,no_reviews): review_class = training_set[ii][1] Y[ii] = 1 if review_class == training_class else 0 return Y def generate_X_matrix(training_set, words_vector): no_reviews = len(training_set) no_words = len(words_vector) X = np.zeros(shape=(no_reviews, no_words+1)) for ii in range(0,no_reviews): X[ii][0] = 1 review_text = training_set[ii][0] total_words_in_review = len(review_text) for word in Set(review_text): word_occurences = review_text.count(word) word_index = words_vector.index(word)+1 X[ii][word_index] = word_occurences / float(total_words_in_review) return X words_vector = generate_words_vector(training_set) X = generate_X_matrix(training_set, words_vector) Y_neg = generate_Y_vector(training_set, 'neg')

As we have done before, the gradient descent method can be applied to derive the feature vector from the and matrices:

numIterations = 100 alpha = 0.55 m,n = np.shape(X) theta = np.ones(n) theta_neg = gradient_descent2(X, Y_neg, theta, alpha, m, numIterations)

What should we do if a specific review tests positive (Y=1) for more than one class? A review could result in Y=1 for both the *neu* class as well as the *neg* class. In that case we will pick the class with the highest score. This is called multinomial logistic regression.

So far, we have seen how to implement a Logistic Regression Classifier in its most basic form. It is true that building such a classifier from scratch, is great for learning purposes. It is also true that no one will get to the point of using deeper / more advanced Machine Learning skills without learning the basics first.

For real-world applications however, often the best solution is to not re-invent the wheel but to re-use tools which are already available. Tools which have been tested thorougly and have been used by plenty of smart programmers before you. One of such a tool is Python’s NLTK library.

NLTK is Python’s Natural Language Toolkit and it can be used for a wide variety of Text Processing and Analytics jobs like tokenization, part-of-speech tagging and classification. It is easy to use and even includes a lot of text corpora, which can be used to train your model if you have no training set available.

Let us also have a look at how to perform sentiment analysis and text classification with NLTK. As always, we will use a training set to train NLTK’s Maximum Entropy Classifier and a test set to verify the results. Our training set has the following format:

training_set = [ ([u'this', u'novel', u'if', u'bunch', u'of', u'childish', ..., u'ladies'], 'neg') ([u'where', u'to', u'begin', u'jeez', u'gasping', u'blushing', ..., u'fail????'], 'neg') ... ]

As you can see, the training set consists of a list of tuples of two elements. The first element is a list of the words in the text of the document and the second element is the class-label of this specific review (‘neg’, ‘neu’ or ‘pos’). Unfortunately NLTK’s Classifiers only accepts the text in a hashable format (dictionaries for example) and that is why we need to convert this list of words into a dictionary of words.

def list_to_dict(words_list): return dict([(word, True) for word in words_list]) training_set_formatted = [(list_to_dict(element[0]), element[1]) for element in training_set]

‘

Once the training set has been converted into the proper format, it can be feed into the train method of the MaxEnt Classifier:

import nltk numIterations = 100 algorithm = nltk.classify.MaxentClassifier.ALGORITHMS[0] classifier = nltk.MaxentClassifier.train(training_set_formatted, algorithm, max_iter=numIterations) classifier.show_most_informative_features(10)

Once the training of the MaxEntClassifier is done, it can be used to classify the review in the test set:

for review in test_set_formatted: label = review[1] text = review[0] determined_label = classifier.classify(text) print determined_label, label

So far we have seen the theory behind the Naive Bayes Classifier and how to implement it (in the context of Text Classification) and in the previous and this blog-post we have seen the theory and implementation of Logistic Regression Classifiers. Although this is done at a basic level, it should give some understanding of the Logistic Regression method (I hope at a level where you can apply it and classify data yourself). There are however still many (advanced) topics which have not been discussed here:

- Which hill-climbing / gradient descent algorithm to use; IIS (Improved Iterative Scaling), GIS (Generalized Iterative Scaling), BFGS, L-BFGS or Coordinate Descent
- Encoding of the feature vector and the use of dummy variables
- Logistic Regression is an inherently sequential algorithm; although it is quiet fast, you might need a parallelization strategy if you start using larger datasets.

If you see any errors please do not hesitate to contact me. If you have enjoyed reading, maybe even learned something, do not forget to subscribe to this blog and share it!

—

[1] See the paper of Nigam et. al. on Maximum Entropy and the paper of Bo Pang et. al. on Sentiment Analysis using Maximum Entropy. Also see Using Maximum Entropy for text classification (1999), A simple introduction to Maximum Entropy models(1997), A brief MaxEnt tutorial, and another good MIT article.

[2] See for example Chapter 7 of Speech and Language Processing by (Jurafsky & Martin): For the task of period disambiguation a feature could be whether or not a period is followed by a capital letter unless the previous word is *St.*

One of the most important tasks in Machine Learning are the Classification tasks (a.k.a. supervised machine learning). Classification is used to make an accurate prediction of the class of entries in a test set (a dataset of which the entries have not yet been labelled) with the model which was constructed from a training set. You could think of classifying crime in the field of pre-policing, classifying patients in the health sector, classifying houses in the real-estate sector. Another field in which classification is big, is Natural Lanuage Processing (NLP). This goal of this field of science is to makes machines (computers) understand written (human) language. You could think of text categorization, sentiment analysis, spam detection and topic categorization.

For classification tasks there are three widely used algorithms; the Naive Bayes, Logistic Regression / Maximum Entropy and Support Vector Machines. We have already seen how the Naive Bayes works in the context of Sentiment Analysis. Although it is more accurate than a bag-of-words model, it has the assumption of conditional independence of its features. This is a simplification which makes the NB classifier easy to implement, but it is also unrealistic in most cases and leads to a lower accuracy. A direct improvement on the N.B. classifier, is an algorithm which does not assume conditional independence but tries to estimate the weight vectors (feature values) directly. This algorithm is called Maximum Entropy in the field of NLP and Logistic Regression in the field of Statistics.

Maximum Entropy might sound like a difficult concept, but actually it is not. It is a simple idea, which can be implemented with a few lines of code. But to fully understand it, we must first go into the basics of Regression and Logistic Regression.

Regression Analysis is the field of mathematics where the goal is to find a function which best correlates with a dataset. Let’s say we have a dataset containing datapoints; . For each of these (input) datapoints there is a corresponding (output) -value. Here, the -datapoints are called the independent variables and the dependent variable; the value of depends on the value of , while the value of may be freely chosen without any restriction imposed on it by any other variable.

The goal of Regression analysis is to find a function which can best describe the correlation between and . In the field of Machine Learning, this function is called the hypothesis function and is denoted as .

If we can find such a function, we can say we have successfully built a Regression model. If the input-data lives in a 2D-space, this boils down to finding a curve which fits through the data points. In the 3D case we have to find a plane and in higher dimensions a hyperplane.

To give an example, let’s say that we are trying to find a predictive model for the success of students in a course called Machine Learning. We have a dataset which contains the final grade of students. Dataset contains the values of the independent variables. Our initial assumption is that the final grade only depends on the studying time. The variable therefore indicates how many hours student has studied. The first thing we would do is visualize this data:

If the results looks like the figure on the left, then we are out of luck. It looks like the points are distributed randomly and there is no correlation between and at all. However, if it looks like the figure on the right, there is probably a strong correlation and we can start looking for the function which describes this correlation.

This function could for example be:

or

where are the dependent parameters of our model.

In evaluating the results from the previous section, we may find the results unsatisfying; the function does not correlate with the datapoints strongly enough. Our initial assumption is probably not complete. Taking only the studying time into account is not enough. The final grade does not only depend on the studying time, but also on how much the students have slept the night before the exam. Now the dataset contains an additional variable which represents the sleeping time. Our dataset is then given by . In this dataset indicates how many hours student has studied and indicates how many hours he has slept.

This is an example of multivariate regression. The function has to include both variables. For example:

or

.

All of the above examples are examples of linear regression. We have seen that in some cases depends on a linear form of , but it can also depend on some power of , or on the log or any other form of . However, in all cases the parameters were linear.

So, what makes linear regression linear is not that depends in a linear way on , but that it depends in a linear way on . needs to be linear with respect to the model-parameters . Mathematically speaking it needs to satisfy the superposition principle. Examples of nonlinear regression would be:

or

The reason why the distinction is made between linear and nonlinear regression is that nonlinear regression problems are more difficult to solve and therefore more computational intensive algorithms are needed.

Linear regression models can be written as a linear system of equations, which can be solved by finding the closed-form solution with Linear Algebra. See these statistics notes for more on solving linear models with linear algebra.

As discussed before, such a closed-form solution can only be found for linear regression problems. However, even when the problem is linear in nature, we need to take into account that calculating the inverse of a by matrix has a time-complexity of . This means that for large datasets ( ) finding the closed-form solution will take more time than solving it iteratively (gradient descent method) as is done for nonlinear problems. So solving it iteratively is usually preferred for larger datasets, even if it is a linear problem.

The Gradient Descent method is a general optimization technique in which we try to find the value of the parameters with an iterative approach.

First, we construct a cost function (also known as loss function or error function) which gives the difference between the values of (the values you expect to have with the determined values of ) and the actual values of . The better your estimation of is, the better the values of will approach the values of .

Usually, the cost function is expressed as the squared error between this difference:

At each iteration we choose new values for the parameters , and move towards the ‘true’ values of these parameters, i.e. the values which make this cost function as small as possible. The direction in which we have to move is the negative gradient direction;

.

The reason for this is that a function’s value decreases the fastest if we move towards the direction of the negative gradient (the directional derivative is maximal in the direction of the gradient).

Taking all this into account, this is how gradient descent works:

- Make an initial but intelligent guess for the values of the parameters .
- Keep iterating while the value of the cost function has not met your criteria:
- With the current values of , calculate the gradient of the cost function J ( ).
- Update the values for the parameters
- Fill in these new values in the hypothesis function and calculate again the value of the cost function;

Just as important as the initial guess of the parameters is the value you choose for the learning rate . This learning rate determines how fast you move along the slope of the gradient. If the selected value of this learning rate is too small, it will take too many iterations before you reach your convergence criteria. If this value is too large, you might overshoot and not converge.

Logistic Regression is similar to (linear) regression, but adapted for the purpose of classification. The difference is small; for Logistic Regression we also have to apply gradient descent iteratively to estimate the values of the parameter . And again, during the iteration, the values are estimated by taking the gradient of the cost function. And again, the cost function is given by the squared error of the difference between the hypothesis function and . The major difference however, is the form of the hypothesis function.

When you want to classify something, there are a limited number of classes it can belong to. And for each of these possible classes there can only be two states for ;

either belongs to the specified class and , or it does not belong to the class and . Even though the output values are binary, the independent variables are still continuous. So, we need a function which has as input a large set of continuous variables and for each of these variables produces a binary output. This function, the hypothesis function, has the following form:

.

This function is also known as the logistic function, which is a part of the sigmoid function family. These functions are widely used in the natural sciences because they provide the simplest model for population growth. However, the reason why the logistic function is used for classification in Machine Learning is its ‘S-shape’.

As you can see this function is bounded in the y-direction by 0 and 1. If the variable is very negative, the output function will go to zero (it does not belong to the class). If the variable is very positive, the output will be one and it does belong to the class. (Such a function is called an indicator function.)

The question then is, what will happen to input values which are neither very positive nor very negative, but somewhere ‘in the middle’. We have to define a decision boundary, which separates the positive from the negative class. Usually this decision boundary is chosen at the middle of the logistic function, namely at where the output value is .

(1)

For those who are wondering where entered the picture that we were talking about before. As we can see in the formula of the logistic function, . Meaning, the dependent parameter (also known as the feature), maps the input variable to a position on the -axis. With its -value, we can use the logistic function to calculate the -value. If this -value we assume it does belong in this class and vice versa.

So the feature should be chosen such that it predicts the class membership correctly. It is therefore essential to know which features are useful for the classification task. Once the appropriate features are selected , gradient descent can be used to find the optimal value of these features.

How can we do gradient descent with this logistic function? Except for the hypothesis function having a different form, the gradient descent method is exactly the same. We again have a cost function, of which we have to iteratively take the gradient w.r.t. the feature and update the feature value at each iteration.

This cost function is given by

(2)

We know that:

and

(3)

Plugging these two equations back into the cost function gives us:

(4)

The gradient of the cost function with respect to is given by

(5)

So the gradient of the seemingly difficult cost function, turns out to be a much simpler equation. And with this simple equation, gradient descent for Logistic Regression is again performed in the same way:

- Make an initial but intelligent guess for the values of the parameters .
- Keep iterating while the value of the cost function has not met your criteria:
- With the current values of , calculate the gradient of the cost function J ( ).
- Update the values for the parameters
- Fill in these new values in the hypothesis function and calculate again the value of the cost function;

In the previous section we have seen how we can use Gradient Descent to estimate the feature values , which can then be used to determine the class with the Logistic function. As stated in the introduction, this can be used for a wide variety of classification tasks. The only thing that will be different for each of these classification tasks is the form the features take on.

Here we will continue to look at the example of Text Classification; Lets assume we are doing Sentiment Analysis and want to know whether a specific review should be classified as positive, neutral or negative.

The first thing we need to know is which and what types of features we need to include.

For NLP we will need a large number of features; often as large as the number of words present in the training set. We could reduce the number of features by excluding stopwords, or by only considering n-gram features.

For example, the 5-gram ‘kept me reading and reading’ is much less likely to occur in a review-document than the unigram ‘reading’, but if it occurs it is much more indicative of the class (positive) than ‘reading’. Since we only need to consider n-grams which actually are present in the training set, there will be much less features if we only consider n-grams instead of unigrams.

The second thing we need to know is the actual value of these features. The values are learned by initializing all features to zero, and applying the gradient descent method using the labeled examples in the training set. Once we know the values for the features, we can compute the probability for each class and choose the class with the maximum probability. This is done with the following Logistic function.

In this post we have discussed only the theory of Maximum Entropy and Logistic Regression. Usually such discussions are better understood with examples and the actual code. I will save that for the next blog.

If you have enjoyed reading this post or maybe even learned something from it, subscribe to this blog so you can receive a notification the next time something is posted.

Miles Osborne, Using Maximum Entropy for Sentence Extraction (2002)

Jurafsky and Martin, Speech and Language Processing; Chapter 7

Nigam et. al., Using Maximum Entropy for Text Classification

]]>

With the bag-of-words model we check which word of the text-document appears in a positive-words-list or a negative-words-list. If the word appears in a positive-words-list the total score of the text is updated with +1 and vice versa. If at the end the total score is positive, the text is classified as positive and if it is negative, the text is classified as negative. Simple enough!

With the Naive Bayes model, we do not take only a small set of positive and negative words into account, but all words the NB Classifier was trained with, i.e. all words presents in the training set. If a word has not appeared in the training set, we have no data available and apply Laplacian smoothing (use 1 instead of the conditional probability of the word).

The probability a document belongs to a class is given by the class probability multiplied by the products of the conditional probabilities of each word for that class.

Here is the number of occurences of word in class , is the total number of words in class and is the number of words in the document we are currently classifying.

does not change (unless the training set is expanded), so it can be placed outside of the product:

With this information it is easy to implement a Naive Bayes Text Classifier. (Naive Bayes can also be used to classify non-text / numerical datasets, for an explanation see this notebook).

We have a NaiveBayesText class, which accepts the input values for X and Y as parameters for the “train()” method. Here X is a list of lists, where each lower level list contains all the words in the document. Y is a list containing the label/class of each document.

class NaiveBaseClass: def calculate_relative_occurences(self, list1): no_examples = len(list1) ro_dict = dict(Counter(list1)) for key in ro_dict.keys(): ro_dict[key] = ro_dict[key] / float(no_examples) return ro_dict def get_max_value_key(self, d1): values = d1.values() keys = d1.keys() max_value_index = values.index(max(values)) max_key = keys[max_value_index] return max_key def initialize_nb_dict(self): self.nb_dict = {} for label in self.labels: self.nb_dict[label] = defaultdict(list) class NaiveBayesText(NaiveBaseClass): """" When the goal is classifying text, it is better to give the input X in the form of a list of lists containing words. X = [ ['this', 'is', 'a',...], (...) ] Y still is a 1D array / list containing the labels of each entry def initialize_nb_dict(self): self.nb_dict = {} for label in self.labels: self.nb_dict[label] = [] def train(self, X, Y): self.class_probabilities = self.calculate_relative_occurences(Y) self.labels = np.unique(Y) self.no_examples = len(Y) self.initialize_nb_dict() for ii in range(0,len(Y)): label = Y[ii] self.nb_dict[label] += X[ii] #transform the list with all occurences to a dict with relative occurences for label in self.labels: self.nb_dict[label] = self.calculate_relative_occurences(self.nb_dict[label])

As we can see, the training of the Naive Bayes Classifier is done by iterating through all of the documents in the training set. From all of the documents, a Hash table (dictionary in python language) with the relative occurence of each word per class is constructed.

This is done in two steps:

1. construct a huge list of all occuring words per class:

for ii in range(0,len(Y)): label = Y[ii] self.nb_dict[label] += X[ii]

2. calculate the relative occurence of each word in this huge list, with the “calculate_relative_occurences” method. This method simply uses Python’s Counter module to count how much each word occurs and then divides this number with the total number of words.

The result is saved in the dictionary *nb_dict*.

As we can see, it is easy to train the Naive Bayes Classifier. We simply calculate the relative occurence of each word per class, and save the result in the “nb_dict” dictionary.

This dictionary can be updated, saved to file, and loaded back from file. It contains the results of Naive Bayes Classifier training.

Classifying new documents is also done quite easily by calculating the class probability for each class and then selecting the class with the highest probability.

def classify_single_elem(self, X_elem): Y_dict = {} for label in self.labels: class_probability = self.class_probabilities[label] nb_dict_features = self.nb_dict[label] for word in X_elem: if word in nb_dict_features.keys(): relative_word_occurence = nb_dict_features[word] class_probability *= relative_word_occurence else: class_probability *= 0 Y_dict[label] = class_probability return self.get_max_value_key(Y_dict) def classify(self, X): self.predicted_Y_values = [] n = len(X) for ii in range(0,n): X_elem = X[ii] prediction = self.classify_single_elem(X_elem) self.predicted_Y_values.append(prediction) return self.predicted_Y_values

In the next blog we will look at the results of this naively implemented algorithm for the Naive Bayes Classifier and see how it performs under various conditions; we will see the influence of varying training set sizes and whether the use of n-gram features will improve the accuracy of the classifier.

]]>