Follow

Jun 9, 2020

·

11 min read

·

## Finally understand the concept behind SVM + Implementation in Python via scikit-learn

*Support Vector Machines (SVM)* are not new but are still a powerful tool for classification due to their tendency not to overfit, but to perform well in many cases. If you are only interested in a certain topic, just scroll over the topics. These are the topics in chronological order:

*What’s the**mathematical concept**behind the Support Vector Machine?**What is a kernel and what are**kernel functions**?**What is the**kernel trick**?**What is the**dual problem**of a SVM?**How does**Multiclass Classification**take place?***Implementation**via Python and scikit-learn

**If you are only interested in how it can be implemented using Python and scikit-learn, scroll down to the end!**

## Let’s get started.

The objective is to find a hyperplane in an n-dimensional space that separates the data points to their potential classes. The hyperplane should be positioned with the maximum distance to the data points. The data points with the minimum distance to the hyperplane are called *Support Vectors*. Due to their close position,* *their influence on the exact position of the hyperplane is bigger than of other data points. In the graphic below the Support Vectors are the 3 points (2 blue, 1 green) laying on the lines.

SVMs are also called *kernelized SVM* due to their kernel that converts the input data space into a higher-dimensional space.

The input space X consists of x and x’.

represents the *kernel function* that turns the input space into a higher-dimensional space, so that not every data point is explicitly mapped.

The *kernel function* can also be written as

How the function is defined and useful for laying hyperplanes depends on the data:

**Kernel Functions**

The most popular kernel functions, that are also available in *scikit-learn *are* linear, polynomial, radial basis function *and* sigmoid*. For more functions visit dataflair. In the following you can see how these four kernel functions look like:

**Linear Function**

**2. Polynomial Function**

**3. Radial Basis Function (RBF)**

**4. Sigmoid Function**

What does the kernel function do?

It takes two datapoints x_n and x_m and calculates a distance score of those. This score is higher for closer datapoints and vice versa. Using this score helps to transform the data points to a higher-dimensional mapping, which reduces the computational effort and time and is especially useful for huge amounts of data. It prevents the need for a more complex transformation.

That’s why this step is often referred to as *Kernel Trick*.

As can be seen in the graphic below, the mapping of the data points is turned from a 2D to a 3D space using a kernel function (φ((*a*, *b*)) = (*a*, *b*, *a*2 + *b*2)). The previously centered red dots are now also vertically lower located when turned into a 3D-space. Not clearly separable data points can now better divided by using a kernel.

Furthermore, different kernels can help to lay hyperplanes of diverse shapes through the cloud of data points. Obviously the limits of linear hyperplanes are quickly exceeded due to their restricted adaptability to different shapes. Based on this fact the different kernel functions have been developed.

For a better understanding of how the separation of different hyperplane can look like, the different kinds of kernel functions are visualized in the graphic below.

The following formula poses the optimization problem that is tackled by SVMs. It is explained further down (scikit-learn, n.d.):

The objective is to classify as many data points correctly as possible by maximizing the margin from the* Support Vectors* to the hyperplane while minimizing the term

In other words, the objective can also be explained as finding optimal **w **and **b **that most samples are predicted correctly.

Mostly not all data points can be allocated perfectly so that a distance to the correct margin is represented by

The *normal vector* creates a line that runs through the coordinate origin. The hyperplanes cut this straight line orthogonal at a distance

from the origin.

For the ideal case (Bishop, p.325 ff., 2006)

would be ≥ 1 and followingly perfectly predicted. Having now data points with distance to their ideal position, lets us correct the ideal case of being ≥ 1 to

Simultaneously a penalty term is introduced in the minimization formula. C acts as a *regularization parameter* and controls how strong the penalty is regarding how many data points have been falsely assigned with a total distance of

**The dual problem**

The optimization task can be referred to as a *dual problem*, trying to minimize the parameters, while maximizing the margin. To solve the dual problem, Lagrange multipliers are utilized (alpha≥0).

This leads to a Lagrangian function of (Bishop, p.325 ff., 2006):

Utilizing the following two conditions (Bishop, p.325 ff., 2006):

w and b can be eliminated from *L(w,b,a)*. This leads to the following Lagrangian function which is maximized as (Bishop, p.325 ff., 2006):

Solving the optimization problem, new data points can be classified by using (Bishop, p.325 ff., 2006):

For the kernel function *k(x_n,x_m)* the previously explained kernel functions (sigmoid, linear, polynomial, rbf) can be filled in.

**And that’s it! **If you could follow the math, you understand now the principle behind a support vector machine. It’s easy to understand how to divide a cloud of data points into two classes, but how is it done for multiple classes?

Let’s see how this works.

In its most simple type SVM are applied on binary classification, dividing data points either in 1 or 0. For multiclass classification, the same principle is utilized. The multiclass problem is broken down to multiple binary classification cases, which is also called *one-vs-one*. In scikit-learn one-vs-one is not default and needs to be selected explicitly (as can be seen further down in the code). *One-vs-rest* is set as default. It basically divides the data points in class x and rest. Consecutively a certain class is distinguished from all other classes.

The number of classifiers necessary for *one-vs-one multiclass classification* can be retrieved with the following formula (with n being the number of classes):

In the one-vs-one approach, each classifier separates points of two different classes and comprising all one-vs-one classifiers leads to a multiclass classifier.

You might already be bored from the iris dataset, but it’s the most simple way to demonstrate it, so let's take a look on some code. Some parts of the code you can also find under scikit-learn.

**#Importing the necessary packages and libaries**

from sklearn.metrics import confusion_matrix

from sklearn.model_selection import train_test_split

from sklearn import svm, datasets

import matplotlib.pyplot as plt

import numpy as np

Let’s load the iris dataset as iris and store target and feature variables:

iris = datasets.load_iris()#Store variables as target y and the first two features as X (sepal length and sepal width of the iris flowers)

X = iris.data[:, :2]

y = iris.target

And now let’s split the dataset in train and test-set for the following training and prediction:

`X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state = 0)`

In this step, we take a look at the different kernel functions. The penalty term C is set to 1 for all classifiers. For the multiclass classification, the type one-versus-one is specified, as can be seen in decision_function_shape=’ovo’. For the polynomial function, the degree of 3 is selected, this is not necessary for other kernel functions.

All other parameters are set to default. Here you can read more about the SVC-function of scikit-learn.

linear = svm.SVC(kernel='linear', C=1, decision_function_shape='ovo').fit(X_train, y_train)rbf = svm.SVC(kernel='rbf', gamma=1, C=1, decision_function_shape='ovo').fit(X_train, y_train)poly = svm.SVC(kernel='poly', degree=3, C=1, decision_function_shape='ovo').fit(X_train, y_train)sig = svm.SVC(kernel='sigmoid', C=1, decision_function_shape='ovo').fit(X_train, y_train)

Now let's specify the mesh, in which we will plot the results.

#stepsize in the mesh, it alters the accuracy of the plotprint

#to better understand it, just play with the value, change it and print it

h = .01#create the mesh

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))# create the title that will be shown on the plottitles = ['Linear kernel','RBF kernel','Polynomial kernel','Sigmoid kernel']

Now we’ll use a for loop to plot all 4 kernel functions:

for i, clf in enumerate((linear, rbf, poly, sig)):

#defines how many plots: 2 rows, 2columns=> leading to 4 plots

plt.subplot(2, 2, i + 1)#i+1 is the index

#space between plots

plt.subplots_adjust(wspace=0.4, hspace=0.4) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plot

Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, cmap=plt.cm.PuBuGn, alpha=0.7)# Plot also the training points

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.PuBuGn, edgecolors='grey') plt.xlabel('Sepal length')

plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())

plt.ylim(yy.min(), yy.max())

plt.xticks(())

plt.yticks(())

plt.title(titles[i]) plt.show()

As you might have probably recognized, the result is the image from above in the article:

In the next step we make predictions on the test data set using our 4 different kernel functions:

`linear_pred = linear.predict(X_test)`

poly_pred = poly.predict(X_test)

rbf_pred = rbf.predict(X_test)

sig_pred = sig.predict(X_test)

To understand how well they perform, we utilize a performance measure — accuracy.

# retrieve the accuracy and print it for all 4 kernel functions

accuracy_lin = linear.score(X_test, y_test)

accuracy_poly = poly.score(X_test, y_test)

accuracy_rbf = rbf.score(X_test, y_test)

accuracy_sig = sig.score(X_test, y_test)print(“Accuracy Linear Kernel:”, accuracy_lin)

print(“Accuracy Polynomial Kernel:”, accuracy_poly)

print(“Accuracy Radial Basis Kernel:”, accuracy_rbf)

print(“Accuracy Sigmoid Kernel:”, accuracy_sig

As the accuracy reveals, some kernel functions are more useful than others, depending on the data. And obviously, more data is also helpful for improving the results (the iris data has a not really big size with 50 samples :-) ).

Let’s head to the last step — printing confusion matrices for the 4 kernel functions to understand how and what has been predicted:

# creating a confusion matrixcm_lin = confusion_matrix(y_test, linear_pred)

cm_poly = confusion_matrix(y_test, poly_pred)

cm_rbf = confusion_matrix(y_test, rbf_pred)

cm_sig = confusion_matrix(y_test, sig_pred)print(cm_lin)

print(cm_poly)

print(cm_rbf)

print(cm_sig)

That’s it. I hope it helped to better understand Support Vector Machines, the mathematical concepts, the Kernel trick and the topic Multiclass Classification via SVM.

**Thanks for reading, I appreciate feedback!**

## Sources

- Bishop, 2006, Pattern recognition and machine learning, p.325 ff.
- Vapnik & Cortes, 1995, Support-vector networks
- Scikit-learn, Support Vector Machines
- Scikit-learn, Plot different SVM classifiers in the iris dataset
- Dataflair, Kernel Functions — Introduction to SVM Kernel & Examples

## The whole code

#Importing the necessary packages and libariesfrom sklearn.metrics import confusion_matrix

from sklearn.model_selection import train_test_split

from sklearn import svm, datasets

import matplotlib.pyplot as plt

import numpy as npiris = datasets.load_iris()#Store variables as target y and the first two features as X (sepal length and sepal width of the iris flowers)

X = iris.data[:, :2]

y = iris.targetlinear = svm.SVC(kernel='linear', C=1, decision_function_shape='ovo').fit(X_train, y_train)rbf = svm.SVC(kernel='rbf', gamma=1, C=1, decision_function_shape='ovo').fit(X_train, y_train)poly = svm.SVC(kernel='poly', degree=3, C=1, decision_function_shape='ovo').fit(X_train, y_train)sig = svm.SVC(kernel='sigmoid', C=1, decision_function_shape='ovo').fit(X_train, y_train)#stepsize in the mesh, it alters the accuracy of the plotprint

#to better understand it, just play with the value, change it and print it

h = .01#create the mesh

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))# create the title that will be shown on the plottitles = ['Linear kernel','RBF kernel','Polynomial kernel','Sigmoid kernel']for i, clf in enumerate((linear, rbf, poly, sig)):

#defines how many plots: 2 rows, 2columns=> leading to 4 plots

plt.subplot(2, 2, i + 1)#i+1 is the index

#space between plots

plt.subplots_adjust(wspace=0.4, hspace=0.4)Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plot

Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, cmap=plt.cm.PuBuGn, alpha=0.7)# Plot also the training points

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.PuBuGn, edgecolors='grey')plt.xlabel('Sepal length')

plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())

plt.ylim(yy.min(), yy.max())

plt.xticks(())

plt.yticks(())

plt.title(titles[i])plt.show()linear_pred = linear.predict(X_test)

poly_pred = poly.predict(X_test)

rbf_pred = rbf.predict(X_test)

sig_pred = sig.predict(X_test)# retrieve the accuracy and print it for all 4 kernel functions

accuracy_lin = linear.score(X_test, y_test)

accuracy_poly = poly.score(X_test, y_test)

accuracy_rbf = rbf.score(X_test, y_test)

accuracy_sig = sig.score(X_test, y_test)print(“Accuracy Linear Kernel:”, accuracy_lin)

print(“Accuracy Polynomial Kernel:”, accuracy_poly)

print(“Accuracy Radial Basis Kernel:”, accuracy_rbf)

print(“Accuracy Sigmoid Kernel:”, accuracy_sig# creating a confusion matrixcm_lin = confusion_matrix(y_test, linear_pred)

cm_poly = confusion_matrix(y_test, poly_pred)

cm_rbf = confusion_matrix(y_test, rbf_pred)

cm_sig = confusion_matrix(y_test, sig_pred)print(cm_lin)

print(cm_poly)

print(cm_rbf)

print(cm_sig)