Follow
Jun 9, 2020
·
11 min read
·
Finally understand the concept behind SVM + Implementation in Python via scikit-learn
Support Vector Machines (SVM) are not new but are still a powerful tool for classification due to their tendency not to overfit, but to perform well in many cases. If you are only interested in a certain topic, just scroll over the topics. These are the topics in chronological order:
- What’s the mathematical concept behind the Support Vector Machine?
- What is a kernel and what are kernel functions?
- What is the kernel trick?
- What is the dual problem of a SVM?
- How does Multiclass Classification take place?
- Implementation via Python and scikit-learn
If you are only interested in how it can be implemented using Python and scikit-learn, scroll down to the end!
Let’s get started.
The objective is to find a hyperplane in an n-dimensional space that separates the data points to their potential classes. The hyperplane should be positioned with the maximum distance to the data points. The data points with the minimum distance to the hyperplane are called Support Vectors. Due to their close position, their influence on the exact position of the hyperplane is bigger than of other data points. In the graphic below the Support Vectors are the 3 points (2 blue, 1 green) laying on the lines.
SVMs are also called kernelized SVM due to their kernel that converts the input data space into a higher-dimensional space.
The input space X consists of x and x’.
represents the kernel function that turns the input space into a higher-dimensional space, so that not every data point is explicitly mapped.
The kernel function can also be written as
How the function is defined and useful for laying hyperplanes depends on the data:
Kernel Functions
The most popular kernel functions, that are also available in scikit-learn are linear, polynomial, radial basis function and sigmoid. For more functions visit dataflair. In the following you can see how these four kernel functions look like:
- Linear Function
2. Polynomial Function
3. Radial Basis Function (RBF)
4. Sigmoid Function
What does the kernel function do?
It takes two datapoints x_n and x_m and calculates a distance score of those. This score is higher for closer datapoints and vice versa. Using this score helps to transform the data points to a higher-dimensional mapping, which reduces the computational effort and time and is especially useful for huge amounts of data. It prevents the need for a more complex transformation.
That’s why this step is often referred to as Kernel Trick.
As can be seen in the graphic below, the mapping of the data points is turned from a 2D to a 3D space using a kernel function (φ((a, b)) = (a, b, a2 + b2)). The previously centered red dots are now also vertically lower located when turned into a 3D-space. Not clearly separable data points can now better divided by using a kernel.
Furthermore, different kernels can help to lay hyperplanes of diverse shapes through the cloud of data points. Obviously the limits of linear hyperplanes are quickly exceeded due to their restricted adaptability to different shapes. Based on this fact the different kernel functions have been developed.
For a better understanding of how the separation of different hyperplane can look like, the different kinds of kernel functions are visualized in the graphic below.
The following formula poses the optimization problem that is tackled by SVMs. It is explained further down (scikit-learn, n.d.):
The objective is to classify as many data points correctly as possible by maximizing the margin from the Support Vectors to the hyperplane while minimizing the term
In other words, the objective can also be explained as finding optimal w and b that most samples are predicted correctly.
Mostly not all data points can be allocated perfectly so that a distance to the correct margin is represented by
The normal vector creates a line that runs through the coordinate origin. The hyperplanes cut this straight line orthogonal at a distance
from the origin.
For the ideal case (Bishop, p.325 ff., 2006)
would be ≥ 1 and followingly perfectly predicted. Having now data points with distance to their ideal position, lets us correct the ideal case of being ≥ 1 to
Simultaneously a penalty term is introduced in the minimization formula. C acts as a regularization parameter and controls how strong the penalty is regarding how many data points have been falsely assigned with a total distance of
The dual problem
The optimization task can be referred to as a dual problem, trying to minimize the parameters, while maximizing the margin. To solve the dual problem, Lagrange multipliers are utilized (alpha≥0).
This leads to a Lagrangian function of (Bishop, p.325 ff., 2006):
Utilizing the following two conditions (Bishop, p.325 ff., 2006):
w and b can be eliminated from L(w,b,a). This leads to the following Lagrangian function which is maximized as (Bishop, p.325 ff., 2006):
Solving the optimization problem, new data points can be classified by using (Bishop, p.325 ff., 2006):
For the kernel function k(x_n,x_m) the previously explained kernel functions (sigmoid, linear, polynomial, rbf) can be filled in.
And that’s it! If you could follow the math, you understand now the principle behind a support vector machine. It’s easy to understand how to divide a cloud of data points into two classes, but how is it done for multiple classes?
Let’s see how this works.
In its most simple type SVM are applied on binary classification, dividing data points either in 1 or 0. For multiclass classification, the same principle is utilized. The multiclass problem is broken down to multiple binary classification cases, which is also called one-vs-one. In scikit-learn one-vs-one is not default and needs to be selected explicitly (as can be seen further down in the code). One-vs-rest is set as default. It basically divides the data points in class x and rest. Consecutively a certain class is distinguished from all other classes.
The number of classifiers necessary for one-vs-one multiclass classification can be retrieved with the following formula (with n being the number of classes):
In the one-vs-one approach, each classifier separates points of two different classes and comprising all one-vs-one classifiers leads to a multiclass classifier.
You might already be bored from the iris dataset, but it’s the most simple way to demonstrate it, so let's take a look on some code. Some parts of the code you can also find under scikit-learn.
#Importing the necessary packages and libaries
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn import svm, datasets
import matplotlib.pyplot as plt
import numpy as np
Let’s load the iris dataset as iris and store target and feature variables:
iris = datasets.load_iris()#Store variables as target y and the first two features as X (sepal length and sepal width of the iris flowers)
X = iris.data[:, :2]
y = iris.target
And now let’s split the dataset in train and test-set for the following training and prediction:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state = 0)
In this step, we take a look at the different kernel functions. The penalty term C is set to 1 for all classifiers. For the multiclass classification, the type one-versus-one is specified, as can be seen in decision_function_shape=’ovo’. For the polynomial function, the degree of 3 is selected, this is not necessary for other kernel functions.
All other parameters are set to default. Here you can read more about the SVC-function of scikit-learn.
linear = svm.SVC(kernel='linear', C=1, decision_function_shape='ovo').fit(X_train, y_train)rbf = svm.SVC(kernel='rbf', gamma=1, C=1, decision_function_shape='ovo').fit(X_train, y_train)poly = svm.SVC(kernel='poly', degree=3, C=1, decision_function_shape='ovo').fit(X_train, y_train)sig = svm.SVC(kernel='sigmoid', C=1, decision_function_shape='ovo').fit(X_train, y_train)
Now let's specify the mesh, in which we will plot the results.
#stepsize in the mesh, it alters the accuracy of the plotprint
#to better understand it, just play with the value, change it and print it
h = .01#create the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))# create the title that will be shown on the plot
titles = ['Linear kernel','RBF kernel','Polynomial kernel','Sigmoid kernel']
Now we’ll use a for loop to plot all 4 kernel functions:
for i, clf in enumerate((linear, rbf, poly, sig)):
#defines how many plots: 2 rows, 2columns=> leading to 4 plots
plt.subplot(2, 2, i + 1) #i+1 is the index
#space between plots
plt.subplots_adjust(wspace=0.4, hspace=0.4) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.PuBuGn, alpha=0.7) # Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.PuBuGn, edgecolors='grey') plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i]) plt.show()
As you might have probably recognized, the result is the image from above in the article:
In the next step we make predictions on the test data set using our 4 different kernel functions:
linear_pred = linear.predict(X_test)
poly_pred = poly.predict(X_test)
rbf_pred = rbf.predict(X_test)
sig_pred = sig.predict(X_test)
To understand how well they perform, we utilize a performance measure — accuracy.
# retrieve the accuracy and print it for all 4 kernel functions
accuracy_lin = linear.score(X_test, y_test)
accuracy_poly = poly.score(X_test, y_test)
accuracy_rbf = rbf.score(X_test, y_test)
accuracy_sig = sig.score(X_test, y_test)print(“Accuracy Linear Kernel:”, accuracy_lin)
print(“Accuracy Polynomial Kernel:”, accuracy_poly)
print(“Accuracy Radial Basis Kernel:”, accuracy_rbf)
print(“Accuracy Sigmoid Kernel:”, accuracy_sig
As the accuracy reveals, some kernel functions are more useful than others, depending on the data. And obviously, more data is also helpful for improving the results (the iris data has a not really big size with 50 samples :-) ).
Let’s head to the last step — printing confusion matrices for the 4 kernel functions to understand how and what has been predicted:
# creating a confusion matrix
cm_lin = confusion_matrix(y_test, linear_pred)
cm_poly = confusion_matrix(y_test, poly_pred)
cm_rbf = confusion_matrix(y_test, rbf_pred)
cm_sig = confusion_matrix(y_test, sig_pred)print(cm_lin)
print(cm_poly)
print(cm_rbf)
print(cm_sig)
That’s it. I hope it helped to better understand Support Vector Machines, the mathematical concepts, the Kernel trick and the topic Multiclass Classification via SVM.
Thanks for reading, I appreciate feedback!
Sources
- Bishop, 2006, Pattern recognition and machine learning, p.325 ff.
- Vapnik & Cortes, 1995, Support-vector networks
- Scikit-learn, Support Vector Machines
- Scikit-learn, Plot different SVM classifiers in the iris dataset
- Dataflair, Kernel Functions — Introduction to SVM Kernel & Examples
The whole code
#Importing the necessary packages and libaries
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn import svm, datasets
import matplotlib.pyplot as plt
import numpy as npiris = datasets.load_iris()#Store variables as target y and the first two features as X (sepal length and sepal width of the iris flowers)
X = iris.data[:, :2]
y = iris.targetlinear = svm.SVC(kernel='linear', C=1, decision_function_shape='ovo').fit(X_train, y_train)rbf = svm.SVC(kernel='rbf', gamma=1, C=1, decision_function_shape='ovo').fit(X_train, y_train)poly = svm.SVC(kernel='poly', degree=3, C=1, decision_function_shape='ovo').fit(X_train, y_train)sig = svm.SVC(kernel='sigmoid', C=1, decision_function_shape='ovo').fit(X_train, y_train)#stepsize in the mesh, it alters the accuracy of the plotprint
#to better understand it, just play with the value, change it and print it
h = .01#create the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))# create the title that will be shown on the plot
titles = ['Linear kernel','RBF kernel','Polynomial kernel','Sigmoid kernel']for i, clf in enumerate((linear, rbf, poly, sig)):
#defines how many plots: 2 rows, 2columns=> leading to 4 plots
plt.subplot(2, 2, i + 1) #i+1 is the index
#space between plots
plt.subplots_adjust(wspace=0.4, hspace=0.4)Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.PuBuGn, alpha=0.7)# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.PuBuGn, edgecolors='grey')plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])plt.show()linear_pred = linear.predict(X_test)
poly_pred = poly.predict(X_test)
rbf_pred = rbf.predict(X_test)
sig_pred = sig.predict(X_test)# retrieve the accuracy and print it for all 4 kernel functions
accuracy_lin = linear.score(X_test, y_test)
accuracy_poly = poly.score(X_test, y_test)
accuracy_rbf = rbf.score(X_test, y_test)
accuracy_sig = sig.score(X_test, y_test)print(“Accuracy Linear Kernel:”, accuracy_lin)
print(“Accuracy Polynomial Kernel:”, accuracy_poly)
print(“Accuracy Radial Basis Kernel:”, accuracy_rbf)
print(“Accuracy Sigmoid Kernel:”, accuracy_sig# creating a confusion matrix
cm_lin = confusion_matrix(y_test, linear_pred)
cm_poly = confusion_matrix(y_test, poly_pred)
cm_rbf = confusion_matrix(y_test, rbf_pred)
cm_sig = confusion_matrix(y_test, sig_pred)print(cm_lin)
print(cm_poly)
print(cm_rbf)
print(cm_sig)