Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (2023)

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (1)

Jun 9, 2020

·

11 min read

·

(Video) Support Vector Machines Part 3: The Radial (RBF) Kernel (Part 3 of 3)

Finally understand the concept behind SVM + Implementation in Python via scikit-learn

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (2)

Support Vector Machines (SVM) are not new but are still a powerful tool for classification due to their tendency not to overfit, but to perform well in many cases. If you are only interested in a certain topic, just scroll over the topics. These are the topics in chronological order:

  • What’s the mathematical concept behind the Support Vector Machine?
  • What is a kernel and what are kernel functions?
  • What is the kernel trick?
  • What is the dual problem of a SVM?
  • How does Multiclass Classification take place?
  • Implementation via Python and scikit-learn

If you are only interested in how it can be implemented using Python and scikit-learn, scroll down to the end!

Let’s get started.

The objective is to find a hyperplane in an n-dimensional space that separates the data points to their potential classes. The hyperplane should be positioned with the maximum distance to the data points. The data points with the minimum distance to the hyperplane are called Support Vectors. Due to their close position, their influence on the exact position of the hyperplane is bigger than of other data points. In the graphic below the Support Vectors are the 3 points (2 blue, 1 green) laying on the lines.

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (3)

SVMs are also called kernelized SVM due to their kernel that converts the input data space into a higher-dimensional space.

The input space X consists of x and x’.

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (4)

represents the kernel function that turns the input space into a higher-dimensional space, so that not every data point is explicitly mapped.

The kernel function can also be written as

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (5)

How the function is defined and useful for laying hyperplanes depends on the data:

Kernel Functions

The most popular kernel functions, that are also available in scikit-learn are linear, polynomial, radial basis function and sigmoid. For more functions visit dataflair. In the following you can see how these four kernel functions look like:

  1. Linear Function
Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (6)
(Video) Support Vector Machine (SVM) in 2 minutes

2. Polynomial Function

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (7)

3. Radial Basis Function (RBF)

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (8)

4. Sigmoid Function

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (9)

What does the kernel function do?

It takes two datapoints x_n and x_m and calculates a distance score of those. This score is higher for closer datapoints and vice versa. Using this score helps to transform the data points to a higher-dimensional mapping, which reduces the computational effort and time and is especially useful for huge amounts of data. It prevents the need for a more complex transformation.
That’s why this step is often referred to as Kernel Trick.

As can be seen in the graphic below, the mapping of the data points is turned from a 2D to a 3D space using a kernel function (φ((a, b)) = (a, b, a2 + b2)). The previously centered red dots are now also vertically lower located when turned into a 3D-space. Not clearly separable data points can now better divided by using a kernel.

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (10)

Furthermore, different kernels can help to lay hyperplanes of diverse shapes through the cloud of data points. Obviously the limits of linear hyperplanes are quickly exceeded due to their restricted adaptability to different shapes. Based on this fact the different kernel functions have been developed.

For a better understanding of how the separation of different hyperplane can look like, the different kinds of kernel functions are visualized in the graphic below.

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (11)

The following formula poses the optimization problem that is tackled by SVMs. It is explained further down (scikit-learn, n.d.):

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (12)

The objective is to classify as many data points correctly as possible by maximizing the margin from the Support Vectors to the hyperplane while minimizing the term

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (13)

In other words, the objective can also be explained as finding optimal w and b that most samples are predicted correctly.
Mostly not all data points can be allocated perfectly so that a distance to the correct margin is represented by

(Video) SVM - Inseparable sets, kernels, and multiclass classification

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (14)

The normal vector creates a line that runs through the coordinate origin. The hyperplanes cut this straight line orthogonal at a distance

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (15)

from the origin.

For the ideal case (Bishop, p.325 ff., 2006)

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (16)

would be ≥ 1 and followingly perfectly predicted. Having now data points with distance to their ideal position, lets us correct the ideal case of being ≥ 1 to

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (17)

Simultaneously a penalty term is introduced in the minimization formula. C acts as a regularization parameter and controls how strong the penalty is regarding how many data points have been falsely assigned with a total distance of

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (18)

The dual problem

The optimization task can be referred to as a dual problem, trying to minimize the parameters, while maximizing the margin. To solve the dual problem, Lagrange multipliers are utilized (alpha≥0).

This leads to a Lagrangian function of (Bishop, p.325 ff., 2006):

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (19)

Utilizing the following two conditions (Bishop, p.325 ff., 2006):

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (20)

w and b can be eliminated from L(w,b,a). This leads to the following Lagrangian function which is maximized as (Bishop, p.325 ff., 2006):

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (21)

Solving the optimization problem, new data points can be classified by using (Bishop, p.325 ff., 2006):

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (22)

For the kernel function k(x_n,x_m) the previously explained kernel functions (sigmoid, linear, polynomial, rbf) can be filled in.

And that’s it! If you could follow the math, you understand now the principle behind a support vector machine. It’s easy to understand how to divide a cloud of data points into two classes, but how is it done for multiple classes?
Let’s see how this works.

In its most simple type SVM are applied on binary classification, dividing data points either in 1 or 0. For multiclass classification, the same principle is utilized. The multiclass problem is broken down to multiple binary classification cases, which is also called one-vs-one. In scikit-learn one-vs-one is not default and needs to be selected explicitly (as can be seen further down in the code). One-vs-rest is set as default. It basically divides the data points in class x and rest. Consecutively a certain class is distinguished from all other classes.

The number of classifiers necessary for one-vs-one multiclass classification can be retrieved with the following formula (with n being the number of classes):

(Video) Multi-Class Classification using SVM : One vs. All

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (23)

In the one-vs-one approach, each classifier separates points of two different classes and comprising all one-vs-one classifiers leads to a multiclass classifier.

You might already be bored from the iris dataset, but it’s the most simple way to demonstrate it, so let's take a look on some code. Some parts of the code you can also find under scikit-learn.

#Importing the necessary packages and libaries
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn import svm, datasets
import matplotlib.pyplot as plt
import numpy as np

Let’s load the iris dataset as iris and store target and feature variables:

iris = datasets.load_iris()#Store variables as target y and the first two features as X (sepal length and sepal width of the iris flowers)
X = iris.data[:, :2]
y = iris.target

And now let’s split the dataset in train and test-set for the following training and prediction:

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state = 0)

In this step, we take a look at the different kernel functions. The penalty term C is set to 1 for all classifiers. For the multiclass classification, the type one-versus-one is specified, as can be seen in decision_function_shape=’ovo’. For the polynomial function, the degree of 3 is selected, this is not necessary for other kernel functions.
All other parameters are set to default. Here you can read more about the SVC-function of scikit-learn.

linear = svm.SVC(kernel='linear', C=1, decision_function_shape='ovo').fit(X_train, y_train)rbf = svm.SVC(kernel='rbf', gamma=1, C=1, decision_function_shape='ovo').fit(X_train, y_train)poly = svm.SVC(kernel='poly', degree=3, C=1, decision_function_shape='ovo').fit(X_train, y_train)sig = svm.SVC(kernel='sigmoid', C=1, decision_function_shape='ovo').fit(X_train, y_train)

Now let's specify the mesh, in which we will plot the results.

#stepsize in the mesh, it alters the accuracy of the plotprint
#to better understand it, just play with the value, change it and print it

h = .01
#create the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))
# create the title that will be shown on the plot
titles = ['Linear kernel','RBF kernel','Polynomial kernel','Sigmoid kernel']

Now we’ll use a for loop to plot all 4 kernel functions:

for i, clf in enumerate((linear, rbf, poly, sig)):
#defines how many plots: 2 rows, 2columns=> leading to 4 plots
plt.subplot(2, 2, i + 1) #i+1 is the index
#space between plots

plt.subplots_adjust(wspace=0.4, hspace=0.4)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.PuBuGn, alpha=0.7)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.PuBuGn, edgecolors='grey')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])
plt.show()

As you might have probably recognized, the result is the image from above in the article:

Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (24)

In the next step we make predictions on the test data set using our 4 different kernel functions:

linear_pred = linear.predict(X_test)
poly_pred = poly.predict(X_test)
rbf_pred = rbf.predict(X_test)
sig_pred = sig.predict(X_test)

To understand how well they perform, we utilize a performance measure — accuracy.

# retrieve the accuracy and print it for all 4 kernel functions
accuracy_lin = linear.score(X_test, y_test)
accuracy_poly = poly.score(X_test, y_test)
accuracy_rbf = rbf.score(X_test, y_test)
accuracy_sig = sig.score(X_test, y_test)
print(“Accuracy Linear Kernel:”, accuracy_lin)
print(“Accuracy Polynomial Kernel:”, accuracy_poly)
print(“Accuracy Radial Basis Kernel:”, accuracy_rbf)
print(“Accuracy Sigmoid Kernel:”, accuracy_sig
Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (25)

As the accuracy reveals, some kernel functions are more useful than others, depending on the data. And obviously, more data is also helpful for improving the results (the iris data has a not really big size with 50 samples :-) ).

Let’s head to the last step — printing confusion matrices for the 4 kernel functions to understand how and what has been predicted:

# creating a confusion matrix
cm_lin = confusion_matrix(y_test, linear_pred)
cm_poly = confusion_matrix(y_test, poly_pred)
cm_rbf = confusion_matrix(y_test, rbf_pred)
cm_sig = confusion_matrix(y_test, sig_pred)
print(cm_lin)
print(cm_poly)
print(cm_rbf)
print(cm_sig)
Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (26)

That’s it. I hope it helped to better understand Support Vector Machines, the mathematical concepts, the Kernel trick and the topic Multiclass Classification via SVM.

Thanks for reading, I appreciate feedback!

Sources

  1. Bishop, 2006, Pattern recognition and machine learning, p.325 ff.
  2. Vapnik & Cortes, 1995, Support-vector networks
  3. Scikit-learn, Support Vector Machines
  4. Scikit-learn, Plot different SVM classifiers in the iris dataset
  5. Dataflair, Kernel Functions — Introduction to SVM Kernel & Examples
(Video) Support Vector Machine - SVM - Classification Implementation for Beginners (using python) - Detailed

The whole code

#Importing the necessary packages and libaries
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn import svm, datasets
import matplotlib.pyplot as plt
import numpy as np
iris = datasets.load_iris()#Store variables as target y and the first two features as X (sepal length and sepal width of the iris flowers)
X = iris.data[:, :2]
y = iris.target
linear = svm.SVC(kernel='linear', C=1, decision_function_shape='ovo').fit(X_train, y_train)rbf = svm.SVC(kernel='rbf', gamma=1, C=1, decision_function_shape='ovo').fit(X_train, y_train)poly = svm.SVC(kernel='poly', degree=3, C=1, decision_function_shape='ovo').fit(X_train, y_train)sig = svm.SVC(kernel='sigmoid', C=1, decision_function_shape='ovo').fit(X_train, y_train)#stepsize in the mesh, it alters the accuracy of the plotprint
#to better understand it, just play with the value, change it and print it

h = .01
#create the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))
# create the title that will be shown on the plot
titles = ['Linear kernel','RBF kernel','Polynomial kernel','Sigmoid kernel']
for i, clf in enumerate((linear, rbf, poly, sig)):
#defines how many plots: 2 rows, 2columns=> leading to 4 plots
plt.subplot(2, 2, i + 1) #i+1 is the index
#space between plots

plt.subplots_adjust(wspace=0.4, hspace=0.4)
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=plt.cm.PuBuGn, alpha=0.7)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.PuBuGn, edgecolors='grey')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())
plt.title(titles[i])
plt.show()linear_pred = linear.predict(X_test)
poly_pred = poly.predict(X_test)
rbf_pred = rbf.predict(X_test)
sig_pred = sig.predict(X_test)
# retrieve the accuracy and print it for all 4 kernel functions
accuracy_lin = linear.score(X_test, y_test)
accuracy_poly = poly.score(X_test, y_test)
accuracy_rbf = rbf.score(X_test, y_test)
accuracy_sig = sig.score(X_test, y_test)
print(“Accuracy Linear Kernel:”, accuracy_lin)
print(“Accuracy Polynomial Kernel:”, accuracy_poly)
print(“Accuracy Radial Basis Kernel:”, accuracy_rbf)
print(“Accuracy Sigmoid Kernel:”, accuracy_sig
# creating a confusion matrix
cm_lin = confusion_matrix(y_test, linear_pred)
cm_poly = confusion_matrix(y_test, poly_pred)
cm_rbf = confusion_matrix(y_test, rbf_pred)
cm_sig = confusion_matrix(y_test, sig_pred)
print(cm_lin)
print(cm_poly)
print(cm_rbf)
print(cm_sig)

Videos

1. Lec 10 Support Vector Machine Kernel Trick
(Saptarsi Goswami)
2. Kernel Function in Support Vector Machine SVM || Lesson 83 || Machine Learning || Learning Monkey ||
(Learning Monkey)
3. Support Vector Machine | The Kernel Trick | RBF Function
(Menna A. Moataz)
4. Machine Learning Blink 9.3 (SVM kernel trick for nonlinear classification)
(BASIRA Lab)
5. Support Vector Machines (SVM), Kernel Trick
(Sideridis)
6. Lecture 6 (Part 1) - Support Vector Machine (SVM) and Kernel Trick - Machine Learning Course
(Rafik Bouguelia)
Top Articles
Latest Posts
Article information

Author: Allyn Kozey

Last Updated: 03/24/2023

Views: 6663

Rating: 4.2 / 5 (43 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Allyn Kozey

Birthday: 1993-12-21

Address: Suite 454 40343 Larson Union, Port Melia, TX 16164

Phone: +2456904400762

Job: Investor Administrator

Hobby: Sketching, Puzzles, Pet, Mountaineering, Skydiving, Dowsing, Sports

Introduction: My name is Allyn Kozey, I am a outstanding, colorful, adventurous, encouraging, zealous, tender, helpful person who loves writing and wants to share my knowledge and understanding with you.