# Multiclass Classification with Support Vector Machines (SVM), Kernel Trick & Kernel Functions (2023)

(Video) Support Vector Machines Part 3: The Radial (RBF) Kernel (Part 3 of 3)

## Finally understand the concept behind SVM + Implementation in Python via scikit-learn

Support Vector Machines (SVM) are not new but are still a powerful tool for classification due to their tendency not to overfit, but to perform well in many cases. If you are only interested in a certain topic, just scroll over the topics. These are the topics in chronological order:

• What’s the mathematical concept behind the Support Vector Machine?
• What is a kernel and what are kernel functions?
• What is the kernel trick?
• What is the dual problem of a SVM?
• How does Multiclass Classification take place?
• Implementation via Python and scikit-learn

If you are only interested in how it can be implemented using Python and scikit-learn, scroll down to the end!

## Let’s get started.

The objective is to find a hyperplane in an n-dimensional space that separates the data points to their potential classes. The hyperplane should be positioned with the maximum distance to the data points. The data points with the minimum distance to the hyperplane are called Support Vectors. Due to their close position, their influence on the exact position of the hyperplane is bigger than of other data points. In the graphic below the Support Vectors are the 3 points (2 blue, 1 green) laying on the lines.

SVMs are also called kernelized SVM due to their kernel that converts the input data space into a higher-dimensional space.

The input space X consists of x and x’.

represents the kernel function that turns the input space into a higher-dimensional space, so that not every data point is explicitly mapped.

The kernel function can also be written as

How the function is defined and useful for laying hyperplanes depends on the data:

## Kernel Functions

The most popular kernel functions, that are also available in scikit-learn are linear, polynomial, radial basis function and sigmoid. For more functions visit dataflair. In the following you can see how these four kernel functions look like:

1. Linear Function

2. Polynomial Function

4. Sigmoid Function

What does the kernel function do?

It takes two datapoints x_n and x_m and calculates a distance score of those. This score is higher for closer datapoints and vice versa. Using this score helps to transform the data points to a higher-dimensional mapping, which reduces the computational effort and time and is especially useful for huge amounts of data. It prevents the need for a more complex transformation.
That’s why this step is often referred to as Kernel Trick.

As can be seen in the graphic below, the mapping of the data points is turned from a 2D to a 3D space using a kernel function (φ((a, b)) = (a, b, a2 + b2)). The previously centered red dots are now also vertically lower located when turned into a 3D-space. Not clearly separable data points can now better divided by using a kernel.

Furthermore, different kernels can help to lay hyperplanes of diverse shapes through the cloud of data points. Obviously the limits of linear hyperplanes are quickly exceeded due to their restricted adaptability to different shapes. Based on this fact the different kernel functions have been developed.

For a better understanding of how the separation of different hyperplane can look like, the different kinds of kernel functions are visualized in the graphic below.

The following formula poses the optimization problem that is tackled by SVMs. It is explained further down (scikit-learn, n.d.):

The objective is to classify as many data points correctly as possible by maximizing the margin from the Support Vectors to the hyperplane while minimizing the term

In other words, the objective can also be explained as finding optimal w and b that most samples are predicted correctly.
Mostly not all data points can be allocated perfectly so that a distance to the correct margin is represented by

(Video) SVM - Inseparable sets, kernels, and multiclass classification

The normal vector creates a line that runs through the coordinate origin. The hyperplanes cut this straight line orthogonal at a distance

from the origin.

For the ideal case (Bishop, p.325 ff., 2006)

would be ≥ 1 and followingly perfectly predicted. Having now data points with distance to their ideal position, lets us correct the ideal case of being ≥ 1 to

Simultaneously a penalty term is introduced in the minimization formula. C acts as a regularization parameter and controls how strong the penalty is regarding how many data points have been falsely assigned with a total distance of

## The dual problem

The optimization task can be referred to as a dual problem, trying to minimize the parameters, while maximizing the margin. To solve the dual problem, Lagrange multipliers are utilized (alpha≥0).

This leads to a Lagrangian function of (Bishop, p.325 ff., 2006):

Utilizing the following two conditions (Bishop, p.325 ff., 2006):

w and b can be eliminated from L(w,b,a). This leads to the following Lagrangian function which is maximized as (Bishop, p.325 ff., 2006):

Solving the optimization problem, new data points can be classified by using (Bishop, p.325 ff., 2006):

For the kernel function k(x_n,x_m) the previously explained kernel functions (sigmoid, linear, polynomial, rbf) can be filled in.

And that’s it! If you could follow the math, you understand now the principle behind a support vector machine. It’s easy to understand how to divide a cloud of data points into two classes, but how is it done for multiple classes?
Let’s see how this works.

## Get started

In its most simple type SVM are applied on binary classification, dividing data points either in 1 or 0. For multiclass classification, the same principle is utilized. The multiclass problem is broken down to multiple binary classification cases, which is also called one-vs-one. In scikit-learn one-vs-one is not default and needs to be selected explicitly (as can be seen further down in the code). One-vs-rest is set as default. It basically divides the data points in class x and rest. Consecutively a certain class is distinguished from all other classes.

The number of classifiers necessary for one-vs-one multiclass classification can be retrieved with the following formula (with n being the number of classes):

(Video) Multi-Class Classification using SVM : One vs. All

In the one-vs-one approach, each classifier separates points of two different classes and comprising all one-vs-one classifiers leads to a multiclass classifier.

You might already be bored from the iris dataset, but it’s the most simple way to demonstrate it, so let's take a look on some code. Some parts of the code you can also find under scikit-learn.

`#Importing the necessary packages and libariesfrom sklearn.metrics import confusion_matrixfrom sklearn.model_selection import train_test_splitfrom sklearn import svm, datasetsimport matplotlib.pyplot as pltimport numpy as np`

Let’s load the iris dataset as iris and store target and feature variables:

`iris = datasets.load_iris()#Store variables as target y and the first two features as X (sepal length and sepal width of the iris flowers)X = iris.data[:, :2]y = iris.target`

And now let’s split the dataset in train and test-set for the following training and prediction:

`X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state = 0)`

In this step, we take a look at the different kernel functions. The penalty term C is set to 1 for all classifiers. For the multiclass classification, the type one-versus-one is specified, as can be seen in decision_function_shape=’ovo’. For the polynomial function, the degree of 3 is selected, this is not necessary for other kernel functions.
All other parameters are set to default. Here you can read more about the SVC-function of scikit-learn.

`linear = svm.SVC(kernel='linear', C=1, decision_function_shape='ovo').fit(X_train, y_train)rbf = svm.SVC(kernel='rbf', gamma=1, C=1, decision_function_shape='ovo').fit(X_train, y_train)poly = svm.SVC(kernel='poly', degree=3, C=1, decision_function_shape='ovo').fit(X_train, y_train)sig = svm.SVC(kernel='sigmoid', C=1, decision_function_shape='ovo').fit(X_train, y_train)`

Now let's specify the mesh, in which we will plot the results.

`#stepsize in the mesh, it alters the accuracy of the plotprint#to better understand it, just play with the value, change it and print ith = .01#create the meshx_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))# create the title that will be shown on the plottitles = ['Linear kernel','RBF kernel','Polynomial kernel','Sigmoid kernel']`

Now we’ll use a for loop to plot all 4 kernel functions:

`for i, clf in enumerate((linear, rbf, poly, sig)): #defines how many plots: 2 rows, 2columns=> leading to 4 plots plt.subplot(2, 2, i + 1) #i+1 is the index #space between plots plt.subplots_adjust(wspace=0.4, hspace=0.4)  Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Put the result into a color plot Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, cmap=plt.cm.PuBuGn, alpha=0.7) # Plot also the training points plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.PuBuGn, edgecolors='grey') plt.xlabel('Sepal length') plt.ylabel('Sepal width') plt.xlim(xx.min(), xx.max()) plt.ylim(yy.min(), yy.max()) plt.xticks(()) plt.yticks(()) plt.title(titles[i]) plt.show()`

As you might have probably recognized, the result is the image from above in the article:

In the next step we make predictions on the test data set using our 4 different kernel functions:

`linear_pred = linear.predict(X_test)poly_pred = poly.predict(X_test)rbf_pred = rbf.predict(X_test)sig_pred = sig.predict(X_test)`

To understand how well they perform, we utilize a performance measure — accuracy.

`# retrieve the accuracy and print it for all 4 kernel functionsaccuracy_lin = linear.score(X_test, y_test)accuracy_poly = poly.score(X_test, y_test)accuracy_rbf = rbf.score(X_test, y_test)accuracy_sig = sig.score(X_test, y_test)print(“Accuracy Linear Kernel:”, accuracy_lin)print(“Accuracy Polynomial Kernel:”, accuracy_poly)print(“Accuracy Radial Basis Kernel:”, accuracy_rbf)print(“Accuracy Sigmoid Kernel:”, accuracy_sig`

As the accuracy reveals, some kernel functions are more useful than others, depending on the data. And obviously, more data is also helpful for improving the results (the iris data has a not really big size with 50 samples :-) ).

Let’s head to the last step — printing confusion matrices for the 4 kernel functions to understand how and what has been predicted:

`# creating a confusion matrixcm_lin = confusion_matrix(y_test, linear_pred)cm_poly = confusion_matrix(y_test, poly_pred)cm_rbf = confusion_matrix(y_test, rbf_pred)cm_sig = confusion_matrix(y_test, sig_pred)print(cm_lin)print(cm_poly)print(cm_rbf)print(cm_sig)`

That’s it. I hope it helped to better understand Support Vector Machines, the mathematical concepts, the Kernel trick and the topic Multiclass Classification via SVM.

Thanks for reading, I appreciate feedback!

## Sources

1. Bishop, 2006, Pattern recognition and machine learning, p.325 ff.
2. Vapnik & Cortes, 1995, Support-vector networks
3. Scikit-learn, Support Vector Machines
4. Scikit-learn, Plot different SVM classifiers in the iris dataset
5. Dataflair, Kernel Functions — Introduction to SVM Kernel & Examples
(Video) Support Vector Machine - SVM - Classification Implementation for Beginners (using python) - Detailed

## The whole code

`#Importing the necessary packages and libariesfrom sklearn.metrics import confusion_matrixfrom sklearn.model_selection import train_test_splitfrom sklearn import svm, datasetsimport matplotlib.pyplot as pltimport numpy as npiris = datasets.load_iris()#Store variables as target y and the first two features as X (sepal length and sepal width of the iris flowers)X = iris.data[:, :2]y = iris.targetlinear = svm.SVC(kernel='linear', C=1, decision_function_shape='ovo').fit(X_train, y_train)rbf = svm.SVC(kernel='rbf', gamma=1, C=1, decision_function_shape='ovo').fit(X_train, y_train)poly = svm.SVC(kernel='poly', degree=3, C=1, decision_function_shape='ovo').fit(X_train, y_train)sig = svm.SVC(kernel='sigmoid', C=1, decision_function_shape='ovo').fit(X_train, y_train)#stepsize in the mesh, it alters the accuracy of the plotprint#to better understand it, just play with the value, change it and print ith = .01#create the meshx_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))# create the title that will be shown on the plottitles = ['Linear kernel','RBF kernel','Polynomial kernel','Sigmoid kernel']for i, clf in enumerate((linear, rbf, poly, sig)): #defines how many plots: 2 rows, 2columns=> leading to 4 plots plt.subplot(2, 2, i + 1) #i+1 is the index #space between plots plt.subplots_adjust(wspace=0.4, hspace=0.4)Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plot Z = Z.reshape(xx.shape) plt.contourf(xx, yy, Z, cmap=plt.cm.PuBuGn, alpha=0.7)# Plot also the training points plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.PuBuGn, edgecolors='grey')plt.xlabel('Sepal length') plt.ylabel('Sepal width') plt.xlim(xx.min(), xx.max()) plt.ylim(yy.min(), yy.max()) plt.xticks(()) plt.yticks(()) plt.title(titles[i])plt.show()linear_pred = linear.predict(X_test)poly_pred = poly.predict(X_test)rbf_pred = rbf.predict(X_test)sig_pred = sig.predict(X_test)# retrieve the accuracy and print it for all 4 kernel functionsaccuracy_lin = linear.score(X_test, y_test)accuracy_poly = poly.score(X_test, y_test)accuracy_rbf = rbf.score(X_test, y_test)accuracy_sig = sig.score(X_test, y_test)print(“Accuracy Linear Kernel:”, accuracy_lin)print(“Accuracy Polynomial Kernel:”, accuracy_poly)print(“Accuracy Radial Basis Kernel:”, accuracy_rbf)print(“Accuracy Sigmoid Kernel:”, accuracy_sig# creating a confusion matrixcm_lin = confusion_matrix(y_test, linear_pred)cm_poly = confusion_matrix(y_test, poly_pred)cm_rbf = confusion_matrix(y_test, rbf_pred)cm_sig = confusion_matrix(y_test, sig_pred)print(cm_lin)print(cm_poly)print(cm_rbf)print(cm_sig)`

## Videos

1. Lec 10 Support Vector Machine Kernel Trick
(Saptarsi Goswami)
2. Kernel Function in Support Vector Machine SVM || Lesson 83 || Machine Learning || Learning Monkey ||
(Learning Monkey)
3. Support Vector Machine | The Kernel Trick | RBF Function
(Menna A. Moataz)
4. Machine Learning Blink 9.3 (SVM kernel trick for nonlinear classification)
(BASIRA Lab)
5. Support Vector Machines (SVM), Kernel Trick
(Sideridis)
6. Lecture 6 (Part 1) - Support Vector Machine (SVM) and Kernel Trick - Machine Learning Course
(Rafik Bouguelia)
Top Articles
Latest Posts
Article information

Author: Allyn Kozey

Last Updated: 03/24/2023

Views: 6663

Rating: 4.2 / 5 (43 voted)

Author information

Name: Allyn Kozey

Birthday: 1993-12-21

Address: Suite 454 40343 Larson Union, Port Melia, TX 16164

Phone: +2456904400762