How Lagrangian and Linear Algebra combine to produce Support Vector Machine
I must note that while I have had better accuracy in real life using algorithms like XG Boost classifier, Gradient Boost classifier and ANN compared to Support Vector Machine, this algorithm excites the mathematician in me. I love the elegance of the math behind SVM.
SVM is primarily a binary classification algorithm that works by constructing three separating parallel lines between the two classes of data. The middle one is the main separating line called the hyperplane while the other two push out until they connect with support vectors. These support vectors are the closest points to the hyperplane. They affect the orientation of the hyperplane.
During the training phase the idea is to find a set of these three lines such that the distance between the two outer planes is maximized. Any test data point is then assessed using the dot product to see where it falls before it gets classified accordingly.
Let us call the blue arrow running from the origin to connect orthogonally to the separating line b. To assess were a test data point lies the algorithm employs a dot product concept where the coordinates of the test point are dotted with b. This uses a concept demonstrated below:
The result is the length of the projection of the test vector along b. If the length of the projection is less than that of b then vector falls into the first category. Otherwise it’s classed with the second category. This is referred to as the decision rule.
The x-vectors in the equations above represent any generic point. In terms of the support vectors, the ones in bold black circles on the first diagram of the article, the hyperplanes passing them are defined by special equations. The y variable below simply takes two values to indicate the class of the data point.
As we covered above, the ultimate goal is to find a set of parallel planes where the distance between them is maximized. To achieve this first we need an expression that describes the distance between opposing lines.
In the equation above each v represents the support vectors, the subscripts denote the class while the second parentheses converts this into the geodesic (shortest) distance between the lines.
Now we have three important things: two support vector equations providing constraints that need to be satisfied plus the minimization problem derived above. An optimization problem subject to constraints leaves us with Lagrangian as the best tool for the job as we do not need to worry about constraints as separate entities.
In fact the multipliers of all the data points save for support vectors will be zero. This is because our Lagrangian will have two terms, the second term with a summation needs to be zero.
Using the first optimization equation we can see that if we have the optimum hyperspace, vector b will be equal to the summation of a product of alpha, y and each data point x. This means we can replace vector b in the Lagrangian equation to get:
So the math above is what drives the optimization process during the training phase of our machine learning. After this the test data gets assessed with the decision rule equation. The sign of that result determines the class which the test data point is assigned.
One of the main impressive features of the SVM is its ability to deal with non-linearly separable data. I will endeavor to do another article explaining how this amazing algorithm deals with those problems.