Support Vector Machines for Binary Classification

Understanding Support Vector Machines

Separable Data

You can use a support vector machine (SVM) when your data has exactly two classes. An SVM classifies data by finding the best hyperplane that separates all data points of one class from those of the other class. Thebesthyperplane for an SVM means the one with the largestmarginbetween the two classes. Margin means the maximal width of the slab parallel to the hyperplane that has no interior data points.

Thesupport vectorsare the data points that are closest to the separating hyperplane; these points are on the boundary of the slab. The following figure illustrates these definitions, with + indicating data points of type 1, and – indicating data points of type –1.

Mathematical Formulation: Primal.This discussion follows Hastie, Tibshirani, and Friedman[1]and Christianini and Shawe-Taylor[2].

The data for training is a set of points (vectors)x_jalong with their categoriesy_j. For some dimensiond, thex_j∊R^d, and they_j= ±1. The equation of a hyperplane is

$f (x) = x' β + b = 0$

whereβ∊R^dandbis a real number.

The following problem defines thebestseparating hyperplane (i.e., the decision boundary). Findβandbthat minimize ||β|| such that for all data points (x_j,y_j),

$y_{j} f (x_{j}) \geq 1.$

The support vectors are thex_jon the boundary, those for which $y_{j} f (x_{j}) = 1.$

For mathematical convenience, the problem is usually given as the equivalent problem of minimizing $‖ β ‖$ . This is a quadratic programming problem. The optimal solution $(\hat{β}, \hat{b})$ enables classification of a vectorzas follows:

$class (z) = sign (z' \hat{β} + \hat{b}) = sign (\hat{f} (z)) .$

$\hat{f} (z)$ is theclassification scoreand represents the distancezis from the decision boundary.

Mathematical Formulation: Dual.It is computationally simpler to solve the dual quadratic programming problem. To obtain the dual, take positive Lagrange multipliersα_jmultiplied by each constraint, and subtract from the objective function:

$L_{P} = \frac{1}{2} β' β - \sum_{j} α_{j} (y_{j} (x_{j}' β + b) - 1),$

where you look for a stationary point ofL_Poverβandb. Setting the gradient ofL_Pto 0, you get

\begin{matrix} β = \sum_{j} α_{j} y_{j} x_{j} \\ 0 = \sum_{j} α_{j} y_{j} . \end{matrix}

(1)

Substituting intoL_P, you get the dualL_D:

$L_{D} = \sum_{j} α_{j} - \frac{1}{2} \sum_{j} \sum_{k} α_{j} α_{k} y_{j} y_{k} x_{j}' x_{k},$

which you maximize overα_j≥ 0. In general, manyα_jare 0 at the maximum. The nonzeroα_jin the solution to the dual problem define the hyperplane, as seen inEquation 1, which givesβas the sum ofα_jy_jx_j. The data pointsx_jcorresponding to nonzeroα_jare thesupport vectors.

The derivative ofL_Dwith respect to a nonzeroα_jis 0 at an optimum. This gives

$y_{j} f (x_{j}) - 1 = 0.$

In particular, this gives the value ofbat the solution, by taking anyjwith nonzeroα_j.

The dual is a standard quadratic programming problem. For example, the Optimization Toolbox™quadprog(Optimization Toolbox)solver solves this type of problem.

Nonseparable Data

Your data might not allow for a separating hyperplane. In that case, SVM can use asoft margin, meaning a hyperplane that separates many, but not all data points.

There are two standard formulations of soft margins. Both involve adding slack variablesξ_jand a penalty parameterC.

TheL¹-norm problem is:

$\min_{β, b, ξ} (\frac{1}{2} β' β + C \sum_{j} ξ_{j})$

such that

$\begin{matrix} y_{j} f (x_{j}) \geq 1 - ξ_{j} \\ ξ_{j} \geq 0. \end{matrix}$

TheL¹-norm refers to usingξ_jas slack variables instead of their squares. The three solver optionsSMO,ISDA, andL1QPoffitcsvmminimize theL¹-norm problem.
TheL²-norm problem is:

$\min_{β, b, ξ} (\frac{1}{2} β' β + C \sum_{j} ξ_{j}^{2})$

subject to the same constraints.

In these formulations, you can see that increasingCplaces more weight on the slack variablesξ_j, meaning the optimization attempts to make a stricter separation between classes. Equivalently, reducingCtowards 0 makes misclassification less important.

Mathematical Formulation: Dual.For easier calculations, consider theL¹dual problem to this soft-margin formulation. Using Lagrange multipliersμ_j, the function to minimize for theL¹-norm problem is:

$L_{P} = \frac{1}{2} β' β + C \sum_{j} ξ_{j} - \sum_{j} α_{j} (y_{i} f (x_{j}) - (1 - ξ_{j})) - \sum_{j} μ_{j} ξ_{j},$

where you look for a stationary point ofL_Poverβ,b, and positiveξ_j. Setting the gradient ofL_Pto 0, you get

$\begin{matrix} β = \sum_{j} α_{j} y_{j} x_{j} \\ \sum_{j} α_{j} y_{j} = 0 \\ α_{j} = C - μ_{j} \\ α_{j}, μ_{j}, ξ_{j} \geq 0. \end{matrix}$

These equations lead directly to the dual formulation:

$\max_{α} \sum_{j} α_{j} - \frac{1}{2} \sum_{j} \sum_{k} α_{j} α_{k} y_{j} y_{k} x_{j}' x_{k}$

subject to the constraints

$\begin{array}{l} \sum_{j} y_{j} α_{j} = 0 \\ 0 \leq α_{j} \leq C . \end{array}$

The final set of inequalities,0 ≤α_j≤C, shows whyCis sometimes called abox constraint.Ckeeps the allowable values of the Lagrange multipliersα_jin a “box”, a bounded region.

The gradient equation forbgives the solutionbin terms of the set of nonzeroα_j, which correspond to the support vectors.

You can write and solve the dual of theL²-norm problem in an analogous manner. For details, see Christianini and Shawe-Taylor[2], Chapter 6.

fitcsvmImplementation.Both dual soft-margin problems are quadratic programming problems. Internally,fitcsvmhas several different algorithms for solving the problems.

For one-class or binary classification, if you do not set a fraction of expected outliers in the data (seeOutlierFraction), then the default solver is Sequential Minimal Optimization (SMO). SMO minimizes the one-norm problem by a series of two-point minimizations. During optimization, SMO respects the linear constraint $\sum_{i} α_{i} y_{i} = 0,$ and explicitly includes the bias term in the model. SMO is relatively fast. For more details on SMO, see[3].
For binary classification, if you set a fraction of expected outliers in the data, then the default solver is the Iterative Single Data Algorithm. Like SMO, ISDA solves the one-norm problem. Unlike SMO, ISDA minimizes by a series on one-point minimizations, does not respect the linear constraint, and does not explicitly include the bias term in the model. For more details on ISDA, see[4].
For one-class or binary classification, and if you have an Optimization Toolbox license, you can choose to usequadprog(Optimization Toolbox)to solve the one-norm problem.quadproguses a good deal of memory, but solves quadratic programs to a high degree of precision. For more details, seeQuadratic Programming Definition(Optimization Toolbox).

Nonlinear Transformation with Kernels

Some binary classification problems do not have a simple hyperplane as a useful separating criterion. For those problems, there is a variant of the mathematical approach that retains nearly all the simplicity of an SVM separating hyperplane.

This approach uses these results from the theory of reproducing kernels:

There is a class of functionsG(x₁,x₂) with the following property. There is a linear spaceSand a functionφmappingxtoSsuch that

G(x₁,x₂) = <φ(x₁),φ(x₂)>.

The dot product takes place in the spaceS.
This class of functions includes:
- Polynomials: For some positive integerp,
  
  G(x₁,x₂) = (1 +x₁′x₂)^p.
- Radial basis function (Gaussian):
  
  G(x₁,x₂) = exp(–∥x₁–x₂)∥²).
- 多层感知器或乙状结肠(神经网络):For a positive numberp₁and a negative numberp₂,
  
  G(x₁,x₂) = tanh(p₁x₁′x₂+p₂).
  Note
  Not every set ofp₁andp₂yields a valid reproducing kernel.
  
  fitcsvmdoes not support the sigmoid kernel. Instead, you can define the sigmoid kernel and specify it by using the'KernelFunction'name-value pair argument. For details, seeTrain SVM Classifier Using Custom Kernel.

The mathematical approach using kernels relies on the computational method of hyperplanes. All the calculations for hyperplane classification use nothing more than dot products. Therefore, nonlinear kernels can use identical calculations and solution algorithms, and obtain classifiers that are nonlinear. The resulting classifiers are hypersurfaces in some spaceS, but the spaceSdoes not have to be identified or examined.

Using Support Vector Machines

As with any supervised learning model, you first train a support vector machine, and then cross validate the classifier. Use the trained machine to classify (predict) new data. In addition, to obtain satisfactory predictive accuracy, you can use various SVM kernel functions, and you must tune the parameters of the kernel functions.

Training an SVM Classifier

Train, and optionally cross validate, an SVM classifier usingfitcsvm. The most common syntax is:

SVMModel = fitcsvm(X,Y,'KernelFunction','rbf',... 'Standardize',true,'ClassNames',{'negClass','posClass'});

The inputs are:

X— Matrix of predictor data, where each row is one observation, and each column is one predictor.
Y— Array of class labels with each row corresponding to the value of the corresponding row inX.Ycan be a categorical, character, or string array, a logical or numeric vector, or a cell array of character vectors.
KernelFunction— The default value is'linear'for two-class learning, which separates the data by a hyperplane. The value'gaussian'(or'rbf') is the default for one-class learning, and specifies to use the Gaussian (or radial basis function) kernel. An important step to successfully train an SVM classifier is to choose an appropriate kernel function.
Standardize— Flag indicating whether the software should standardize the predictors before training the classifier.
ClassNames— Distinguishes between the negative and positive classes, or specifies which classes to include in the data. The negative class is the first element (or row of a character array), e.g.,'negClass', and the positive class is the second element (or row of a character array), e.g.,'posClass'.ClassNamesmust be the same data type asY. It is good practice to specify the class names, especially if you are comparing the performance of different classifiers.

The resulting, trained model (SVMModel) contains the optimized parameters from the SVM algorithm, enabling you to classify new data.

For more name-value pairs you can use to control the training, see thefitcsvmreference page.

Classifying New Data with an SVM Classifier

Classify new data usingpredict. The syntax for classifying new data using a trained SVM classifier (SVMModel) is:

[label,score] = predict(SVMModel,newX);

The resulting vector,label, represents the classification of each row inX.scoreis ann-by-2 matrix of soft scores. Each row corresponds to a row inX, which is a new observation. The first column contains the scores for the observations being classified in the negative class, and the second column contains the scores observations being classified in the positive class.

To estimate posterior probabilities rather than scores, first pass the trained SVM classifier (SVMModel) tofitPosterior, which fits a score-to-posterior-probability transformation function to the scores. The syntax is:

ScoreSVMModel = fitPosterior(SVMModel,X,Y);

The propertyScoreTransformof the classifierScoreSVMModelcontains the optimal transformation function. PassScoreSVMModeltopredict. Rather than returning the scores, the output argumentscorecontains the posterior probabilities of an observation being classified in the negative (column 1 ofscore) or positive (column 2 ofscore) class.

Tuning an SVM Classifier

Use the'OptimizeHyperparameters'name-value pair argument offitcsvmto find parameter values that minimize the cross-validation loss. The eligible parameters are'BoxConstraint','KernelFunction','KernelScale','PolynomialOrder', and'Standardize'. For an example, seeOptimize Classifier Fit Using Bayesian Optimization. Alternatively, you can use thebayesoptfunction, as shown in优化旨在使用bayesopt分类器. Thebayesoptfunction allows more flexibility to customize optimization. You can use thebayesoptfunction to optimize any parameters, including parameters that are not eligible to optimize when you use thefitcsvmfunction.

You can also try tuning parameters of your classifier manually according to this scheme:

Pass the data tofitcsvm, and set the name-value pair argument'KernelScale','auto'. Suppose that the trained SVM model is calledSVMModel. The software uses a heuristic procedure to select the kernel scale. The heuristic procedure uses subsampling. Therefore, to reproduce results, set a random number seed usingrngbefore training the classifier.
Cross validate the classifier by passing it tocrossval. By default, the software conducts 10-fold cross validation.
Pass the cross-validated SVM model tokfoldLossto estimate and retain the classification error.
Retrain the SVM classifier, but adjust the'KernelScale'and'BoxConstraint'name-value pair arguments.
- BoxConstraint— One strategy is to try a geometric sequence of the box constraint parameter. For example, take 11 values, from1e-5to1e5by a factor of 10. IncreasingBoxConstraintmight decrease the number of support vectors, but also might increase training time.
- KernelScale— One strategy is to try a geometric sequence of the RBF sigma parameter scaled at the original kernel scale. Do this by:
  1. Retrieving the original kernel scale, e.g.,ks, using dot notation:ks = SVMModel.KernelParameters.Scale.
  2. Use as new kernel scales factors of the original. For example, multiplyksby the 11 values1e-5to1e5, increasing by a factor of 10.

Choose the model that yields the lowest classification error. You might want to further refine your parameters to obtain better accuracy. Start with your initial parameters and perform another cross-validation step, this time using a factor of 1.2.

Train SVM Classifiers Using a Gaussian Kernel

Open Live Script

This example shows how to generate a nonlinear classifier with Gaussian kernel function. First, generate one class of points inside the unit disk in two dimensions, and another class of points in the annulus from radius 1 to radius 2. Then, generates a classifier based on the data with the Gaussian radial basis function kernel. The default linear classifier is obviously unsuitable for this problem, since the model is circularly symmetric. Set the box constraint parameter toInf没有miscla做出严格的分类、意义ssified training points. Other kernel functions might not work with this strict box constraint, since they might be unable to provide a strict classification. Even though the rbf classifier can separate the classes, the result can be overtrained.

Generate 100 points uniformly distributed in the unit disk. To do so, generate a radiusras the square root of a uniform random variable, generate an angletuniformly in (0, $2 π$ ), and put the point at (rcos(t),rsin(t)).

rng(1);% For reproducibilityr = sqrt(rand(100,1));% Radiust = 2*pi*rand(100,1);% Angledata1 = [r.*cos(t), r.*sin(t)];% Points

Generate 100 points uniformly distributed in the annulus. The radius is again proportional to a square root, this time a square root of the uniform distribution from 1 through 4.

r2 = sqrt(3*rand(100,1)+1);% Radiust2 = 2*pi*rand(100,1);% Angledata2 = [r2.*cos(t2), r2.*sin(t2)];% points

Plot the points, and plot circles of radii 1 and 2 for comparison.

figure; plot(data1(:,1),data1(:,2),'r.','MarkerSize',15) holdonplot(data2(:,1),data2(:,2),'b.','MarkerSize',15) ezpolar(@(x)1);ezpolar(@(x)2); axisequalholdoff

图包含一个坐标轴对象。The axes object contains 6 objects of type line, text.

Put the data in one matrix, and make a vector of classifications.

data3 = [data1;data2]; theclass = ones(200,1); theclass(1:100) = -1;

Train an SVM classifier withKernelFunctionset to'rbf'andBoxConstraintset toInf. Plot the decision boundary and flag the support vectors.

%Train the SVM Classifiercl = fitcsvm(data3,theclass,'KernelFunction','rbf',...'BoxConstraint',Inf,'ClassNames',[-1,1]);% Predict scores over the gridd = 0.02; [x1Grid,x2Grid] = meshgrid(min(data3(:,1)):d:max(data3(:,1)),...min(data3(:,2)):d:max(data3(:,2))); xGrid = [x1Grid(:),x2Grid(:)]; [~,scores] = predict(cl,xGrid);% Plot the data and the decision boundaryfigure; h(1:2) = gscatter(data3(:,1),data3(:,2),theclass,'rb','.'); holdonezpolar(@(x)1); h(3) = plot(data3(cl.IsSupportVector,1),data3(cl.IsSupportVector,2),'ko'); contour(x1Grid,x2Grid,reshape(scores(:,2),size(x1Grid)),[0 0],'k'); legend(h,{'-1','+1','Support Vectors'}); axisequalholdoff

图包含一个坐标轴对象。The axes object contains 6 objects of type line, text, contour. These objects represent -1, +1, Support Vectors.

fitcsvmgenerates a classifier that is close to a circle of radius 1. The difference is due to the random training data.

Training with the default parameters makes a more nearly circular classification boundary, but one that misclassifies some training data. Also, the default value ofBoxConstraintis1, and, therefore, there are more support vectors.

cl2 = fitcsvm(data3,theclass,'KernelFunction','rbf'); [~,scores2] = predict(cl2,xGrid); figure; h(1:2) = gscatter(data3(:,1),data3(:,2),theclass,'rb','.'); holdonezpolar(@(x)1); h(3) = plot(data3(cl2.IsSupportVector,1),data3(cl2.IsSupportVector,2),'ko'); contour(x1Grid,x2Grid,reshape(scores2(:,2),size(x1Grid)),[0 0],'k'); legend(h,{'-1','+1','Support Vectors'}); axisequalholdoff

图包含一个坐标轴对象。The axes object contains 6 objects of type line, text, contour. These objects represent -1, +1, Support Vectors.

Train SVM Classifier Using Custom Kernel

Open Script

This example shows how to use a custom kernel function, such as the sigmoid kernel, to train SVM classifiers, and adjust custom kernel function parameters.

Generate a random set of points within the unit circle. Label points in the first and third quadrants as belonging to the positive class, and those in the second and fourth quadrants in the negative class.

rng(1);% For reproducibilityn = 100;% Number of points per quadrantr1 = sqrt(rand(2*n,1));% Random radiit1 = [pi/2*rand(n,1); (pi/2*rand(n,1)+pi)];% Random angles for Q1 and Q3X1 = [r1.*cos(t1) r1.*sin(t1)];% Polar-to-Cartesian conversionr2 = sqrt(rand(2*n,1)); t2 = [pi/2*rand(n,1)+pi/2; (pi/2*rand(n,1)-pi/2)];% Random angles for Q2 and Q4X2 = [r2.*cos(t2) r2.*sin(t2)]; X = [X1; X2];% PredictorsY = ones(4*n,1); Y(2*n + 1:end) = -1;% Labels

Plot the data.

figure; gscatter(X(:,1),X(:,2),Y); title('Scatter Diagram of Simulated Data')

Write a function that accepts two matrices in the feature space as inputs, and transforms them into a Gram matrix using the sigmoid kernel.

functionG = mysigmoid(U,V)% Sigmoid kernel function with slope gamma and intercept cγ= 1; c = -1; G = tanh(gamma*U*V' + c);end

Save this code as a file namedmysigmoidon your MATLAB® path.

Train an SVM classifier using the sigmoid kernel function. It is good practice to standardize the data.

Mdl1 = fitcsvm(X,Y,'KernelFunction','mysigmoid','Standardize',true);

Mdl1is aClassificationSVMclassifier containing the estimated parameters.

图数据,并确定支持向量金宝appd the decision boundary.

% Compute the scores over a gridd = 0.02;% Step size of the grid[x1Grid,x2Grid] = meshgrid(min(X(:,1)):d:max(X(:,1)),...min(X(:,2)):d:max(X(:,2))); xGrid = [x1Grid(:),x2Grid(:)];% The grid[~,scores1] = predict(Mdl1,xGrid);% The scoresfigure; h(1:2) = gscatter(X(:,1),X(:,2),Y); holdonh(3) = plot(X(Mdl1.IsSupportVector,1),...X(Mdl1.IsSupportVector,2),'ko','MarkerSize',10);% Support vectorscontour(x1Grid,x2Grid,reshape(scores1(:,2),size(x1Grid)),[0 0],'k');% Decision boundarytitle('Scatter Diagram with the Decision Boundary') legend({'-1','1','Support Vectors'},'Location','Best'); holdoff

You can adjust the kernel parameters in an attempt to improve the shape of the decision boundary. This might also decrease the within-sample misclassification rate, but, you should first determine the out-of-sample misclassification rate.

Determine the out-of-sample misclassification rate by using 10-fold cross validation.

CVMdl1 = crossval(Mdl1); misclass1 = kfoldLoss(CVMdl1); misclass1

misclass1 = 0.1350

The out-of-sample misclassification rate is 13.5%.

Write another sigmoid function, but Setγ= 0.5;.

functionG = mysigmoid2(U,V)% Sigmoid kernel function with slope gamma and intercept cγ= 0.5; c = -1; G = tanh(gamma*U*V' + c);end

Save this code as a file namedmysigmoid2on your MATLAB® path.

Train another SVM classifier using the adjusted sigmoid kernel. Plot the data and the decision region, and determine the out-of-sample misclassification rate.

Mdl2 = fitcsvm(X,Y,'KernelFunction','mysigmoid2','Standardize',true); [~,scores2] = predict(Mdl2,xGrid); figure; h(1:2) = gscatter(X(:,1),X(:,2),Y); holdonh(3) = plot(X(Mdl2.IsSupportVector,1),...X(Mdl2.IsSupportVector,2),'ko','MarkerSize',10); title('Scatter Diagram with the Decision Boundary') contour(x1Grid,x2Grid,reshape(scores2(:,2),size(x1Grid)),[0 0],'k'); legend({'-1','1','Support Vectors'},'Location','Best'); holdoffCVMdl2 = crossval(Mdl2); misclass2 = kfoldLoss(CVMdl2); misclass2

misclass2 = 0.0450

After the sigmoid slope adjustment, the new decision boundary seems to provide a better within-sample fit, and the cross-validation rate contracts by more than 66%.

Optimize Classifier Fit Using Bayesian Optimization

Open Live Script

This example shows how to optimize an SVM classification using thefitcsvmfunction and theOptimizeHyperparametersname-value argument.

Generate Data

The classification works on locations of points from a Gaussian mixture model. InThe Elements of Statistical Learning, Hastie, Tibshirani, and Friedman (2009), page 17 describes the model. The model begins with generating 10 base points for a "green" class, distributed as 2-D independent normals with mean (1,0) and unit variance. It also generates 10 base points for a "red" class, distributed as 2-D independent normals with mean (0,1) and unit variance. For each class (green and red), generate 100 random points as follows:

Choose a base pointmof the appropriate color uniformly at random.
Generate an independent random point with 2-D normal distribution with meanmand variance I/5, where I is the 2-by-2 identity matrix. In this example, use a variance I/50 to show the advantage of optimization more clearly.

Generate the 10 base points for each class.

rng('default')% For reproducibilitygrnpop = mvnrnd([1,0],eye(2),10); redpop = mvnrnd([0,1],eye(2),10);

View the base points.

plot(grnpop(:,1),grnpop(:,2),'go') holdonplot(redpop(:,1),redpop(:,2),'ro') holdoff

图包含一个坐标轴对象。The axes object contains 2 objects of type line.

Since some red base points are close to green base points, it can be difficult to classify the data points based on location alone.

Generate the 100 data points of each class.

redpts = zeros(100,2); grnpts = redpts;fori = 1:100 grnpts(i,:) = mvnrnd(grnpop(randi(10),:),eye(2)*0.02); redpts(i,:) = mvnrnd(redpop(randi(10),:),eye(2)*0.02);end

View the data points.

figure plot(grnpts(:,1),grnpts(:,2),'go') holdonplot(redpts(:,1),redpts(:,2),'ro') holdoff

图包含一个坐标轴对象。The axes object contains 2 objects of type line.

Prepare Data for Classification

Put the data into one matrix, and make a vectorgrpthat labels the class of each point. 1 indicates the green class, and –1 indicates the red class.

cdata = [grnpts;redpts]; grp = ones(200,1); grp(101:200) = -1;

Prepare Cross-Validation

Set up a partition for cross-validation.

c = cvpartition(200,'KFold',10);

This step is optional. If you specify a partition for the optimization, then you can compute an actual cross-validation loss for the returned model.

Optimize Fit

To find a good fit, meaning one with optimal hyperparameters that minimize the cross-validation loss, use Bayesian optimization. Specify a list of hyperparameters to optimize by using theOptimizeHyperparametersname-value argument, and specify optimization options by using theHyperparameterOptimizationOptionsname-value argument.

Specify'OptimizeHyperparameters'as'auto'. The'auto'option includes a typical set of hyperparameters to optimize.fitcsvmfinds optimal values ofBoxConstraintandKernelScale. Set the hyperparameter optimization options to use the cross-validation partitioncand to choose the'expected-improvement-plus'acquisition function for reproducibility. The default acquisition function depends on run time and, therefore, can give varying results.

opts = struct('CVPartition',c,'AcquisitionFunctionName','expected-improvement-plus'); Mdl = fitcsvm(cdata,grp,'KernelFunction','rbf',...'OptimizeHyperparameters','auto','HyperparameterOptimizationOptions',opts)

|=====================================================================================================| | Iter | Eval | Objective | Objective | BestSoFar | BestSoFar | BoxConstraint| KernelScale | | | result | | runtime | (observed) | (estim.) | | | |=====================================================================================================| | 1 | Best | 0.345 | 0.29522 | 0.345 | 0.345 | 0.00474 | 306.44 | | 2 | Best | 0.115 | 0.18714 | 0.115 | 0.12678 | 430.31 | 1.4864 | | 3 | Accept | 0.52 | 0.27751 | 0.115 | 0.1152 | 0.028415 | 0.014369 | | 4 | Accept | 0.61 | 0.297 | 0.115 | 0.11504 | 133.94 | 0.0031427 | | 5 | Accept | 0.34 | 0.40509 | 0.115 | 0.11504 | 0.010993 | 5.7742 | | 6 | Best | 0.085 | 0.2515 | 0.085 | 0.085039 | 885.63 | 0.68403 | | 7 | Accept | 0.105 | 0.25066 | 0.085 | 0.085428 | 0.3057 | 0.58118 | | 8 | Accept | 0.21 | 0.26477 | 0.085 | 0.09566 | 0.16044 | 0.91824 | | 9 | Accept | 0.085 | 0.23688 | 0.085 | 0.08725 | 972.19 | 0.46259 | | 10 | Accept | 0.1 | 0.42455 | 0.085 | 0.090952 | 990.29 | 0.491 | | 11 | Best | 0.08 | 0.35181 | 0.08 | 0.079362 | 2.5195 | 0.291 | | 12 | Accept | 0.09 | 0.2114 | 0.08 | 0.08402 | 14.338 | 0.44386 | | 13 | Accept | 0.1 | 0.20009 | 0.08 | 0.08508 | 0.0022577 | 0.23803 | | 14 | Accept | 0.11 | 0.49489 | 0.08 | 0.087378 | 0.2115 | 0.32109 | | 15 | Best | 0.07 | 0.29763 | 0.07 | 0.081507 | 910.2 | 0.25218 | | 16 | Best | 0.065 | 0.36104 | 0.065 | 0.072457 | 953.22 | 0.26253 | | 17 | Accept | 0.075 | 0.38855 | 0.065 | 0.072554 | 998.74 | 0.23087 | | 18 | Accept | 0.295 | 0.27085 | 0.065 | 0.072647 | 996.18 | 44.626 | | 19 | Accept | 0.07 | 0.31933 | 0.065 | 0.06946 | 985.37 | 0.27389 | | 20 | Accept | 0.165 | 0.27464 | 0.065 | 0.071622 | 0.065103 | 0.13679 | |=====================================================================================================| | Iter | Eval | Objective | Objective | BestSoFar | BestSoFar | BoxConstraint| KernelScale | | | result | | runtime | (observed) | (estim.) | | | |=====================================================================================================| | 21 | Accept | 0.345 | 0.30824 | 0.065 | 0.071764 | 971.7 | 999.01 | | 22 | Accept | 0.61 | 0.26964 | 0.065 | 0.071967 | 0.0010168 | 0.0010005 | | 23 | Accept | 0.345 | 0.34764 | 0.065 | 0.071959 | 0.0011459 | 995.89 | | 24 | Accept | 0.35 | 0.22277 | 0.065 | 0.071863 | 0.0010003 | 40.628 | | 25 | Accept | 0.24 | 0.46237 | 0.065 | 0.072124 | 996.55 | 10.423 | | 26 | Accept | 0.61 | 0.48664 | 0.065 | 0.072067 | 994.71 | 0.0010063 | | 27 | Accept | 0.47 | 0.20158 | 0.065 | 0.07218 | 993.69 | 0.029723 | | 28 | Accept | 0.3 | 0.17353 | 0.065 | 0.072291 | 993.15 | 170.01 | | 29 | Accept | 0.16 | 0.41714 | 0.065 | 0.072103 | 992.81 | 3.8594 | | 30 | Accept | 0.365 | 0.42269 | 0.065 | 0.072112 | 0.0010017 | 0.044287 |

图包含一个坐标轴对象。坐标轴对象with title Min objective vs. Number of function evaluations contains 2 objects of type line. These objects represent Min observed objective, Estimated min objective.

图包含一个坐标轴对象。坐标轴对象with title Objective function model contains 5 objects of type line, surface, contour. These objects represent Observed points, Model mean, Next point, Model minimum feasible.

__________________________________________________________ Optimization completed. MaxObjectiveEvaluations of 30 reached. Total function evaluations: 30 Total elapsed time: 42.6693 seconds Total objective function evaluation time: 9.3728 Best observed feasible point: BoxConstraint KernelScale _____________ ___________ 953.22 0.26253 Observed objective function value = 0.065 Estimated objective function value = 0.073726 Function evaluation time = 0.36104 Best estimated feasible point (according to models): BoxConstraint KernelScale _____________ ___________ 985.37 0.27389 Estimated objective function value = 0.072112 Estimated function evaluation time = 0.29981

Mdl = ClassificationSVM ResponseName: Categori“Y”calPredictors: [] ClassNames: [-1 1] ScoreTransform: 'none' NumObservations: 200 HyperparameterOptimizationResults: [1x1 BayesianOptimization] Alpha: [77x1 double] Bias: -0.2352 KernelParameters: [1x1 struct] BoxConstraints: [200x1 double] ConvergenceInfo: [1x1 struct] IsSupportVector: [200x1 logical] Solver: 'SMO' Properties, Methods

fitcsvm返回一个ClassificationSVMmodel object that uses the best estimated feasible point. The best estimated feasible point is the set of hyperparameters that minimizes the upper confidence bound of the cross-validation loss based on the underlying Gaussian process model of the Bayesian optimization process.

The Bayesian optimization process internally maintains a Gaussian process model of the objective function. The objective function is the cross-validated misclassification rate for classification. For each iteration, the optimization process updates the Gaussian process model and uses the model to find a new set of hyperparameters. Each line of the iterative display shows the new set of hyperparameters and these column values:

Objective— Objective function value computed at the new set of hyperparameters.
Objective runtime— Objective function evaluation time.
Eval result— Result report, specified asAccept,Best, orError.Acceptindicates that the objective function returns a finite value, andErrorindicates that the objective function returns a value that is not a finite real scalar.Bestindicates that the objective function returns a finite value that is lower than previously computed objective function values.
BestSoFar(observed)— The minimum objective function value computed so far. This value is either the objective function value of the current iteration (if theEval resultvalue for the current iteration isBest) or the value of the previousBestiteration.
BestSoFar(estim.)— At each iteration, the software estimates the upper confidence bounds of the objective function values, using the updated Gaussian process model, at all the sets of hyperparameters tried so far. Then the software chooses the point with the minimum upper confidence bound. TheBestSoFar(estim.)value is the objective function value returned by thepredictObjectivefunction at the minimum point.

The plot below the iterative display shows theBestSoFar(observed)andBestSoFar(estim.)values in blue and green, respectively.

The returned objectMdl使用最好的可行点,估计是,the set of hyperparameters that produces theBestSoFar(estim.)value in the final iteration based on the final Gaussian process model.

You can obtain the best point from theHyperparameterOptimizationResultsproperty or by using thebestPointfunction.

Mdl.HyperparameterOptimizationResults.XAtMinEstimatedObjective

ans=1×2 tableBoxConstraint KernelScale _____________ ___________ 985.37 0.27389

[x,CriterionValue,iteration] = bestPoint(Mdl.HyperparameterOptimizationResults)

x=1×2 tableBoxConstraint KernelScale _____________ ___________ 985.37 0.27389

CriterionValue = 0.0888

iteration = 19

By default, thebestPointfunction uses the'min-visited-upper-confidence-interval'criterion. This criterion chooses the hyperparameters obtained from the 19th iteration as the best point.CriterionValueis the upper bound of the cross-validated loss computed by the final Gaussian process model. Compute the actual cross-validated loss by using the partitionc.

L_MinEstimated = kfoldLoss(fitcsvm(cdata,grp,'CVPartition',c,'KernelFunction','rbf',...'BoxConstraint',x.BoxConstraint,'KernelScale',x.KernelScale))

L_MinEstimated = 0.0700

The actual cross-validated loss is close to the estimated value. TheEstimated objective function valueis displayed below the plots of the optimization results.

You can also extract the best observed feasible point (that is, the lastBestpoint in the iterative display) from theHyperparameterOptimizationResultsproperty or by specifyingCriterionas'min-observed'.

Mdl.HyperparameterOptimizationResults.XAtMinObjective

ans=1×2 tableBoxConstraint KernelScale _____________ ___________ 953.22 0.26253

[x_observed,CriterionValue_observed,iteration_observed] = bestPoint(Mdl.HyperparameterOptimizationResults,'Criterion','min-observed')

x_observed=1×2 tableBoxConstraint KernelScale _____________ ___________ 953.22 0.26253

CriterionValue_observed = 0.0650

iteration_observed = 16

The'min-observed'criterion chooses the hyperparameters obtained from the 16th iteration as the best point.CriterionValue_observedis the actual cross-validated loss computed using the selected hyperparameters. For more information, see theCriterionname-value argument ofbestPoint.

Visualize the optimized classifier.

d = 0.02; [x1Grid,x2Grid] = meshgrid(min(cdata(:,1)):d:max(cdata(:,1)),...min(cdata(:,2)):d:max(cdata(:,2))); xGrid = [x1Grid(:),x2Grid(:)]; [~,scores] = predict(Mdl,xGrid); figure h(1:2) = gscatter(cdata(:,1),cdata(:,2),grp,'rg','+*'); holdonh(3) = plot(cdata(Mdl.IsSupportVector,1),...cdata(Mdl.IsSupportVector,2),'ko'); contour(x1Grid,x2Grid,reshape(scores(:,2),size(x1Grid)),[0 0],'k'); legend(h,{'-1','+1','Support Vectors'},'Location','Southeast');

图包含一个坐标轴对象。The axes object contains 4 objects of type line, contour. These objects represent -1, +1, Support Vectors.

Evaluate Accuracy on New Data

Generate and classify new test data points.

grnobj = gmdistribution(grnpop,.2*eye(2)); redobj = gmdistribution(redpop,.2*eye(2)); newData = random(grnobj,10); newData = [newData;random(redobj,10)]; grpData = ones(20,1);% green = 1grpData(11:20) = -1;% red = -1v = predict(Mdl,newData);

Compute the misclassification rates on the test data set.

L_Test = loss(Mdl,newData,grpData)

L_Test = 0.3500

Determine which new data points are classified correctly. Format the correctly classified points in red squares and the incorrectly classified points in black squares.

h(4:5) = gscatter(newData(:,1),newData(:,2),v,'mc','**'); mydiff = (v == grpData);% Classified correctlyforii = mydiff% Plot red squares around correct ptsh(6) = plot(newData(ii,1),newData(ii,2),'rs','MarkerSize',12);endforii = not(mydiff)% Plot black squares around incorrect ptsh(7) = plot(newData(ii,1),newData(ii,2),'ks','MarkerSize',12);endlegend(h,{'-1 (training)','+1 (training)','Support Vectors',...'-1 (classified)','+1 (classified)',...'Correctly Classified','Misclassified'},...'Location','Southeast'); holdoff

图包含一个坐标轴对象。The axes object contains 8 objects of type line, contour. These objects represent -1 (training), +1 (training), Support Vectors, -1 (classified), +1 (classified), Correctly Classified, Misclassified.

Plot Posterior Probability Regions for SVM Classification Models

Open Live Script

This example shows how to predict posterior probabilities of SVM models over a grid of observations, and then plot the posterior probabilities over the grid. Plotting posterior probabilities exposes decision boundaries.

Load Fisher's iris data set. Train the classifier using the petal lengths and widths, and remove the virginica species from the data.

loadfisheririsclassKeep = ~strcmp(species,'virginica'); X = meas(classKeep,3:4); y = species(classKeep);

Train an SVM classifier using the data. It is good practice to specify the order of the classes.

SVMModel = fitcsvm(X,y,'ClassNames',{'setosa','versicolor'});

Estimate the optimal score transformation function.

rng(1);% For reproducibility[SVMModel,ScoreParameters] = fitPosterior(SVMModel);

Warning: Classes are perfectly separated. The optimal score-to-posterior transformation is a step function.

ScoreParameters

ScoreParameters =struct with fields:Type: 'step' LowerBound: -0.8431 UpperBound: 0.6897 PositiveClassProbability: 0.5000

The optimal score transformation function is the step function because the classes are separable. The fieldsLowerBoundandUpperBoundofScoreParametersindicate the lower and upper end points of the interval of scores corresponding to observations within the class-separating hyperplanes (the margin). No training observation falls within the margin. If a new score is in the interval, then the software assigns the corresponding observation a positive class posterior probability, i.e., the value in thePositiveClassProbabilityfield ofScoreParameters.

Define a grid of values in the observed predictor space. Predict the posterior probabilities for each instance in the grid.

xMax = max(X); xMin = min(X); d = 0.01; [x1Grid,x2Grid] = meshgrid(xMin(1):d:xMax(1),xMin(2):d:xMax(2)); [~,PosteriorRegion] = predict(SVMModel,[x1Grid(:),x2Grid(:)]);

Plot the positive class posterior probability region and the training data.

figure; contourf(x1Grid,x2Grid,...reshape(PosteriorRegion(:,2),size(x1Grid,1),size(x1Grid,2))); h = colorbar; h.Label.String ='P({\it{versicolor}})'; h.YLabel.FontSize = 16; caxis([0 1]); colormapjet; holdongscatter(X(:,1),X(:,2),y,'mc','.x',[15,10]); sv = X(SVMModel.IsSupportVector,:); plot(sv(:,1),sv(:,2),'yo','MarkerSize',15,'LineWidth',2); axistightholdoff

图包含一个坐标轴对象。The axes object contains 4 objects of type contour, line. These objects represent setosa, versicolor.

In two-class learning, if the classes are separable, then there are three regions: one where observations have positive class posterior probability0, one where it is1, and the other where it is the positive class prior probability.

Analyze Images Using Linear Support Vector Machines

Open Live Script

This example shows how to determine which quadrant of an image a shape occupies by training an error-correcting output codes (ECOC) model comprised of linear SVM binary learners. This example also illustrates the disk-space consumption of ECOC models that store support vectors, their labels, and the estimated $α$ coefficients.

Create the Data Set

Randomly place a circle with radius five in a 50-by-50 image. Make 5000 images. Create a label for each image indicating the quadrant that the circle occupies. Quadrant 1 is in the upper right, quadrant 2 is in the upper left, quadrant 3 is in the lower left, and quadrant 4 is in the lower right. The predictors are the intensities of each pixel.

d = 50;% Height and width of the images in pixelsn = 5e4;% Sample sizeX = zeros(n,d^2);% Predictor matrix preallocationY = zeros(n,1);% Label preallocationtheta = 0:(1/d):(2*pi); r = 5;% Circle radiusrng(1);% For reproducibilityforj = 1:n figmat = zeros(d);% Empty imagec = datasample((r + 1):(d - r - 1),2);% Random circle centerx = r*cos(theta) + c(1);% Make the circley = r*sin(theta) + c(2); idx = sub2ind([d d],round(y),round(x));% Convert to linear indexingfigmat(idx) = 1;% Draw the circleX(j,:) = figmat(:);% Store the dataY(j) = (c(2) >= floor(d/2)) + 2*(c(2) < floor(d/2)) +...(c(1) < floor(d/2)) +...2*((c(1) >= floor(d/2)) & (c(2) < floor(d/2)));% Determine the quadrantend

Plot an observation.

figure imagesc(figmat) h = gca; h.YDir ='normal'; title(sprintf('Quadrant %d',Y(end)))

图包含一个坐标轴对象。坐标轴对象with title Quadrant 1 contains an object of type image.

Train the ECOC Model

Use a 25% holdout sample and specify the training and holdout sample indices.

p = 0.25; CVP = cvpartition(Y,'Holdout',p);% Cross-validation data partitionisIdx = training(CVP);% Training sample indicesoosIdx = test(CVP);% Test sample indices

Create an SVM template that specifies storing the support vectors of the binary learners. Pass it and the training data tofitcecocto train the model. Determine the training sample classification error.

t = templateSVM('SaveSupportVectors',true); MdlSV = fitcecoc(X(isIdx,:),Y(isIdx),'Learners',t); isLoss = resubLoss(MdlSV)

isLoss = 0

MdlSVis a trainedClassificationECOCmulticlass model. It stores the training data and the support vectors of each binary learner. For large data sets, such as those in image analysis, the model can consume a lot of memory.

Determine the amount of disk space that the ECOC model consumes.

infoMdlSV = whos('MdlSV'); mbMdlSV = infoMdlSV.bytes/1.049e6

mbMdlSV = 763.6150

The model consumes 763.6 MB.

Improve Model Efficiency

You can assess out-of-sample performance. You can also assess whether the model has been overfit with a compacted model that does not contain the support vectors, their related parameters, and the training data.

Discard the support vectors and related parameters from the trained ECOC model. Then, discard the training data from the resulting model by usingcompact.

Mdl = discardSupportVectors(MdlSV); CMdl = compact(Mdl); info = whos('Mdl','CMdl'); [bytesCMdl,bytesMdl] = info.bytes; memReduction = 1 - [bytesMdl bytesCMdl]/infoMdlSV.bytes

memReduction =1×20.0626 0.9996

In this case, discarding the support vectors reduces the memory consumption by about 6%. Compacting and discarding support vectors reduces the size by about 99.96%.

An alternative way to manage support vectors is to reduce their numbers during training by specifying a larger box constraint, such as 100. Though SVM models that use fewer support vectors are more desirable and consume less memory, increasing the value of the box constraint tends to increase the training time.

RemoveMdlSVandMdlfrom the workspace.

clearMdlMdlSV

Assess Holdout Sample Performance

Calculate the classification error of the holdout sample. Plot a sample of the holdout sample predictions.

oosLoss = loss(CMdl,X(oosIdx,:),Y(oosIdx))

oosLoss = 0

yHat = predict(CMdl,X(oosIdx,:)); nVec = 1:size(X,1); oosIdx = nVec(oosIdx); figure;forj = 1:9 subplot(3,3,j) imagesc(reshape(X(oosIdx(j),:),[d d])) h = gca; h.YDir ='normal'; title(sprintf('Quadrant: %d',yHat(j)))endtext(-1.33*d,4.5*d + 1,'Predictions','FontSize',17)

The model does not misclassify any holdout sample observations.

References

[1] Hastie, T., R. Tibshirani, and J. Friedman.The Elements of Statistical Learning, second edition. New York: Springer, 2008.

[2] Christianini, N., and J. Shawe-Taylor.An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge, UK: Cambridge University Press, 2000.

[3] Fan, R.-E., P.-H. Chen, and C.-J. Lin. “Working set selection using second order information for training support vector machines.”Journal of Machine Learning Research, Vol 6, 2005, pp. 1889–1918.

[4] Kecman V, t - m。黄,m·沃格特。“Iterative Single Data Algorithm for Training Kernel Machines from Huge Data Sets: Theory and Performance.” InSupport Vector Machines: Theory and Applications. Edited by Lipo Wang, 255–274. Berlin: Springer-Verlag, 2005.