Main Content

crossval

Cross-validate machine learning model

    Description

    example

    CVMdl= crossval(Mdl)returns a cross-validated (partitioned) machine learning model (CVMdl) from a trained model (Mdl). By default,crossvaluses 10-fold cross-validation on the training data.

    CVMdl= crossval(Mdl,Name,Value)sets an additional cross-validation option. You can specify only one name-value argument. For example, you can specify the number of folds or a holdout sample proportion.

    Examples

    collapse all

    Load theionospheredata set. This data set has 34 predictors and 351 binary responses for radar returns, either bad ('b') or good ('g').

    loadionosphererng(1);% For reproducibility

    Train a support vector machine (SVM) classifier. Standardize the predictor data and specify the order of the classes.

    SVMModel = fitcsvm(X,Y,'Standardize',true,'ClassNames',{'b','g'});

    SVMModelis a trainedClassificationSVMclassifier.'b'is the negative class and'g'is the positive class.

    Cross-validate the classifier using 10-fold cross-validation.

    CVSVMModel = crossval(SVMModel)
    CVSVMModel = ClassificationPartitionedModel CrossValidatedModel: 'SVM' PredictorNames: {1x34 cell} ResponseName: 'Y' NumObservations: 351 KFold: 10 Partition: [1x1 cvpartition] ClassNames: {'b' 'g'} ScoreTransform: 'none' Properties, Methods

    CVSVMModelis aClassificationPartitionedModelcross-validated classifier. During cross-validation, the software completes these steps:

    1. Randomly partition the data into 10 sets of equal size.

    2. Train an SVM classifier on nine of the sets.

    3. Repeat steps 1 and 2k= 10 times. The software leaves out one partition each time and trains on the other nine partitions.

    4. Combine generalization statistics for each fold.

    Display the first model inCVSVMModel.Trained.

    FirstModel = CVSVMModel.Trained{1}
    FirstModel = CompactClassificationSVM ResponseName: 'Y' CategoricalPredictors: [] ClassNames: {'b' 'g'} ScoreTransform: 'none' Alpha: [78x1 double] Bias: -0.2208 KernelParameters: [1x1 struct] Mu: [0.8888 0 0.6320 0.0406 0.5931 0.1205 0.5361 ... ] Sigma: [0.3149 0 0.5033 0.4441 0.5255 0.4663 0.4987 ... ] SupportVectors: [78x34 double] SupportVectorLabels: [78x1 double] Properties, Methods

    FirstModelis the first of the 10 trained classifiers. It is aCompactClassificationSVMclassifier.

    You can estimate the generalization error by passingCVSVMModeltokfoldLoss.

    Specify a holdout sample proportion for cross-validation. By default,crossvaluses 10-fold cross-validation to cross-validate a naive Bayes classifier. However, you have several other options for cross-validation. For example, you can specify a different number of folds or a holdout sample proportion.

    Load theionospheredata set. This data set has 34 predictors and 351 binary responses for radar returns, either bad ('b') or good ('g').

    loadionosphere

    Remove the first two predictors for stability.

    X = X(:,3:end); rng('default');% For reproducibility

    Train a naive Bayes classifier using the predictorsXand class labelsY. A recommended practice is to specify the class names.'b'is the negative class and'g'is the positive class.fitcnbassumes that each predictor is conditionally and normally distributed.

    Mdl = fitcnb(X,Y,'ClassNames',{'b','g'});

    Mdlis a trainedClassificationNaiveBayesclassifier.

    Cross-validate the classifier by specifying a 30% holdout sample.

    CVMdl = crossval(Mdl,'Holdout',0.3)
    CVMdl = ClassificationPartitionedModel CrossValidatedModel: 'NaiveBayes' PredictorNames: {1x32 cell} ResponseName: 'Y' NumObservations: 351 KFold: 1 Partition: [1x1 cvpartition] ClassNames: {'b' 'g'} ScoreTransform: 'none' Properties, Methods

    CVMdlis aClassificationPartitionedModelcross-validated, naive Bayes classifier.

    Display the properties of the classifier trained using 70% of the data.

    TrainedModel = CVMdl.Trained{1}
    TrainedModel = CompactClassificationNaiveBayes ResponseName: 'Y' CategoricalPredictors: [] ClassNames: {'b' 'g'} ScoreTransform: 'none' DistributionNames: {1x32 cell} DistributionParameters: {2x32 cell} Properties, Methods

    TrainedModelis aCompactClassificationNaiveBayesclassifier.

    Estimate the generalization error by passingCVMdltokfoldloss.

    kfoldLoss(CVMdl)
    ans = 0.2095

    The out-of-sample misclassification error is approximately 21%.

    减少通过选择fi泛化误差ve most important predictors.

    idx = fscmrmr(X,Y); Xnew = X(:,idx(1:5));

    Train a naive Bayes classifier for the new predictor.

    Mdlnew = fitcnb(Xnew,Y,'ClassNames',{'b','g'});

    Cross-validate the new classifier by specifying a 30% holdout sample, and estimate the generalization error.

    CVMdlnew = crossval(Mdlnew,'Holdout',0.3); kfoldLoss(CVMdlnew)
    ans = 0.1429

    The out-of-sample misclassification error is reduced from approximately 21% to approximately 14%.

    Train a regression generalized additive model (GAM) by usingfitrgam, and create a cross-validated GAM by usingcrossvaland the holdout option. Then, usekfoldPredict预测反应为validation-fold observations using a model trained on training-fold observations.

    Load thepatientsdata set.

    loadpatients

    Create a table that contains the predictor variables (Age,Diastolic,Smoker,Weight,Gender,SelfAssessedHealthStatus) and the response variable (Systolic).

    tbl = table(Age,Diastolic,Smoker,Weight,Gender,SelfAssessedHealthStatus,Systolic);

    Train a GAM that contains linear terms for predictors.

    Mdl = fitrgam(tbl,'Systolic');

    Mdlis aRegressionGAMmodel object.

    Cross-validate the model by specifying a 30% holdout sample.

    rng('default')% For reproducibilityCVMdl = crossval(Mdl,'Holdout',0.3)
    CVMdl = RegressionPartitionedGAM CrossValidatedModel: 'GAM' PredictorNames: {1x6 cell} CategoricalPredictors: [3 5 6] ResponseName: 'Systolic' NumObservations: 100 KFold: 1 Partition: [1x1 cvpartition] NumTrainedPerFold: [1x1 struct] ResponseTransform: 'none' IsStandardDeviationFit: 0 Properties, Methods

    Thecrossvalfunction creates aRegressionPartitionedGAMmodel objectCVMdlwith the holdout option. During cross-validation, the software completes these steps:

    1. Randomly select and reserve 30% of the data as validation data, and train the model using the rest of the data.

    2. Store the compact, trained model in theTrainedproperty of the cross-validated model objectRegressionPartitionedGAM.

    You can choose a different cross-validation setting by using the'CrossVal','CVPartition','KFold', or'Leaveout'name-value argument.

    Predict responses for the validation-fold observations by usingkfoldPredict. The function predicts responses for the validation-fold observations by using the model trained on the training-fold observations. The function assignsNaNto the training-fold observations.

    yFit = kfoldPredict(CVMdl);

    Find the validation-fold observation indexes, and create a table containing the observation index, observed response values, and predicted response values. Display the first eight rows of the table.

    idx = find(~isnan(yFit)); t = table(idx,tbl.Systolic(idx),yFit(idx),...'VariableNames',{'Obseraction Index','Observed Value','Predicted Value'}); head(t)
    ans=8×3 tableObseraction Index Observed Value Predicted Value _________________ ______________ _______________ 1 124 130.22 6 121 124.38 7 130 125.26 12 115 117.05 20 125 121.82 22 123 116.99 23 114 107 24 128 122.52

    Compute the regression error (mean squared error) for the validation-fold observations.

    L = kfoldLoss(CVMdl)
    L = 43.8715

    Input Arguments

    collapse all

    Machine learning model, specified as a full regression or classification model object, as given in the following tables of supported models.

    Regression Model Object

    Model Full Regression Model Object
    Gaussian process regression (GPR) model RegressionGP(If you supply a custom'ActiveSet'in the call tofitrgp, then you cannot cross-validate the GPR model.)
    Generalized additive model (GAM) RegressionGAM
    Neural network model RegressionNeuralNetwork

    Classification Model Object

    Model Full Classification Model Object
    Generalized additive model ClassificationGAM
    k-nearest neighbor model ClassificationKNN
    Naive Bayes model ClassificationNaiveBayes
    Neural network model ClassificationNeuralNetwork
    Support vector machine for one-class and binary classification ClassificationSVM

    Name-Value Arguments

    Specify optional comma-separated pairs ofName,Valuearguments.Nameis the argument name andValueis the corresponding value.Namemust appear inside quotes. You can specify several name and value pair arguments in any order asName1,Value1,...,NameN,ValueN.

    Example:crossval(Mdl,'KFold',3)specifies using three folds in a cross-validated model.

    Cross-validation partition, specified as acvpartitionpartition object created bycvpartition. The partition object specifies the type of cross-validation and the indexing for the training and validation sets.

    You can specify only one of these four name-value arguments:'CVPartition','Holdout','KFold', or'Leaveout'.

    Example:Suppose you create a random partition for 5-fold cross-validation on 500 observations by usingcvp = cvpartition(500,'KFold',5). Then, you can specify the cross-validated model by using'CVPartition',cvp.

    Fraction of the data used for holdout validation, specified as a scalar value in the range (0,1). If you specify'Holdout',p, then the software completes these steps:

    1. Randomly select and reservep*100% of the data as validation data, and train the model using the rest of the data.

    2. Store the compact, trained model in theTrainedproperty of the cross-validated model. IfMdldoes not have a corresponding compact object, thenTrainedcontains a full object.

    You can specify only one of these four name-value arguments:'CVPartition','Holdout','KFold', or'Leaveout'.

    Example:'Holdout',0.1

    Data Types:double|single

    Number of folds to use in a cross-validated model, specified as a positive integer value greater than 1. If you specify'KFold',k, then the software completes these steps:

    1. Randomly partition the data intoksets.

    2. For each set, reserve the set as validation data, and train the model using the otherk– 1sets.

    3. Store thekcompact, trained models in ak-by-1 cell vector in theTrainedproperty of the cross-validated model. IfMdldoes not have a corresponding compact object, thenTrainedcontains a full object.

    You can specify only one of these four name-value arguments:'CVPartition','Holdout','KFold', or'Leaveout'.

    Example:'KFold',5

    Data Types:single|double

    Leave-one-out cross-validation flag, specified as'on'or'off'. If you specify'Leaveout','on', then for each of thenobservations (wherenis the number of observations, excluding missing observations, specified in theNumObservationsproperty of the model), the software completes these steps:

    1. Reserve the one observation as validation data, and train the model using the othern– 1 observations.

    2. Store thencompact, trained models in ann-by-1 cell vector in theTrainedproperty of the cross-validated model. IfMdldoes not have a corresponding compact object, thenTrainedcontains a full object.

    You can specify only one of these four name-value arguments:'CVPartition','Holdout','KFold', or'Leaveout'.

    Example:'Leaveout','on'

    Output Arguments

    collapse all

    Cross-validated machine learning model, returned as one of the cross-validated (partitioned) model objects in the following tables, depending on the input modelMdl.

    Regression Model Object

    Model Regression Model (Mdl) Cross-Validated Model (CVMdl)
    Gaussian process regression model RegressionGP RegressionPartitionedModel
    Generalized additive model RegressionGAM RegressionPartitionedGAM
    Neural network model RegressionNeuralNetwork RegressionPartitionedModel

    Classification Model Object

    Model Classification Model (Mdl) Cross-Validated Model (CVMdl)
    Generalized additive model ClassificationGAM ClassificationPartitionedGAM
    k-nearest neighbor model ClassificationKNN ClassificationPartitionedModel
    Naive Bayes model ClassificationNaiveBayes ClassificationPartitionedModel
    Neural network model ClassificationNeuralNetwork ClassificationPartitionedModel
    Support vector machine for one-class and binary classification ClassificationSVM ClassificationPartitionedModel

    Tips

    • Assess the predictive performance ofMdlon cross-validated data by using thekfoldfunctions and properties ofCVMdl, such askfoldPredict,kfoldLoss,kfoldMargin, andkfoldEdgefor classification andkfoldPredictandkfoldLossfor regression.

    • Return a partitioned classifier with stratified partitioning by using the name-value argument'KFold'or'Holdout'.

    • Create acvpartitionobjectcvpusingcvp =cvpartition(n,'KFold',k). Return a partitioned classifier with nonstratified partitioning by using the name-value argument'CVPartition',cvp.

    Alternative Functionality

    Instead of training a model and then cross-validating it, you can create a cross-validated model directly by using a fitting function and specifying one of these name-value argument:'CrossVal','CVPartition','Holdout','Leaveout', or'KFold'.

    Extended Capabilities

    Introduced in R2012a