Main Content

ClassificationTree class

Superclasses:CompactClassificationTree

Binary decision tree for multiclass classification

Description

AClassificationTreeobject represents a decision tree with binary splits for classification. An object of this class can predict responses for new data using thepredictmethod. The object contains the data used for training, so it can also compute resubstitution predictions.

Construction

Create aClassificationTreeobject by usingfitctree.

Properties

BinEdges

Bin edges for numeric predictors, specified as a cell array ofpnumeric vectors, wherep是the number of predictors. Each vector includes the bin edges for a numeric predictor. The element in the cell array for a categorical predictor is empty because the software does not bin categorical predictors.

The software bins numeric predictors only if you specify the'NumBins'name-value argument as a positive integer scalar when training a model with tree learners. TheBinEdgesproperty is empty if the'NumBins'value is empty (default).

You can reproduce the binned predictor dataXbinnedby using theBinEdgesproperty of the trained modelmdl.

X = mdl.X; % Predictor data Xbinned = zeros(size(X)); edges = mdl.BinEdges; % Find indices of binned predictors. idxNumeric = find(~cellfun(@isempty,edges)); if iscolumn(idxNumeric) idxNumeric = idxNumeric'; end for j = idxNumeric x = X(:,j); % Convert x to array if x is a table. if istable(x) x = table2array(x); end % Group x into bins by using thediscretizefunction. xbinned = discretize(x,[-inf; edges{j}; inf]); Xbinned(:,j) = xbinned; end
Xbinned包含本指数,从1到怒江mber of bins, for numeric predictors.Xbinnedvalues are 0 for categorical predictors. IfXcontainsNaNs, then the correspondingXbinnedvalues areNaNs.

CategoricalPredictors

Categorical predictor indices, specified as a vector of positive integers.CategoricalPredictorscontains index values indicating that the corresponding predictors are categorical. The index values are between 1 andp, wherep是the number of predictors used to train the model. If none of the predictors are categorical, then this property is empty ([]).

CategoricalSplit

Ann-by-2 cell array, wheren是the number of categorical splits intree. Each row inCategoricalSplitgives left and right values for a categorical split. For each branch node with categorical splitjbased on a categorical predictor variablez, the left child is chosen ifzis inCategoricalSplit(j,1)and the right child is chosen ifzis inCategoricalSplit(j,2). The splits are in the same order as nodes of the tree. Nodes for these splits can be found by runningcuttypeand selecting'categorical'cuts from top to bottom.

Children

Ann-by-2 array containing the numbers of the child nodes for each node intree, wheren是the number of nodes. Leaf nodes have child node0.

ClassCount

Ann-by-karray of class counts for the nodes intree, wheren是the number of nodes andk是the number of classes. For any node numberi, the class countsClassCount(i,:)are counts of observations (from the data used in fitting the tree) from each class satisfying the conditions for nodei.

ClassNames

List of the elements inYwith duplicates removed.ClassNamescan be a categorical array, cell array of character vectors, character array, logical vector, or a numeric vector.ClassNameshas the same data type as the data in the argumentY.(The software treats string arrays as cell arrays of character vectors.)

ClassProbability

Ann-by-karray of class probabilities for the nodes intree, wheren是the number of nodes andk是the number of classes. For any node numberi, the class probabilitiesClassProbability(i,:)are the estimated probabilities for each class for a point satisfying the conditions for nodei.

Cost

Square matrix, whereCost(i,j)是the cost of classifying a point into classjif its true class isi(the rows correspond to the true class and the columns correspond to the predicted class). The order of the rows and columns ofCostcorresponds to the order of the classes inClassNames. The number of rows and columns inCost是the number of unique classes in the response. This property is read-only.

CutCategories

Ann-by-2 cell array of the categories used at branches intree, wheren是the number of nodes. For each branch nodeibased on a categorical predictor variableX, the left child is chosen ifXis among the categories listed inCutCategories{i,1}, and the right child is chosen ifXis among those listed inCutCategories{i,2}. Both columns ofCutCategoriesare empty for branch nodes based on continuous predictors and for leaf nodes.

CutPointcontains the cut points for'continuous'cuts, andCutCategoriescontains the set of categories.

CutPoint

Ann-element vector of the values used as cut points intree, wheren是the number of nodes. For each branch nodeibased on a continuous predictor variableX, the left child is chosen ifXand the right child is chosen ifX>=CutPoint(i).CutPointisNaNfor branch nodes based on categorical predictors and for leaf nodes.

CutPointcontains the cut points for'continuous'cuts, andCutCategoriescontains the set of categories.

CutType

Ann-element cell array indicating the type of cut at each node intree, wheren是the number of nodes. For each nodei,CutType{i}is:

  • 'continuous'— If the cut is defined in the formX < vfor a variableXand cut pointv.

  • 'categorical'— If the cut is defined by whether a variableXtakes a value in a set of categories.

  • ''— Ifiis a leaf node.

CutPointcontains the cut points for'continuous'cuts, andCutCategoriescontains the set of categories.

CutPredictor

Ann-element cell array of the names of the variables used for branching in each node intree, wheren是the number of nodes. These variables are sometimes known ascut variables. For leaf nodes,CutPredictorcontains an empty character vector.

CutPointcontains the cut points for'continuous'cuts, andCutCategoriescontains the set of categories.

CutPredictorIndex

Ann-element array of numeric indices for the variables used for branching in each node intree, wheren是the number of nodes. For more information, seeCutPredictor.

ExpandedPredictorNames

Expanded predictor names, stored as a cell array of character vectors.

If the model uses encoding for categorical variables, thenExpandedPredictorNamesincludes the names that describe the expanded variables. Otherwise,ExpandedPredictorNames是the same asPredictorNames.

HyperparameterOptimizationResults

Description of the cross-validation optimization of hyperparameters, stored as aBayesianOptimizationobject or a table of hyperparameters and associated values. Nonempty when theOptimizeHyperparameters名称-值对非空的创造。价值管理ends on the setting of theHyperparameterOptimizationOptionsname-value pair at creation:

  • 'bayesopt'(default) — Object of classBayesianOptimization

  • 'gridsearch'or'randomsearch'— Table of hyperparameters used, observed objective function values (cross-validation loss), and rank of observations from lowest (best) to highest (worst)

IsBranchNode

Ann-element logical vector that istruefor each branch node andfalsefor each leaf node oftree.

ModelParameters

Parameters used in trainingtree. To display all parameter values, entertree.ModelParameters. To access a particular parameter, use dot notation.

NumObservations

Number of observations in the training data, a numeric scalar.NumObservationscan be less than the number of rows of input dataXwhen there are missing values inXor responseY.

NodeClass

Ann-element cell array with the names of the most probable classes in each node oftree, wheren是the number of nodes in the tree. Every element of this array is a character vector equal to one of the class names inClassNames.

NodeError

Ann-element vector of the errors of the nodes intree, wheren是the number of nodes.NodeError(i)是the misclassification probability for nodei.

NodeProbability

Ann-element vector of the probabilities of the nodes intree, wheren是the number of nodes. The probability of a node is computed as the proportion of observations from the original data that satisfy the conditions for the node. This proportion is adjusted for any prior probabilities assigned to each class.

NodeRisk

Ann-element vector of the risk of the nodes in the tree, wheren是the number of nodes. The risk for each node is the measure of impurity (Gini index or deviance) for this node weighted by the node probability. If the tree is grown by twoing, the risk for each node is zero.

NodeSize

Ann-element vector of the sizes of the nodes intree, wheren是the number of nodes. The size of a node is defined as the number of observations from the data used to create the tree that satisfy the conditions for the node.

NumNodes

The number of nodes intree.

Parent

Ann-element vector containing the number of the parent node for each node intree, wheren是the number of nodes. The parent of the root node is0.

PredictorNames

Cell array of character vectors containing the predictor names, in the order which they appear inX.

Prior

Numeric vector of prior probabilities for each class. The order of the elements ofPriorcorresponds to the order of the classes inClassNames. The number of elements ofPrior是the number of unique classes in the response. This property is read-only.

PruneAlpha

Numeric vector with one element per pruning level. If the pruning level ranges from 0 toM, thenPruneAlphahasM+ 1 elements sorted in ascending order.PruneAlpha(1)is for pruning level 0 (no pruning),PruneAlpha(2)is for pruning level 1, and so on.

PruneList

Ann-element numeric vector with the pruning levels in each node oftree, wheren是the number of nodes. The pruning levels range from 0 (no pruning) toM, whereM是the distance between the deepest leaf and the root node.

ResponseName

A character vector that specifies the name of the response variable (Y).

RowsUsed

Ann-element logical vector indicating which rows of the original predictor data (X) were used in fitting. If the software uses all rows ofX, thenRowsUsedis an empty array ([]).

ScoreTransform

Function handle for transforming predicted classification scores, or character vector representing a built-in transformation function.

nonemeans no transformation, or@(x)x.

改变分数转换函数,佛r example,function, use dot notation.

  • For available functions (seefitctree), enter

    Mdl.ScoreTransform = 'function';
  • You can set a function handle for an available function, or a function you define yourself by entering

    tree.ScoreTransform = @function;

SurrogateCutCategories

Ann-element cell array of the categories used for surrogate splits intree, wheren是the number of nodes intree. For each nodek,SurrogateCutCategories{k}is a cell array. The length ofSurrogateCutCategories{k}is equal to the number of surrogate predictors found at this node. Every element ofSurrogateCutCategories{k}is either an empty character vector for a continuous surrogate predictor, or is a two-element cell array with categories for a categorical surrogate predictor. The first element of this two-element cell array lists categories assigned to the left child by this surrogate split, and the second element of this two-element cell array lists categories assigned to the right child by this surrogate split. The order of the surrogate split variables at each node is matched to the order of variables inSurrogateCutPredictor. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes,SurrogateCutCategoriescontains an empty cell.

SurrogateCutFlip

Ann-element cell array of the numeric cut assignments used for surrogate splits intree, wheren是the number of nodes intree. For each nodek,SurrogateCutFlip{k}is a numeric vector. The length ofSurrogateCutFlip{k}is equal to the number of surrogate predictors found at this node. Every element ofSurrogateCutFlip{k}is either zero for a categorical surrogate predictor, or a numeric cut assignment for a continuous surrogate predictor. The numeric cut assignment can be either –1 or +1. For every surrogate split with a numeric cutCbased on a continuous predictor variableZ, the left child is chosen ifZ<Cand the cut assignment for this surrogate split is +1, or ifZCand the cut assignment for this surrogate split is –1. Similarly, the right child is chosen ifZCand the cut assignment for this surrogate split is +1, or ifZ<Cand the cut assignment for this surrogate split is –1. The order of the surrogate split variables at each node is matched to the order of variables inSurrogateCutPredictor. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes,SurrogateCutFlipcontains an empty array.

SurrogateCutPoint

Ann-element cell array of the numeric values used for surrogate splits intree, wheren是the number of nodes intree. For each nodek,SurrogateCutPoint{k}is a numeric vector. The length ofSurrogateCutPoint{k}is equal to the number of surrogate predictors found at this node. Every element ofSurrogateCutPoint{k}is eitherNaNfor a categorical surrogate predictor, or a numeric cut for a continuous surrogate predictor. For every surrogate split with a numeric cutCbased on a continuous predictor variableZ, the left child is chosen ifZ<CandSurrogateCutFlipfor this surrogate split is +1, or ifZCandSurrogateCutFlipfor this surrogate split is –1. Similarly, the right child is chosen ifZCandSurrogateCutFlipfor this surrogate split is +1, or ifZ<CandSurrogateCutFlipfor this surrogate split is –1. The order of the surrogate split variables at each node is matched to the order of variables returned bySurrogateCutPredictor. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes,SurrogateCutPointcontains an empty cell.

SurrogateCutType

Ann-element cell array indicating types of surrogate splits at each node intree, wheren是the number of nodes intree. For each nodek,SurrogateCutType{k}is a cell array with the types of the surrogate split variables at this node. The variables are sorted by the predictive measure of association with the optimal predictor in the descending order, and only variables with the positive predictive measure are included. The order of the surrogate split variables at each node is matched to the order of variables inSurrogateCutPredictor. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes,SurrogateCutTypecontains an empty cell. A surrogate split type can be either'continuous'if the cut is defined in the formZ<Vfor a variableZand cut pointVor'categorical'if the cut is defined by whetherZtakes a value in a set of categories.

SurrogateCutPredictor

Ann-element cell array of the names of the variables used for surrogate splits in each node intree, wheren是the number of nodes intree. Every element ofSurrogateCutPredictoris a cell array with the names of the surrogate split variables at this node. The variables are sorted by the predictive measure of association with the optimal predictor in the descending order, and only variables with the positive predictive measure are included. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes,SurrogateCutPredictorcontains an empty cell.

SurrogatePredictorAssociation

Ann-element cell array of the predictive measures of association for surrogate splits intree, wheren是the number of nodes intree. For each nodek,SurrogatePredictorAssociation{k}is a numeric vector. The length ofSurrogatePredictorAssociation{k}is equal to the number of surrogate predictors found at this node. Every element ofSurrogatePredictorAssociation{k}gives the predictive measure of association between the optimal split and this surrogate split. The order of the surrogate split variables at each node is the order of variables inSurrogateCutPredictor. The optimal-split variable at this node does not appear. For nonbranch (leaf) nodes,SurrogatePredictorAssociationcontains an empty cell.

W

The scaledweights, a vector with lengthn, the number of rows inX.

X

A matrix or table of predictor values. Each column ofXrepresents one variable, and each row represents one observation.

Y

A categorical array, cell array of character vectors, character array, logical vector, or a numeric vector. Each row ofYrepresents the classification of the corresponding row ofX.

Object Functions

compact Compact tree
compareHoldout Compare accuracies of two classification models using new data
crossval Cross-validated decision tree
cvloss Classification error by cross validation
edge Classification edge
gather Gather properties ofStatistics and Machine Learning Toolboxobject from GPU
lime Local interpretable model-agnostic explanations (LIME)
loss Classification error
margin Classification margins
nodeVariableRange Retrieve variable range of decision tree node
partialDependence Compute partial dependence
plotPartialDependence Create partial dependence plot (PDP) and individual conditional expectation (ICE) plots
predict 预测使用分类树标签
predictorImportance Estimates of predictor importance for classification tree
prune Produce sequence of classification subtrees by pruning
resubEdge Classification edge by resubstitution
resubLoss Classification error by resubstitution
resubMargin Classification margins by resubstitution
resubPredict Predict resubstitution labels of classification tree
shapley Shapley values
surrogateAssociation Mean predictive measure of association for surrogate splits in classification tree
testckfold Compare accuracies of two classification models by repeated cross-validation
view View classification tree

Copy Semantics

Value. To learn how value classes affect copy operations, seeCopying Objects.

Examples

collapse all

Grow a classification tree using theionospheredata set.

loadionospheretc = fitctree(X,Y)
tc = ClassificationTree ResponseName: 'Y' CategoricalPredictors: [] ClassNames: {'b' 'g'} ScoreTransform: 'none' NumObservations: 351 Properties, Methods

You can control the depth of the trees using theMaxNumSplits,MinLeafSize, orMinParentSizename-value pair parameters.fitctreegrows deep decision trees by default. You can grow shallower trees to reduce model complexity or computation time.

Load theionospheredata set.

loadionosphere

The default values of the tree depth controllers for growing classification trees are:

  • n - 1forMaxNumSplits.n是the training sample size.

  • 1forMinLeafSize.

  • 10forMinParentSize.

These default values tend to grow deep trees for large training sample sizes.

Train a classification tree using the default values for tree depth control. Cross-validate the model by using 10-fold cross-validation.

rng(1);% For reproducibilityMdlDefault = fitctree(X,Y,'CrossVal','on');

Draw a histogram of the number of imposed splits on the trees. Also, view one of the trees.

numBranches = @(x)sum(x.IsBranch); mdlDefaultNumSplits = cellfun(numBranches, MdlDefault.Trained); figure; histogram(mdlDefaultNumSplits)

Figure contains an axes object. The axes object contains an object of type histogram.

view(MdlDefault.Trained{1},“模式”,'graph')

Figure Classification tree viewer contains an axes object and other objects of type uimenu, uicontrol. The axes object contains 51 objects of type line, text. One or more of the lines displays its values using only markers

The average number of splits is around 15.

Suppose that you want a classification tree that is not as complex (deep) as the ones trained using the default number of splits. Train another classification tree, but set the maximum number of splits at 7, which is about half the mean number of splits from the default classification tree. Cross-validate the model by using 10-fold cross-validation.

Mdl7 = fitctree(X,Y,'MaxNumSplits'7'CrossVal','on'); view(Mdl7.Trained{1},“模式”,'graph')

Figure Classification tree viewer contains an axes object and other objects of type uimenu, uicontrol. The axes object contains 21 objects of type line, text. One or more of the lines displays its values using only markers

Compare the cross-validation classification errors of the models.

classErrorDefault = kfoldLoss(MdlDefault)
classErrorDefault = 0.1168
classError7 = kfoldLoss(Mdl7)
classError7 = 0.1311

Mdl7is much less complex and performs only slightly worse thanMdlDefault.

More About

expand all

References

[1] Breiman, L., J. Friedman, R. Olshen, and C. Stone.Classification and Regression Trees. Boca Raton, FL: CRC Press, 1984.

Extended Capabilities

Version History

Introduced in R2011a