Select Predictors for Random Forests

Open Live Script

此示例显示了如何在种植回归树的随机森林时为数据集选择适当的拆分预测器选择技术。该示例还显示了如何确定哪些预测因素最重要的是培训数据中。

Load and Preprocess Data

Load the卡比格数据集。考虑一个模型，该模型可以预测汽车的燃油经济性，鉴于其气缸数量，发动机位移，马力，重量，加速度，模型年度和原产国。考虑气缸，，，，Model_Year，，，，and起源as categorical variables.

加载卡比格气缸= categorical(Cylinders); Model_Year = categorical(Model_Year); Origin = categorical(cellstr(Origin)); X = table(Cylinders,Displacement,Horsepower,Weight,Acceleration,Model_Year,Origin);

Determine Levels in Predictors

标准的购物车算法倾向于将预测变量（例如，连续变量），例如级别较少的变量（例如，分类变量）拆分预测变量。如果您的数据是异质的，或者您的预测变量在其级别数量上差异很大，则请考虑使用曲率或交互测试进行分式预测器选择而不是标准购物车。

对于每个预测指标，确定数据中的级别数量。做到这一点的一种方法是定义一个匿名函数：

使用所有变量将所有变量转换为分类数据类型categorical
Determines all unique categories while ignoring missing values usingcategories
Counts the categories usingnumel

这n, apply the function to each variable usingVarfun。

countLevels = @(x)numel(categories(categorical(x))); numLevels = varfun(countLevels,X,'输出格式'，，，，'制服'）；

Compare the number of levels among the predictor variables.

图栏（numlevels）标题（'Number of Levels Among Predictors') xlabel('Predictor variable') ylabel(“级别数”) h = gca; h.XTickLabel = X.Properties.VariableNames(1:end-1); h.XTickLabelRotation = 45; h.TickLabelInterpreter ='没有任何';

图包含一个轴对象。预测变量中具有标题级别的轴对象包含一个类型bar的对象。

这continuous variables have many more levels than the categorical variables. Because the number of levels among the predictors varies so much, using standard CART to select split predictors at each node of the trees in a random forest can yield inaccurate predictor importance estimates. In this case, use the curvature test or interaction test. Specify the algorithm by using the'PredictorSelection'name-value pair argument. For more details, see选择拆分预测器选择技术。

Train Bagged Ensemble of Regression Trees

Train a bagged ensemble of 200 regression trees to estimate predictor importance values. Define a tree learner using these name-value pair arguments:

'numVariablestosame'- 使用每个节点上的所有预测变量来确保每棵树都使用所有预测变量。
'PredictorSelection','interaction-curvature'— Specify usage of the interaction test to select split predictors.
“代理”，，，，'上'- 指定替代拆分的用法以提高准确性，因为数据集包括缺少值。

t = templateTree('num variablestosame'，，，，'all'，，，，...'PredictorSelection'，，，，“相互作用融合”，，，，“代理”，，，，'上'）；RNG（1）;% For reproducibilityMdl = fitrensemble(X,MPG,'方法'，，，，'Bag'，，，，“数值”，200，...“学习者”，t）;

Mdl是一个regressionBaggedEnsemble模型。

估计模型 $r^{2}$ using out-of-bag predictions.

yHat = oobPredict(Mdl); R2 = corr(Mdl.Y,yHat)^2

R2 = 0.8744

Mdlexplains 87% of the variability around the mean.

Predictor Importance Estimation

Estimate predictor importance values by permuting out-of-bag observations among the trees.

Impoob= oobPermutedPredictorImportance(Mdl);

Impoob是预测器重要性估计值的1乘7个向量，与预测变量相对应Mdl.PredictorNames。估计值不偏向包含许多级别的预测因子。

比较预测器的重要性估计值。

图栏（Impoob）标题（'Unbiased Predictor Importance Estimates') xlabel('Predictor variable') ylabel('重要性') h = gca; h.XTickLabel = Mdl.PredictorNames; h.XTickLabelRotation = 45; h.TickLabelInterpreter ='没有任何';

图包含一个轴对象。带有标题无偏见的预测值估计的轴对象包含类型bar的对象。

Greater importance estimates indicate more important predictors. The bar graph suggests thatModel_Year是最重要的预测指标，其次是气缸and重量。这Model_Yearand气缸变量分别只有13个和5个不同的级别，而重量variable has over 300 levels.

比较预测permutin估计的重要性g out-of-bag observations and those estimates obtained by summing gains in the mean squared error due to splits on each predictor. Also, obtain predictor association measures estimated by surrogate splits.

[impGain,predAssociation] = predictorImportance(Mdl); figure plot(1:numel(Mdl.PredictorNames),[impOOB' impGain']) title(“预测重要性估计比较”) xlabel('Predictor variable') ylabel('重要性') h = gca; h.XTickLabel = Mdl.PredictorNames; h.XTickLabelRotation = 45; h.TickLabelInterpreter ='没有任何';legend(“ OOB置入”，，，，“ MSE改进”) gridon

图包含一个轴对象。这axes object with title Predictor Importance Estimation Comparison contains 2 objects of type line. These objects represent OOB permuted, MSE improvement.

According to the values ofimpGain，变量移位，，，，Horsepower，，，，and重量appear to be equally important.

predAssociation是预测相关度量的7 x 7矩阵。行和列对应于预测变量Mdl.PredictorNames。这Predictive Measure of Association是指示拆分观察的决策规则之间相似性的值。最佳的替代决策分裂产生了关联的最大预测度量。您可以使用的元素来推断成对预测变量之间关系的强度predAssociation。较大的值表示预测变量对更高相关。

figure imagesc(predAssociation) title('Predictor Association Estimates') colorbar h = gca; h.XTickLabel = Mdl.PredictorNames; h.XTickLabelRotation = 45; h.TickLabelInterpreter ='没有任何';h.yticklabel = mdl.predictornames;

图包含一个轴对象。带有标题预测器关联估计的轴对象包含类型图像的对象。

predAssociation(1,2)

ANS = 0.6871

最大的协会是气缸and移位，，，，but the value is not high enough to indicate a strong relationship between the two predictors.

使用减少预测因子集生长随机森林

由于预测时间随随机森林中的预测变量的数量而增加，因此一个好的做法是创建一个模型，使用尽可能少的预测因子。

仅使用最佳的两个预测因子来种植200种回归树的随机森林。默认值'num variablestosame'value ofTemplatetreeis one third of the number of predictors for regression, sofitrensembleuses the random forest algorithm.

t = templateTree('PredictorSelection'，，，，“相互作用融合”，，，，“代理”，，，，'上'，，，，...“可重现”，真的）;％用于随机预测器选择的可重复性mdlreduds = fitrensemble（x（：，{'Model_Year''Weight'}),MPG,'方法'，，，，'Bag'，，，，...“数值”，200，“学习者”，t）;

计算 $r^{2}$ 还原模型。

yHatReduced = oobPredict(MdlReduced); r2Reduced = corr(Mdl.Y,yHatReduced)^2

R2Reduced = 0.8653

这 $r^{2}$ 因为还原模型接近 $r^{2}$ of the full model. This result suggests that the reduced model is sufficient for prediction.

也可以看看