Select Predictors for Random Forests
此示例显示了如何在种植回归树的随机森林时为数据集选择适当的拆分预测器选择技术。该示例还显示了如何确定哪些预测因素最重要的是培训数据中。
Load and Preprocess Data
Load the卡比格
数据集。考虑一个模型,该模型可以预测汽车的燃油经济性,鉴于其气缸数量,发动机位移,马力,重量,加速度,模型年度和原产国。考虑气缸
,,,,Model_Year
,,,,and起源
as categorical variables.
加载卡比格气缸= categorical(Cylinders); Model_Year = categorical(Model_Year); Origin = categorical(cellstr(Origin)); X = table(Cylinders,Displacement,Horsepower,Weight,Acceleration,Model_Year,Origin);
Determine Levels in Predictors
标准的购物车算法倾向于将预测变量(例如,连续变量),例如级别较少的变量(例如,分类变量)拆分预测变量。如果您的数据是异质的,或者您的预测变量在其级别数量上差异很大,则请考虑使用曲率或交互测试进行分式预测器选择而不是标准购物车。
对于每个预测指标,确定数据中的级别数量。做到这一点的一种方法是定义一个匿名函数:
使用所有变量将所有变量转换为分类数据类型
categorical
Determines all unique categories while ignoring missing values using
categories
Counts the categories using
numel
这n, apply the function to each variable usingVarfun
。
countLevels = @(x)numel(categories(categorical(x))); numLevels = varfun(countLevels,X,'输出格式',,,,'制服');
Compare the number of levels among the predictor variables.
图栏(numlevels)标题('Number of Levels Among Predictors') xlabel('Predictor variable') ylabel(“级别数”) h = gca; h.XTickLabel = X.Properties.VariableNames(1:end-1); h.XTickLabelRotation = 45; h.TickLabelInterpreter ='没有任何';
这continuous variables have many more levels than the categorical variables. Because the number of levels among the predictors varies so much, using standard CART to select split predictors at each node of the trees in a random forest can yield inaccurate predictor importance estimates. In this case, use the curvature test or interaction test. Specify the algorithm by using the'PredictorSelection'
name-value pair argument. For more details, see选择拆分预测器选择技术。
Train Bagged Ensemble of Regression Trees
Train a bagged ensemble of 200 regression trees to estimate predictor importance values. Define a tree learner using these name-value pair arguments:
'numVariablestosame'
- 使用每个节点上的所有预测变量来确保每棵树都使用所有预测变量。'PredictorSelection','interaction-curvature'
— Specify usage of the interaction test to select split predictors.“代理”,,,,'上'
- 指定替代拆分的用法以提高准确性,因为数据集包括缺少值。
t = templateTree('num variablestosame',,,,'all',,,,...'PredictorSelection',,,,“相互作用融合”,,,,“代理”,,,,'上');RNG(1);% For reproducibilityMdl = fitrensemble(X,MPG,'方法',,,,'Bag',,,,“数值”,200,...“学习者”,t);
Mdl
是一个regressionBaggedEnsemble
模型。
估计模型 using out-of-bag predictions.
yHat = oobPredict(Mdl); R2 = corr(Mdl.Y,yHat)^2
R2 = 0.8744
Mdl
explains 87% of the variability around the mean.
Predictor Importance Estimation
Estimate predictor importance values by permuting out-of-bag observations among the trees.
Impoob= oobPermutedPredictorImportance(Mdl);
Impoob
是预测器重要性估计值的1乘7个向量,与预测变量相对应Mdl.PredictorNames
。估计值不偏向包含许多级别的预测因子。
比较预测器的重要性估计值。
图栏(Impoob)标题('Unbiased Predictor Importance Estimates') xlabel('Predictor variable') ylabel('重要性') h = gca; h.XTickLabel = Mdl.PredictorNames; h.XTickLabelRotation = 45; h.TickLabelInterpreter ='没有任何';
Greater importance estimates indicate more important predictors. The bar graph suggests thatModel_Year
是最重要的预测指标,其次是气缸
and重量
。这Model_Year
and气缸
变量分别只有13个和5个不同的级别,而重量
variable has over 300 levels.
比较预测permutin估计的重要性g out-of-bag observations and those estimates obtained by summing gains in the mean squared error due to splits on each predictor. Also, obtain predictor association measures estimated by surrogate splits.
[impGain,predAssociation] = predictorImportance(Mdl); figure plot(1:numel(Mdl.PredictorNames),[impOOB' impGain']) title(“预测重要性估计比较”) xlabel('Predictor variable') ylabel('重要性') h = gca; h.XTickLabel = Mdl.PredictorNames; h.XTickLabelRotation = 45; h.TickLabelInterpreter ='没有任何';legend(“ OOB置入”,,,,“ MSE改进”) gridon
According to the values ofimpGain
,变量移位
,,,,Horsepower
,,,,and重量
appear to be equally important.
predAssociation
是预测相关度量的7 x 7矩阵。行和列对应于预测变量Mdl.PredictorNames
。这Predictive Measure of Association是指示拆分观察的决策规则之间相似性的值。最佳的替代决策分裂产生了关联的最大预测度量。您可以使用的元素来推断成对预测变量之间关系的强度predAssociation
。较大的值表示预测变量对更高相关。
figure imagesc(predAssociation) title('Predictor Association Estimates') colorbar h = gca; h.XTickLabel = Mdl.PredictorNames; h.XTickLabelRotation = 45; h.TickLabelInterpreter ='没有任何';h.yticklabel = mdl.predictornames;
predAssociation(1,2)
ANS = 0.6871
最大的协会是气缸
and移位
,,,,but the value is not high enough to indicate a strong relationship between the two predictors.
使用减少预测因子集生长随机森林
由于预测时间随随机森林中的预测变量的数量而增加,因此一个好的做法是创建一个模型,使用尽可能少的预测因子。
仅使用最佳的两个预测因子来种植200种回归树的随机森林。默认值'num variablestosame'
value ofTemplatetree
is one third of the number of predictors for regression, sofitrensemble
uses the random forest algorithm.
t = templateTree('PredictorSelection',,,,“相互作用融合”,,,,“代理”,,,,'上',,,,...“可重现”,真的);%用于随机预测器选择的可重复性mdlreduds = fitrensemble(x(:,{'Model_Year''Weight'}),MPG,'方法',,,,'Bag',,,,...“数值”,200,“学习者”,t);
计算 还原模型。
yHatReduced = oobPredict(MdlReduced); r2Reduced = corr(Mdl.Y,yHatReduced)^2
R2Reduced = 0.8653
这 因为还原模型接近 of the full model. This result suggests that the reduced model is sufficient for prediction.
也可以看看
Templatetree
|fitrensemble
|OOBPREDICT
|OOBPERMUTED PREDICTORIMPORTANE
|预测象征
|corr