主要内容

Framework for Ensemble Learning

使用各种方法,您可以将许多弱学习者的结果融入一个高质量的集合预测因子。这些方法非常遵循相同的语法,因此您可以尝试不同的方法在命令中具有次要更改。

You can create an ensemble for classification by usingfitcensembleor for regression by usingfitrensemble

要使用分类进行培训fitcensemble,使用此语法。

ens = fitcensemble(X,Y,Name,Value)
  • Xis the matrix of data. Each row contains one observation, and each column contains one predictor variable.

  • Y是响应的矢量,与行相同的观察数X

  • Name,Value使用一个或多个名称值对参数指定其他选项。例如,您可以使用该方法指定集合聚合方法'Method'argument, the number of ensemble learning cycles with the'NumLearningCycles'论点,以及弱学习者的类型“学习者”argument. For a complete list of name-value pair arguments, see thefitcensemble功能页面。

This figure shows the information you need to create a classification ensemble.

同样,您可以通过使用来培训融合的集合fitrensemble,它遵循与此相同的语法fitcensemble。对于details on the input arguments and name-value pair arguments, see thefitrensemble功能页面。

对于all classification or nonlinear regression problems, follow these steps to create an ensemble:

Prepare the Predictor Data

所有监督的学习方法都以预测器数据开头,通常称为Xin this documentation.Xcan be stored in a matrix or a table. Each row ofXrepresents one observation, and each column ofXrepresents one variable or predictor.

Prepare the Response Data

You can use a wide variety of data types for the response data.

  • 对于regression ensembles,Y必须是一个数字向量,其中元素数量相同,作为行的行数X

  • 对于分类集合,Y可以是数字矢量,分类矢量,字符数组,字符串阵列,字符向量,或逻辑向量。

    例如,假设您的响应数据按以下顺序包含三个观察:true,false,true。You could expressYas:

    • [1;0;1](numeric vector)

    • categorical({'true','false','true'})(categorical vector)

    • [真; false;真实](逻辑向量)

    • ['true ';'false';'true '](字符数组,带有空格的填充,所以每行都有相同的长度)

    • ["true","false","true"](string array)

    • {'true','false','true'}(cell array of character vectors)

    Use whichever data type is most convenient. Because you cannot represent missing values with logical entries, do not use logical entries when you have missing values inY

fitcensembleandfitrensembleignore missing values inYwhen creating an ensemble. This table contains the method of including missing entries.

数据类型 缺少条目
Numeric vector NaN
Categorical vector
Character array 空间行
String array <缺失>or""
Cell array of character vectors ''
Logical vector (不可能代表)

选择一个Applicable Ensemble Aggregation Method

创建分类和回归合奏fitcensembleandfitrensemble, respectively, choose appropriate algorithms from this list.

  • 对于classification with two classes:

    • 'AdaBoostM1'

    • 'logitboost'

    • “GentleBoost”

    • 'robustboost'(requires Optimization Toolbox™)

    • 'LPBoost'(requires Optimization Toolbox)

    • 'TotalBoost'(requires Optimization Toolbox)

    • 'rusboost'

    • '子空间'

    • '袋'

  • 对于classification with three or more classes:

    • 'AdaBoostM2'

    • 'LPBoost'(requires Optimization Toolbox)

    • 'TotalBoost'(requires Optimization Toolbox)

    • 'rusboost'

    • '子空间'

    • '袋'

  • 对于regression:

    • 'LSBoost'

    • '袋'

对于descriptions of the various algorithms, see合奏算法

看到Suggestions for Choosing an Appropriate Ensemble Algorithm

This table lists characteristics of the various algorithms. In the table titles:

  • 不平衡— Good for imbalanced data (one class has many more observations than the other)

  • Stop- 算法自我终止

  • Sparse— Requires fewer weak learners than other ensemble algorithms

Algorithm 回归 Binary Classification 多包分类 Class Imbalance Stop Sparse
× × ×
AdaBoostM1 ×
AdaBoostM2 ×
LogitBoost ×
GentleBoost ×
RobustBoost ×
LPBoost × × × ×
TotalBoost × × × ×
RUSBoost × × ×
lsboost. ×
Subspace × ×

RobustBoost,LPBoost, andTotalBoost需要优化工具箱许可证。尝试TotalBoost之前LPBoost, asTotalBoostcan be more robust.

Suggestions for Choosing an Appropriate Ensemble Algorithm

  • 回归— Your choices arelsboost.or。看到集合算法的一般特征升压和装袋之间的主要差异。

  • Binary Classification- 尝试AdaBoostM1first, with these modifications:

    数据特征 推荐算法
    Many predictors Subspace
    偏斜数据(更多的一个类别的观察) RUSBoost
    标签噪声(一些培训数据有错误的类) RobustBoost
    Many observations AvoidLPBoostandTotalBoost
  • 多包分类- 尝试AdaBoostM2first, with these modifications:

    数据特征 推荐算法
    Many predictors Subspace
    偏斜数据(更多的一个类别的观察) RUSBoost
    Many observations AvoidLPBoostandTotalBoost

对于details of the algorithms, see合奏算法

集合算法的一般特征

  • Boostalgorithms generally use very shallow trees. This construction uses relatively little time or memory. However, for effective predictions, boosted trees might need more ensemble members than bagged trees. Therefore it is not always clear which class of algorithms is superior.

  • generally constructs deep trees. This construction is both time consuming and memory-intensive. This also leads to relatively slow predictions.

  • can estimate the generalization error without additional cross validation. SeeoobLoss

  • Except forSubspace,所有升压和装袋算法都是基于decision tree学习者。Subspacecan use eitherdiscriminant analysisork-最近的邻居学习者。

有关个体集合成员的特征的详细信息,请参阅分类算法的特征

Set the Number of Ensemble Members

Choosing the size of an ensemble involves balancing speed and accuracy.

  • 较大的合奏需要更长时间培训并生成预测。

  • Some ensemble algorithms can become overtrained (inaccurate) when too large.

To set an appropriate size, consider starting with several dozen to several hundred members in an ensemble, training the ensemble, and then checking the ensemble quality, as inTest Ensemble Quality。如果似乎您需要更多成员,请使用它们添加resume方法(分类)或resume方法(回归)。重复直到添加更多成员,不会提高集合质量。

Tip

对于classification, theLPBoostandTotalBoostalgorithms are self-terminating, meaning you do not have to investigate the appropriate ensemble size. Try settingnumlarnicalningcycles.to500。The algorithms usually terminate with fewer members.

Prepare the Weak Learners

Currently the weak learner types are:

  • 'Discriminant'(推荐Subspace合奏)

  • 'KNN'(only forSubspace合奏)

  • 'Tree'(for any ensemble exceptSubspace)

There are two ways to set the weak learner type in an ensemble.

  • To create an ensemble with default weak learner options, specify the value of the“学习者”名称 - 值对参数作为弱学员名称的字符向量或字符串标量。例如:

    ens = fitcensemble(X,Y,'Method','Subspace', ... 'NumLearningCycles',50,'Learners','KNN'); % or ens = fitrensemble(X,Y,'Method','Bag', ... 'NumLearningCycles',50,'Learners','Tree');
  • To create an ensemble with nondefault weak learner options, create a nondefault weak learner using the appropriatetemplate方法。

    对于example, if you have missing data, and want to use classification trees with surrogate splits for better accuracy:

    templ = templatetree('代理','全部');ens = fitcensemble(x,y,'方法','adaboostm2',...'numlarningcycles',50,'学习者,templ);

    To grow trees with leaves containing a number of observations that is at least 10% of the sample size:

    templ = templatetree('minleafsize',size(x,1)/ 10);ens = fitcensemble(x,y,'方法','adaboostm2',...'numlarningcycles',50,'学习者,templ);

    Alternatively, choose the maximal number of splits per tree:

    templ = templatetree('maxnumsplits',4);ens = fitcensemble(x,y,'方法','adaboostm2',...'numlarningcycles',50,'学习者,templ);

    You can also use nondefault weak learners infitrensemble

While you can givefitcensembleandfitrensemblea cell array of learner templates, the most common usage is to give just one weak learner template.

有关使用模板的示例,请参阅Handle Imbalanced Data or Unequal Misclassification Costs in Classification EnsemblesandSurrogate Splits

Decision trees can handleNaNvalues inX。Such values are called “missing”. If you have some missing values in a row ofX,决策树仅使用非贴材值查找最佳分割。如果整行包括NaN,fitcensembleandfitrensembleignore that row. If you have data with a large fraction of missing values inX, use surrogate decision splits. For examples of surrogate splits, seeHandle Imbalanced Data or Unequal Misclassification Costs in Classification EnsemblesandSurrogate Splits

Common Settings for Tree Weak Learners

  • 弱学员树的深度对训练时间,内存使用和预测精度进行了差异。您控制这些参数的深度:

    • MaxNumSplits— The maximal number of branch node splits isMaxNumSplits每棵树。设置大值MaxNumSplits得到深刻的树木。The default for bagging issize(X,1) - 1。The default for boosting is1

    • MinLeafSize- 每个叶子至少有MinLeafSizeobservations. Set small values ofMinLeafSize得到深刻的树木。分类的默认是1and5for regression.

    • MinParentSize— Each branch node in the tree has at leastMinParentSizeobservations. Set small values ofMinParentSize得到深刻的树木。分类的默认是2and10for regression.

    If you supply bothMinParentSizeandMinLeafSize, the learner uses the setting that gives larger leaves (shallower trees):

    MinParent = max(MinParent,2*MinLeaf)

    If you additionally supplyMaxNumSplits, then the software splits a tree until one of the three splitting criteria is satisfied.

  • Surrogate— Grow decision trees with surrogate splits whenSurrogateis'on'。当数据缺少值时,使用替代拆分。

    Note

    Surrogate splits cause slower training and use more memory.

  • 预测互联fitcensemble,fitrensemble, andTreeBaggergrow trees using the standard CART algorithm[11]by default. If the predictor variables are heterogeneous or there are predictors having many levels and other having few levels, then standard CART tends to select predictors having many levels as split predictors. For split-predictor selection that is robust to the number of levels that the predictors have, consider specifying'curvature'or'interaction-curvature'。These specifications conduct chi-square tests of association between each predictor and the response or each pair of predictors and the response, respectively. The predictor that yields the minimalp-value is the split predictor for a particular node. For more details, seeChoose Split Predictor Selection Technique

    Note

    When boosting decision trees, selecting split predictors using the curvature or interaction tests is not recommended.

Callfitcensembleorfitrensemble

语法fitcensembleandfitrensemble是相同的。对于fitrensemble, the syntax is:

ens = fitrensemble(X,Y,Name,Value)
  • Xis the matrix of data. Each row contains one observation, and each column contains one predictor variable.

  • Yis the responses, with the same number of observations as rows inX

  • Name,Value使用一个或多个名称值对参数指定其他选项。例如,您可以使用该方法指定集合聚合方法'Method'argument, the number of ensemble learning cycles with the'NumLearningCycles'论点,以及弱学习者的类型“学习者”argument. For a complete list of name-value pair arguments, see thefitrensemble功能页面。

The result offitrensembleandfitcensembleis an ensemble object, suitable for making predictions on new data. For a basic example of creating a regression ensemble, see火车回归合奏。有关创建分类集合的基本示例,请参阅Train Classification Ensemble

Where to Set Name-Value Pairs

There are several name-value pairs you can pass tofitcensembleorfitrensemble, and several that apply to the weak learners (模板异教徒,templateKNN, andtemplateTree). To determine which name-value pair argument is appropriate, the ensemble or the weak learner:

  • 使用模板名称值对来控制弱学习者的特征。

  • Usefitcensembleorfitrensemble名称 - 值对参数,用于控制整体组合,无论是算法还是结构。

对于example, for an ensemble of boosted classification trees with each tree deeper than the default, set thetemplateTree名称值对参数MinLeafSizeandMinParentSizeto smaller values than the defaults. Or,MaxNumSplitsto a larger value than the defaults. The trees are then leafier (deeper).

To name the predictors in a classification ensemble (part of the structure of the ensemble), use the预测name-value pair infitcensemble

看到Also

|||||||

Related Topics