Main Content

anova

Analysis of variance for linear regression model

Description

example

tbl= anova(mdl)returns a table with component ANOVA statistics.

example

tbl= anova(mdl,anovatype)returns ANOVA statistics of the specified typeanovatype. For example, specifyanovatypeas'component'(default) to return a table with component ANOVA statistics, or specifyanovatypeas'summary'to return a table with summary ANOVA statistics.

tbl= anova(mdl,'component',sstype)computes component ANOVA statistics using the specified type of sum of squares.

Examples

collapse all

Create a component ANOVA table from a linear regression model of thehospitaldata set.

Load thehospitaldata set and create a model of blood pressure as a function of age and gender.

loadhospitaltbl = table(hospital.Age,hospital.Sex,hospital.BloodPressure(:,2),...'VariableNames',{'Age','Sex','BloodPressure'}); tbl.Sex = categorical(tbl.Sex); mdl = fitlm(tbl,'BloodPressure ~ Sex + Age^2')
mdl = Linear regression model: BloodPressure ~ 1 + Age + Sex + Age^2 Estimated Coefficients: Estimate SE tStat pValue _________ ________ ________ _________ (Intercept) 63.942 19.194 3.3314 0.0012275 Age 0.90673 1.0442 0.86837 0.38736 Sex_Male 3.0019 1.3765 2.1808 0.031643 Age^2 -0.011275 0.013853 -0.81389 0.41772 Number of observations: 100, Error degrees of freedom: 96 Root Mean Squared Error: 6.83 R-squared: 0.0577, Adjusted R-Squared: 0.0283 F-statistic vs. constant model: 1.96, p-value = 0.125

Create an ANOVA table of the model.

tbl = anova(mdl)
tbl=4×5 tableSumSq DF MeanSq F pValue ______ __ ______ _______ ________ Age 18.705 1 18.705 0.40055 0.52831 Sex 222.09 1 222.09 4.7558 0.031643 Age^2 30.934 1 30.934 0.66242 0.41772 Error 4483.1 96 46.699

The table displays the following columns for each term except the constant (intercept) term:

  • SumSq— Sum of squares explained by the term.

  • DF— Degrees of freedom. In this example,DFis 1 for each term in the model andnpfor the error term, wherenis the number of observations andpis the number of coefficients (including the intercept) in the model. For example, theDFfor the error term in this model is 100 – 4 = 96. If any variable in the model is a categorical variable, theDFfor that variable is the number of indicator variables created for its categories (number of categories – 1).

  • MeanSq— Mean square, defined byMeanSq= SumSq/DF. For example, the mean square of the error term, mean squared error (MSE), is 4.4831e+03/96 = 46.6991.

  • FF-statistic value to test the null hypothesis that the corresponding coefficient is zero, computed byF = MeanSq/MSE, whereMSEis the mean squared error. When the null hypothesis is true, theF-statistic follows theF-distribution. The numerator degrees of freedom is theDFvalue for the corresponding term, and the denominator degrees of freedom isnp.In this example, eachF-statistic follows an F ( 1 , 9 6 ) -distribution.

  • pValuep-value of theF-statistic value. For example, thep-value forAgeis 0.5283, implying thatAgeis not significant at the 5% significance level given the other terms in the model.

Create a summary ANOVA table from a linear regression model of thehospitaldata set.

Load thehospitaldata set and create a model of blood pressure as a function of age and gender.

loadhospitaltbl = table(hospital.Age,hospital.Sex,hospital.BloodPressure(:,2),...'VariableNames',{'Age','Sex','BloodPressure'}); tbl.Sex = categorical(tbl.Sex); mdl = fitlm(tbl,'BloodPressure ~ Sex + Age^2')
mdl = Linear regression model: BloodPressure ~ 1 + Age + Sex + Age^2 Estimated Coefficients: Estimate SE tStat pValue _________ ________ ________ _________ (Intercept) 63.942 19.194 3.3314 0.0012275 Age 0.90673 1.0442 0.86837 0.38736 Sex_Male 3.0019 1.3765 2.1808 0.031643 Age^2 -0.011275 0.013853 -0.81389 0.41772 Number of observations: 100, Error degrees of freedom: 96 Root Mean Squared Error: 6.83 R-squared: 0.0577, Adjusted R-Squared: 0.0283 F-statistic vs. constant model: 1.96, p-value = 0.125

Create a summary ANOVA table of the model.

tbl = anova(mdl,'summary')
tbl=7×5 tableSumSq DF MeanSq F pValue ______ __ ______ _______ ________ Total 4757.8 99 48.059 Model 274.73 3 91.577 1.961 0.12501 . Linear 243.8 2 121.9 2.6103 0.078726 . Nonlinear 30.934 1 30.934 0.66242 0.41772 Residual 4483.1 96 46.699 . Lack of fit 1483.1 39 38.028 0.72253 0.85732 . Pure error 3000 57 52.632

The table displays tests for groups of terms:Total,Model, andResidual.

  • Total— This row shows the total sum of squares (SumSq), degrees of freedom (DF), and the mean squared error (MeanSq). Note thatMeanSq= SumSq/DF.

  • Model— This row includesSumSq,DF,MeanSq,F-statistic value (F), andp-value (pValue). Because this model includes a nonlinear term (Age^2),anovapartitions the sum of squares (SumSq) ofModelinto two parts:SumSqexplained by the linear terms (AgeandSex) andSumSqexplained by the nonlinear term (Age^2). The correspondingF统计值是用于测试的意义of the linear terms and the nonlinear term as separate groups. The nonlinear group consists of theAge^2term only, so it has the samep-value as theAge^2term in theComponent ANOVA Table.

  • Residual— This row includesSumSq,DF,MeanSq,F, andpValue. Because the data set includes replications,anovapartitions the residualSumSqinto the part for the replications (Pure error) and the rest (Lack of fit). To test the lack of fit,anovacomputes theF通过比较英国国防部统计值el residuals to the model-free variance estimate computed on the replications. TheF-statistic value shows no evidence of lack of fit.

Fit a linear regression model that contains a categorical predictor. Reorder the categories of the categorical predictor to control the reference level in the model. Then, useanovato test the significance of the categorical variable.

Model with Categorical Predictor

Load thecarsmalldata set and create a linear regression model ofMPGas a function ofModel_Year. To treat the numeric vectorModel_Yearas a categorical variable, identify the predictor using the'CategoricalVars'name-value pair argument.

loadcarsmallmdl = fitlm(Model_Year,MPG,'CategoricalVars',1,'VarNames',{'Model_Year',“英里”})
mdl =线性回归模型:MPG ~ 1 + Model_Year Estimated Coefficients: Estimate SE tStat pValue ________ ______ ______ __________ (Intercept) 17.69 1.0328 17.127 3.2371e-30 Model_Year_76 3.8839 1.4059 2.7625 0.0069402 Model_Year_82 14.02 1.4369 9.7571 8.2164e-16 Number of observations: 94, Error degrees of freedom: 91 Root Mean Squared Error: 5.56 R-squared: 0.531, Adjusted R-Squared: 0.521 F-statistic vs. constant model: 51.6, p-value = 1.07e-15

The model formula in the display,MPG ~ 1 + Model_Year, corresponds to

MPG = β 0 + β 1 Ι Year = 76 + β 2 Ι Year = 82 + ϵ ,

where Ι Year = 76 and Ι Year = 82 are indicator variables whose value is one if the value ofModel_Yearis 76 and 82, respectively. TheModel_Yearvariable includes three distinct values, which you can check by using theuniquefunction.

unique(Model_Year)
ans =3×170 76 82

fitlmchooses the smallest value inModel_Yearas a reference level ('70') and creates two indicator variables Ι Year = 76 and Ι Year = 82 . The model includes only two indicator variables because the design matrix becomes rank deficient if the model includes three indicator variables (one for each level) and an intercept term.

Model with Full Indicator Variables

You can interpret the model formula ofmdlas a model that has three indicator variables without an intercept term:

y = β 0 Ι x 1 = 70 + ( β 0 + β 1 ) Ι x 1 = 76 + ( β 0 + β 2 ) Ι x 2 = 82 + ϵ .

Alternatively, you can create a model that has three indicator variables without an intercept term by manually creating indicator variables and specifying the model formula.

temp_Year = dummyvar(categorical(Model_Year)); Model_Year_70 = temp_Year(:,1); Model_Year_76 = temp_Year(:,2); Model_Year_82 = temp_Year(:,3); tbl = table(Model_Year_70,Model_Year_76,Model_Year_82,MPG); mdl = fitlm(tbl,'MPG ~ Model_Year_70 + Model_Year_76 + Model_Year_82 - 1')
mdl = Linear regression model: MPG ~ Model_Year_70 + Model_Year_76 + Model_Year_82 Estimated Coefficients: Estimate SE tStat pValue ________ _______ ______ __________ Model_Year_70 17.69 1.0328 17.127 3.2371e-30 Model_Year_76 21.574 0.95387 22.617 4.0156e-39 Model_Year_82 31.71 0.99896 31.743 5.2234e-51 Number of observations: 94, Error degrees of freedom: 91 Root Mean Squared Error: 5.56

Choose Reference Level in Model

You can choose a reference level by modifying the order of categories in a categorical variable. First, create a categorical variableYear.

Year = categorical(Model_Year);

Check the order of categories by using thecategoriesfunction.

categories(Year)
ans =3x1 cell{'70'} {'76'} {'82'}

If you useYearas a predictor variable, thenfitlmchooses the first category'70'as a reference level. ReorderYearby using thereordercatsfunction.

Year_reordered = reordercats(Year,{'76','70','82'}); categories(Year_reordered)
ans =3x1 cell{'76'} {'70'} {'82'}

The first category ofYear_reorderedis'76'. Create a linear regression model ofMPGas a function ofYear_reordered.

mdl2 = fitlm(Year_reordered,MPG,'VarNames',{'Model_Year',“英里”})
mdl2 = Linear regression model: MPG ~ 1 + Model_Year Estimated Coefficients: Estimate SE tStat pValue ________ _______ _______ __________ (Intercept) 21.574 0.95387 22.617 4.0156e-39 Model_Year_70 -3.8839 1.4059 -2.7625 0.0069402 Model_Year_82 10.136 1.3812 7.3385 8.7634e-11 Number of observations: 94, Error degrees of freedom: 91 Root Mean Squared Error: 5.56 R-squared: 0.531, Adjusted R-Squared: 0.521 F-statistic vs. constant model: 51.6, p-value = 1.07e-15

mdl2uses'76'as a reference level and includes two indicator variables Ι Year = 70 and Ι Year = 82 .

Evaluate Categorical Predictor

The model display ofmdl2includes ap-value of each term to test whether or not the corresponding coefficient is equal to zero. Eachp-value examines each indicator variable. To examine the categorical variableModel_Yearas a group of indicator variables, useanova. Use the'components'(default) option to return a component ANOVA table that includes ANOVA statistics for each variable in the model except the constant term.

anova(mdl2,'components')
ans=2×5 tableSumSq DF MeanSq F pValue ______ __ ______ _____ __________ Model_Year 3190.1 2 1595.1 51.56 1.0694e-15 Error 2815.2 91 30.936

The component ANOVA table includes thep-value of theModel_Yearvariable, which is smaller than thep-values of the indicator variables.

Input Arguments

collapse all

Linear regression model object, specified as aLinearModelobject created by usingfitlmorstepwiselm, or aCompactLinearModelobject created by usingcompact.

ANOVA type, specified as one of these values:

  • 'component'anovareturns the tabletblwith ANOVA statistics for each variable in the model except the constant term.

  • 'summary'anovareturns the tabletblwith summary ANOVA statistics for grouped variables and the model as a whole.

For details, see thetbloutput argument description.

Sum of squares type for each term, specified as one of the values in this table.

Value Description
1 Type 1 sum of squares — Reduction in residual sum of squares obtained by adding the term to a fit that already includes the preceding terms
2 类型2平方和-减少剩余金额of squares obtained by adding the term to a model that contains all other terms
3 Type 3 sum of squares — Reduction in residual sum of squares obtained by adding the term to a model that contains all other terms, but with their effects constrained to obey the usual “sigma restrictions” that make models estimable
'h' Hierarchical model — Similar to Type 2, but uses both continuous and categorical factors to determine the hierarchy of terms

The sum of squares for any term is determined by comparing two models. For a model containing main effects but no interactions, the value ofsstypeinfluences the computations on unbalanced data only.

Suppose you are fitting a model with two factors and their interaction, and the terms appear in the orderA,B,AB. LetR(·) represent the residual sum of squares for the model. So,R(A,B,AB) is the residual sum of squares fitting the whole model,R(A) is the residual sum of squares fitting the main effect ofAonly, andR(1) is the residual sum of squares fitting the mean only. The three sum of squares types are as follows:

Term Type 1 Sum of Squares Type 2 Sum of Squares Type 3 Sum of Squares

A

R(1) –R(A)

R(B) –R(A,B)

R(B,AB) –R(A,B,AB)

B

R(A) –R(A,B)

R(A) –R(A,B)

R(A,AB) –R(A,B,AB)

AB

R(A,B) –R(A,B,AB)

R(A,B) –R(A,B,AB)

R(A,B) –R(A,B,AB)

The models for Type 3 sum of squares have sigma restrictions imposed. This means, for example, that in fittingR(B,AB), the array ofABeffects is constrained to sum to 0 overAfor each value ofB, and overBfor each value ofA.

For Type 3 sum of squares:

  • Ifmdlis aCompactLinearModelobject and the regression model is nonhierarchical,anovareturns an error.

  • Ifmdlis aLinearModelobject and the regression model is nonhierarchical,anovarefits the model using effects coding whenever it needs to compute a Type 3 sum of squares.

  • If the regression model inmdlis hierarchical,anovacomputes the results without refitting the model.

sstypeapplies only ifanovatypeis'component'.

Output Arguments

collapse all

ANOVA summary statistics table, returned as a table.

The contents oftbldepend on the ANOVA type specified inanovatype.

  • Ifanovatypeis'component', thentblcontains ANOVA statistics for each variable in the model except the constant (intercept) term. The table includes these columns for each variable:

    Column Description
    SumSq

    Sum of squares explained by the term, computed depending onsstype

    DF

    Degrees of freedom

    • DFof a numeric variable is 1.

    • DFof a categorical variable is the number of indicator variables created for the category (number of categories – 1). Note thattblcontains one row for each categorical variable instead of one row for each indicator variable as in the model display. Useanovato test a categorical variable as a group of indicator variables.

    • DFof an error term isnp, wherenis the number of observations andpis the number of coefficients in the model.

    MeanSq

    Mean square, defined byMeanSq=SumSq/DF

    MeanSqfor the error term is the mean squared error (MSE).

    F

    F-statistic value to test the null hypothesis that the corresponding coefficient is zero, computed byF=MeanSq/MSE

    When the null hypothesis is true, theF-statistic follows theF-distribution. The numerator degrees of freedom is theDFvalue for the corresponding term, and the denominator degrees of freedom isnp.

    pValue

    p-value of theF-statistic value

    For an example, seeComponent ANOVA Table.

  • Ifanovatypeis'summary', thentblcontains summary statistics of grouped terms for each row. The table includes the same columns as'component'and these rows:

    Row Description
    Total

    Total statistics

    • SumSq— Total sum of squares, which is the sum of the squared deviations of the response around its mean

    • DF— Sum of degrees of freedom ofModelandResidual

    Model

    Statistics for the model as a whole

    • SumSq— Model sum of squares, which is the sum of the squared deviations of the fitted value around the response mean.

    • FandpValue— These values provide a test of whether the model as a whole fits significantly better than a degenerate model consisting of only a constant term.

    Ifmdlincludes only linear terms, thenanovadoes not decomposeModelintoLinearandNonLinear.

    Linear

    Statistics for linear terms

    • SumSq— Sum of squares for linear terms, which is the difference between the model sum of squares and the sum of squares for nonlinear terms.

    • FandpValue— These values provide a test of whether the model with only linear terms fits better than a degenerate model consisting of only a constant term.anovauses the mean squared error that is based on the full model to compute thisF-value, so theF-value obtained by dropping the nonlinear terms and repeating the test is not the same as the value in this row.

    Nonlinear

    Statistics for nonlinear terms

    • SumSq— Sum of squares for nonlinear (higher-order or interaction) terms, which is the increase in the residual sum of squares obtained by keeping only the linear terms and dropping all nonlinear terms.

    • FandpValue— These values provide a test of whether the full model fits significantly better than a smaller model consisting of only the linear terms.

    Residual

    Statistics for residuals

    • SumSq— Residual sum of squares, which is the sum of the squared residual values

    • MeanSq— Mean squared error, used to compute theF-statistic values forModel,Linear, andNonLinear

    Ifmdlis a fullLinearModelobject and the sample data contains replications (multiple observations sharing the same predictor values), thenanovadecomposes the residual sum of squares into a sum of squares for the replicated observations (Lack of fit) and the remaining sum of squares (Pure error).

    Lack of fit

    Lack-of-fit statistics

    • SumSq— Sum of squares due to lack of fit, which is the difference between the residual sum of squares and the replication sum of squares.

    • FandpValue— TheF-statistic value is the ratio of lack-of-fitMeanSqto pure errorMeanSq. The ratio provides a test of bias by measuring whether the variation of the residuals is larger than the variation of the replications. A lowp-value implies that adding additional terms to the model can improve the fit.

    Pure error

    Statistics for pure error

    • SumSq— Replication sum of squares, obtained by finding the sets of points with identical predictor values, computing the sum of squared deviations around the mean within each set, and pooling the computed values

    • MeanSq— Model-free pure error variance estimate of the response

    For an example, seeSummary ANOVA Table.

Alternative Functionality

More complete ANOVA statistics are available in theanova1,anova2, andanovanfunctions.

Extended Capabilities

Version History

Introduced in R2012a