anova
Analysis of variance for linear regression model
Description
Examples
Component ANOVA Table
Create a component ANOVA table from a linear regression model of thehospital
data set.
Load thehospital
data set and create a model of blood pressure as a function of age and gender.
loadhospitaltbl = table(hospital.Age,hospital.Sex,hospital.BloodPressure(:,2),...'VariableNames',{'Age','Sex','BloodPressure'}); tbl.Sex = categorical(tbl.Sex); mdl = fitlm(tbl,'BloodPressure ~ Sex + Age^2')
mdl = Linear regression model: BloodPressure ~ 1 + Age + Sex + Age^2 Estimated Coefficients: Estimate SE tStat pValue _________ ________ ________ _________ (Intercept) 63.942 19.194 3.3314 0.0012275 Age 0.90673 1.0442 0.86837 0.38736 Sex_Male 3.0019 1.3765 2.1808 0.031643 Age^2 -0.011275 0.013853 -0.81389 0.41772 Number of observations: 100, Error degrees of freedom: 96 Root Mean Squared Error: 6.83 R-squared: 0.0577, Adjusted R-Squared: 0.0283 F-statistic vs. constant model: 1.96, p-value = 0.125
Create an ANOVA table of the model.
tbl = anova(mdl)
tbl=4×5 tableSumSq DF MeanSq F pValue ______ __ ______ _______ ________ Age 18.705 1 18.705 0.40055 0.52831 Sex 222.09 1 222.09 4.7558 0.031643 Age^2 30.934 1 30.934 0.66242 0.41772 Error 4483.1 96 46.699
The table displays the following columns for each term except the constant (intercept) term:
SumSq
— Sum of squares explained by the term.DF
— Degrees of freedom. In this example,DF
is 1 for each term in the model andn–pfor the error term, wherenis the number of observations andpis the number of coefficients (including the intercept) in the model. For example, theDF
for the error term in this model is 100 – 4 = 96. If any variable in the model is a categorical variable, theDF
for that variable is the number of indicator variables created for its categories (number of categories – 1).MeanSq
— Mean square, defined byMeanSq= SumSq/DF
. For example, the mean square of the error term, mean squared error (MSE), is 4.4831e+03/96 = 46.6991.F
—F-statistic value to test the null hypothesis that the corresponding coefficient is zero, computed byF = MeanSq/MSE
, whereMSE
is the mean squared error. When the null hypothesis is true, theF-statistic follows theF-distribution. The numerator degrees of freedom is theDF
value for the corresponding term, and the denominator degrees of freedom isn–p.In this example, eachF-statistic follows an -distribution.pValue
—p-value of theF-statistic value. For example, thep-value forAge
is 0.5283, implying thatAge
is not significant at the 5% significance level given the other terms in the model.
Summary ANOVA Table
Create a summary ANOVA table from a linear regression model of thehospital
data set.
Load thehospital
data set and create a model of blood pressure as a function of age and gender.
loadhospitaltbl = table(hospital.Age,hospital.Sex,hospital.BloodPressure(:,2),...'VariableNames',{'Age','Sex','BloodPressure'}); tbl.Sex = categorical(tbl.Sex); mdl = fitlm(tbl,'BloodPressure ~ Sex + Age^2')
mdl = Linear regression model: BloodPressure ~ 1 + Age + Sex + Age^2 Estimated Coefficients: Estimate SE tStat pValue _________ ________ ________ _________ (Intercept) 63.942 19.194 3.3314 0.0012275 Age 0.90673 1.0442 0.86837 0.38736 Sex_Male 3.0019 1.3765 2.1808 0.031643 Age^2 -0.011275 0.013853 -0.81389 0.41772 Number of observations: 100, Error degrees of freedom: 96 Root Mean Squared Error: 6.83 R-squared: 0.0577, Adjusted R-Squared: 0.0283 F-statistic vs. constant model: 1.96, p-value = 0.125
Create a summary ANOVA table of the model.
tbl = anova(mdl,'summary')
tbl=7×5 tableSumSq DF MeanSq F pValue ______ __ ______ _______ ________ Total 4757.8 99 48.059 Model 274.73 3 91.577 1.961 0.12501 . Linear 243.8 2 121.9 2.6103 0.078726 . Nonlinear 30.934 1 30.934 0.66242 0.41772 Residual 4483.1 96 46.699 . Lack of fit 1483.1 39 38.028 0.72253 0.85732 . Pure error 3000 57 52.632
The table displays tests for groups of terms:Total
,Model
, andResidual
.
Total
— This row shows the total sum of squares (SumSq
), degrees of freedom (DF
), and the mean squared error (MeanSq
). Note thatMeanSq= SumSq/DF
.Model
— This row includesSumSq
,DF
,MeanSq
,F-statistic value (F
), andp-value (pValue
). Because this model includes a nonlinear term (Age^2
),anova
partitions the sum of squares (SumSq
) ofModel
into two parts:SumSq
explained by the linear terms (Age
andSex
) andSumSq
explained by the nonlinear term (Age^2
). The correspondingF统计值是用于测试的意义of the linear terms and the nonlinear term as separate groups. The nonlinear group consists of theAge^2
term only, so it has the samep-value as theAge^2
term in theComponent ANOVA Table.Residual
— This row includesSumSq
,DF
,MeanSq
,F
, andpValue
. Because the data set includes replications,anova
partitions the residualSumSq
into the part for the replications (Pure error
) and the rest (Lack of fit
). To test the lack of fit,anova
computes theF通过比较英国国防部统计值el residuals to the model-free variance estimate computed on the replications. TheF-statistic value shows no evidence of lack of fit.
Linear Regression with Categorical Predictor
Fit a linear regression model that contains a categorical predictor. Reorder the categories of the categorical predictor to control the reference level in the model. Then, useanova
to test the significance of the categorical variable.
Model with Categorical Predictor
Load thecarsmall
data set and create a linear regression model ofMPG
as a function ofModel_Year
. To treat the numeric vectorModel_Year
as a categorical variable, identify the predictor using the'CategoricalVars'
name-value pair argument.
loadcarsmallmdl = fitlm(Model_Year,MPG,'CategoricalVars',1,'VarNames',{'Model_Year',“英里”})
mdl =线性回归模型:MPG ~ 1 + Model_Year Estimated Coefficients: Estimate SE tStat pValue ________ ______ ______ __________ (Intercept) 17.69 1.0328 17.127 3.2371e-30 Model_Year_76 3.8839 1.4059 2.7625 0.0069402 Model_Year_82 14.02 1.4369 9.7571 8.2164e-16 Number of observations: 94, Error degrees of freedom: 91 Root Mean Squared Error: 5.56 R-squared: 0.531, Adjusted R-Squared: 0.521 F-statistic vs. constant model: 51.6, p-value = 1.07e-15
The model formula in the display,MPG ~ 1 + Model_Year
, corresponds to
,
where
and
are indicator variables whose value is one if the value ofModel_Year
is 76 and 82, respectively. TheModel_Year
variable includes three distinct values, which you can check by using theunique
function.
unique(Model_Year)
ans =3×170 76 82
fitlm
chooses the smallest value inModel_Year
as a reference level ('70'
) and creates two indicator variables
and
. The model includes only two indicator variables because the design matrix becomes rank deficient if the model includes three indicator variables (one for each level) and an intercept term.
Model with Full Indicator Variables
You can interpret the model formula ofmdl
as a model that has three indicator variables without an intercept term:
.
Alternatively, you can create a model that has three indicator variables without an intercept term by manually creating indicator variables and specifying the model formula.
temp_Year = dummyvar(categorical(Model_Year)); Model_Year_70 = temp_Year(:,1); Model_Year_76 = temp_Year(:,2); Model_Year_82 = temp_Year(:,3); tbl = table(Model_Year_70,Model_Year_76,Model_Year_82,MPG); mdl = fitlm(tbl,'MPG ~ Model_Year_70 + Model_Year_76 + Model_Year_82 - 1')
mdl = Linear regression model: MPG ~ Model_Year_70 + Model_Year_76 + Model_Year_82 Estimated Coefficients: Estimate SE tStat pValue ________ _______ ______ __________ Model_Year_70 17.69 1.0328 17.127 3.2371e-30 Model_Year_76 21.574 0.95387 22.617 4.0156e-39 Model_Year_82 31.71 0.99896 31.743 5.2234e-51 Number of observations: 94, Error degrees of freedom: 91 Root Mean Squared Error: 5.56
Choose Reference Level in Model
You can choose a reference level by modifying the order of categories in a categorical variable. First, create a categorical variableYear
.
Year = categorical(Model_Year);
Check the order of categories by using thecategories
function.
categories(Year)
ans =3x1 cell{'70'} {'76'} {'82'}
If you useYear
as a predictor variable, thenfitlm
chooses the first category'70'
as a reference level. ReorderYear
by using thereordercats
function.
Year_reordered = reordercats(Year,{'76','70','82'}); categories(Year_reordered)
ans =3x1 cell{'76'} {'70'} {'82'}
The first category ofYear_reordered
is'76'
. Create a linear regression model ofMPG
as a function ofYear_reordered
.
mdl2 = fitlm(Year_reordered,MPG,'VarNames',{'Model_Year',“英里”})
mdl2 = Linear regression model: MPG ~ 1 + Model_Year Estimated Coefficients: Estimate SE tStat pValue ________ _______ _______ __________ (Intercept) 21.574 0.95387 22.617 4.0156e-39 Model_Year_70 -3.8839 1.4059 -2.7625 0.0069402 Model_Year_82 10.136 1.3812 7.3385 8.7634e-11 Number of observations: 94, Error degrees of freedom: 91 Root Mean Squared Error: 5.56 R-squared: 0.531, Adjusted R-Squared: 0.521 F-statistic vs. constant model: 51.6, p-value = 1.07e-15
mdl2
uses'76'
as a reference level and includes two indicator variables
and
.
Evaluate Categorical Predictor
The model display ofmdl2
includes ap-value of each term to test whether or not the corresponding coefficient is equal to zero. Eachp-value examines each indicator variable. To examine the categorical variableModel_Year
as a group of indicator variables, useanova
. Use the'components'
(default) option to return a component ANOVA table that includes ANOVA statistics for each variable in the model except the constant term.
anova(mdl2,'components')
ans=2×5 tableSumSq DF MeanSq F pValue ______ __ ______ _____ __________ Model_Year 3190.1 2 1595.1 51.56 1.0694e-15 Error 2815.2 91 30.936
The component ANOVA table includes thep-value of theModel_Year
variable, which is smaller than thep-values of the indicator variables.
Input Arguments
mdl
—Linear regression model object
LinearModel
object|CompactLinearModel
object
Linear regression model object, specified as aLinearModel
object created by usingfitlm
orstepwiselm
, or aCompactLinearModel
object created by usingcompact
.
anovatype
—ANOVA type
'component'
(default) |'summary'
ANOVA type, specified as one of these values:
'component'
—anova
returns the tabletbl
with ANOVA statistics for each variable in the model except the constant term.'summary'
—anova
returns the tabletbl
with summary ANOVA statistics for grouped variables and the model as a whole.
For details, see thetbl
output argument description.
sstype
—Sum of squares type
'h'
(default) |1
|2
|3
Sum of squares type for each term, specified as one of the values in this table.
Value | Description |
---|---|
1 |
Type 1 sum of squares — Reduction in residual sum of squares obtained by adding the term to a fit that already includes the preceding terms |
2 |
类型2平方和-减少剩余金额of squares obtained by adding the term to a model that contains all other terms |
3 |
Type 3 sum of squares — Reduction in residual sum of squares obtained by adding the term to a model that contains all other terms, but with their effects constrained to obey the usual “sigma restrictions” that make models estimable |
'h' |
Hierarchical model — Similar to Type 2, but uses both continuous and categorical factors to determine the hierarchy of terms |
The sum of squares for any term is determined by comparing two models. For a model containing main effects but no interactions, the value ofsstype
influences the computations on unbalanced data only.
Suppose you are fitting a model with two factors and their interaction, and the terms appear in the orderA,B,AB. LetR(·) represent the residual sum of squares for the model. So,R(A,B,AB) is the residual sum of squares fitting the whole model,R(A) is the residual sum of squares fitting the main effect ofAonly, andR(1) is the residual sum of squares fitting the mean only. The three sum of squares types are as follows:
Term | Type 1 Sum of Squares | Type 2 Sum of Squares | Type 3 Sum of Squares |
---|---|---|---|
A |
R(1) –R(A) |
R(B) –R(A,B) |
R(B,AB) –R(A,B,AB) |
B |
R(A) –R(A,B) |
R(A) –R(A,B) |
R(A,AB) –R(A,B,AB) |
AB |
R(A,B) –R(A,B,AB) |
R(A,B) –R(A,B,AB) |
R(A,B) –R(A,B,AB) |
The models for Type 3 sum of squares have sigma restrictions imposed. This means, for example, that in fittingR(B,AB), the array ofABeffects is constrained to sum to 0 overAfor each value ofB, and overBfor each value ofA.
For Type 3 sum of squares:
If
mdl
is aCompactLinearModel
object and the regression model is nonhierarchical,anova
returns an error.If
mdl
is aLinearModel
object and the regression model is nonhierarchical,anova
refits the model using effects coding whenever it needs to compute a Type 3 sum of squares.If the regression model in
mdl
is hierarchical,anova
computes the results without refitting the model.
sstype
applies only ifanovatype
is'component'
.
Output Arguments
tbl
— ANOVA summary statistics table
table
ANOVA summary statistics table, returned as a table.
The contents oftbl
depend on the ANOVA type specified inanovatype
.
If
anovatype
is'component'
, thentbl
contains ANOVA statistics for each variable in the model except the constant (intercept) term. The table includes these columns for each variable:Column Description SumSq
Sum of squares explained by the term, computed depending on
sstype
DF
Degrees of freedom
DF
of a numeric variable is 1.DF
of a categorical variable is the number of indicator variables created for the category (number of categories – 1). Note thattbl
contains one row for each categorical variable instead of one row for each indicator variable as in the model display. Useanova
to test a categorical variable as a group of indicator variables.DF
of an error term isn–p, wherenis the number of observations andpis the number of coefficients in the model.
MeanSq
Mean square, defined by
MeanSq
=SumSq
/DF
MeanSq
for the error term is the mean squared error (MSE).F
F-statistic value to test the null hypothesis that the corresponding coefficient is zero, computed by
F
=MeanSq
/MSE
When the null hypothesis is true, theF-statistic follows theF-distribution. The numerator degrees of freedom is the
DF
value for the corresponding term, and the denominator degrees of freedom isn–p.pValue
p-value of theF-statistic value
For an example, seeComponent ANOVA Table.
If
anovatype
is'summary'
, thentbl
contains summary statistics of grouped terms for each row. The table includes the same columns as'component'
and these rows:Row Description Total
Total statistics
SumSq
— Total sum of squares, which is the sum of the squared deviations of the response around its meanDF
— Sum of degrees of freedom ofModel
andResidual
Model
Statistics for the model as a whole
SumSq
— Model sum of squares, which is the sum of the squared deviations of the fitted value around the response mean.F
andpValue
— These values provide a test of whether the model as a whole fits significantly better than a degenerate model consisting of only a constant term.
If
mdl
includes only linear terms, thenanova
does not decomposeModel
intoLinear
andNonLinear
.Linear
Statistics for linear terms
SumSq
— Sum of squares for linear terms, which is the difference between the model sum of squares and the sum of squares for nonlinear terms.F
andpValue
— These values provide a test of whether the model with only linear terms fits better than a degenerate model consisting of only a constant term.anova
uses the mean squared error that is based on the full model to compute thisF-value, so theF-value obtained by dropping the nonlinear terms and repeating the test is not the same as the value in this row.
Nonlinear
Statistics for nonlinear terms
SumSq
— Sum of squares for nonlinear (higher-order or interaction) terms, which is the increase in the residual sum of squares obtained by keeping only the linear terms and dropping all nonlinear terms.F
andpValue
— These values provide a test of whether the full model fits significantly better than a smaller model consisting of only the linear terms.
Residual
Statistics for residuals
SumSq
— Residual sum of squares, which is the sum of the squared residual valuesMeanSq
— Mean squared error, used to compute theF-statistic values forModel
,Linear
, andNonLinear
If
mdl
is a fullLinearModel
object and the sample data contains replications (multiple observations sharing the same predictor values), thenanova
decomposes the residual sum of squares into a sum of squares for the replicated observations (Lack of fit
) and the remaining sum of squares (Pure error
).Lack of fit
Lack-of-fit statistics
SumSq
— Sum of squares due to lack of fit, which is the difference between the residual sum of squares and the replication sum of squares.F
andpValue
— TheF-statistic value is the ratio of lack-of-fitMeanSq
to pure errorMeanSq
. The ratio provides a test of bias by measuring whether the variation of the residuals is larger than the variation of the replications. A lowp-value implies that adding additional terms to the model can improve the fit.
Pure error
Statistics for pure error
SumSq
— Replication sum of squares, obtained by finding the sets of points with identical predictor values, computing the sum of squared deviations around the mean within each set, and pooling the computed valuesMeanSq
— Model-free pure error variance estimate of the response
For an example, seeSummary ANOVA Table.
Alternative Functionality
More complete ANOVA statistics are available in theanova1
,anova2
, andanovan
functions.
Extended Capabilities
GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.
This function fully supports GPU arrays. For more information, seeRun MATLAB Functions on a GPU(Parallel Computing Toolbox).
Version History
Open Example
You have a modified version of this example. Do you want to open this example with your edits?
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select:.
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina(Español)
- Canada(English)
- United States(English)
Europe
- Belgium(English)
- Denmark(English)
- Deutschland(Deutsch)
- España(Español)
- Finland(English)
- France(Français)
- Ireland(English)
- Italia(Italiano)
- Luxembourg(English)
- Netherlands(English)
- Norway(English)
- Österreich(Deutsch)
- Portugal(English)
- Sweden(English)
- Switzerland
- United Kingdom(English)