Joseph V. Casillas, PhD
Rutgers University
Last update: 2025-03-02
2. a hypothesis that is taken for granted
3. a thing that is accepted as true or as certain to happen, without proof
We can break the assumptions up into 3 areas:
There should be no specification error
Model specification
Including an irrelevant variable (sin of commission): 🤷🏽
Model specification
Excluding a relevant variable (sin of omission): 👎🏽
Model specification
What to do
Importance: high
The error term should meet the following characteristics…
If the mean of the error term deviates far from 0…
Importance: low… unless you are interested in the intercept.
Importance: medium/high
Importance: High, but uncommon in standard regression
Importance: high
Importance: high
mtcars
model predicted that a 6 ton car should get 5.22 mpg.Diagnostics
Considerations for model diagnostics
1. Model assumptions
2. Outliers
Diagnostics
Diagnostics
Diagnostics
Diagnostics
Homoskedasticity of residuals
Diagnostics
Durbin-Watson test
data: mod1
DW = 1.119, p-value = 0.0002656
alternative hypothesis: true autocorrelation is greater than 0
Diagnostics
No autocorrelation of residuals
Durbin-Watson test
data: mod2
DW = 0.26212, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
Diagnostics
No autocorrelation of residuals - Correction
Durbin-Watson test
data: mod2_fix
DW = 2.6825, p-value = 0.9875
alternative hypothesis: true autocorrelation is greater than 0
mod2
Diagnostics
Pearson's product-moment correlation
data: assumptions_data$x and mod1$residuals
t = -5.72e-16, df = 49, p-value = 1
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.2755837 0.2755837
sample estimates:
cor
-8.171416e-17
Diagnostics
Predictors and residuals are uncorrelated
Pearson's product-moment correlation
data: assumptions_data$x and mod2$residuals
t = -5.0882e-16, df = 49, p-value = 1
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.2755837 0.2755837
sample estimates:
cor
-7.268887e-17
Diagnostics
Leverage
Influential data points
Influential data points
Influential data points
\[y = -0.07 + -1.71 x\]
\[y = 0.07 + -2.04 x\]
gvlma
to test model assumptions.##
## Call:
## lm(formula = y ~ x, data = assumptions_data)
##
## Coefficients:
## (Intercept) x
## -0.06577 -1.71081
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = mod1)
##
## Value p-value Decision
## Global Stat 557.334 0.000e+00 Assumptions NOT satisfied!
## Skewness 73.855 0.000e+00 Assumptions NOT satisfied!
## Kurtosis 457.890 0.000e+00 Assumptions NOT satisfied!
## Link Function 6.284 1.218e-02 Assumptions NOT satisfied!
## Heteroscedasticity 19.304 1.115e-05 Assumptions NOT satisfied!
##
## Call:
## lm(formula = y ~ x, data = assumptions_data[1:50, ])
##
## Coefficients:
## (Intercept) x
## 0.07471 -2.04296
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = noOut)
##
## Value p-value Decision
## Global Stat 2.4877 0.6468 Assumptions acceptable.
## Skewness 0.6541 0.4186 Assumptions acceptable.
## Kurtosis 0.3175 0.5731 Assumptions acceptable.
## Link Function 0.9478 0.3303 Assumptions acceptable.
## Heteroscedasticity 0.5683 0.4509 Assumptions acceptable.
##
## Call:
## lm(formula = y_quad ~ x, data = assumptions_data)
##
## Coefficients:
## (Intercept) x
## 2.838 -1.600
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = mod2)
##
## Value p-value Decision
## Global Stat 87.100 0.000e+00 Assumptions NOT satisfied!
## Skewness 25.674 4.042e-07 Assumptions NOT satisfied!
## Kurtosis 14.082 1.750e-04 Assumptions NOT satisfied!
## Link Function 45.552 1.487e-11 Assumptions NOT satisfied!
## Heteroscedasticity 1.793 1.806e-01 Assumptions acceptable.
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
lm(formula = mpg ~ wt, data = mtcars)
lm()
. Min. 1st Qu. Median Mean 3rd Qu. Max.
-4.5432 -2.3647 -0.1252 0.0000 1.4096 6.8727
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.285126 1.877627 19.857575 8.241799e-19
wt -5.344472 0.559101 -9.559044 1.293959e-10
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Interpretation
Statistical Significance
As long as the effect is not statistically equivalent to 0 it is called statistically significant
It may be an effect of trivial magnitude
Basically, it means that this prediction is better than nothing
It doesn’t really mean it is really “significant” in the terms that we think of as significance
It doesn’t indicate importance
Interpretation
Significance versus Importance
How do we know if the “significant” effect we found is actually important?
How important are the chosen predictors?
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
Interpretation
A note about F-ratios
If you have an ANOVA background… Mean Squared Deviations
\[MS_{Total} = \frac{\sum{(y_i - \bar{y})^2}}{n-1}\]
\[MS_{Predicted} = \frac{\sum{(\hat{y}_i - \bar{y})^2}}{k}\]
\[MS_{Error} = \frac{\sum{(y_i - \hat{y}_i)^2}}{n - k - 1}\]
\[F_{(k),(n-k-1)} = \frac{\sum{(\hat{y}_i - \bar{y})^2} / (k)}{\sum{(y_i - \hat{y}_i)^2} / (n - k - 1)}\]
Interpretation
A note about F-ratios
If you have an ANOVA background… Mean Squared Deviations
\[MS_{Total} = \frac{SS_{Total}}{df_{Total}}\]
\[MS_{Predicted} = \frac{SS_{Predicted}}{df_{Predicted}}\]
\[MS_{Error} = \frac{SS_{Error}}{df_{Error}}\]
\[F = \frac{MS_{Predicted}}{MS_{Error}}\]
Interpretation
Degrees of Freedom
Derived from the number of sample statistics used in your computation:
Interpretation
Degrees of Freedom
In the denominator of the standard deviation, you are using a sample statistic, not a population parameter, so you have df = n - 1:
Usually we have an F-table with associated degrees of freedom to show us whether the F-ratio is “statistically significant”:
Interpretation
Degrees of Freedom
\[df_{Total} = n - 1\]
\[df_{Predicted} = k\]
\[df_{Error} = n - k - 1\]
\[df_{Total} = df_{Predicted} + df_{Error}\]