Joseph V. Casillas, PhD
Rutgers University
Last update: 2025-03-22
\[Y = a + bX + e\]
MRC
Overview
A (x) |
B (y) |
C (x ∨ y) |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 1 |
MRC
Overview
MRC
Overview
Weighted sum
\[Y = a + bX + e\]
\[Y = a_{0} + b_{1}X_{1} + b_{2}X_{2} {...} b_{k}X_{k} + e\]
What does this mean for our parameter estimates and how do we interpret them?
(Semi)partial betas
MRC
MRC
Recall that the b-weight
of the bivariate model is \(r(\frac{s_y}{s_x})\)
MRC
MRC
Cohen & Cohen (1975)
MRC
MRC
Squared semipartial correlation
MRC
Squared partial
correlation
MRC
Statistical control1
MRC
Conceptual understanding
MRC
MRC
MRC
MRC
MRC
Call:
lm(formula = mpg ~ wt + drat, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.4159 -2.0452 0.0136 1.7704 6.7466
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.290 7.318 4.139 0.000274 ***
wt -4.783 0.797 -6.001 1.59e-06 ***
drat 1.442 1.459 0.989 0.330854
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.047 on 29 degrees of freedom
Multiple R-squared: 0.7609, Adjusted R-squared: 0.7444
F-statistic: 46.14 on 2 and 29 DF, p-value: 9.761e-10
MRC
beta | SE | 95% LB | 95% UB | t-ratio | p.value | |
---|---|---|---|---|---|---|
(Intercept) | 30.29 | 7.32 | 15.32 | 45.26 | 4.14 | 0.00 |
wt | -4.78 | 0.80 | -6.41 | -3.15 | -6.00 | 0.00 |
drat | 1.44 | 1.46 | -1.54 | 4.43 | 0.99 | 0.33 |
MRC
MRC
Making predictions
\[Y = a_{0} + b_{1}X_{1} + b_{2}X_{2} {...} b_{k}X_{k} + e\]
mtcars
model can be summarized as…\[ \begin{aligned} \widehat{mpg} &= 30.29 + -4.78(wt) + 1.44(drat) \end{aligned} \]
What is the predicted mpg
for a car that weighs 1 unit with a rear axel ratio (drat) of 2?
And one that weighs 1 with a drat of 4?
And one that weighs 3 with a drat of 2.5?
Interactions
Recall
A (x) |
B (y) |
C (x ∨ y) |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 1 |
If A = 1
OR
B = 1,
then C = 1
Otherwise, C = 0
Interactions
Recall
Non-Additivity: Interaction Terms
A (x) |
B (y) |
C (x ∧ y) |
(x ∨ y) |
---|---|---|---|
0 | 0 | 0 | 0 |
0 | 1 | 0 | 1 |
1 | 0 | 0 | 1 |
1 | 1 | 1 | 1 |
If A = 1
AND
B = 1,
then C = 1
Otherwise, C = 0
Interactions
Genetics example
Drugs and alcohol
Consider taking the upcoming midterm in one of the following conditions
Interactions
Drugs and alcohol continued
Interactions
Non-additivity of effects
Interactions
The multiple regression formula:
\(Y = a_{0} + b_{1}X_{1} + b_{2}X_{2} + (b_{1}X_{1} \times b_{2}X_{2}) + e\)
Including interactions in R
Call:
lm(formula = mpg ~ wt + drat + wt:drat, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.8913 -1.8634 -0.3398 1.3247 6.4730
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.550 12.631 0.439 0.6637
wt 3.884 3.798 1.023 0.3153
drat 8.494 3.321 2.557 0.0162 *
wt:drat -2.543 1.093 -2.327 0.0274 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.839 on 28 degrees of freedom
Multiple R-squared: 0.7996, Adjusted R-squared: 0.7782
F-statistic: 37.25 on 3 and 28 DF, p-value: 6.567e-10
Interactions
Interactions
Interactions
Interactions
Interactions
A new assumption
Multicollinearity
How useful is the model output?
Traditional t-test to determine if the b = 0
\[t_{b} = \frac{b}{SE_{b}}\]
Problem
How useful is the model output?
Your t-tests might not be informative
How do you deal with this?
Hierarchical partitioning of variance through nested model comparisons
A nested model is when one model is nested inside the other such that there is a more inclusive model that contains more parameters and a less inclusive model (restricted model) that contains just a subset of specific variables you would like to test
Nested model comparisons
What is a nested model?
How
Nested model comparisons
Nested model comparisons
Nested model comparisons
Inclusive Regression Model (R2I):
\[\hat{y} = a + b_{1}x_{1} + b_{2}x_{2} + b_{3}x_{3} + e\]
Restricted Regression Model (R2R):
\[\hat{y} = a + b_{1}x_{1} + b_{2}x_{2} + e\]
Nested Model Comparison:
\[(R^2_{I} - R^2_{R}) = sr^2(y, x_{3} \times x_{1}, x_{2})\]
Nested model comparisons: (kI - kR) = 1
Restricted Regression Model 1:
\[\hat{y} = a + b_{1}x_{1} + b_{2}x_{2} + e\]
Restricted Regression Model 2:
\[\hat{y} = a + b_{1}x_{1} + b_{3}x_{3} + e\]
Restricted Regression Model 3:
\[\hat{y} = a + b_{2}x_{2} + b_{3}x_{3} + e\]
Nested model comparisons: (kI - kR) = 2
Restricted Regression Model 4:
\[\hat{y} = a + b_{1}x_{1} + e\]
Restricted Regression Model 5:
\[\hat{y} = a + b_{2}x_{2} + e\]
Restricted Regression Model 6:
\[\hat{y} = a + b_{3}x_{3} + e\]
Nested model comparisons
Nested model comparisons
Stop voicing example (cont)
Stop voicing example (cont)
\[F_{(\color{red}{k_{I}}-\color{blue}{k_{R}}),(n - \color{red}{k_{I}} - 1)} = \frac{(\color{red}{R^2_{I}} - \color{blue}{R^2_{R}}) / (\color{red}{k_{I}} - \color{blue}{k_{R}})}{(1 - \color{red}{R^2_{I}}) / (n - \color{red}{k_{I}} - 1)}\]
Nested model comparisons
What about likelihood ratio tests?
y ~ a * b * c
Main effects: a, b, and c
2-way interactions: a:b, a:c, and b:c
3-way interactions: a:b:c
Nested model comparisons
Summarizing
Problem: Multicollinearity
Solution: Hierarchical partitioning of variance
Make an inclusive model and do hierarchical partitioning of variance
Determine which terms (lower order and higher order, additive and interactive), if any, can be eliminated
So which one gets causal priority?
Main effects generally do:
There is no convention for deciding which comes first:
Only test plausible rival hypotheses
Be careful and selective!
Even with a small number of primitive component terms (main effects), once you examine all the possible interactions you will have a complex model
You will exhaust your statistical power very quickly
mod_full <- lm(mpg ~ wt + drat + wt:drat, data = mtcars) # inclusive model
mod_int <- lm(mpg ~ wt + drat , data = mtcars) # restricted model
mod_drat <- lm(mpg ~ wt , data = mtcars) # restricted model
mod_wt <- lm(mpg ~ drat , data = mtcars) # restricted model
mod_null <- lm(mpg ~ 1 , data = mtcars) # null model
anova()
functionaov()
function)Nested model comparisons
Doing it in R
Analysis of Variance Table
Model 1: mpg ~ wt + drat
Model 2: mpg ~ wt + drat + wt:drat
Res.Df RSS Df Sum of Sq F Pr(>F)
1 29 269.24
2 28 225.62 1 43.624 5.4139 0.02744 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
wt
x drat
interaction (F(1) = 5.41, p < 0.03)Inflated R2
As k increases, there is more capitalization on chance
As n increases, the overestimation of R2 is less:
Alpha slippage
\[\color{red}{\alpha _{e}} = 1 - (1 - \color{blue}{\alpha _{T}})^\color{green}{k}\]
αT = | Test-wise alpha |
k = | Number of significance tests |
αE = | Experiment-wise alpha |
Alpha slippage
0.05 = | Test-wise alpha |
10 = | Number of significance tests |
.40 = | Experiment-wise alpha |
0.05 = | Test-wise alpha |
20 = | Number of significance tests |
.64 = | Experiment-wise alpha |
Alpha slippage
0.05 = | Test-wise alpha |
100 = | Number of significance tests |
.99 = | Experiment-wise alpha |
0.05 = | Test-wise alpha |
500 = | Number of significance tests |
1.0 = | Experiment-wise alpha |
Alpha slippage
Taking repeated risks
Empirical selection of variables
We should strive to create theoretically specified models that test a priori alternative hypotheses:
We should be discouraged from testing atheoretical ones:
Empirical selection of variables
Empirical selection of variables
Reason 1: Some people are interested in prediction and not causation, and predictions don’t require theory
Empirical selection of variables
Reason 2: There may be an honest lack of theory in the area you are investigating
Empirical selection of variables
How should we do it?
Empirical selection of variables
Empirical selection of variables
Keep doing this until when everything left is significant
When removing something more results in a statistically significant sR2 F-ratio
Then you have to put that last variable back in, and that’s your final model
You are picking off the weakest variables until you can no longer validly eliminate anything else
Empirical selection of variables
In true backward elimination, once you eliminate something you can’t go back on that decision
Due to multicollinearity, one variable that was eliminated in an earlier step might now be significant in a new context, but you have no way to know this
Remember significance is often context dependent
Variable A might have not been significant in step 2, but now that variables B, C, and D have been removed it might be significant
Empirical selection of variables
Empirical selection of variables
Empirical selection of variables
Similar to Backward Elimination:
Empirical selection of variables
Empirical selection of variables
hi
General description:
Results
\(\beta\)
SE
CI
p-val
Llompart & Casillas (2016)
Reporting
Llompart & Casillas (2016)
Reporting
Casillas (2015)
Reporting
Casillas (2015)
Reporting
Casillas et al. (2023)
Reporting
Casillas et al. (2023)
Some data
'data.frame': 10000 obs. of 5 variables:
$ age : num 5 5 5 5 5 5 5 5 5 5 ...
$ height : num 36.8 40.9 41.9 40.9 40 ...
$ score : num 57.4 55.7 53.9 56.8 53.8 ...
$ age_c : num -6.5 -6.5 -6.5 -6.5 -6.5 -6.5 -6.5 -6.5 -6.5 -6.5 ...
$ height_c: num -12.64 -8.56 -7.51 -8.5 -9.42 ...
age height score age_c height_c
1 7.9 40.43 58.87 -3.6 -8.98
2 18.0 56.09 70.63 6.5 6.68
3 5.1 42.76 57.14 -6.4 -6.65
4 17.0 57.87 72.29 5.5 8.45
5 16.8 54.27 68.92 5.3 4.86
6 18.0 54.04 70.00 6.5 4.62
7 11.5 43.96 66.89 0.0 -5.45
8 17.3 58.31 72.03 5.8 8.89
9 16.9 57.62 70.04 5.4 8.21
10 8.8 44.40 60.44 -2.7 -5.01
Call:
lm(formula = height ~ age_c, data = mrc_ex_data)
Residuals:
Min 1Q Median 3Q Max
-11.255 -2.033 0.031 2.013 11.665
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 49.415790 0.029986 1648.0 <2e-16 ***
age_c 1.261273 0.007989 157.9 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.999 on 9998 degrees of freedom
Multiple R-squared: 0.7137, Adjusted R-squared: 0.7137
F-statistic: 2.493e+04 on 1 and 9998 DF, p-value: < 2.2e-16
Call:
lm(formula = score ~ age_c, data = mrc_ex_data)
Residuals:
Min 1Q Median 3Q Max
-7.6328 -1.3835 0.0204 1.3758 7.1214
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 64.367451 0.019954 3225.8 <2e-16 ***
age_c 1.240785 0.005316 233.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.995 on 9998 degrees of freedom
Multiple R-squared: 0.8449, Adjusted R-squared: 0.8449
F-statistic: 5.447e+04 on 1 and 9998 DF, p-value: < 2.2e-16
Call:
lm(formula = score ~ height_c, data = mrc_ex_data)
Residuals:
Min 1Q Median 3Q Max
-11.5541 -2.1973 -0.0409 2.1267 12.3640
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 64.367451 0.031959 2014 <2e-16 ***
height_c 0.701625 0.005703 123 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.196 on 9998 degrees of freedom
Multiple R-squared: 0.6022, Adjusted R-squared: 0.6021
F-statistic: 1.513e+04 on 1 and 9998 DF, p-value: < 2.2e-16
What can we do about this?
Call:
lm(formula = score ~ height_c + age_c, data = mrc_ex_data)
Residuals:
Min 1Q Median 3Q Max
-7.6294 -1.3824 0.0191 1.3743 7.1263
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 64.367451 0.019955 3225.66 <2e-16 ***
height_c -0.001730 0.006655 -0.26 0.795
age_c 1.242967 0.009936 125.09 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.995 on 9997 degrees of freedom
Multiple R-squared: 0.8449, Adjusted R-squared: 0.8449
F-statistic: 2.723e+04 on 2 and 9997 DF, p-value: < 2.2e-16
We partial out the spurious relationship between math score
and height by conditioning on age.