Joseph V. Casillas, PhD
Rutgers University
Last update: 2025-02-23
Some examples
Note: We interpret the ~
as “as a function of”.
\[y = a + bx\]
\[y = 50 + 10x\] \[y = 50 + 10 \times 2\] \[y = 50 + 20\] \[y = 70\]
Linear algebra
Linear algebra
The linear model
Bivariate regression
\[\color{blue}{response} \sim intercept + (slope * \color{red}{predictor})\]
\[\color{blue}{\hat{y}} = a + b\color{red}{x}\]
Bivariate regression
Bivariate regression
x | y |
---|---|
1 | 1 |
2 | 2 |
3 | 3 |
y
we would be right oncey
and the observed values of y
.Bivariate regression
Bivariate regression
y
isn’t the best solution.
y
datax
and y
.Bivariate regression
Bivariate regression
Bivariate regression
x_i | y_i | y_hat | pred.error | sqrd.error |
---|---|---|---|---|
1 | 2 | 1.6 | 0.4 | 0.16 |
2 | 1 | 2.7 | -1.7 | 2.89 |
3 | 3 | 3.8 | -0.8 | 0.64 |
NA | NA | NA | NA | NA |
NA | NA | NA | -2.1 | 3.69 |
Bivariate regression
x_i | y_i | y_hat | pred.error | sqrd.error |
---|---|---|---|---|
1 | 2 | 0.75 | 1.25 | 1.5625 |
2 | 1 | 1.25 | -0.25 | 0.0625 |
3 | 3 | 1.75 | 1.25 | 1.5625 |
NA | NA | NA | NA | NA |
NA | NA | NA | 2.25 | 3.1875 |
Bivariate regression
x_i | y_i | y_hat | pred.error | sqrd.error |
---|---|---|---|---|
1 | 2 | 1.5 | 0.5 | 0.25 |
2 | 1 | 2 | -1.0 | 1.00 |
3 | 3 | 2.5 | 0.5 | 0.25 |
NA | NA | NA | NA | NA |
NA | NA | NA | 0.0 | 1.50 |
Bivariate regression
\[b = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sum{(x_i - \bar{x})^2}}\]
\[a = \bar{y} - b\bar{x}\]
Bivariate regression
Least squares estimation
obs | x | y | x - xbar | y - ybar | (x-xbar)(y-ybar) | x - xbar^2 |
---|---|---|---|---|---|---|
1 | 1 | 2 | -1 | 0 | 0 | 1 |
2 | 2 | 1 | 0 | -1 | 0 | 0 |
3 | 3 | 3 | 1 | 1 | 1 | 1 |
NA | NA | NA | NA | NA | NA | NA |
Sum | 6 | 6 | NA | NA | 1 | 2 |
Mean | 2 | 2 | NA | NA | NA | NA |
Slope
\[b = \frac{\sum{\color{red}{(x_i - \bar{x})(y_i - \bar{y})}}}{\sum{\color{blue}{(x_i - \bar{x})^2}}}\]
\[b = \frac{\color{red}{1}}{\color{blue}{2}}\]
Intercept
\[a = \bar{y} - b\bar{x}\]
\[a = 2 - (0.5 \times 2)\]
\[a = 1\]
\[\hat{y} = 1 + 0.5x\]
Bivariate regression
Bivariate regression
Technical stuff - Terminology review
Fitting a model
\[y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]
mtcars
mtcars
dataset, we can fit a model with the following variables:
mpg
wt
lm()
functionlm()
will use the ordinary least squares method to obtain parameter estimates, i.e., β0 (intercept) and β1 (slope)mpg
for a given weight
Bivariate regression
Interpreting model output
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.54 -2.36 -0.13 1.41 6.87
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.29 1.88 19.9 <2e-16 ***
wt -5.34 0.56 -9.6 1e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3 on 30 degrees of freedom
Multiple R-squared: 0.75, Adjusted R-squared: 0.74
F-statistic: 91 on 1 and 30 DF, p-value: 1.3e-10
Bivariate regression
Interpreting model output
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 37.29 | 1.88 | 19.86 | 0.00 |
wt | -5.34 | 0.56 | -9.56 | 0.00 |
Bivariate regression
Interpreting model output
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 37.29 | 1.88 | 19.86 | 0.00 |
wt | -5.34 | 0.56 | -9.56 | 0.00 |
Bivariate regression
Interpreting model output
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 37.29 | 1.88 | 19.86 | 0.00 |
wt | -5.34 | 0.56 | -9.56 | 0.00 |
Bivariate regression
Understanding slopes and intercepts
Bivariate regression
Same intercept, but different slopes
Bivariate regression
Positive and negative slope
Bivariate regression
Different intercepts, but same slopes
Bivariate regression
Interpreting model output
The coefficient of determination
Bivariate regression
Variance and error deviation
Bivariate regression
Variance and error deviation
\(\color{red}{y_i}\) = An observed, measured value of y
\(\color{blue}{\bar{y}}\) = The mean value of y
\(\color{green}{\hat{y}}\) = A value of y predicted by our model
Bivariate regression
Variance and error deviation
\(\color{red}{y_i}\) = An observed, measured value of y
\(\color{blue}{\bar{y}}\) = The mean value of y
\(\color{green}{\hat{y}}\) = A value of y predicted by our model
Total deviation: \(\color{red}{y_i} - \color{blue}{\bar{y}}\)
Ex. \(y_3\) = 3 - 2
Bivariate regression
Variance and error deviation
\(\color{red}{y_i}\) = An observed, measured value of y
\(\color{blue}{\bar{y}}\) = The mean value of y
\(\color{green}{\hat{y}}\) = A value of y predicted by our model
Total deviation: \(\color{red}{y_i} - \color{blue}{\bar{y}}\)
Predicted deviation: \(\color{green}{\hat{y}} - \color{blue}{\bar{y}}\)
Ex. \(y_3\) = 2.5 - 2
Bivariate regression
Variance and error deviation
\(\color{red}{y_i}\) = An observed, measured value of y
\(\color{blue}{\bar{y}}\) = The mean value of y
\(\color{green}{\hat{y}}\) = A value of y predicted by our model
Total deviation: \(\color{red}{y_i} - \color{blue}{\bar{y}}\)
Predicted deviation: \(\color{green}{\hat{y}} - \color{blue}{\bar{y}}\)
Error deviation: \(\color{red}{y_i} - \color{blue}{\hat{y}}\)
Ex. \(y_3\) = 3 - 2.5
Bivariate regression
Variance and error deviation
\[SS_{total} = \sum (y_i - \bar{y})^2\] \[SS_{predicted} = \sum (\hat{y}_i - \bar{y})^2\] \[SS_{error} = \sum (y_i - \hat{y}_i)^2\]
\[SS_{total} = SS_{predicted} + SS_{error}\]
\[R^2 = \frac{SS_{predicted}}{SS_{total}}\]
https://www.react-graph-gallery.com/example/scatterplot-r2-playground
\[\hat{y} = \beta_0 + \beta_1 wt_i + \epsilon_i\]
mtcars
model can be summarized as…\[mpg = 37.29 + -5.34 wt\]
mpg
for a car that weighs 1 unit? And one that weighs 3? And 6?Taken from QASS 22 (Ch. 1, Table 2)
Rows: 32
Columns: 3
$ respondent <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ education <dbl> 4, 4, 6, 6, 6, 8, 8, 8, 8, 10, 10, 10, 11, 12, 12, 12, 12, …
$ income <dbl> 6281, 10516, 6898, 8212, 11744, 8618, 10011, 12405, 14664, …
respondent education income
1 1 4 6281
2 2 4 10516
3 3 6 6898
4 4 6 8212
5 5 6 11744
6 6 8 8618
First we’ll fit the model using lm()
.
Here is a table of the model summary:
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 5077.512 | 1497.8302 | 3.389912 | 0.0019753 |
education | 732.400 | 117.5221 | 6.232021 | 0.0000007 |
What is the predicted income of somebody with 6/10/16 years of education in Riverside?
Do it in R: