
Joseph V. Casillas, PhD
Rutgers University
Last update: 2025-02-23
Some examples
Note: We interpret the ~ as “as a function of”.
\[y = a + bx\]
\[y = 50 + 10x\] \[y = 50 + 10 \times 2\] \[y = 50 + 20\] \[y = 70\]
Linear algebra


Linear algebra


The linear model
Bivariate regression
\[\color{blue}{response} \sim intercept + (slope * \color{red}{predictor})\]
\[\color{blue}{\hat{y}} = a + b\color{red}{x}\]
Bivariate regression


Bivariate regression

| x | y |
|---|---|
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
y we would be right oncey and the observed values of y.Bivariate regression
Bivariate regression

y isn’t the best solution.
y datax and y.Bivariate regression
Bivariate regression

Bivariate regression

| x_i | y_i | y_hat | pred.error | sqrd.error |
|---|---|---|---|---|
| 1 | 2 | 1.6 | 0.4 | 0.16 |
| 2 | 1 | 2.7 | -1.7 | 2.89 |
| 3 | 3 | 3.8 | -0.8 | 0.64 |
| NA | NA | NA | NA | NA |
| NA | NA | NA | -2.1 | 3.69 |
Bivariate regression

| x_i | y_i | y_hat | pred.error | sqrd.error |
|---|---|---|---|---|
| 1 | 2 | 0.75 | 1.25 | 1.5625 |
| 2 | 1 | 1.25 | -0.25 | 0.0625 |
| 3 | 3 | 1.75 | 1.25 | 1.5625 |
| NA | NA | NA | NA | NA |
| NA | NA | NA | 2.25 | 3.1875 |
Bivariate regression

| x_i | y_i | y_hat | pred.error | sqrd.error |
|---|---|---|---|---|
| 1 | 2 | 1.5 | 0.5 | 0.25 |
| 2 | 1 | 2 | -1.0 | 1.00 |
| 3 | 3 | 2.5 | 0.5 | 0.25 |
| NA | NA | NA | NA | NA |
| NA | NA | NA | 0.0 | 1.50 |
Bivariate regression
\[b = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sum{(x_i - \bar{x})^2}}\]
\[a = \bar{y} - b\bar{x}\]
Bivariate regression
Least squares estimation
| obs | x | y | x - xbar | y - ybar | (x-xbar)(y-ybar) | x - xbar^2 |
|---|---|---|---|---|---|---|
| 1 | 1 | 2 | -1 | 0 | 0 | 1 |
| 2 | 2 | 1 | 0 | -1 | 0 | 0 |
| 3 | 3 | 3 | 1 | 1 | 1 | 1 |
| NA | NA | NA | NA | NA | NA | NA |
| Sum | 6 | 6 | NA | NA | 1 | 2 |
| Mean | 2 | 2 | NA | NA | NA | NA |
Slope
\[b = \frac{\sum{\color{red}{(x_i - \bar{x})(y_i - \bar{y})}}}{\sum{\color{blue}{(x_i - \bar{x})^2}}}\]
\[b = \frac{\color{red}{1}}{\color{blue}{2}}\]
Intercept
\[a = \bar{y} - b\bar{x}\]
\[a = 2 - (0.5 \times 2)\]
\[a = 1\]
\[\hat{y} = 1 + 0.5x\]
Bivariate regression
Bivariate regression
Technical stuff - Terminology review
Fitting a model
\[y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]
mtcarsmtcars dataset, we can fit a model with the following variables:
mpgwtlm() functionlm() will use the ordinary least squares method to obtain parameter estimates, i.e., β0 (intercept) and β1 (slope)mpg for a given weightBivariate regression
Interpreting model output
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.54 -2.36 -0.13 1.41 6.87
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.29 1.88 19.9 <2e-16 ***
wt -5.34 0.56 -9.6 1e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3 on 30 degrees of freedom
Multiple R-squared: 0.75, Adjusted R-squared: 0.74
F-statistic: 91 on 1 and 30 DF, p-value: 1.3e-10
Bivariate regression
Interpreting model output

| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 37.29 | 1.88 | 19.86 | 0.00 |
| wt | -5.34 | 0.56 | -9.56 | 0.00 |
Bivariate regression
Interpreting model output

| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 37.29 | 1.88 | 19.86 | 0.00 |
| wt | -5.34 | 0.56 | -9.56 | 0.00 |
Bivariate regression
Interpreting model output

| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 37.29 | 1.88 | 19.86 | 0.00 |
| wt | -5.34 | 0.56 | -9.56 | 0.00 |
Bivariate regression
Understanding slopes and intercepts
Bivariate regression
Same intercept, but different slopes


Bivariate regression
Positive and negative slope


Bivariate regression
Different intercepts, but same slopes


Bivariate regression
Interpreting model output
The coefficient of determination
Bivariate regression
Variance and error deviation

Bivariate regression
Variance and error deviation

\(\color{red}{y_i}\) = An observed, measured value of y
\(\color{blue}{\bar{y}}\) = The mean value of y
\(\color{green}{\hat{y}}\) = A value of y predicted by our model
Bivariate regression
Variance and error deviation

\(\color{red}{y_i}\) = An observed, measured value of y
\(\color{blue}{\bar{y}}\) = The mean value of y
\(\color{green}{\hat{y}}\) = A value of y predicted by our model
Total deviation: \(\color{red}{y_i} - \color{blue}{\bar{y}}\)
Ex. \(y_3\) = 3 - 2
Bivariate regression
Variance and error deviation

\(\color{red}{y_i}\) = An observed, measured value of y
\(\color{blue}{\bar{y}}\) = The mean value of y
\(\color{green}{\hat{y}}\) = A value of y predicted by our model
Total deviation: \(\color{red}{y_i} - \color{blue}{\bar{y}}\)
Predicted deviation: \(\color{green}{\hat{y}} - \color{blue}{\bar{y}}\)
Ex. \(y_3\) = 2.5 - 2
Bivariate regression
Variance and error deviation

\(\color{red}{y_i}\) = An observed, measured value of y
\(\color{blue}{\bar{y}}\) = The mean value of y
\(\color{green}{\hat{y}}\) = A value of y predicted by our model
Total deviation: \(\color{red}{y_i} - \color{blue}{\bar{y}}\)
Predicted deviation: \(\color{green}{\hat{y}} - \color{blue}{\bar{y}}\)
Error deviation: \(\color{red}{y_i} - \color{blue}{\hat{y}}\)
Ex. \(y_3\) = 3 - 2.5
Bivariate regression
Variance and error deviation

\[SS_{total} = \sum (y_i - \bar{y})^2\] \[SS_{predicted} = \sum (\hat{y}_i - \bar{y})^2\] \[SS_{error} = \sum (y_i - \hat{y}_i)^2\]
\[SS_{total} = SS_{predicted} + SS_{error}\]
\[R^2 = \frac{SS_{predicted}}{SS_{total}}\]
https://www.react-graph-gallery.com/example/scatterplot-r2-playground
\[\hat{y} = \beta_0 + \beta_1 wt_i + \epsilon_i\]
mtcars model can be summarized as…\[mpg = 37.29 + -5.34 wt\]
mpg for a car that weighs 1 unit? And one that weighs 3? And 6?Taken from QASS 22 (Ch. 1, Table 2)
Rows: 32
Columns: 3
$ respondent <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ education <dbl> 4, 4, 6, 6, 6, 8, 8, 8, 8, 10, 10, 10, 11, 12, 12, 12, 12, …
$ income <dbl> 6281, 10516, 6898, 8212, 11744, 8618, 10011, 12405, 14664, …
respondent education income
1 1 4 6281
2 2 4 10516
3 3 6 6898
4 4 6 8212
5 5 6 11744
6 6 8 8618
First we’ll fit the model using lm().
Here is a table of the model summary:
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 5077.512 | 1497.8302 | 3.389912 | 0.0019753 |
| education | 732.400 | 117.5221 | 6.232021 | 0.0000007 |
What is the predicted income of somebody with 6/10/16 years of education in Riverside?
Do it in R: