Data Science for Linguists

The linear model: Bivariate regression

Joseph V. Casillas, PhD

Rutgers UniversitySpring 2025
Last update: 2025-02-23

The linear model

Overview

What it encompasses… a lot. It’s everywhere.
The linear model allows us to test for a linear relationship between 2 (or more) variables
It can be used to
1. quantify the strength of a relationship
2. predict

Some examples

weight ~ height
IQ ~ age
ice cream sold ~ temperature
RT ~ group

vocab size ~ age
vowel duration ~ stress
target fixations ~ grammatical gender
F1 ~ vowel

Note: We interpret the ~ as “as a function of”.

Linear algebra

Remember middle school?

It really is the same thing.
You probably saw something like this:

\[y = a + bx\]

…or maybe some other variation (i.e., y = mx + b) where bx is the slope and a is the point where the line crosses the y-axis when x = 0, i.e. the y-intercept.
If you know two of the variables, you can solve for the third… assume x = 2. Solve for y.

\[y = 50 + 10x\] \[y = 50 + 10 \times 2\] \[y = 50 + 20\] \[y = 70\]

Linear algebra

Cartesian coordinates

Linear algebra

Cartesian coordinates

The linear model

Bivariate regression

The linear model is basically the same as linear algebra, with two subtle differences
1. We fit the line through data points (measurements, observations)
2. We use slightly different terminology

\[\color{blue}{response} \sim intercept + (slope * \color{red}{predictor})\]

\[\color{blue}{\hat{y}} = a + b\color{red}{x}\]

We call our dependent variable the response variable (or criterion)
Our independent variables are called predictors
The intercept and the slope are coefficients
- These are the meat and potatoes of the linear model
- They are also called parameter estimates

How do we know the line of best fit?

Bivariate regression

Bivariate regression

x	y
1	1
2	2
3	3

If we tried to predict using just the mean of y we would be right once
We’d miss badly for (1, 1) and (3, 3)
The line of best fit is that which reduces the distance between the predicted values of y and the observed values of y.

Bivariate regression

Measurement error

We rarely (never) find a perfectly linear relationship in our data
- Most relationships are not perfectly linear
- Our measurements are not perfect (VOT, formants, durations, RT)
- There is error in everything (normal distribution)
We account for error in our models

\[\hat{y} = a + bx + \color{red}{\epsilon}\]

Bivariate regression

Measurement error

Again, we can see that the mean of y isn’t the best solution.
- It can summarize the y data
- It cannot explain the relationship between x and y.
We need a method to assess how well a given line fits the data
We need a method to determine the optimal intercept and slope (the line of best fit)

Bivariate regression

More on error

x_i	y_i	y_hat	pred.error	sqrd.error
1	2	1.6	0.4	0.16
2	1	2.7	-1.7	2.89
3	3	3.8	-0.8	0.64
NA	NA	NA	NA	NA
NA	NA	NA	-2.1	3.69

This line is not the best fit for these data because it does not minimize the sum of the squares of the errors.

Bivariate regression

More on error

x_i	y_i	y_hat	pred.error	sqrd.error
1	2	0.75	1.25	1.5625
2	1	1.25	-0.25	0.0625
3	3	1.75	1.25	1.5625
NA	NA	NA	NA	NA
NA	NA	NA	2.25	3.1875

This line is not the best fit for these data because it does not minimize the sum of the squares of the errors.

Bivariate regression

More on error

x_i	y_i	y_hat	pred.error	sqrd.error
1	2	1.5	0.5	0.25
2	1	2	-1.0	1.00
3	3	2.5	0.5	0.25
NA	NA	NA	NA	NA
NA	NA	NA	0.0	1.50

This line represents a much better fit (1.50 is lower than the SSE of the other lines)
So how do we calculate the line the minimizes the sum of squares of the errors?

Bivariate regression

Least squares estimation

\[b = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sum{(x_i - \bar{x})^2}}\]

\[a = \bar{y} - b\bar{x}\]

Bivariate regression

Least squares estimation

obs	x	y	x - xbar	y - ybar	(x-xbar)(y-ybar)	x - xbar^2
1	1	2	-1	0	0	1
2	2	1	0	-1	0	0
3	3	3	1	1	1	1
NA	NA	NA	NA	NA	NA	NA
Sum	6	6	NA	NA	1	2
Mean	2	2	NA	NA	NA	NA

Slope

\[b = \frac{\sum{\color{red}{(x_i - \bar{x})(y_i - \bar{y})}}}{\sum{\color{blue}{(x_i - \bar{x})^2}}}\]

\[b = \frac{\color{red}{1}}{\color{blue}{2}}\]

Intercept

\[a = \bar{y} - b\bar{x}\]

\[a = 2 - (0.5 \times 2)\]

\[a = 1\]

\[\hat{y} = 1 + 0.5x\]

Bivariate regression

Technical stuff - Terminology review

The best fit line is determined using the ordinary least squares method
In other words, we try to find the best line that minimizes the distance between the predicted values (the line, \(\color{red}{\hat{y}}\)) and the observed data, \(\color{blue}{y_{i}}\)
The distances representing deviations from the regression line are called residuals
The best fit line is the one that reduces the sum of squares of the error term (this is a measure of the variability around the best fit line)
We never measure anything perfectly… there is always error

Fitting a model

\[y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

Coefficients (parameter estimates)
- \(\beta_0\): intercept
- \(\beta_1\): slope
\(\epsilon\): error

Ex. `mtcars`

Get to know the dataset

Using the mtcars dataset, we can fit a model with the following variables:
- response variable: mpg
- predictor: wt

Fitting the model

We will fit the model using the linear equation…
\[mpg_i = \beta_0 + \beta_1 wt_i + \epsilon_i\]
To do this in R, we use the lm() function
lm() will use the ordinary least squares method to obtain parameter estimates, i.e., β₀ (intercept) and β₁ (slope)
In essence, it will estimate the parameters we need to predict mpg for a given weight

Bivariate regression

Interpreting model output

Intercept
Slope
R²


Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
 -4.54  -2.36  -0.13   1.41   6.87 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    37.29       1.88    19.9   <2e-16 ***
wt             -5.34       0.56    -9.6    1e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3 on 30 degrees of freedom
Multiple R-squared:  0.75,  Adjusted R-squared:  0.74 
F-statistic:  91 on 1 and 30 DF,  p-value: 1.3e-10

Bivariate regression

Interpreting model output

Intercept
Slope
R²

term	estimate	std.error	statistic	p.value
(Intercept)	37.29	1.88	19.86	0.00
wt	-5.34	0.56	-9.56	0.00

Bivariate regression

Interpreting model output

Intercept
Slope
R²

term	estimate	std.error	statistic	p.value
(Intercept)	37.29	1.88	19.86	0.00
wt	-5.34	0.56	-9.56	0.00

Bivariate regression

Interpreting model output

Intercept
Slope
R²

term	estimate	std.error	statistic	p.value
(Intercept)	37.29	1.88	19.86	0.00
wt	-5.34	0.56	-9.56	0.00

Bivariate regression

Understanding slopes and intercepts

Bivariate regression

Same intercept, but different slopes

Bivariate regression

Positive and negative slope

Bivariate regression

Different intercepts, but same slopes

Bivariate regression

Interpreting model output

Intercept
Slope
R²

The coefficient of determination

An overall assessment of model fit
R² is the variance explained by your model
Ranges from 0 to 1 (1 = 100% variance explained)
Literally calculated as r \(\times\) r = R² (note the uppercase)
But what does it mean to explain variance?

More about error

Bivariate regression

Variance and error deviation

To understand variance we have to think about deviance
In other words we have to think about the relationship between \(\color{red}{y_i}\), \(\color{blue}{\bar{y}}\), and \(\color{green}{\hat{y}}\)
- \(\color{red}{y_i}\) = An observed, measured value of y
- \(\color{blue}{\bar{y}}\) = The mean value of y
- \(\color{green}{\hat{y}}\) = A value of y predicted by our model (along the regression line)

Bivariate regression

Variance and error deviation

\(\color{red}{y_i}\) = An observed, measured value of y
\(\color{blue}{\bar{y}}\) = The mean value of y
\(\color{green}{\hat{y}}\) = A value of y predicted by our model

Bivariate regression

Variance and error deviation

\(\color{red}{y_i}\) = An observed, measured value of y
\(\color{blue}{\bar{y}}\) = The mean value of y
\(\color{green}{\hat{y}}\) = A value of y predicted by our model

Total deviation: \(\color{red}{y_i} - \color{blue}{\bar{y}}\)

Ex. \(y_3\) = 3 - 2

Bivariate regression

Variance and error deviation

\(\color{red}{y_i}\) = An observed, measured value of y
\(\color{blue}{\bar{y}}\) = The mean value of y
\(\color{green}{\hat{y}}\) = A value of y predicted by our model

Total deviation: \(\color{red}{y_i} - \color{blue}{\bar{y}}\)
Predicted deviation: \(\color{green}{\hat{y}} - \color{blue}{\bar{y}}\)

Ex. \(y_3\) = 2.5 - 2

Bivariate regression

Variance and error deviation

\(\color{red}{y_i}\) = An observed, measured value of y
\(\color{blue}{\bar{y}}\) = The mean value of y
\(\color{green}{\hat{y}}\) = A value of y predicted by our model

Total deviation: \(\color{red}{y_i} - \color{blue}{\bar{y}}\)
Predicted deviation: \(\color{green}{\hat{y}} - \color{blue}{\bar{y}}\)
Error deviation: \(\color{red}{y_i} - \color{blue}{\hat{y}}\)

Ex. \(y_3\) = 3 - 2.5

Bivariate regression

Variance and error deviation

Again we square these values so that positive values are not canceled out by the negative values
We calculate these deviations for all observations

\[SS_{total} = \sum (y_i - \bar{y})^2\] \[SS_{predicted} = \sum (\hat{y}_i - \bar{y})^2\] \[SS_{error} = \sum (y_i - \hat{y}_i)^2\]

\[SS_{total} = SS_{predicted} + SS_{error}\]

Error and R²

\[R^2 = \frac{SS_{predicted}}{SS_{total}}\]

…or r * r

Summing up

Making predictions

Recall the linear model equation…

\[\hat{y} = \beta_0 + \beta_1 wt_i + \epsilon_i\]

Our mtcars model can be summarized as…

\[mpg = 37.29 + -5.34 wt\]

What is the predicted mpg for a car that weighs 1 unit? And one that weighs 3? And 6?

31.94 \(mpg = 37.29 + -5.34 \times 1\)
21.25 \(mpg = 37.29 + -5.34 \times 3\)
5.22 \(mpg = 37.29 + -5.34 \times 6\)

References

Johnson, K. (2011). Patterns and tests. In K. Johnson (Ed.), Quantitative methods in linguistics (pp. 34–69). Wiley.

Lewis-Beck, M. (1980). Bivariate regression: Fitting a straight line. In M. Lewis-Beck (Ed.), Applied regression: An introduction (pp. 9–25). Newbury Park, CA: Sage.

Wickham, H., & Grolemund, G. (2016). R for data science: Import, tidy, transform, visualize, and model data. O’Reilly Media.

Bonus
Riverside income and education (qass 22)

Explore riverside data

Taken from QASS 22 (Ch. 1, Table 2)

glimpse(riverside)

Rows: 32
Columns: 3
$ respondent <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ education  <dbl> 4, 4, 6, 6, 6, 8, 8, 8, 8, 10, 10, 10, 11, 12, 12, 12, 12, …
$ income     <dbl> 6281, 10516, 6898, 8212, 11744, 8618, 10011, 12405, 14664, …

head(riverside)

  respondent education income
1          1         4   6281
2          2         4  10516
3          3         6   6898
4          4         6   8212
5          5         6  11744
6          6         8   8618

Fit model

First we’ll fit the model using lm().

mod <- lm(income ~ education, data = riverside)

Here is a table of the model summary:

term	estimate	std.error	statistic	p.value
(Intercept)	5077.512	1497.8302	3.389912	0.0019753
education	732.400	117.5221	6.232021	0.0000007

Making predictions

What is the predicted income of somebody with 6/10/16 years of education in Riverside?

5077 + (732 * 6)

## [1] 9469

Do it in R:

Data Science for Linguists

The linear model: Bivariate regression

The linear model

Overview

Linear algebra

Remember middle school?

Cartesian coordinates

Cartesian coordinates

Bivariate regression

How do we know the line of best fit?

Measurement error

\[\hat{y} = a + bx + \color{red}{\epsilon}\]

Measurement error

More on error

More on error

More on error

More on error

More on error

Least squares estimation

Ex. mtcars

Get to know the dataset

Fitting the model

More about error

Error and R2

…or r * r

Summing up

Making predictions

References

BonusRiverside income and education (qass 22)

Explore riverside data

Fit model

Making predictions

Ex. `mtcars`

Error and R²

Bonus
Riverside income and education (qass 22)