Data Science for Linguists

The linear model: Bivariate regression

Joseph V. Casillas, PhD

Rutgers UniversitySpring 2025
Last update: 2025-02-23

The linear model

Overview

  • What it encompasses… a lot. It’s everywhere.
  • The linear model allows us to test for a linear relationship between 2 (or more) variables
  • It can be used to
    1. quantify the strength of a relationship
    2. predict

Some examples

  • weight ~ height
  • IQ ~ age
  • ice cream sold ~ temperature
  • RT ~ group
  • vocab size ~ age
  • vowel duration ~ stress
  • target fixations ~ grammatical gender
  • F1 ~ vowel

Note: We interpret the ~ as “as a function of”.

Linear algebra

Remember middle school?

  • It really is the same thing.
  • You probably saw something like this:

\[y = a + bx\]

  • …or maybe some other variation (i.e., y = mx + b) where bx is the slope and a is the point where the line crosses the y-axis when x = 0, i.e. the y-intercept.
  • If you know two of the variables, you can solve for the third… assume x = 2. Solve for y.

\[y = 50 + 10x\] \[y = 50 + 10 \times 2\] \[y = 50 + 20\] \[y = 70\]

Linear algebra

Cartesian coordinates

Linear algebra

Cartesian coordinates

The linear model

Bivariate regression

Bivariate regression

  • The linear model is basically the same as linear algebra, with two subtle differences
    1. We fit the line through data points (measurements, observations)
    2. We use slightly different terminology

\[\color{blue}{response} \sim intercept + (slope * \color{red}{predictor})\]

\[\color{blue}{\hat{y}} = a + b\color{red}{x}\]

  • We call our dependent variable the response variable (or criterion)
  • Our independent variables are called predictors
  • The intercept and the slope are coefficients
    • These are the meat and potatoes of the linear model
    • They are also called parameter estimates

How do we know the line of best fit?

Bivariate regression


Bivariate regression


x y
1 1
2 2
3 3
  • If we tried to predict using just the mean of y we would be right once
  • We’d miss badly for (1, 1) and (3, 3)
  • The line of best fit is that which reduces the distance between the predicted values of y and the observed values of y.

Bivariate regression

Measurement error

  • We rarely (never) find a perfectly linear relationship in our data
    • Most relationships are not perfectly linear
    • Our measurements are not perfect (VOT, formants, durations, RT)
    • There is error in everything (normal distribution)
  • We account for error in our models

\[\hat{y} = a + bx + \color{red}{\epsilon}\]

Bivariate regression

Measurement error

  • Again, we can see that the mean of y isn’t the best solution.
    • It can summarize the y data
    • It cannot explain the relationship between x and y.
  • We need a method to assess how well a given line fits the data
  • We need a method to determine the optimal intercept and slope (the line of best fit)

Bivariate regression

More on error

  • One way to do this is to calculate the ‘distance’ between predicted y (ŷ, y-hat) and observed y (yi).
  • If we add up this distance for each observation we can compare it to other lines (i.e., other distances).
  • The line that produces the shortest distance is the best.

Bivariate regression

More on error

  • The difference between \(\hat{y}\) and \(y_i\) is called prediction error
  • The total prediction error (TPE) is the sum of the prediction error of all observations
  • This measurement is not ideal because negative values cancel out the positive ones, thus we square them.
  • We call this the sum of square of the errors (SSE)

Bivariate regression

More on error

x_i y_i y_hat pred.error sqrd.error
1 2 1.6 0.4 0.16
2 1 2.7 -1.7 2.89
3 3 3.8 -0.8 0.64
NA NA NA NA NA
NA NA NA -2.1 3.69
  • This line is not the best fit for these data because it does not minimize the sum of the squares of the errors.

Bivariate regression

More on error

x_i y_i y_hat pred.error sqrd.error
1 2 0.75 1.25 1.5625
2 1 1.25 -0.25 0.0625
3 3 1.75 1.25 1.5625
NA NA NA NA NA
NA NA NA 2.25 3.1875
  • This line is not the best fit for these data because it does not minimize the sum of the squares of the errors.

Bivariate regression

More on error

x_i y_i y_hat pred.error sqrd.error
1 2 1.5 0.5 0.25
2 1 2 -1.0 1.00
3 3 2.5 0.5 0.25
NA NA NA NA NA
NA NA NA 0.0 1.50
  • This line represents a much better fit (1.50 is lower than the SSE of the other lines)
  • So how do we calculate the line the minimizes the sum of squares of the errors?

Bivariate regression

Least squares estimation


\[b = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sum{(x_i - \bar{x})^2}}\]

\[a = \bar{y} - b\bar{x}\]

Bivariate regression

Least squares estimation

obs x y x - xbar y - ybar (x-xbar)(y-ybar) x - xbar^2
1 1 2 -1 0 0 1
2 2 1 0 -1 0 0
3 3 3 1 1 1 1
NA NA NA NA NA NA NA
Sum 6 6 NA NA 1 2
Mean 2 2 NA NA NA NA

Slope

\[b = \frac{\sum{\color{red}{(x_i - \bar{x})(y_i - \bar{y})}}}{\sum{\color{blue}{(x_i - \bar{x})^2}}}\]

\[b = \frac{\color{red}{1}}{\color{blue}{2}}\]

Intercept

\[a = \bar{y} - b\bar{x}\]

\[a = 2 - (0.5 \times 2)\]

\[a = 1\]

\[\hat{y} = 1 + 0.5x\]

Bivariate regression

Bivariate regression

Technical stuff - Terminology review

  • The best fit line is determined using the ordinary least squares method
  • In other words, we try to find the best line that minimizes the distance between the predicted values (the line, \(\color{red}{\hat{y}}\)) and the observed data, \(\color{blue}{y_{i}}\)
  • The distances representing deviations from the regression line are called residuals
  • The best fit line is the one that reduces the sum of squares of the error term (this is a measure of the variability around the best fit line)
  • We never measure anything perfectly… there is always error

Fitting a model

\[y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]

  • Coefficients (parameter estimates)
    • \(\beta_0\): intercept
    • \(\beta_1\): slope
  • \(\epsilon\): error

Ex. mtcars

Get to know the dataset

  • Using the mtcars dataset, we can fit a model with the following variables:
    • response variable: mpg
    • predictor: wt

Fitting the model

  • We will fit the model using the linear equation…
    \[mpg_i = \beta_0 + \beta_1 wt_i + \epsilon_i\]
  • To do this in R, we use the lm() function
  • lm() will use the ordinary least squares method to obtain parameter estimates, i.e., β0 (intercept) and β1 (slope)
  • In essence, it will estimate the parameters we need to predict mpg for a given weight

Bivariate regression

Interpreting model output

  • Intercept
  • Slope
  • R2

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
 -4.54  -2.36  -0.13   1.41   6.87 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    37.29       1.88    19.9   <2e-16 ***
wt             -5.34       0.56    -9.6    1e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3 on 30 degrees of freedom
Multiple R-squared:  0.75,  Adjusted R-squared:  0.74 
F-statistic:  91 on 1 and 30 DF,  p-value: 1.3e-10

Bivariate regression

Interpreting model output

  • Intercept
  • Slope
  • R2

term estimate std.error statistic p.value
(Intercept) 37.29 1.88 19.86 0.00
wt -5.34 0.56 -9.56 0.00

Bivariate regression

Interpreting model output

  • Intercept
  • Slope
  • R2

term estimate std.error statistic p.value
(Intercept) 37.29 1.88 19.86 0.00
wt -5.34 0.56 -9.56 0.00

Bivariate regression

Interpreting model output

  • Intercept
  • Slope
  • R2

term estimate std.error statistic p.value
(Intercept) 37.29 1.88 19.86 0.00
wt -5.34 0.56 -9.56 0.00

Bivariate regression

Understanding slopes and intercepts

Bivariate regression

Same intercept, but different slopes

Bivariate regression

Positive and negative slope

Bivariate regression

Different intercepts, but same slopes

Bivariate regression

Interpreting model output

  • Intercept
  • Slope
  • R2

The coefficient of determination

  • An overall assessment of model fit
  • R2 is the variance explained by your model
  • Ranges from 0 to 1 (1 = 100% variance explained)
  • Literally calculated as r \(\times\) r = R2 (note the uppercase)
  • But what does it mean to explain variance?

More about error

Bivariate regression

Variance and error deviation

  • To understand variance we have to think about deviance
  • In other words we have to think about the relationship between \(\color{red}{y_i}\), \(\color{blue}{\bar{y}}\), and \(\color{green}{\hat{y}}\)
    • \(\color{red}{y_i}\) = An observed, measured value of y
    • \(\color{blue}{\bar{y}}\) = The mean value of y
    • \(\color{green}{\hat{y}}\) = A value of y predicted by our model (along the regression line)

Bivariate regression

Variance and error deviation


\(\color{red}{y_i}\) = An observed, measured value of y
\(\color{blue}{\bar{y}}\) = The mean value of y
\(\color{green}{\hat{y}}\) = A value of y predicted by our model

Bivariate regression

Variance and error deviation


\(\color{red}{y_i}\) = An observed, measured value of y
\(\color{blue}{\bar{y}}\) = The mean value of y
\(\color{green}{\hat{y}}\) = A value of y predicted by our model

Total deviation: \(\color{red}{y_i} - \color{blue}{\bar{y}}\)

Ex. \(y_3\) = 3 - 2

Bivariate regression

Variance and error deviation


\(\color{red}{y_i}\) = An observed, measured value of y
\(\color{blue}{\bar{y}}\) = The mean value of y
\(\color{green}{\hat{y}}\) = A value of y predicted by our model

Total deviation: \(\color{red}{y_i} - \color{blue}{\bar{y}}\)
Predicted deviation: \(\color{green}{\hat{y}} - \color{blue}{\bar{y}}\)

Ex. \(y_3\) = 2.5 - 2

Bivariate regression

Variance and error deviation


\(\color{red}{y_i}\) = An observed, measured value of y
\(\color{blue}{\bar{y}}\) = The mean value of y
\(\color{green}{\hat{y}}\) = A value of y predicted by our model

Total deviation: \(\color{red}{y_i} - \color{blue}{\bar{y}}\)
Predicted deviation: \(\color{green}{\hat{y}} - \color{blue}{\bar{y}}\)
Error deviation: \(\color{red}{y_i} - \color{blue}{\hat{y}}\)

Ex. \(y_3\) = 3 - 2.5

Bivariate regression

Variance and error deviation

  • Again we square these values so that positive values are not canceled out by the negative values
  • We calculate these deviations for all observations

\[SS_{total} = \sum (y_i - \bar{y})^2\] \[SS_{predicted} = \sum (\hat{y}_i - \bar{y})^2\] \[SS_{error} = \sum (y_i - \hat{y}_i)^2\]

\[SS_{total} = SS_{predicted} + SS_{error}\]

Error and R2

\[R^2 = \frac{SS_{predicted}}{SS_{total}}\]

…or r * r

Summing up

Making predictions

  • Recall the linear model equation…

\[\hat{y} = \beta_0 + \beta_1 wt_i + \epsilon_i\]

  • Our mtcars model can be summarized as…

\[mpg = 37.29 + -5.34 wt\]

  • What is the predicted mpg for a car that weighs 1 unit? And one that weighs 3? And 6?

31.94 \(mpg = 37.29 + -5.34 \times 1\)
21.25 \(mpg = 37.29 + -5.34 \times 3\)
5.22 \(mpg = 37.29 + -5.34 \times 6\)

References

Johnson, K. (2011). Patterns and tests. In K. Johnson (Ed.), Quantitative methods in linguistics (pp. 34–69). Wiley.
Lewis-Beck, M. (1980). Bivariate regression: Fitting a straight line. In M. Lewis-Beck (Ed.), Applied regression: An introduction (pp. 9–25). Newbury Park, CA: Sage.
Wickham, H., & Grolemund, G. (2016). R for data science: Import, tidy, transform, visualize, and model data. O’Reilly Media.

Bonus
Riverside income and education (qass 22)

Explore riverside data

Taken from QASS 22 (Ch. 1, Table 2)

glimpse(riverside)
Rows: 32
Columns: 3
$ respondent <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
$ education  <dbl> 4, 4, 6, 6, 6, 8, 8, 8, 8, 10, 10, 10, 11, 12, 12, 12, 12, …
$ income     <dbl> 6281, 10516, 6898, 8212, 11744, 8618, 10011, 12405, 14664, …
head(riverside)
  respondent education income
1          1         4   6281
2          2         4  10516
3          3         6   6898
4          4         6   8212
5          5         6  11744
6          6         8   8618

Fit model

First we’ll fit the model using lm().

mod <- lm(income ~ education, data = riverside)


Here is a table of the model summary:

term estimate std.error statistic p.value
(Intercept) 5077.512 1497.8302 3.389912 0.0019753
education 732.400 117.5221 6.232021 0.0000007

Making predictions

What is the predicted income of somebody with 6/10/16 years of education in Riverside?

5077 + (732 * 6)
## [1] 9469


Do it in R: