Data Science for Linguists

Multilevel models

Joseph V. Casillas, PhD

Rutgers UniversitySpring 2025
Last update: 2025-04-18

@chelseaparlettpelleriti
The model with 1000 names #statsTikTok 🙃
♬ Say So (feat. Nicki Minaj) - Doja Cat / Nicki Minaj

Next steps

What we’ve seen

MRC
Linear regression
General linear model
Generalized linear model

What about repeated measures designs?

Whenever we have more than one data point from the same participant we are dealing with a type of repeated measures design
- between subjects factor
- within subjects factor

Everything we have done this semester has been under the assumption that we have one data point per participant (i.e., no within subjects factors)
This is because one of the assumptions of our models was that there was no autocorrelation, i.e., that the data were independent
Repeated measures designs introduce autocorrelation into the model. Why?

What’s the big deal?

Disregarding lack of independence = pseudo-replication
In other words, replicating the data as though they were independent when they aren’t
This will inflate your degrees of freedom, completely bias your parameter estimates and make your p-values meaningless

What does a repeated measures design look like?

This dataset has the weight of 50 different chicks at 12 different time points
Notice that the first 12 rows come from the same chick (chick 1)
In other words, Time can be thought of as a continuous time series variable (within-subjects) that would violate our assumptions of independence
Diet, on the other hand, is a between-subjects factor (you can’t be on more than one diet at a time)
Notice that both Time and Diet appear in this dataset as numeric values.
You have to think about your data to make sure you understand what type of variables you have.

weight	Time	Chick	Diet
42	0	1	1
51	2	1	1
59	4	1	1
64	6	1	1
76	8	1	1
93	10	1	1
106	12	1	1
125	14	1	1
149	16	1	1
171	18	1	1
199	20	1	1
205	21	1	1
40	0	2	1
49	2	2	1
58	4	2	1
72	6	2	1
84	8	2	1
103	10	2	1
122	12	2	1
138	14	2	1

What can we do?

Currently, the most popular solution is to use a multi-level or hierarchical model
These models are commonly referred to as mixed effects models
They include both “fixed” effects and “random” effects
They provide a flexible framework that allows us to account for the hierarchical structure of our data (within-subjects variables, time-series data, etc.) and provide partial pooling estimates
Building on our knowledge of the linear model, understanding the basic idea of mixed effects models is trivial

How does it work?

Standard linear model

Recall our formula for fitting a linear model

\[ \begin{align} y_{i} & \sim Normal(\mu_{i}, \sigma) \\ \mu_{i} & = \alpha + \beta_{1} x_{i} \\ \sigma & \sim Normal(0, \sigma^{2}) \end{align} \]

We fit these models in R using lm() or glm():
lm(criterion ~ predictor, data = my_data)
In a mixed effects model, what we know as predictors are called fixed effects
The novel aspect of a mixed effects model is that it also includes random effects
Random effects allow us to account for non-independence

How does it work?

Mixed effects model

A mixed effects model mixes both fixed effects and random effects

\[ \hat{y} = \alpha + \color{red}{\beta}\color{blue}{X} + \color{green}{u}\color{purple}{Z} + \epsilon \]

\[ response = intercept + \color{red}{slope} \times \color{blue}{FE} + \color{green}{u} \times \color{purple}{RE} + error \]

We fit these models in R using lmer() or glmer() from the lme4 package (there are other options as well)

lmer(criterion ~ fixed_effect + (1|random_effect), data = my_data)

What is a random effect?

This is actually a really nuanced question and nobody has a good answer
Some people use descriptions like the following:

fixed	random
repeatable	non repeatable
systematic influence	random influence
exhaust the pop.	sample the pop.
generally of interest	often not of interest
continuous or categorical	have to be categorical

…but you will always find exceptions to this
Most of the time you will see subjects and items as random effects (things that are repeated in the design)
This typically implies that the model includes random intercepts, and/or random slopes
Instead of trying to define random effects, let’s try to understand what they do (and come back to the idea later)

subjects	time	response
p_01	1	20.00
p_01	2	21.18
p_01	3	18.93
p_01	4	8.55
p_01	5	26.20
p_01	6	24.15
p_01	7	25.55
p_01	8	19.66
p_01	9	27.61
p_01	10	34.99
p_01	11	42.63
p_01	12	35.87
p_01	13	40.81
p_01	14	40.16
p_01	15	34.22
p_01	16	41.99
p_01	17	44.12
p_01	18	48.13
p_01	19	57.08
p_01	20	57.06
p_02	1	23.22
p_02	2	24.68
p_02	3	32.39
p_02	4	33.54

The dataframe includes values of the response variable at 20 different time points for each participant (n = 10)
We could model this as:

lm(response ~ time, data = my_df)

\[ \begin{align} response_{i} & \sim N(\mu_{i}, \sigma) \\ \mu_{i} & = \alpha + \beta_1 time_{i} \\ \sigma & \sim Normal(0, \sigma^{2}) \end{align} \]

But this would violate our assumption of independence (and would produce a terribly misspecified model)

Let’s include `subjects` as a random effect

We will do this by giving subjects a random intercept
In other words, we will allow the intercepts to vary for each participant (as opposed to modeling just 1 intercept)

lmer(response ~ time + (1|subjects), data = my_df)

What would this look like?

\[ \begin{align} response_{it} & \sim N(\mu_{it}, \sigma) \\ \mu_{it} & = \alpha + \alpha_{subject_i} + \beta_1 time_{t} \\ \sigma & \sim Normal(0, \sigma^{2}) \end{align} \]

What did we do?

We are taking into account the idiosyncratic differences associated with each individual subject
By giving each subject its own intercept we are informing the model that each individual has a different starting point in the time course (when time = 0)

In general this makes sense because some people…
- are faster/slower responders
- speak faster/slower
- have higher/lower pitched voices

We can also take into account the fact that some people…
- slow down during an experiment
- respond more/less accurately over time
- learn at different rates

What did we do?

The model now allows the random intercepts to vary for each individual for the effect time
This means we included a random slope for each participant
By adding a random slope for the effect time we take into account the fact that response change for each individual at a different rate
Under the hood, the model uses this information to calculate the best fit line for all of the data
This method is called partial pooling and represents one of the most important (and least understood) aspects of mixed effects modeling

lmer(response ~ time + (1 + time|subjects), data = my_df)

Again, (1 + time|subjects) represents the random structure of the model
Anything to the right of | is a random intercept
Anything to the left of | is given a random slope for the effect specified to the right
Thus, (1 + time|subjects) means random slopes for the effect time for each subject
It is rarely a good idea to only use random intercepts

Great news!

We have spent the entire semester building up our knowledge of the linear model so that we would be prepared to understand mixed effects models
Why? They are the standard in speech sciences now

Everything that you have learned in this class applies to a mixed effects model
- model interpretation
- nested model comparisons
- treatment of categorical factors
- centering, standardizing, and other transformations

What about GLMs?
- Them too!
- You can fit mixed effects models for count data and binary outcomes using glmer()
- All you have to do is specify the distribution and the linking function

Multilevel logistic regession model

glmer(
  formula = response ~ fixed_effect + 
    (1 + fixed_effect | participant), 
  family = binomial(link = "logit"), 
  data = DATA
)

Multilevel poisson regression model

glmer(
  formula = counts   ~ fixed_effect + 
    (1 + fixed_effect | participant), 
  family = poisson(link = "log"), 
  data = DATA
)

Understanding partial pooling

Based on Bates (2011) and Mahr (2017) with some help from McElreath (2015)

What do you mean by ‘multiple levels’?

We can think of multilevel models as “models within models”

Generally:

One level estimates observed groups/items/individuals/etc.
Another level estimates populations of groups/individuals

For this reason, some reasercheres (myself included) prefer to refer to grouping-level effects and population-level effects, as opposed to random effects and fixed effects

Under this view, random intercepts and random slopes are conceptualized and varying intercepts and varying slopes

So what is ‘pooling’ all about?

Some describe multilevel models as ‘models with memory’
- The population-level model helps inform/learn about the grouping-level
- They learn faster, better
- They are robust to overfitting
It has to do with how information is ‘pooled’
1. Complete pooling
2. No pooling
3. Partial pooling

Partial pooling

`sleepstudy` dataset

Part of lme4 package
Criterion = reaction time (Reaction)
Predictor = # of days of sleep deprivation (Days)
10 observations per participant
+2 fake participants w/ incomplete data

tibble [183 × 3] (S3: tbl_df/tbl/data.frame)
 $ Reaction: num [1:183] 250 259 251 321 357 ...
 $ Days    : num [1:183] 0 1 2 3 4 5 6 7 8 9 ...
 $ Subject : chr [1:183] "308" "308" "308" "308" ...

Reaction	Days	Subject
249.5600	0	308
258.7047	1	308
250.8006	2	308
321.4398	3	308
356.8519	4	308
414.6901	5	308
382.2038	6	308
290.1486	7	308
430.5853	8	308
466.3535	9	308
222.7339	0	309
205.2658	1	309

No pooling

Model	Subject	Intercept	Slope_Days
No pooling	308	244.1927	21.764702
No pooling	309	205.0549	2.261785
No pooling	310	203.4842	6.114899
No pooling	330	289.6851	3.008073
No pooling	331	285.7390	5.266019
No pooling	332	264.2516	9.566768
No pooling	333	275.0191	9.142046
No pooling	334	240.1629	12.253141
No pooling	335	263.0347	-2.881034
No pooling	337	290.1041	19.025974
No pooling	349	215.1118	13.493933
No pooling	350	225.8346	19.504017
No pooling	351	261.1470	6.433498
No pooling	352	276.3721	13.566549
No pooling	369	254.9681	11.348109
No pooling	370	210.4491	18.056151
No pooling	371	253.6360	9.188445
No pooling	372	267.0448	11.298073
No pooling	374	286.0000	2.000000

Complete pooling


Call:
lm(formula = Reaction ~ Days, data = df_sleep)

Residuals:
     Min       1Q   Median       3Q      Max 
-110.646  -27.951    1.829   26.388  139.875 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  252.321      6.406  39.389  < 2e-16 ***
Days          10.328      1.210   8.537 5.48e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 47.43 on 181 degrees of freedom
Multiple R-squared:  0.2871,    Adjusted R-squared:  0.2831 
F-statistic: 72.88 on 1 and 181 DF,  p-value: 5.484e-15

Complete pooling

Model	Subject	Intercept	Slope_Days
Complete pooling	308	252.3207	10.32766
Complete pooling	309	252.3207	10.32766
Complete pooling	310	252.3207	10.32766
Complete pooling	330	252.3207	10.32766
Complete pooling	331	252.3207	10.32766
Complete pooling	332	252.3207	10.32766
Complete pooling	333	252.3207	10.32766
Complete pooling	334	252.3207	10.32766
Complete pooling	335	252.3207	10.32766
Complete pooling	337	252.3207	10.32766
Complete pooling	349	252.3207	10.32766
Complete pooling	350	252.3207	10.32766
Complete pooling	351	252.3207	10.32766
Complete pooling	352	252.3207	10.32766
Complete pooling	369	252.3207	10.32766
Complete pooling	370	252.3207	10.32766
Complete pooling	371	252.3207	10.32766
Complete pooling	372	252.3207	10.32766
Complete pooling	374	252.3207	10.32766
Complete pooling	373	252.3207	10.32766

Can we improve estimates with partial pooling?

Reaction ~ 1 + Days + (1 + Days | Subject)

We allow the effect of Days to vary for each Subject

By-subject random intercepts with random slope for Days

lmer(Reaction ~ 1 + Days + (1 + Days | Subject), data = df_sleep)

Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ 1 + Days + (1 + Days | Subject)
   Data: df_sleep

REML criterion at convergence: 1771.4

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.9707 -0.4703  0.0276  0.4594  5.2009 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 Subject  (Intercept) 582.72   24.140       
          Days         35.03    5.919   0.07
 Residual             649.36   25.483       
Number of obs: 183, groups:  Subject, 20

Fixed effects:
            Estimate Std. Error t value
(Intercept)  252.543      6.433  39.257
Days          10.452      1.542   6.778

Correlation of Fixed Effects:
     (Intr)
Days -0.137

Most of this will look familiar
Fixed effects estimates (and interpretations) work just like lm, glm, etc.
What’s new? The random effects estimates

Partial pooling

Subject	Intercept	Slope_Days	Model
308	253.9478	19.6264337	Partial pooling
309	211.7331	1.7319161	Partial pooling
310	213.1582	4.9061511	Partial pooling
330	275.1425	5.6436007	Partial pooling
331	273.7286	7.3862730	Partial pooling
332	260.6504	10.1632571	Partial pooling
333	268.3683	10.2246059	Partial pooling
334	244.5524	11.4837802	Partial pooling
335	251.3702	-0.3355788	Partial pooling
337	286.2319	19.1090424	Partial pooling
349	226.7663	11.5531844	Partial pooling
350	238.7807	17.0156827	Partial pooling
351	256.2344	7.4119456	Partial pooling
352	272.3511	13.9920878	Partial pooling
369	254.9484	11.2985770	Partial pooling
370	226.3701	15.2027877	Partial pooling
371	252.5051	9.4335409	Partial pooling
372	263.8916	11.7253429	Partial pooling
373	248.9753	10.3915288	Partial pooling
374	271.1450	11.0782516	Partial pooling

Multilevel models use the grouping variables to learn about the nested structure of the data
The model can use this information to make educated guesses when there is incomplete information
The model takes advantage of partial pooling, producing shrinkage
This builds skepticism into the model with regard to extreme values

References

Bates, D. (2011). Mixed models in r using the lme4 package part 2: Longitudinal data, modeling interactions. http://lme4.r-forge.r-project.org/slides/2011-03-16-Amsterdam/2Longitudinal.pdf.

Brown, V. A. (2021). An introduction to linear mixed-effects modeling in R. Advances in Methods and Practices in Psychological Science, 4(1), 1–19. https://doi.org/10.1177/2515245920960351

DeBruine, L. M., & Barr, D. J. (2021). Understanding mixed-effects models through data simulation. Advances in Methods and Practices in Psychological Science, 4(1), 1–13. https://doi.org/10.1177/2515245920965119

Mahr, T. J. (2017). Plotting partial pooling in mixed-effects models. https://www.tjmahr.com/plotting-partial-pooling-in-mixed-effects-models/.

Mahr, T. J. (2019). Another mixed effects model visualization. https://www.tjmahr.com/another-mixed-effects-model-visualization/.

McElreath, R. (2015). Statistical rethinking: A bayesian course with examples in r and stan. CRC Press.

weight	Time	Chick	Diet
42	0	1	1
51	2	1	1
59	4	1	1
64	6	1	1
76	8	1	1
93	10	1	1
106	12	1	1
125	14	1	1
149	16	1	1
171	18	1	1
199	20	1	1
205	21	1	1
40	0	2	1
49	2	2	1
58	4	2	1
72	6	2	1
84	8	2	1
103	10	2	1
122	12	2	1
138	14	2	1

weight	Time	Chick	Diet
42	0	1	1
51	2	1	1
59	4	1	1
64	6	1	1
76	8	1	1
93	10	1	1
106	12	1	1
125	14	1	1
149	16	1	1
171	18	1	1
199	20	1	1
205	21	1	1
40	0	2	1
49	2	2	1
58	4	2	1
72	6	2	1
84	8	2	1
103	10	2	1
122	12	2	1
138	14	2	1

Data Science for Linguists

Multilevel models

Next steps

What we’ve seen

What about repeated measures designs?

What’s the big deal?

What does a repeated measures design look like?

What can we do?

How does it work?

Standard linear model

How does it work?

Mixed effects model

What is a random effect?

Let’s include subjects as a random effect

What did we do?

What did we do?

Great news!

Understanding partial pooling

What do you mean by ‘multiple levels’?

So what is ‘pooling’ all about?

Partial pooling

sleepstudy dataset

No pooling

Complete pooling

Complete pooling

Can we improve estimates with partial pooling?

Partial pooling

References

Let’s include `subjects` as a random effect

`sleepstudy` dataset

weight	Time	Chick	Diet
42	0	1	1
51	2	1	1
59	4	1	1
64	6	1	1
76	8	1	1
93	10	1	1
106	12	1	1
125	14	1	1
149	16	1	1
171	18	1	1
199	20	1	1
205	21	1	1
40	0	2	1
49	2	2	1
58	4	2	1
72	6	2	1
84	8	2	1
103	10	2	1
122	12	2	1
138	14	2	1