Data Science for Linguists

Bayesian inference

Joseph V. Casillas, PhD

Rutgers UniversitySpring 2025
Last update: 2025-04-25

What is probability?

The world according to frequentists

What is frequentism?

A blanket term used to refer to “classical” statistics
Population parameters are fixed, actually exist
Probability refers to the long-run frequency of a given event
A sample of data is the result of one of an infinite number of exactly repeated experiments
Data are random, result from sampling a fixed population distribution

What if…

There isn’t one true population parameter…
but an entire distribution of parameters, some more plausible than others

Bayesian inference
is all about the posterior

Applied to regression

Classical: There is a single “true” line of best fit, and I’ll give my best estimate of it.

Bayesian: There is a distribution of lines of fit…some more plausible than others…and I’ll give you samples from that distribution.

Distributions of plausible outcomes

In Bayesian inference we attempt to approximate this (these) distribution(s)… we call it the posterior distribution
Bayesian inference is all about the posterior
It represents a distribution of estimates that could arise from the data
Values that are more common (appear more often) are more plausible
For very simple problems we can calculate the posterior analytically by integrating (calculus)
For most problems (regression) we cannot calculate the posterior… so we approximate it via sampling (and calculus)

A trivial example

More cars

Let’s explore the mpg variable
N = 32
The range is 10.4, 33.9.
The mean is 20.090625.
The SD is 6.0269481.
The 95% quantiles are 10.4, 32.7375
We’ll fit an intercept only model.

A trivial example

Frequentist model

We’ll fit an intercept only model:

. lm(mpg ~ 1)

\[ \begin{align} mpg_{i} & \sim N(\mu_{i}, \sigma) \\ \mu_{i} & = \alpha \end{align} \]

The mean mpg is 20.09 ±6.03 SD.

lm(mpg ~ 1, data = mtcars) |> summary()


Call:
lm(formula = mpg ~ 1, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.6906 -4.6656 -0.8906  2.7094 13.8094 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   20.091      1.065   18.86   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.027 on 31 degrees of freedom

A trivial example

Bayesian model

We’ll fit an intercept only model.

. brm(mpg ~ 1)

\[ \begin{align} mpg_{i} & \sim N(\mu_{i}, \sigma) \\ \mu_{i} & = \alpha \end{align} \]

The mean mpg is 20.09 ±6.03 SD.

brm(mpg ~ 1, data = mtcars) |> summary()

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: mpg ~ 1 
   Data: mtcars (Number of observations: 32) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    20.06      1.04    18.06    22.25 1.00     2881     2280

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     6.14      0.79     4.84     7.87 1.00     3106     2113

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

A trivial example

Let’s compare

lm(mpg ~ 1, data = mtcars)

            Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 20.09062   1.065424 18.85693 1.526151e-18

brm(mpg ~ 1, data = mtcars)

           Estimate Est.Error      Q2.5     Q97.5
Intercept 20.062135  1.042972 18.055954 22.252435
Sigma      6.135967  0.790000  4.839691  7.872904

Recall that the mean mpg is 20.09 ±6.03 SD
The Bayesian model estimates an additional paramter… which?
The Bayesian estimates are slightly different (but really close)… why?

So what’s the difference?

Exploring the posterior

cp <- as_tibble(cars_bayes_int)

b_Intercept	sigma
21.93103	5.183423
21.46296	5.282083
19.08187	6.228794
20.53519	5.165951
19.68760	4.864501
20.81836	8.179158
19.45889	5.133981
19.96603	5.367908
20.35175	6.781768
19.92893	5.826664
20.19183	6.904526
21.06296	6.069348
19.70825	6.274392
20.85624	6.445802
21.56705	5.920678

This posterior distribution has 4000 draws of mean and SD values that are compatible with our data

nrow(cp)

[1] 4000

The posterior is just like any other distribution of data
We can analyze/summarize it however we want

mean(cp$b_Intercept)

[1] 20.06213

quantile(cp$b_Intercept, probs = c(0.025, 0.975), names = F)

[1] 18.05595 22.25243

How do we get a posterior?

Let’s talk about probability

\[ P(\color{green}{y}) \]

“the probability of y”

“the probability of you getting an A in stats”

\[ P(\color{green}{y} | \color{red}{z} ) \]

“the probability of y given z”

“the probability of you getting an A in stats given you don’t do the readings”

How do we get a posterior?

\[ P(A | B) = \frac{P(B | A) \times P(A)}{P(B)} \]

The probability of A given B is equal to the probability of B given A times the probability of A divided by the probability of B

You’ll probably never actually use this, even though most texts on Bayesian inference start with this and base rate fallacy examples

How do we get a posterior?

The posterior distribution is the product of the likelihood and the prior…
(divided by the normalizing constant)

How do we get a posterior?

\[ Posterior = \frac{Likelihood \times Prior}{Normalizing\ constant} \]

Prior and posterior describe when information is obtained: what we know pre-data is our prior information, and what we learn post-data is the updated information (“posterior”).

The likelihood in the equation says how likely the data is given the model parameters. I think of it as fit: How well do the parameters fit the data? Classical regression’s line of best fit is the maximum likelihood line. The likelihood also encompasses the data-generating process behind the model. For example, if we assume that the observed data is normally distributed, then we evaluate the likelihood by using the normal probability density function. You don’t need to know what that last sentence means. What’s important is that the likelihood contains our built-in assumptions about how the data is distributed.

How do we get a posterior?

\[ updated\ knowledge \propto Likelihood\ of\ data \times prior\ beliefs \]

We can think of Bayes’ theorem as a principled/systematic method for updating our knowledge in light of new data.

Update beliefs in proportion to how well the data fit those beliefs.
Because beliefs have probabilities, we can quantify our uncertainty about them.

How do we get a posterior?

\[ P(unknown | observed) = \frac{\color{red}{P(observed | unknown)} \times P(unknown)}{P(observed)} \]

Frequentism vs. Bayesianism

In essence the frequentist approach assess the probability of the data given a specific hypothesis

\[ P(data | H_{x}) \]

Through Bayes formula one is able to asses the probability of a specific hypothesis given the data

\[ P(H_{x} | data) \]

This is what we pretty much always want in science (and what many people think they are doing with frequentist statistics)

About priors

To approximate the posterior distribution we need priors and data
Priors = prior beliefs or assumptions
They represent a way for us to incorporate prior experience or domain expertise into the model
You can set any prior you want (subjective)
Many object to Bayesian inference because of the priors
Counterpoint: priors force you to be explicit about your assumptions

Why?

Advantages

A more intuitive inferential framework
Focus on distributions and uncertainty estimation instead of point estimates
More natural interpretation of results
Ability to handle small samples with appropriate guard against overfitting
Natural/principled way to combine prior information with data, within a solid decision theoretical framework
Flexible, can fit many models
Multiple comparisons?
Evidence for the null?
Model convergence?

Disadvantages

Conceptually difficult
Steep learning curve
Requires careful consideration of prior assumptions
Frequentist probability is based on an imaginary set of experiments that you never actually carry out
Manipulating posterior takes practice
Less common, less accepted
Computationally costly, slow
There is no correct way to choose a prior

Frequentism vs. Bayesianism?

A Bayesian estimate with flat priors is generally equivalent to a frequentist MLE
Frequentism is not wrong, just different
Frequentism is susceptible to QRPs
It’s about using the right tool for the job
Do whatever you need to do answer your questions

How you do it?

More sleep (but Bayesian)

`sleepstudy` dataset

Part of lme4 package
Criterion = reaction time (Reaction)
Predictor = # of days of sleep deprivation (Days)
10 observations per participant
+2 fake participants w/ incomplete data

tibble [183 × 4] (S3: tbl_df/tbl/data.frame)
 $ Reaction: num [1:183] 250 259 251 321 357 ...
 $ Days    : num [1:183] 0 1 2 3 4 5 6 7 8 9 ...
 $ Subject : chr [1:183] "308" "308" "308" "308" ...
 $ is_extra: logi [1:183] FALSE FALSE FALSE FALSE FALSE FALSE ...

Reaction	Days	Subject	is_extra
249.5600	0	308	FALSE
258.7047	1	308	FALSE
250.8006	2	308	FALSE
321.4398	3	308	FALSE
356.8519	4	308	FALSE
414.6901	5	308	FALSE
382.2038	6	308	FALSE
290.1486	7	308	FALSE
430.5853	8	308	FALSE
466.3535	9	308	FALSE
222.7339	0	309	FALSE
205.2658	1	309	FALSE

Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ 1 + Days + (1 + Days | Subject)
   Data: df_sleep

REML criterion at convergence: 1771.4

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.9707 -0.4703  0.0276  0.4594  5.2009 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 Subject  (Intercept) 582.72   24.140       
          Days         35.03    5.919   0.07
 Residual             649.36   25.483       
Number of obs: 183, groups:  Subject, 20

Fixed effects:
            Estimate Std. Error t value
(Intercept)  252.543      6.433  39.257
Days          10.452      1.542   6.778

Correlation of Fixed Effects:
     (Intr)
Days -0.137

Bayesian version

Fitting the model

\[ \begin{aligned} Reaction_{ij} & \sim normal(\mu, \sigma) \\ \mu & = \alpha_{ij} + \beta * Days_{ij} \\ \alpha & \sim normal(0, 1) \\ \beta & \sim normal(0, 1) \\ \sigma & \sim normal(0, 1) \end{aligned} \]

brm(Reaction ~ 1 + Days + (1 + Days | Subject), data = sleepstudy)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: Reaction ~ Days + (Days | Subject) 
   Data: df_sleep (Number of observations: 183) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~Subject (Number of levels: 20) 
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)          25.88      6.26    15.14    39.91 1.00     1793     2244
sd(Days)                6.58      1.53     4.14    10.09 1.00     1581     2319
cor(Intercept,Days)     0.09      0.30    -0.46     0.66 1.00     1047     1656

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   252.18      6.80   238.81   265.63 1.00     1909     2649
Days         10.34      1.69     6.88    13.61 1.00     1314     1918

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    25.79      1.52    23.03    28.93 1.00     3622     2961

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

“We put a tiny sliver of scientific information into the model and then we bet on the things that can happen a large number of ways”

Takeaways

There are primarily two camps in statistics: frequentists and Bayesians
The two camps see probability in different ways
Bayesian statistics is all about the posterior

References

Bates, D. (2011). Mixed models in r using the lme4 package part 2: Longitudinal data, modeling interactions. http://lme4.r-forge.r-project.org/slides/2011-03-16-Amsterdam/2Longitudinal.pdf.

Mahr, T. J. (2017). Plotting partial pooling in mixed-effects models. https://www.tjmahr.com/plotting-partial-pooling-in-mixed-effects-models/.

Mahr, T. J. (2019). Another mixed effects model visualization. https://www.tjmahr.com/another-mixed-effects-model-visualization/.

Mahr, T. J. (2020). Bayes’ theorem in three panels. https://www.tjmahr.com/bayes-theorem-in-three-panels/.

McElreath, R. (2015). Statistical rethinking: A bayesian course with examples in r and stan. CRC Press.

Data Science for Linguists

Bayesian inference

What is probability?

The world according to frequentists

What is frequentism?

What if…

Bayesian inference is all about the posterior

Applied to regression

Distributions of plausible outcomes

A trivial example

A trivial example

A trivial example

A trivial example

Let’s compare

So what’s the difference?

Exploring the posterior

How do we get a posterior?

How do we get a posterior?

How do we get a posterior?

How do we get a posterior?

How do we get a posterior?

How do we get a posterior?

How do we get a posterior?

Frequentism vs. Bayesianism

Frequentism vs. Bayesianism

About priors

Why?

Advantages

Disadvantages

Frequentism vs. Bayesianism?

Frequentism vs. Bayesianism?

How you do it?

More sleep (but Bayesian)

sleepstudy dataset

Bayesian version

Fitting the model

Takeaways

References

Bayesian inference
is all about the posterior

`sleepstudy` dataset