Data Science for Linguists

Bayesian inference

Joseph V. Casillas, PhD

Rutgers UniversitySpring 2025
Last update: 2025-04-25

What is probability?

The world according to frequentists

What is frequentism?

  • A blanket term used to refer to “classical” statistics
  • Population parameters are fixed, actually exist
  • Probability refers to the long-run frequency of a given event
  • A sample of data is the result of one of an infinite number of exactly repeated experiments
  • Data are random, result from sampling a fixed population distribution

What if…

There isn’t one true population parameter…
but an entire distribution of parameters, some more plausible than others

Bayesian inference
is all about the posterior

Applied to regression

Classical: There is a single “true” line of best fit, and I’ll give my best estimate of it.

Bayesian: There is a distribution of lines of fit…some more plausible than others…and I’ll give you samples from that distribution.

Distributions of plausible outcomes

  • In Bayesian inference we attempt to approximate this (these) distribution(s)… we call it the posterior distribution
  • Bayesian inference is all about the posterior
  • It represents a distribution of estimates that could arise from the data
  • Values that are more common (appear more often) are more plausible
  • For very simple problems we can calculate the posterior analytically by integrating (calculus)
  • For most problems (regression) we cannot calculate the posterior… so we approximate it via sampling (and calculus)



A trivial example

More cars

  • Let’s explore the mpg variable
  • N = 32
  • The range is 10.4, 33.9.
  • The mean is 20.090625.
  • The SD is 6.0269481.
  • The 95% quantiles are 10.4, 32.7375
  • We’ll fit an intercept only model.


A trivial example

Frequentist model

We’ll fit an intercept only model:

.
lm(mpg ~ 1)

\[ \begin{align} mpg_{i} & \sim N(\mu_{i}, \sigma) \\ \mu_{i} & = \alpha \end{align} \]

The mean mpg is 20.09 ±6.03 SD.



lm(mpg ~ 1, data = mtcars) |> summary()

Call:
lm(formula = mpg ~ 1, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.6906 -4.6656 -0.8906  2.7094 13.8094 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   20.091      1.065   18.86   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.027 on 31 degrees of freedom

A trivial example

Bayesian model

We’ll fit an intercept only model.

.
brm(mpg ~ 1)

\[ \begin{align} mpg_{i} & \sim N(\mu_{i}, \sigma) \\ \mu_{i} & = \alpha \end{align} \]

The mean mpg is 20.09 ±6.03 SD.


brm(mpg ~ 1, data = mtcars) |> summary()
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: mpg ~ 1 
   Data: mtcars (Number of observations: 32) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    20.06      1.04    18.06    22.25 1.00     2881     2280

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     6.14      0.79     4.84     7.87 1.00     3106     2113

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

A trivial example

Let’s compare

lm(mpg ~ 1, data = mtcars)
            Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 20.09062   1.065424 18.85693 1.526151e-18
brm(mpg ~ 1, data = mtcars)
           Estimate Est.Error      Q2.5     Q97.5
Intercept 20.062135  1.042972 18.055954 22.252435
Sigma      6.135967  0.790000  4.839691  7.872904


  • Recall that the mean mpg is 20.09 ±6.03 SD
  • The Bayesian model estimates an additional paramter… which?
  • The Bayesian estimates are slightly different (but really close)… why?

So what’s the difference?

Exploring the posterior

cp <- as_tibble(cars_bayes_int)
b_Intercept sigma
21.93103 5.183423
21.46296 5.282083
19.08187 6.228794
20.53519 5.165951
19.68760 4.864501
20.81836 8.179158
19.45889 5.133981
19.96603 5.367908
20.35175 6.781768
19.92893 5.826664
20.19183 6.904526
21.06296 6.069348
19.70825 6.274392
20.85624 6.445802
21.56705 5.920678
  • This posterior distribution has 4000 draws of mean and SD values that are compatible with our data
nrow(cp)
[1] 4000
  • The posterior is just like any other distribution of data
  • We can analyze/summarize it however we want
mean(cp$b_Intercept)
[1] 20.06213
quantile(cp$b_Intercept, probs = c(0.025, 0.975), names = F)
[1] 18.05595 22.25243

How do we get a posterior?

Let’s talk about probability

\[ P(\color{green}{y}) \]

“the probability of y
“the probability of you getting an A in stats

\[ P(\color{green}{y} | \color{red}{z} ) \]

“the probability of y given z
“the probability of you getting an A in stats given you don’t do the readings

How do we get a posterior?


\[ P(A | B) = \frac{P(B | A) \times P(A)}{P(B)} \]

The probability of A given B is equal to the probability of B given A times the probability of A divided by the probability of B

You’ll probably never actually use this, even though most texts on Bayesian inference start with this and base rate fallacy examples

How do we get a posterior?

How do we get a posterior?

The posterior distribution is the product of the likelihood and the prior…
(divided by the normalizing constant)

How do we get a posterior?


\[ Posterior = \frac{Likelihood \times Prior}{Normalizing\ constant} \]

How do we get a posterior?


\[ updated\ knowledge \propto Likelihood\ of\ data \times prior\ beliefs \]

We can think of Bayes’ theorem as a principled/systematic method for updating our knowledge in light of new data.

  • Update beliefs in proportion to how well the data fit those beliefs.
  • Because beliefs have probabilities, we can quantify our uncertainty about them.

How do we get a posterior?



\[ P(unknown | observed) = \frac{\color{red}{P(observed | unknown)} \times P(unknown)}{P(observed)} \]

Frequentism vs. Bayesianism

Frequentism vs. Bayesianism

In essence the frequentist approach assess the probability of the data given a specific hypothesis

\[ P(data | H_{x}) \]

Through Bayes formula one is able to asses the probability of a specific hypothesis given the data

\[ P(H_{x} | data) \]

This is what we pretty much always want in science (and what many people think they are doing with frequentist statistics)

About priors

  • To approximate the posterior distribution we need priors and data
  • Priors = prior beliefs or assumptions
  • They represent a way for us to incorporate prior experience or domain expertise into the model
  • You can set any prior you want (subjective)
  • Many object to Bayesian inference because of the priors
  • Counterpoint: priors force you to be explicit about your assumptions

Why?

Advantages

  • A more intuitive inferential framework
  • Focus on distributions and uncertainty estimation instead of point estimates
  • More natural interpretation of results
  • Ability to handle small samples with appropriate guard against overfitting
  • Natural/principled way to combine prior information with data, within a solid decision theoretical framework
  • Flexible, can fit many models
  • Multiple comparisons?
  • Evidence for the null?
  • Model convergence?

Disadvantages

  • Conceptually difficult
  • Steep learning curve
  • Requires careful consideration of prior assumptions
  • Frequentist probability is based on an imaginary set of experiments that you never actually carry out
  • Manipulating posterior takes practice
  • Less common, less accepted
  • Computationally costly, slow
  • There is no correct way to choose a prior

Frequentism vs. Bayesianism?

Frequentism vs. Bayesianism?

  • A Bayesian estimate with flat priors is generally equivalent to a frequentist MLE

  • Frequentism is not wrong, just different

  • Frequentism is susceptible to QRPs

  • It’s about using the right tool for the job

  • Do whatever you need to do answer your questions

How you do it?

More sleep (but Bayesian)

sleepstudy dataset

  • Part of lme4 package
  • Criterion = reaction time (Reaction)
  • Predictor = # of days of sleep deprivation (Days)
  • 10 observations per participant
  • +2 fake participants w/ incomplete data
tibble [183 × 4] (S3: tbl_df/tbl/data.frame)
 $ Reaction: num [1:183] 250 259 251 321 357 ...
 $ Days    : num [1:183] 0 1 2 3 4 5 6 7 8 9 ...
 $ Subject : chr [1:183] "308" "308" "308" "308" ...
 $ is_extra: logi [1:183] FALSE FALSE FALSE FALSE FALSE FALSE ...
Reaction Days Subject is_extra
249.5600 0 308 FALSE
258.7047 1 308 FALSE
250.8006 2 308 FALSE
321.4398 3 308 FALSE
356.8519 4 308 FALSE
414.6901 5 308 FALSE
382.2038 6 308 FALSE
290.1486 7 308 FALSE
430.5853 8 308 FALSE
466.3535 9 308 FALSE
222.7339 0 309 FALSE
205.2658 1 309 FALSE

Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ 1 + Days + (1 + Days | Subject)
   Data: df_sleep

REML criterion at convergence: 1771.4

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.9707 -0.4703  0.0276  0.4594  5.2009 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 Subject  (Intercept) 582.72   24.140       
          Days         35.03    5.919   0.07
 Residual             649.36   25.483       
Number of obs: 183, groups:  Subject, 20

Fixed effects:
            Estimate Std. Error t value
(Intercept)  252.543      6.433  39.257
Days          10.452      1.542   6.778

Correlation of Fixed Effects:
     (Intr)
Days -0.137

Bayesian version

Fitting the model

\[ \begin{aligned} Reaction_{ij} & \sim normal(\mu, \sigma) \\ \mu & = \alpha_{ij} + \beta * Days_{ij} \\ \alpha & \sim normal(0, 1) \\ \beta & \sim normal(0, 1) \\ \sigma & \sim normal(0, 1) \end{aligned} \]


brm(Reaction ~ 1 + Days + (1 + Days | Subject), data = sleepstudy)

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: Reaction ~ Days + (Days | Subject) 
   Data: df_sleep (Number of observations: 183) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~Subject (Number of levels: 20) 
                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)          25.88      6.26    15.14    39.91 1.00     1793     2244
sd(Days)                6.58      1.53     4.14    10.09 1.00     1581     2319
cor(Intercept,Days)     0.09      0.30    -0.46     0.66 1.00     1047     1656

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   252.18      6.80   238.81   265.63 1.00     1909     2649
Days         10.34      1.69     6.88    13.61 1.00     1314     1918

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    25.79      1.52    23.03    28.93 1.00     3622     2961

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

“We put a tiny sliver of scientific information into the model and then we bet on the things that can happen a large number of ways”

Takeaways

  • There are primarily two camps in statistics: frequentists and Bayesians
  • The two camps see probability in different ways
  • Bayesian statistics is all about the posterior

References

Bates, D. (2011). Mixed models in r using the lme4 package part 2: Longitudinal data, modeling interactions. http://lme4.r-forge.r-project.org/slides/2011-03-16-Amsterdam/2Longitudinal.pdf.
Mahr, T. J. (2017). Plotting partial pooling in mixed-effects models. https://www.tjmahr.com/plotting-partial-pooling-in-mixed-effects-models/.
Mahr, T. J. (2019). Another mixed effects model visualization. https://www.tjmahr.com/another-mixed-effects-model-visualization/.
Mahr, T. J. (2020). Bayes’ theorem in three panels. https://www.tjmahr.com/bayes-theorem-in-three-panels/.
McElreath, R. (2015). Statistical rethinking: A bayesian course with examples in r and stan. CRC Press.