Data Science for Linguists

.title[
# Data Science for Linguists
]
.subtitle[
## The Generalized Linear Model
]
.author[
### Joseph V. Casillas, PhD
]
.date[
### Rutgers University</br>Spring 2023</br>Last update: 2023-04-12
]

---

# Review

---
background-image: url(./assets/img/01_distributions.png)
background-size: contain

background-image: url(./assets/img/03b_hypothesis_testing.png)
background-size: contain

---
background-image: url(./assets/img/01_distributions.png), url(./assets/img/03a_nhst.png), url(./assets/img/05_correlation.png), url(./assets/img/06b_bvar_reg.png), url(./assets/img/07_general_lm01.png)
background-size: 400px, 400px, 400px, 400px, 400px
background-position: 5% 10%, 50% 50%, 95% 10%, 0% 95%, 100% 95%

---

# Review

### What we know, where we've been

.large[
- Distributions
  - Normal distribution
  - CLT
- Hypothesis testing
  - z-tests
  - t-tests
- Bivariate correlation
- The linear model
  - Bivariate regression
  - Multiple regression and correlation
- The general linear model
  - Continuous predictors
  - Categorical predictors
]
]

### What they have in common

.large[
- Criterion
  - Continuous
  - Linear relationship with predictors
  - `$-\infty$` : `$\infty$`
  - Errors are normally distributed
]
]

---

# (P)review

### Where we're headed

.large[
- Sometimes we measure phenomena that are not continuous variables ranging from 
  `$-\infty$` : `$\infty$`
]

<p></p>

.large[
- For example, we often analyze binary outcomes
  - Decisions
  - The presence/absence of a linguistic feature
  - Categorical perception 
]

<p></p>

.large[
- Sometimes we count things
  - Number of languages in a given area
  - Number of code switches during a linguistic interview
]

<p></p>

.large[
- We can extend our model in order to account for these different types of 
*dependent* variables
]

---
class: title-slide-section-grey, middle

# The generalized linear model

---

# The generalized linear model

### History

- Formulated by Nelder and Wedderburn (1972)
- The purpose was to unify (again) the different models being used at the time

### How

Recall that linear models contain a *systematic component* that specifies 
predictors (*X*<sub>1</sub>, *X*<sub>2</sub> ... *X*<sub>*k*</sub>), which, 
in turn, create our linear predictor (β<sub>0</sub> + β<sub>1</sub>*x*<sub>1</sub> ... β<sub>k</sub>*x*<sub>k</sub>). At a minimum, a GLM contains the following:

1) .blue[A data *distribution*]. This refers to the probability distribution of 
the response variable.

2) .blue[A *linking function*] that transforms the criterion used to model the 
data. It *links* the data distribution of the response variable to the 
systematic component of the model.

]

3) .blue[An *estimator*], or method for obtaining parameter estimates

You (the researcher) are responsible for selecting (1) and (2). (3) is always 
the same.

]

---
background-image: url(./assets/img/glm_distributions.png)
background-size: 600px
background-position: 95% 50%

# The generalized linear model

### Distributions

#### .Large[Meet the (exponential) family]

.Large[
The exponential family is a series<br>of probability distributions, each<br>with its own properties.

There are ±12, but we'll focus on 3:

- Gaussian 
- Binomial 
- Poisson 
]

---

# The generalized linear model

### Linking functions

.Large[
The linking function transforms the response variable in the manner most 
appropriate given the data distribution you have selected.

- .RUred[Identity]: for gaussian response variable (normal distribution)

- .blue[Logit]: for binary response variable (binomial distribution)

- .green[Log]: for count response variable (poisson distribution)

]

(more detail later)

---

# The generalized linear model

### The estimator: Maximum Likelihood Estimation (MLE)

<ru-blockquote>
Maximum likelihood estimation is the method that determines the values of the
parameters of the model. The parameter estimates are obtained in a way that 
maximizes the likelihood that the process described by the model produced 
the data that were actually observed.<sup>1</sup>
</ru-blockquote>

.footnote[<sup>1</sup> [Brooks-Bartlett](https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1) (2018)]

---
background-image: url(./assets/img/06a_bvar_reg.png)
background-size: 600px
background-position: 98%
class: middle

## Recall that classical linear regression uses .RUred[*ordinary least squares estimation*] to determine the line that minimizes the residual sum of squares.

]

.footnote[You don't need to know the details behind MLE,  
but it never hurts to understand what is going on  
under the hood. [Brooks-Bartlett](https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1) (2018)  
is a good place to start if you are interested.]

---
count: false

# The generalized linear model

### Something old

- It turns out we've spent the previous 11 weeks working with the gaussian 
distribution
- Gaussian is another way of saying normal
- This means we can think of standard linear regression in the general linear model as a special case of the generalized linear model
  - The errors are normally (gaussian) distributed
  - The criterion is associated with the linear predictors via an identity 
  linking function 
  - Instead of least squares estimation we use maximum likelihood estimation

]

### Something new

.Large["y as a function of x"]

]

---
count: false

# The generalized linear model

### Something old

]

### Something new

.Large["y as a function of x"]

.Large[y ~ x]

]

---
count: false

# The generalized linear model

### Something old

]

### Something new

.Large[

"y as a function of x"

y ~ x

|          |        |                                           |
| :------- | :----: | :---------------------------------------- |
| `$y_{i}$`  | `$\sim$` | `$Normal(\mu_{i}, \sigma)$`                 |
| `$u_{i}$`  | `$=$`    | `$\alpha + \beta_{1} \times predictor_{i}$` |
| `$\sigma$` | `$\sim$` | `$Normal(0, \sigma^2)$`                     |

]
]

---

# The generalized linear model

### Something old

]

### Something new

- We can use different combinations of distributions and linking functions to 
model different outcome variables
- There are many options. Learning new modeling tools will open your mind to 
new experimental possibilities (if you only have a hammer, everything is a nail)
- We will focus on two types of regression that can be fit in the Generalized 
Linear Model framework: 
  1. Logistic regression (outcome variable is binary)
  2. Poisson regression (outcome variable represents counts)

]

---

# The generalized linear model

### Assumptions

.Large[
- Data are independently distributed (independence of scores)

- Dependent variable follows a distribution from the exponential family

- Linear relationship between transformed response variable and predictors 
(linear relationship is result of linking function)

- Errors are independent (NOTE: they do not have to be normally distributed)
]

---

# The generalized linear model

### Doing it in R

.Large[

#### The `glm()` function

- Thus far we have used the `lm()` function to fit models

- The `glm()` function works in the same way

]

```r
my_glm <- glm(
  criterion ~ pred1 + pred2 + pred1:pred2, # model formula
  data = my_df,                            # select dataframe
* family = gaussian(link = "identity")
)
```

- Steps
  1. Specify the .blue[model formula]: `criterion ~ pred1 + pred2 + pred1:pred2`
  2. Select the .red[dataframe]: `data = my_df`
  3. Select a .green[distribution family] and .green[linking function]: 
  `family = gaussian(link = "identity")`

---

# The generalized linear model

```r
# Standard lm()
lm(
  formula = mpg ~ drat, 
  data = mtcars
  # No likelihood/linking function
)
```

<table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> -7.525 </td>
   <td style="text-align:right;"> 5.477 </td>
   <td style="text-align:right;"> -1.374 </td>
   <td style="text-align:right;"> 0.18 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> drat </td>
   <td style="text-align:right;"> 7.678 </td>
   <td style="text-align:right;"> 1.507 </td>
   <td style="text-align:right;"> 5.096 </td>
   <td style="text-align:right;"> 0.00 </td>
  </tr>
</tbody>
</table>
]

```r
# glm() equivalent
glm(
  formula = mpg ~ drat, 
  data = mtcars, 
* family = gaussian(link = "identity")
)
```

]

</br>

---

background-color: black
class: middle

# Logistic regression

---

# Logistic regression

.Large[
### Model setup

- The criterion is binary (0/1)

- The distribution for binary data is either the **bernoulli** or **binomial** distribution

- The linking function is the **logit**
]
]

<br>

.Large[
|                |        |                           |
| -------------: | :----- | :------------------------ |
| `$y_{i}$`        | `$\sim$` | `$Bernoulli(1, p_{i})$`     |
| `$logit(p)_{i}$` | `$=$`    | `$\alpha + \beta_{1}x_{i}$` |
]
]

---

# Logistic regression

### How it works

.pull-left[
.Large[
- The logit linking function transforms the criterion so it can be modeled by the linear predictor

`$$logit(p) = log(\frac{p}{1 - p})$$`
]
]

.pull-right[
.Large[
- In other words, the linking function transforms the dichotomous 0/1 outcomes to `$-\infty$` : `$\infty$`

`$$log(\frac{p_{i}}{1 - p_{i}}) = \alpha + \beta_{1}X_{1}$$`
]
]

---
layout: false
class: middle

<blockquote class="tiktok-embed" cite="https://www.tiktok.com/@chelseaparlettpelleriti/video/6871246394531876102" data-video-id="6871246394531876102" data-embed-from="oembed" style="max-width: 605px;min-width: 325px;" > <section> <a target="_blank" title="@chelseaparlettpelleriti" href="https://www.tiktok.com/@chelseaparlettpelleriti?refer=embed">@chelseaparlettpelleriti</a> <p>🐧🧊 and that's on making sure you create a safe place to ask questions as a teacher 😎✌🏼 <a title="statstiktok" target="_blank" href="https://www.tiktok.com/tag/statstiktok?refer=embed">#statsTikTok</a> <a title="datascience" target="_blank" href="https://www.tiktok.com/tag/datascience?refer=embed">#datascience</a></p> <a target="_blank" title="♬ The Wonder Pets! - Wonder Pets" href="https://www.tiktok.com/music/The-Wonder-Pets-6841272099479227142?refer=embed">♬ The Wonder Pets! - Wonder Pets</a> </section> </blockquote> 

<script async data-external="1" src="https://www.tiktok.com/embed.js"></script>
<style>blockquote.tiktok-embed {border:unset;padding:unset;}</style>

---

---

# Logistic regression

### What you need to know

- The model calculates the probability that y = 1, i.e., the probability of a 
"success", or presence of something

- Model output from logistic regression is similar to `lm()`

- Model interpretation "works" the same way, i.e., a 1-unit change in 
`predictor` is associated with a change of X in the criterion

- But... the parameter estimates represent changes in the log-odds of y = 1

- This is much less intuitive, much more difficult to understand without some 
math
]

---

# Logistic regression

### Example

- You conducted an experiment in which participants heard a range of 
bilabial stops that differed in voice-onset time

- The stimuli ranged from -60 ms to 60 ms in 10 ms increments

- Participants were presented stimuli drawn at random from the continuum and 
identified the sounds as /b/'s or /p/'s

- A /p/ response is coded as a 1
]

---
background-image: url(./assets/img/vot.png)
background-size: contain

---

# Logistic regression

```r
mod_log <- glm(
  resp ~ vot, 
  data = vot_logistic_data, 
  family = "binomial"
)
```

]

```

Call:
glm(formula = resp ~ vot, family = "binomial", data = vot_logistic_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3085  -0.6583  -0.2198   0.6503   2.9320

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.84614    0.06089   -13.9   <2e-16 ***
vot          0.05731    0.00213    26.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 3482.8  on 2599  degrees of freedom
Residual deviance: 2063.6  on 2598  degrees of freedom
AIC: 2067.6

Number of Fisher Scoring iterations: 5
```

]

---

# Logistic regression

### Example

.Large[
- We can convert the log-odds to probabilities by calculating the inverse 
logit<sup>1</sup>
]

```r
inv_logit(mod_log) %>% kable(., format = 'html')
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> variables </th>
   <th style="text-align:right;"> betas </th>
   <th style="text-align:right;"> prob </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> -0.8461443 </td>
   <td style="text-align:right;"> 0.3002423 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> vot </td>
   <td style="text-align:right;"> 0.0573077 </td>
   <td style="text-align:right;"> 0.5143230 </td>
  </tr>
</tbody>
</table>

.Large[
- This is still difficult to interpret... a plot might help. 
]

</br></br>

<sup>1</sup> Inverse logit = `$\frac{1}{1 + exp(-x)}$`

---
class: middle

---
class: middle

```r
inv_logit(mod_log) %>% kable(., format = 'html')
```

</br>

- Now the intercept is interpretable  
(note it is already centered)

- What does the parameter estimate  
for VOT mean?

- Can calculate how the probability  
differs from one specific point to  
another?

]

]

---
class: middle

#### We can use the model coefficients

<table class="table" style="font-size: 16px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> -0.846 </td>
   <td style="text-align:right;"> 0.061 </td>
   <td style="text-align:right;"> -13.897 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> vot </td>
   <td style="text-align:right;"> 0.057 </td>
   <td style="text-align:right;"> 0.002 </td>
   <td style="text-align:right;"> 26.899 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

- Calculate the inverse logit of the  
linear equation:

`$$\alpha + \beta_{VOT} * 10ms$$`

```r
plogis(-0.846 + 0.057 * 10)
```

```
## [1] 0.4314347
```

- What about the change in probability of selecting /p/ when shifting from 10 ms 
to 20 ms?

```r
plogis(-0.846 + 0.057 * 20) - 
plogis(-0.846 + 0.057 * 10)
```

```
## [1] 0.1415404
```

]

The shift from 10ms to 20 ms VOT corresponds  
with a positive difference of 14% in 
the probability of selecting /p/

]

---

# Logistic regression

### Summary

- The `glm()` function works similarly to the `lm()` function

- We test for main effects and interactions the same way too, i.e., using 
nested model comparisons with the `anova()` function

- The exponential family and corresponding linking function are  
`family = binomial(link = "logit")`

- Interpretation of logistic regression works the same way as classic linear 
regression

- Parameter estimates are evaluated in log odds (and require some work in order 
to accurately interpret them)
]

---
layout: false
class: middle

<blockquote class="tiktok-embed" cite="https://www.tiktok.com/@chelseaparlettpelleriti/video/6828303282482449669" data-video-id="6828303282482449669" data-embed-from="oembed" style="max-width: 605px;min-width: 325px;" > <section> <a target="_blank" title="@chelseaparlettpelleriti" href="https://www.tiktok.com/@chelseaparlettpelleriti?refer=embed">@chelseaparlettpelleriti</a> <p>🙃😉 <a title="statstiktok" target="_blank" href="https://www.tiktok.com/tag/statstiktok?refer=embed">#statsTiktok</a> <a title="machinelearning" target="_blank" href="https://www.tiktok.com/tag/machinelearning?refer=embed">#machineLearning</a></p> <a target="_blank" title="♬ original sound - Cass Martin " href="https://www.tiktok.com/music/original-sound-6622403695122320133?refer=embed">♬ original sound - Cass Martin </a> </section> </blockquote> 

<script async data-external="1" src="https://www.tiktok.com/embed.js"></script>
<style>blockquote.tiktok-embed {border:unset;padding:unset;}</style>

---

# Practice

- [Logistic regression practice](./assets/logistic_regression_walkthrough/glm_logistic.zip)

---

background-color: black
class: middle

# Poisson regression

---

# Poisson regression

### Model setup

- The criterion is a non-negative number (0, 1, 2...)

- The distribution for count data is typically the **poisson** distribution

- The linking function is **log**

### How it works

- The log linking function transforms the criterion so that it can be modeled by 
the linear predictor

]

.Large[
|                    |        |                           |
| -----------------: | :----- | :------------------------ |
| `$y_{i}$`            | `$\sim$` | `$Poisson(\lambda_{i})$`     |
| `$log(\lambda)_{i}$` | `$=$`    | `$\alpha + \beta_{1}x_{i}$` |
]

]

---
class: middle

---

# Poisson regression

### What you need to know

- Model output from poisson regression is similar to `lm()`

- Model interpretation "works" the same way, i.e., a 1-unit change in 
`predictor` is associated with a change of X in the criterion

- But... the parameter estimates represent changes in the criterion on a 
logarithmic scale

- The parameter estimates that can be interpreted as multiplicative effects

- This is not too difficult to understand
]

---

# Poisson regression

### Example

```r
glm(
  units ~ temp, 
  data = ice_cream_poisson_data, 
* family = poisson(link = "log")
) 
```

<table class="table" style="font-size: 14px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 2.496 </td>
   <td style="text-align:right;"> 0.015 </td>
   <td style="text-align:right;"> 165.709 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> temp </td>
   <td style="text-align:right;"> 0.039 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 225.702 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

]

]

.footnote[
- A 1 unit change in temperature is associated with a  
change of 0.04 log units in ice cream sold.

- We can exponentiate 0.04 to make it more  
interpretable.  `exp(coef(mod_poisson)[2])` = 
1.0400058

- A 1 unit change in temperature gives a 4% positive  
increase in ice cream sold.
]

---

# Poisson regression

### Example

```r
glm(
* units ~ temp + city,
  data = ice_cream_poisson_data, 
  family = poisson(link = "log")
)
```

<table class="table" style="font-size: 14px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 2.457 </td>
   <td style="text-align:right;"> 0.016 </td>
   <td style="text-align:right;"> 157.027 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> temp </td>
   <td style="text-align:right;"> 0.043 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 229.264 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> cityTucson </td>
   <td style="text-align:right;"> -0.514 </td>
   <td style="text-align:right;"> 0.007 </td>
   <td style="text-align:right;"> -78.088 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

]

]

.footnote[
- A 1 unit change in temperature is associated with a  
positive difference of approx. 4% in ice cream sold  
in NYC.

- When temp = 0, ice cream sold in Tucson is approx.  
42% less

]

---

# Poisson regression

### Example

```r
ice_cream_poisson_data %>% 
  mutate(temp_c = temp - mean(temp)) %>% 
* glm(units ~ temp_c + city,
      data = ., 
      family = poisson(link = "log"))
```

<table class="table" style="font-size: 14px; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> term </th>
   <th style="text-align:right;"> estimate </th>
   <th style="text-align:right;"> std.error </th>
   <th style="text-align:right;"> statistic </th>
   <th style="text-align:right;"> p.value </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> (Intercept) </td>
   <td style="text-align:right;"> 5.387 </td>
   <td style="text-align:right;"> 0.005 </td>
   <td style="text-align:right;"> 1113.731 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> temp_c </td>
   <td style="text-align:right;"> 0.043 </td>
   <td style="text-align:right;"> 0.000 </td>
   <td style="text-align:right;"> 229.264 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> cityTucson </td>
   <td style="text-align:right;"> -0.514 </td>
   <td style="text-align:right;"> 0.007 </td>
   <td style="text-align:right;"> -78.088 </td>
   <td style="text-align:right;"> 0 </td>
  </tr>
</tbody>
</table>

]

]

.footnote[
- A 1 unit change in temperature is associated with a  
positive difference of approx. 4% in ice cream sold  
in NYC.

- At the average temperature (68.5), ice cream  
sold in Tucson is (still) approx. 42% less

]

---

# Poisson regression

### Summary

- The `glm()` function works similarly to the `lm()` function

- We test for main effects and interactions the same way too, i.e., using 
nested model comparisons with the `anova()` function

- The exponential family and corresponding linking function are  
`family = poisson(link = "log")`

- Interpretation of poisson regression works the same way as classic linear 
regression

- Parameter estimates are evaluated on a log (multiplicative) scale and are 
not difficult to interpret
]

---

---

# Practice

- [Poisson regression practice](./assets/poisson_regression_walkthrough/glm_poisson.zip)

---
exclude: true

(Wickham and Grolemund, 2016; Hardy, 1993b; Hardy, 1993c; Hardy, 1993d)

---
layout: false
class: title-slide-final, left

# References

[1] M. Hardy. "Assessing Group Differences in Effects". In: _Regression
with Dummy Variables_. Ed. by M. Hardy. Sage University Paper Series on
Quantitative Applications in the Social Sciences - 93. Newbury Park,
CA: Sage, 1993, pp. 29-63. ISBN: 9780803951280.

[2] M. Hardy. "Creating Dummy Variables". In: _Regression with Dummy
Variables_. Ed. by M. Hardy. Sage University Paper Series on
Quantitative Applications in the Social Sciences - 93. Newbury Park,
CA: Sage, 1993, pp. 7-17. ISBN: 9780803951280.

[3] M. Hardy. "Using Dummy Variables as Regressors". In: _Regression
with Dummy Variables_. Ed. by M. Hardy. Sage University Paper Series on
Quantitative Applications in the Social Sciences - 93. Newbury Park,
CA: Sage, 1993, pp. 18-28. ISBN: 9780803951280.

[4] H. Wickham and G. Grolemund. _R for Data Science: Import, Tidy,
Transform, Visualize, and Model Data_. O'Reilly Media, 2016.

Nelder, J., Wedderburn, R. (1972). Generalized Linear Models. *Journal of the Royal Statistical Society*. 135 (3). 370–384.