Joseph V. Casillas, PhD
Rutgers University
Last update: 2025-01-04
Mastering quantitative methods is an “[…] increasingly vital component of linguistic training” (Johnson, 2011b).
Main roadblock = lack of resources
Most programs don’t offer this training. Students get it elsewhere (or not at all)
Not a lot of good texts made specifically for linguistics
“All models are wrong, but some are useful”
- George Box
Statistics is hard
You are learning
You will make mistakes
You should come away with a better understanding of the big picture
Ideally this would be the first class of many you take on the subject
Data reduction
Inference
Discovery of relationships
Exploration
Really?
EDA or CDA?
Exploratory data analysis
vs.
Confirmatory data analysis
EDA: used to get to know your data, to generate hypotheses
CDA: used ONLY to test hypotheses
Both EDA and CDA (done correctly) are a necessary part of science
What is an observation anyway?
What is an observation anyway?
What is an observation anyway?
What is an observation anyway?
What is an observation anyway?
What is an observation anyway?
What is an observation anyway?
What is an observation anyway?
What is an observation anyway?
Johnson (2011b) discusses 4 types…
The type of observations you deal with are related to your area of research and your questions.
The type of observations you analyze will determine the statistical methods you use to answer your questions.
Different types of observations are distributed in different ways.
If we collect a series of observations we have a sample of data
We can create a histogram
by selecting a bin size and counting the frequency of the values in our sample that fall inside the bin
A histogram shows us the frequency distribution of our sample
For example, the histogram to the right shows that the value 3
appears in this sample over 200 times
We can describe the characteristics of our sample to see how the values are distributed.
For this we use…
Measures of central tendency (mid-point)
Measures of dispersion (around mid-point)
Measures of central tendency (mid-point)
Measures of dispersion (around mid-point)
[1] 2 5 8
There are many families of distributions1, each of which have important characteristics.
The most important for us (for now) is the normal distribution.
It is easy to recognize because it looks like a bell.
Distributions
The basics
If we continue decreasing the bin size we can also see that the rectangular shape of the vertical bars becomes a curved line
We can do this to the point that the bin sizes are decreased to the limit at zero
The formula that derives this line is called a probability density function.
Distributions
Less-than-normal data
QQ plots
For data which follow a normal distribution:
Subtract the mean from each value and divide by the standard deviation
\[\color{red}{z} = \frac{\color{green}{x} - \color{orange}{\mu}}{\color{blue}{\sigma}}\]
2 - 5 = -3 / 3 = -1
5 - 5 = 0 / 3 = 0
8 - 5 = 3 / 3 = 1
22 - 55 = -33 / 33 = -1
55 - 55 = 0 / 33 = 0
88 - 55 = 33 / 33 = 1
Distributions
Convert to z-score
Subtract the mean from each value and divide by the standard deviation
\[\color{red}{z} = \frac{\color{green}{x} - \color{orange}{\mu}}{\color{blue}{\sigma}}\]
Exam scores
Range | Median | Mean score | SD score |
---|---|---|---|
20 | 89.5 | 88.7 | 6.412661 |
group | Range | Median | Mean score | SD score |
---|---|---|---|---|
Class_A | 20 | 89.5 | 88.7 | 6.412661 |
Class_B | 29 | 72.0 | 73.5 | 9.698224 |
Exam scores - Standardized
id | score | Std. score | Dev. score |
---|---|---|---|
s03 | 78.00 | -1.67 | -10.70 |
s02 | 81.00 | -1.20 | -7.70 |
s08 | 84.00 | -0.73 | -4.70 |
s01 | 86.00 | -0.42 | -2.70 |
Friend | 89.00 | 0.05 | 0.30 |
s07 | 90.00 | 0.20 | 1.30 |
s04 | 92.00 | 0.51 | 3.30 |
s10 | 94.00 | 0.83 | 5.30 |
s09 | 95.00 | 0.98 | 6.30 |
s05 | 98.00 | 1.45 | 9.30 |
id | score | Std. score | Dev. score |
---|---|---|---|
s17 | 60.00 | -1.39 | -13.50 |
s18 | 64.00 | -0.98 | -9.50 |
s13 | 65.00 | -0.88 | -8.50 |
s15 | 69.00 | -0.46 | -4.50 |
s12 | 71.00 | -0.26 | -2.50 |
s20 | 73.00 | -0.05 | -0.50 |
s11 | 76.00 | 0.26 | 2.50 |
s14 | 82.00 | 0.88 | 8.50 |
s19 | 86.00 | 1.29 | 12.50 |
You | 89.00 | 1.60 | 15.50 |
You’re friend received the same score on her exam,
but you did better relative to the rest of your class