Data Science for Linguists

Bivariate correlation

Joseph V. Casillas, PhD

Rutgers UniversitySpring 2025
Last update: 2025-02-15

But first…

World views and
statistical Assumptions

Statistical assumptions

Statistics are about more than just mathematics…

Where do statistical assumptions come from?
Who made these statistical assumptions and why?
Are statistical assumptions truly consistent with the nature of reality?

Statistical assumptions

Where do mathematical concepts in general come from?

They are influenced by worldviews/paradigms/research traditions
Choices are informed, and they need to be logically thought out
Statistics can be mathematically valid without answering the original research question (what you really want to know)
We need conceptual as well as mathematical validity

Statistical World-Views

Two major world views (Weltanschauungen) have historically influenced the field of Statistics:

The Newtonian Worldview
The Darwinian Worldview

Plato (380 BC) The Republic

Believed in absolute God-given categories
Things possess an essence of a type (e.g., species)
Individual variations are “imperfect reflections” of ideal types (eidolon, ειδωλον)… therefore not important
We can’t trust our sensory perceptions… we can only arrive at truth using pure reason

Aristotle (350 BC) De Partibus Animalium

Different idea of categories than Plato
Tried to create a taxonomy of animals
Taxonomy based on specific parts of animals that define them as a part of group
Category membership defined by having a specific feature that members of other groups don’t possess
- Birds have wings, but monkeys don’t, so only winged animals could be considered birds

Both philosophers believed in categories:

But their definition of what categories were and where they came from was discrepant

Plato

Categories are God-given
Things possess an essence of a type
Observable features are not reliable because they are based on sense perceptions

Aristotle

Categories are inherent in the individual
Individuals need to have some identifiable, visible feature to be classified
It is the possession of certain observable features that puts individuals in categories

To Aristotle the individual was ultimate reality, but to Plato the individual was an imperfect reflection of the perfect category or its “ideal type” (= eidolon, ειδωλον)

What about Newton and Darwin?

Alfonso X “El Sabio” of Castile
to the rescue!

Rendered Arabic scientific, had philosophical texts translated into Castilian, restoring long-lost classical knowledge to Christian Europe
The Arabic translation of De Partibus Animalium (Aristotle, 350 BC) comprises treatises 11-14 of the Kitāb al-Hayawān by Yahyà bin al-Bitrīq
“El libro de los animales”
Has lasting effect on 14th Century (Medieval) Scholastic Philosophy
- Essentialism
- Nominalism

Newton and Darwin

Essentialists were Platonists:

Believed that categories are real and God-given
The essentialist (platonic) worldview is highly influential: Isaac Newton was an essentialist
The work of Newton directly influences big names responsible for introducing statistical methods to the social sciences
- Laplace: Leplace-Guass distribution, least squares estimation
- Quételet: L’homme moyen (the “average man”) is characterized by the mean values of measured variables that follow a normal distribution (golden mean)

Newton and Darwin

Nominalists were Aristotelians:

Like Aristotle, but much more radical
They thought categories didn’t really exist in nature
Darwin was a nominalist
We create categories because we see common features and we arbitrarily try to sort individuals into these categories

Newtonian thinking:

The “Golden Mean” as Target?

Newtonian thinking:

Wanted to base the Social Sciences on purely physical principles
Normal distribution was developed to explain the measurement of error:
- The normal distribution is the shape of error
- Found that many traits were normally distributed
- So was there a “golden mean” that nature aimed at, and was the distribution around this mean just “error”?
Wanted to create a set of universal laws to govern all human behavior using the golden mean:
- Standard deviation was just “error”
- In Structuralist psychology, Wundt and others were looking for universal principles of human behavior
- They also saw individual differences as error

Persistent Influences of the
Newtonian World-View

Later development: the Analysis of Variance (ANOVA)
characterizes all individual differences (variance) as error

This comes from the dominance of the Newtonian worldview within experimental psychology
Many statistical procedures just manipulate people in an attempt to discover some “truth” about some aspect of human behavior
This still a common approach today

The Nominalists

Charles Robert Darwin, FRS (1809-1882)

On the Origin of Species (1859)
Also addressed the nature of social and behavioral sciences
Clearly a Nominalist
- Species are transitory
- They grade into one another
- They have no objective or permanent existence
Species are mental constructs we invent to conceptualize populations
- Individuals are what are important, not species
- Principles of continuum and contiguity

Nominalist Implications

If frequencies of heritable traits within populations of individuals change over time then the entire species, and the definition of the species, changes with them:
- No strict Biblical “bringing forth living creatures after their kind”
- Species are mental constructs that we invent to help us conceptualize “snapshots” of evolving populations at a single point in time

The Darwinian
Worldview

The Darwinians:

Sir Francis Galton

Hereditary Genius (1869)
Strongly influenced by his cousin: Charles Robert Darwin
English Victorian polymath
Founder of:
- Biological Psychology
- Differential Psychology
- Eugenics Movement
Wanted to investigate the heritability and realized that Quételet’s methods and statistical models wouldn’t work for this purpose

Attack of the Clones

If we were all clones, when we reproduced there would be no significant difference between my offspring and yours:
- So my kids would be no more like me than they would be like you
Any differences between individuals would be random error:
- i.e., A deviation from the “golden mean”
- It would be deviation from the general traits of humans not unique heritable differences
But we know this is not the case
- My kids are more like me than like you
- How do we show this?

The Darwinians: Sir Francis Galton

Twin and Adoption Studies

Galton invented both twin studies and adoption studies to investigate this:
- Differences between MZ twins should reflect environmental differences, and differences between adopted siblings reflected genetic differences
These were purely observational studies

Deviations from the Mean

Galton needed Karl Pearson to develop the mathematical tools to analyze these data
Key was the deviation from the mean:
- Individuality defined as the deviation from the mean
- My deviation and my offspring’s deviation should be similar
- My kids don’t start back at the “ideal human” (“l’homme moyen”) mean and then make deviations from there
- They start from my genetic contributions and make deviations from there
He wanted to compare deviations from the mean among individuals, and then use these to measure the deviations of the group

The Darwinians: Karl Pearson

Established the discipline of mathematical statistics
Founded the world’s first university statistics department at University College London in 1911
Controversial proponent of eugenics
Protégé and biographer of Sir Francis Galton
Not a “Sir”
Galton and Pearson jointly developed the correlation coefficient (r) in 1895
Also invented the chi-squared test:
- Started feud with Sir Ronald Fisher about it

Correlation coefficient (r)

“The Coefficient of Co-relation”

The descriptive statistic that measures the degree of linear association between any two variables:

Ranges in value from -1 to +1
When r = 0 there is no correlation
Can be used to compare two people on same variable
Can be used to compare two variables within one person

Correlation coefficient (r)

Individual Differences

The correlation coefficient is (was) about comparing individual differences
Recall that z-scores let us compare two different variables and see how they deviate from the mean
You can compare weight and IQ, or “apples and oranges”, by using standardized scores
The correlation coefficient is about the association between two variables

Z-Scores & Correlation Coefficients

The Z-Scores:

\[z_x = \frac{(x_i - \bar{x})} {s_x} \qquad \qquad z_y = \frac{(y_i - \bar{y})} {s_y}\]

Pearson’s “Coefficient of Co-relation”:

\[r_{xy} = \frac{\sum (z_x) (z_y)}{n - 1}\]

Pearson’s correlation coefficient

Details

A Perfect Correlation:
r_xy = 1
An Imperfect Correlation:
r_xy < 1
Mismatch between z_x and z_y produces correlation coefficients lower than 1

Assumptions

Scale of measurement should be interval or ratio
Variables should be approximately normally distributed
The association should be linear
There should be no outliers in the data

Example: vocabulary size and age

Vocabulary size and age

Bivariate correlation

Average native test-takers of age 4 already know approx. 5,000 words
Average native test-takers of age 8 already know approx. 10,000 words
Average native test-takers of age 15 already know approx. 18,000 words
Most adult native test-takers range from 20,000–35,000 words

Let’s sample the data and calculate Pearson’s correlation coeficient for age and vocabulary size

\[\frac{\sum_{i}^{n} \left ( \frac{x_i - \bar{x}}{s_x} \right ) \left ( \frac{y_i - \bar{y}}{s_y} \right )}{n - 1} = \frac{\sum_{i}^{n} \left ( z_x \right ) \left ( z_y \right )}{n - 1} = {r_{xy}}\]

Vocab_sample	x	y
1	9.50	12.44
2	4.25	3.58
3	6.45	9.71
4	6.94	7.69
5	5.47	5.59
6	10.71	14.77
7	11.07	10.24
8	8.39	7.95
9	14.73	21.99
10	9.99	13.63
11	12.92	15.69
12	4.29	6.57

\[\frac{\sum_{i}^{n} \left ( \frac{\color{red}{x_i} - \bar{x}}{s_x} \right ) \left ( \frac{\color{blue}{y_i} - \bar{y}}{s_y} \right )}{n - 1} = \frac{\sum_{i}^{n} \left ( z_x \right ) \left ( z_y \right )}{n - 1} = {r_{xy}}\]

Vocab_sample	x	y
1	9.50	12.44
2	4.25	3.58
3	6.45	9.71
4	6.94	7.69
5	5.47	5.59
6	10.71	14.77
7	11.07	10.24
8	8.39	7.95
9	14.73	21.99
10	9.99	13.63
11	12.92	15.69
12	4.29	6.57

\[\frac{\sum_{i}^{n} \left ( \frac{x_i - \color{red}{\bar{x}}}{\color{blue}{s_x}} \right ) \left ( \frac{y_i - \color{red}{\bar{y}}}{\color{blue}{s_y}} \right )}{n - 1} = \frac{\sum_{i}^{n} \left ( z_x \right ) \left ( z_y \right )}{n - 1} = {r_{xy}}\]

Vocab_sample	x	y
1	9.50	12.44
2	4.25	3.58
3	6.45	9.71
4	6.94	7.69
5	5.47	5.59
6	10.71	14.77
7	11.07	10.24
8	8.39	7.95
9	14.73	21.99
10	9.99	13.63
11	12.92	15.69
12	4.29	6.57
NA	NA	NA
Sum	104.71	129.85
Mean	8.73	10.82
SD	3.36	5.15

\[\frac{\sum_{i}^{n} \color{red}{\left ( \frac{x_i - \bar{x}}{s_x} \right )} \left ( \frac{y_i - \bar{y}}{s_y} \right )}{n - 1} = \frac{\sum_{i}^{n} \color{red}{\left ( z_x \right )} \left ( z_y \right )}{n - 1} = {r_{xy}}\]

Vocab_sample	x	y	z_x
1	9.50	12.44	0.23
2	4.25	3.58	-1.33
3	6.45	9.71	-0.68
4	6.94	7.69	-0.53
5	5.47	5.59	-0.97
6	10.71	14.77	0.59
7	11.07	10.24	0.70
8	8.39	7.95	-0.10
9	14.73	21.99	1.79
10	9.99	13.63	0.38
11	12.92	15.69	1.25
12	4.29	6.57	-1.32

\[\frac{\sum_{i}^{n} \left ( \frac{x_i - \bar{x}}{s_x} \right ) \color{blue}{\left ( \frac{y_i - \bar{y}}{s_y} \right )}}{n - 1} = \frac{\sum_{i}^{n} \left ( z_x \right ) \color{blue}{\left ( z_y \right )}}{n - 1} = {r_{xy}}\]

Vocab_sample	x	y	z_x	z_y
1	9.50	12.44	0.23	0.31
2	4.25	3.58	-1.33	-1.41
3	6.45	9.71	-0.68	-0.22
4	6.94	7.69	-0.53	-0.61
5	5.47	5.59	-0.97	-1.02
6	10.71	14.77	0.59	0.77
7	11.07	10.24	0.70	-0.11
8	8.39	7.95	-0.10	-0.56
9	14.73	21.99	1.79	2.17
10	9.99	13.63	0.38	0.55
11	12.92	15.69	1.25	0.94
12	4.29	6.57	-1.32	-0.82

\[\frac{\sum_{i}^{n} \color{purple}{\left ( \frac{x_i - \bar{x}}{s_x} \right ) \left ( \frac{y_i - \bar{y}}{s_y} \right )}}{n - 1} = \frac{\sum_{i}^{n} \color{purple}{\left ( z_x \right ) \left ( z_y \right )}}{n - 1} = {r_{xy}}\]

Vocab_sample	x	y	z_x	z_y	(z_x)(z_y)
1	9.50	12.44	0.23	0.31	0.07
2	4.25	3.58	-1.33	-1.41	1.88
3	6.45	9.71	-0.68	-0.22	0.15
4	6.94	7.69	-0.53	-0.61	0.32
5	5.47	5.59	-0.97	-1.02	0.99
6	10.71	14.77	0.59	0.77	0.45
7	11.07	10.24	0.70	-0.11	-0.08
8	8.39	7.95	-0.10	-0.56	0.06
9	14.73	21.99	1.79	2.17	3.88
10	9.99	13.63	0.38	0.55	0.21
11	12.92	15.69	1.25	0.94	1.17
12	4.29	6.57	-1.32	-0.82	1.08

\[\frac{\color{green}{\sum_{i}^{n} \left ( \frac{x_i - \bar{x}}{s_x} \right ) \left ( \frac{y_i - \bar{y}}{s_y} \right )}}{n - 1} = \frac{\color{green}{\sum_{i}^{n} \left ( z_x \right ) \left ( z_y \right )}}{n - 1} = {r_{xy}}\]

Vocab_sample	x	y	z_x	z_y	(z_x)(z_y)
1	9.50	12.44	0.23	0.31	0.07
2	4.25	3.58	-1.33	-1.41	1.88
3	6.45	9.71	-0.68	-0.22	0.15
4	6.94	7.69	-0.53	-0.61	0.32
5	5.47	5.59	-0.97	-1.02	0.99
6	10.71	14.77	0.59	0.77	0.45
7	11.07	10.24	0.7	-0.11	-0.08
8	8.39	7.95	-0.1	-0.56	0.06
9	14.73	21.99	1.79	2.17	3.88
10	9.99	13.63	0.38	0.55	0.21
11	12.92	15.69	1.25	0.94	1.17
12	4.29	6.57	-1.32	-0.82	1.08
NA	NA	NA	NA	NA	NA
Sum	104.71	129.85	0	0	10.18
Mean	8.73	10.82	0	0	NA
SD	3.36	5.15	1	1	NA

\[\frac{\sum_{i}^{n} \left ( \frac{x_i - \bar{x}}{s_x} \right ) \left ( \frac{y_i - \bar{y}}{s_y} \right )}{n - 1} = \frac{\sum_{i}^{n} \left ( z_x \right ) \left ( z_y \right )}{n - 1} = {\color{red}{r_{xy}}}\]

my_sum <- sum(vocab_sample$`(z_x)(z_y)`) 
my_n <- (nrow(vocab_sample) - 1)
my_r <- my_sum / my_n
my_r

[1] 0.9254545

10.18 / (12 - 1)

[1] 0.9254545

Pearson’s correlation coefficient = 0.9254545.

Doing it in R

There are two useful functions:

cor()
cor.test()

cor(mtcars$mpg, mtcars$disp)

[1] -0.8475514

cor.test(mtcars$mpg, mtcars$disp)


    Pearson's product-moment correlation

data:  mtcars$mpg and mtcars$disp
t = -8.7472, df = 30, p-value = 9.38e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9233594 -0.7081376
sample estimates:
       cor 
-0.8475514

A word of warning…

Correlation \(\neq\) causation

Practice: test scores

Practice - test scores

Load the test_scores_rm dataset from the ds4ling package.

data(test_scores_rm)
test_scores_rm

# A tibble: 16 × 4
   id     spec  test1 test2
   <chr>  <chr> <dbl> <dbl>
 1 span01 g1_lo  64.3  69.2
 2 span02 g1_lo  59.8  63.7
 3 span03 g1_hi  66.1  70.9
 4 span04 g1_hi  72.8  79.2
 5 span05 g2_lo  68.3  75.4
 6 span06 g2_lo  69.2  76.7
 7 span07 g2_hi  71.4  77.2
 8 span08 g2_hi  80.4  88.9
 9 cata01 g1_lo  75.6  83.6
10 cata02 g1_lo  71.2  78.8
11 cata03 g1_hi  69.1  74.6
12 cata04 g1_hi  72.4  80.7
13 cata05 g2_lo  71.7  77.9
14 cata06 g2_lo  69.0  75  
15 cata07 g2_hi  69.9  76  
16 cata08 g2_hi  77.3  85.6

References

Darwin, C. (2016). On the origin of species (1859). Routledge: London.

Ferrari, G. R., & Griffith, T. (2000). Plato: The Republic. Cambridge University Press: Cambridge.

Figueredo, A. J. (2013a). The Darwinian worldview in statistics. Statistical Methods in Psychological Research.

Figueredo, A. J. (2013b). The Newtonian worldview in statistics. Statistical Methods in Psychological Research.

Galton, F. (1869). Hereditary genius: An inquiry into its laws and consequences. Macmillan; Co.: London.

Johnson, K. (2011a). Fundamentals of quantitative analysis. In K. Johnson (Ed.), Quantitative methods in linguistics (pp. 1–33). Wiley.

Johnson, K. (2011b). Patterns and tests. In K. Johnson (Ed.), Quantitative methods in linguistics (pp. 34–69). Wiley.

Newton, I. (1687). Philosophiae naturalis principia mathematica (mathematical principles of natural philosophy). London.

Quetelet, L. A. J. (1869). Sur l’homme et le développement de ses facultés, ou essai de physique sociale (Vol. 2).

Ross, W. D., & Smith, J. A. (1912). The works of aristotle: De Partibus Animalium; De Motu; De Incessu Animalium; De Generatione Animalium. Clarendon Press.

Takahashi, S. (2009). The manga guide to statistics. Trend-Pro CO., LTD.

Wickham, H., & Grolemund, G. (2016). R for data science: Import, tidy, transform, visualize, and model data. O’Reilly Media.

Data Science for Linguists

Bivariate correlation

World views and statistical Assumptions

Statistics are about more than just mathematics…

Where do mathematical concepts in general come from?

Two major world views (Weltanschauungen) have historically influenced the field of Statistics:

Plato

Aristotle

Newton and Darwin

Essentialists were Platonists:

Nominalists were Aristotelians:

Newtonian thinking:

The “Golden Mean” as Target?

Applied to Social Science

The Nominalists

Charles Robert Darwin, FRS (1809-1882)

Correlation coefficient (r)

Correlation coefficient (r)

“The Coefficient of Co-relation”

Individual Differences

Z-Scores & Correlation Coefficients

The Z-Scores:

Pearson’s “Coefficient of Co-relation”:

Pearson’s correlation coefficient

Details

Assumptions

Example: vocabulary size and age

There are two useful functions:

Practice: test scores

References

World views and
statistical Assumptions