Data Science for Linguists

Bivariate correlation

Joseph V. Casillas, PhD

Rutgers UniversitySpring 2025
Last update: 2025-02-15

But first…

World views and
statistical Assumptions

Statistical assumptions

Statistics are about more than just mathematics…

  • Where do statistical assumptions come from?
  • Who made these statistical assumptions and why?
  • Are statistical assumptions truly consistent with the nature of reality?

Statistical assumptions

Where do mathematical concepts in general come from?

  • They are influenced by worldviews/paradigms/research traditions
  • Choices are informed, and they need to be logically thought out
  • Statistics can be mathematically valid without answering the original research question (what you really want to know)
  • We need conceptual as well as mathematical validity

Statistical World-Views


Two major world views (Weltanschauungen) have historically influenced the field of Statistics:

  1. The Newtonian Worldview
  2. The Darwinian Worldview

Plato (380 BC) The Republic

  • Believed in absolute God-given categories
  • Things possess an essence of a type (e.g., species)
  • Individual variations are “imperfect reflections” of ideal types (eidolon, ειδωλον)… therefore not important
  • We can’t trust our sensory perceptions… we can only arrive at truth using pure reason

Aristotle (350 BC) De Partibus Animalium

  • Different idea of categories than Plato
  • Tried to create a taxonomy of animals
  • Taxonomy based on specific parts of animals that define them as a part of group
  • Category membership defined by having a specific feature that members of other groups don’t possess
    • Birds have wings, but monkeys don’t, so only winged animals could be considered birds

Both philosophers believed in categories:

But their definition of what categories were and where they came from was discrepant

Plato

  • Categories are God-given
  • Things possess an essence of a type
  • Observable features are not reliable because they are based on sense perceptions

Aristotle

  • Categories are inherent in the individual
  • Individuals need to have some identifiable, visible feature to be classified
  • It is the possession of certain observable features that puts individuals in categories

To Aristotle the individual was ultimate reality, but to Plato the individual was an imperfect reflection of the perfect category or its “ideal type” (= eidolon, ειδωλον)

What about Newton and Darwin?

Alfonso X “El Sabio” of Castile
to the rescue!

  • Rendered Arabic scientific, had philosophical texts translated into Castilian, restoring long-lost classical knowledge to Christian Europe
  • The Arabic translation of De Partibus Animalium (Aristotle, 350 BC) comprises treatises 11-14 of the Kitāb al-Hayawān by Yahyà bin al-Bitrīq
  • “El libro de los animales”
  • Has lasting effect on 14th Century (Medieval) Scholastic Philosophy
    • Essentialism
    • Nominalism

Newton and Darwin

Essentialists were Platonists:

  • Believed that categories are real and God-given
  • The essentialist (platonic) worldview is highly influential: Isaac Newton was an essentialist
  • The work of Newton directly influences big names responsible for introducing statistical methods to the social sciences
    • Laplace: Leplace-Guass distribution, least squares estimation
    • Quételet: L’homme moyen (the “average man”) is characterized by the mean values of measured variables that follow a normal distribution (golden mean)

Newton and Darwin

Nominalists were Aristotelians:

  • Like Aristotle, but much more radical
  • They thought categories didn’t really exist in nature
  • Darwin was a nominalist
  • We create categories because we see common features and we arbitrarily try to sort individuals into these categories

Newtonian thinking:

The “Golden Mean” as Target?

Newtonian thinking:

Applied to Social Science

  • Wanted to base the Social Sciences on purely physical principles
  • Normal distribution was developed to explain the measurement of error:
    • The normal distribution is the shape of error
    • Found that many traits were normally distributed
    • So was there a “golden mean” that nature aimed at, and was the distribution around this mean just “error”?
  • Wanted to create a set of universal laws to govern all human behavior using the golden mean:
    • Standard deviation was just “error”
    • In Structuralist psychology, Wundt and others were looking for universal principles of human behavior
    • They also saw individual differences as error

Persistent Influences of the
Newtonian World-View

Later development: the Analysis of Variance (ANOVA)
characterizes all individual differences (variance) as error

.

  • This comes from the dominance of the Newtonian worldview within experimental psychology
  • Many statistical procedures just manipulate people in an attempt to discover some “truth” about some aspect of human behavior
  • This still a common approach today

The Nominalists

Charles Robert Darwin, FRS (1809-1882)

  • On the Origin of Species (1859)
  • Also addressed the nature of social and behavioral sciences
  • Clearly a Nominalist
    • Species are transitory
    • They grade into one another
    • They have no objective or permanent existence
  • Species are mental constructs we invent to conceptualize populations
    • Individuals are what are important, not species
    • Principles of continuum and contiguity

Nominalist Implications

  • If frequencies of heritable traits within populations of individuals change over time then the entire species, and the definition of the species, changes with them:
    • No strict Biblical “bringing forth living creatures after their kind”
    • Species are mental constructs that we invent to help us conceptualize “snapshots” of evolving populations at a single point in time

The Darwinian
Worldview

The Darwinians:

Sir Francis Galton

  • Hereditary Genius (1869)
  • Strongly influenced by his cousin: Charles Robert Darwin
  • English Victorian polymath
  • Founder of:
    • Biological Psychology
    • Differential Psychology
    • Eugenics Movement
  • Wanted to investigate the heritability and realized that Quételet’s methods and statistical models wouldn’t work for this purpose

Attack of the Clones

  • If we were all clones, when we reproduced there would be no significant difference between my offspring and yours:
    • So my kids would be no more like me than they would be like you
  • Any differences between individuals would be random error:
    • i.e., A deviation from the “golden mean”
    • It would be deviation from the general traits of humans not unique heritable differences
  • But we know this is not the case
    • My kids are more like me than like you
    • How do we show this?

The Darwinians: Sir Francis Galton

Twin and Adoption Studies

  • Galton invented both twin studies and adoption studies to investigate this:
    • Differences between MZ twins should reflect environmental differences, and differences between adopted siblings reflected genetic differences
  • These were purely observational studies

Deviations from the Mean

  • Galton needed Karl Pearson to develop the mathematical tools to analyze these data
  • Key was the deviation from the mean:
    • Individuality defined as the deviation from the mean
    • My deviation and my offspring’s deviation should be similar
    • My kids don’t start back at the “ideal human” (“l’homme moyen”) mean and then make deviations from there
    • They start from my genetic contributions and make deviations from there
  • He wanted to compare deviations from the mean among individuals, and then use these to measure the deviations of the group

The Darwinians: Karl Pearson

  • Established the discipline of mathematical statistics
  • Founded the world’s first university statistics department at University College London in 1911
  • Controversial proponent of eugenics
  • Protégé and biographer of Sir Francis Galton
  • Not a “Sir”
  • Galton and Pearson jointly developed the correlation coefficient (r) in 1895
  • Also invented the chi-squared test:
    • Started feud with Sir Ronald Fisher about it

Correlation coefficient (r)

Correlation coefficient (r)

“The Coefficient of Co-relation”

The descriptive statistic that measures the degree of linear association between any two variables:

  • Ranges in value from -1 to +1
  • When r = 0 there is no correlation
  • Can be used to compare two people on same variable
  • Can be used to compare two variables within one person

Correlation coefficient (r)

Individual Differences

  • The correlation coefficient is (was) about comparing individual differences
  • Recall that z-scores let us compare two different variables and see how they deviate from the mean
  • You can compare weight and IQ, or “apples and oranges”, by using standardized scores
  • The correlation coefficient is about the association between two variables

Z-Scores & Correlation Coefficients

The Z-Scores:

\[z_x = \frac{(x_i - \bar{x})} {s_x} \qquad \qquad z_y = \frac{(y_i - \bar{y})} {s_y}\]


Pearson’s “Coefficient of Co-relation”:

\[r_{xy} = \frac{\sum (z_x) (z_y)}{n - 1}\]

Pearson’s correlation coefficient

Details

  • A Perfect Correlation:
    rxy = 1
  • An Imperfect Correlation:
    rxy < 1
  • Mismatch between zx and zy produces correlation coefficients lower than 1

Assumptions

  • Scale of measurement should be interval or ratio
  • Variables should be approximately normally distributed
  • The association should be linear
  • There should be no outliers in the data

Example: vocabulary size and age

Vocabulary size and age

Bivariate correlation

  • Average native test-takers of age 4 already know approx. 5,000 words
  • Average native test-takers of age 8 already know approx. 10,000 words
  • Average native test-takers of age 15 already know approx. 18,000 words
  • Most adult native test-takers range from 20,000–35,000 words

Let’s sample the data and calculate Pearson’s correlation coeficient for age and vocabulary size

\[\frac{\sum_{i}^{n} \left ( \frac{x_i - \bar{x}}{s_x} \right ) \left ( \frac{y_i - \bar{y}}{s_y} \right )}{n - 1} = \frac{\sum_{i}^{n} \left ( z_x \right ) \left ( z_y \right )}{n - 1} = {r_{xy}}\]

\[\frac{\sum_{i}^{n} \left ( \frac{x_i - \bar{x}}{s_x} \right ) \left ( \frac{y_i - \bar{y}}{s_y} \right )}{n - 1} = \frac{\sum_{i}^{n} \left ( z_x \right ) \left ( z_y \right )}{n - 1} = {r_{xy}}\]

Vocab_sample x y
1 9.50 12.44
2 4.25 3.58
3 6.45 9.71
4 6.94 7.69
5 5.47 5.59
6 10.71 14.77
7 11.07 10.24
8 8.39 7.95
9 14.73 21.99
10 9.99 13.63
11 12.92 15.69
12 4.29 6.57

\[\frac{\sum_{i}^{n} \left ( \frac{\color{red}{x_i} - \bar{x}}{s_x} \right ) \left ( \frac{\color{blue}{y_i} - \bar{y}}{s_y} \right )}{n - 1} = \frac{\sum_{i}^{n} \left ( z_x \right ) \left ( z_y \right )}{n - 1} = {r_{xy}}\]

Vocab_sample x y
1 9.50 12.44
2 4.25 3.58
3 6.45 9.71
4 6.94 7.69
5 5.47 5.59
6 10.71 14.77
7 11.07 10.24
8 8.39 7.95
9 14.73 21.99
10 9.99 13.63
11 12.92 15.69
12 4.29 6.57

\[\frac{\sum_{i}^{n} \left ( \frac{x_i - \color{red}{\bar{x}}}{\color{blue}{s_x}} \right ) \left ( \frac{y_i - \color{red}{\bar{y}}}{\color{blue}{s_y}} \right )}{n - 1} = \frac{\sum_{i}^{n} \left ( z_x \right ) \left ( z_y \right )}{n - 1} = {r_{xy}}\]

Vocab_sample x y
1 9.50 12.44
2 4.25 3.58
3 6.45 9.71
4 6.94 7.69
5 5.47 5.59
6 10.71 14.77
7 11.07 10.24
8 8.39 7.95
9 14.73 21.99
10 9.99 13.63
11 12.92 15.69
12 4.29 6.57
NA NA NA
Sum 104.71 129.85
Mean 8.73 10.82
SD 3.36 5.15

\[\frac{\sum_{i}^{n} \color{red}{\left ( \frac{x_i - \bar{x}}{s_x} \right )} \left ( \frac{y_i - \bar{y}}{s_y} \right )}{n - 1} = \frac{\sum_{i}^{n} \color{red}{\left ( z_x \right )} \left ( z_y \right )}{n - 1} = {r_{xy}}\]

Vocab_sample x y z_x
1 9.50 12.44 0.23
2 4.25 3.58 -1.33
3 6.45 9.71 -0.68
4 6.94 7.69 -0.53
5 5.47 5.59 -0.97
6 10.71 14.77 0.59
7 11.07 10.24 0.70
8 8.39 7.95 -0.10
9 14.73 21.99 1.79
10 9.99 13.63 0.38
11 12.92 15.69 1.25
12 4.29 6.57 -1.32

\[\frac{\sum_{i}^{n} \left ( \frac{x_i - \bar{x}}{s_x} \right ) \color{blue}{\left ( \frac{y_i - \bar{y}}{s_y} \right )}}{n - 1} = \frac{\sum_{i}^{n} \left ( z_x \right ) \color{blue}{\left ( z_y \right )}}{n - 1} = {r_{xy}}\]

Vocab_sample x y z_x z_y
1 9.50 12.44 0.23 0.31
2 4.25 3.58 -1.33 -1.41
3 6.45 9.71 -0.68 -0.22
4 6.94 7.69 -0.53 -0.61
5 5.47 5.59 -0.97 -1.02
6 10.71 14.77 0.59 0.77
7 11.07 10.24 0.70 -0.11
8 8.39 7.95 -0.10 -0.56
9 14.73 21.99 1.79 2.17
10 9.99 13.63 0.38 0.55
11 12.92 15.69 1.25 0.94
12 4.29 6.57 -1.32 -0.82

\[\frac{\sum_{i}^{n} \color{purple}{\left ( \frac{x_i - \bar{x}}{s_x} \right ) \left ( \frac{y_i - \bar{y}}{s_y} \right )}}{n - 1} = \frac{\sum_{i}^{n} \color{purple}{\left ( z_x \right ) \left ( z_y \right )}}{n - 1} = {r_{xy}}\]

Vocab_sample x y z_x z_y (z_x)(z_y)
1 9.50 12.44 0.23 0.31 0.07
2 4.25 3.58 -1.33 -1.41 1.88
3 6.45 9.71 -0.68 -0.22 0.15
4 6.94 7.69 -0.53 -0.61 0.32
5 5.47 5.59 -0.97 -1.02 0.99
6 10.71 14.77 0.59 0.77 0.45
7 11.07 10.24 0.70 -0.11 -0.08
8 8.39 7.95 -0.10 -0.56 0.06
9 14.73 21.99 1.79 2.17 3.88
10 9.99 13.63 0.38 0.55 0.21
11 12.92 15.69 1.25 0.94 1.17
12 4.29 6.57 -1.32 -0.82 1.08

\[\frac{\color{green}{\sum_{i}^{n} \left ( \frac{x_i - \bar{x}}{s_x} \right ) \left ( \frac{y_i - \bar{y}}{s_y} \right )}}{n - 1} = \frac{\color{green}{\sum_{i}^{n} \left ( z_x \right ) \left ( z_y \right )}}{n - 1} = {r_{xy}}\]

Vocab_sample x y z_x z_y (z_x)(z_y)
1 9.50 12.44 0.23 0.31 0.07
2 4.25 3.58 -1.33 -1.41 1.88
3 6.45 9.71 -0.68 -0.22 0.15
4 6.94 7.69 -0.53 -0.61 0.32
5 5.47 5.59 -0.97 -1.02 0.99
6 10.71 14.77 0.59 0.77 0.45
7 11.07 10.24 0.7 -0.11 -0.08
8 8.39 7.95 -0.1 -0.56 0.06
9 14.73 21.99 1.79 2.17 3.88
10 9.99 13.63 0.38 0.55 0.21
11 12.92 15.69 1.25 0.94 1.17
12 4.29 6.57 -1.32 -0.82 1.08
NA NA NA NA NA NA
Sum 104.71 129.85 0 0 10.18
Mean 8.73 10.82 0 0 NA
SD 3.36 5.15 1 1 NA

\[\frac{\sum_{i}^{n} \left ( \frac{x_i - \bar{x}}{s_x} \right ) \left ( \frac{y_i - \bar{y}}{s_y} \right )}{n - 1} = \frac{\sum_{i}^{n} \left ( z_x \right ) \left ( z_y \right )}{n - 1} = {\color{red}{r_{xy}}}\]


my_sum <- sum(vocab_sample$`(z_x)(z_y)`) 
my_n <- (nrow(vocab_sample) - 1)
my_r <- my_sum / my_n
my_r
[1] 0.9254545

or

10.18 / (12 - 1)
[1] 0.9254545

Pearson’s correlation coefficient = 0.9254545.

Doing it in R

There are two useful functions:

  1. cor()

  2. cor.test()

cor(mtcars$mpg, mtcars$disp)
[1] -0.8475514
cor.test(mtcars$mpg, mtcars$disp)

    Pearson's product-moment correlation

data:  mtcars$mpg and mtcars$disp
t = -8.7472, df = 30, p-value = 9.38e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9233594 -0.7081376
sample estimates:
       cor 
-0.8475514 

A word of warning…

Correlation \(\neq\) causation

Correlation \(\neq\) causation

Practice: test scores

Practice - test scores

Load the test_scores_rm dataset from the ds4ling package.

data(test_scores_rm)
test_scores_rm
# A tibble: 16 × 4
   id     spec  test1 test2
   <chr>  <chr> <dbl> <dbl>
 1 span01 g1_lo  64.3  69.2
 2 span02 g1_lo  59.8  63.7
 3 span03 g1_hi  66.1  70.9
 4 span04 g1_hi  72.8  79.2
 5 span05 g2_lo  68.3  75.4
 6 span06 g2_lo  69.2  76.7
 7 span07 g2_hi  71.4  77.2
 8 span08 g2_hi  80.4  88.9
 9 cata01 g1_lo  75.6  83.6
10 cata02 g1_lo  71.2  78.8
11 cata03 g1_hi  69.1  74.6
12 cata04 g1_hi  72.4  80.7
13 cata05 g2_lo  71.7  77.9
14 cata06 g2_lo  69.0  75  
15 cata07 g2_hi  69.9  76  
16 cata08 g2_hi  77.3  85.6

References

Darwin, C. (2016). On the origin of species (1859). Routledge: London.
Ferrari, G. R., & Griffith, T. (2000). Plato: The Republic. Cambridge University Press: Cambridge.
Figueredo, A. J. (2013a). The Darwinian worldview in statistics. Statistical Methods in Psychological Research.
Figueredo, A. J. (2013b). The Newtonian worldview in statistics. Statistical Methods in Psychological Research.
Galton, F. (1869). Hereditary genius: An inquiry into its laws and consequences. Macmillan; Co.: London.
Johnson, K. (2011a). Fundamentals of quantitative analysis. In K. Johnson (Ed.), Quantitative methods in linguistics (pp. 1–33). Wiley.
Johnson, K. (2011b). Patterns and tests. In K. Johnson (Ed.), Quantitative methods in linguistics (pp. 34–69). Wiley.
Newton, I. (1687). Philosophiae naturalis principia mathematica (mathematical principles of natural philosophy). London.
Quetelet, L. A. J. (1869). Sur l’homme et le développement de ses facultés, ou essai de physique sociale (Vol. 2).
Ross, W. D., & Smith, J. A. (1912). The works of aristotle: De Partibus Animalium; De Motu; De Incessu Animalium; De Generatione Animalium. Clarendon Press.
Takahashi, S. (2009). The manga guide to statistics. Trend-Pro CO., LTD.
Wickham, H., & Grolemund, G. (2016). R for data science: Import, tidy, transform, visualize, and model data. O’Reilly Media.