Data Science for Linguists

Getting started

Joseph V. Casillas, PhD

Rutgers UniversitySpring 2025
Last update: 2025-01-04

What is Data Science?

This process should be version controlled!

This process should be version controlled!




So what is version control?




So what is version control?

Don’t forget the stats…







mtcars |>
  ggplot() + 
  aes(x = disp, y = mpg) + 
  geom_point() + 
  geom_smooth(method = 'lm', formula = "y ~ x")

Literate programming

  • This means we write code in a way that clearly documents what we did.

  • Instead of writing code with the purpose of telling the computer what to do, we write code that tells other humans what we told the computer to do and why.

  • Importantly, we don’t separate our code from the report/essay/manuscript we are writing. Everything is together, in a single document (usually).

In this class you will learn to…

manage version controlled research projects

in a way that facilitates collaboration and honesty

get and tidy data

transform and visualize your data

fit statistical models to your data and test hypotheses

communicate your results using literate programming

This is reproducible research

What you’ll need…

Programs and packages

Programs we will use

  • Slack
  • R
  • RStudio
  • GitHub account

Slack

  • Slack is a communication platform we’ll use to discuss specific topics outside of class.
  • It also serves as a resource for students to ask questions and exchange information relevant to the course.
  • You will receive an email with an invitation link to join the course Slack (www.ru-ds4ling.slack.com)
  • For personal matters (only) you can email the professor.
  • Need help? Instructions

Programs we will use

  • Slack
  • R
  • RStudio
  • GitHub account

Slack app

  • Get class notifications 24/7, everywhere you go 🤓
  • There is a downloadable app so that you don’t have to use the web interface

Programs we will use

  • Slack
  • R
  • RStudio
  • GitHub

R

  • R is the statistical programming language we will learn about in this class.

  • You can download R here: https://cran.r-project.org

  • Need help? Instructions

Programs we will use

  • Slack
  • R
  • RStudio
  • GitHub

RStudio (Posit, Positron)

Programs we will use

  • Slack
  • R
  • RStudio
  • GitHub

R packages we will use

Obligatory

  • tidyverse: Install and load tidy verse packages

  • ds4ling: Functions and datasets used in this course

  • knitr: Dynamic report generation

  • rmarkdown: Dynamic documents

  • papaja: Reproducible APA manuscripts in RMarkdown

  • xaringan: HTML presentations in RMarkdown

  • here: Reproducible way to set working directory

  • devtools: Install packages from GitHub

You can download a package in r using the following command:

install.packages("packageName")

Programs we will use

  • Slack
  • R
  • RStudio
  • GitHub

R packages we will use

Helpful

  • lme4: Multilevel models

  • brms: Bayesian data analysis

  • patchwork: Combine ggplots

  • broom: Stat models to tidy dataframe

  • learnr: Interactive tutorials

  • stringr: For manipulating strings

  • sjPlot: For making plots and tables from model objects

Programs we will use

  • Slack
  • R
  • RStudio
  • GitHub

Github

  • Github is a Web-based Git version control repository hosting service.

  • It is mostly used for computer code (like Dropbox for nerds).

  • We will use GitHub for project management and sharing reproducible reports.

  • Need help? Instructions

Programs we will use

  • Slack
  • R
  • RStudio
  • GitHub

Github Desktop

  • This can make interacting with Git much easier

  • You can download the app here: https://desktop.github.com

Data Science for Linguists


Getting help

If you have problems setting up any of the aforementioned

software ask for help in the slack channel


R, RStudio, RMarkdown, GitHub, and Slack here