+ - 0:00:00
Notes for current slide
Notes for next slide

Welcome

Brief intro to R
🙌

Ivan Castro
VISN2 Center for Integrated Healthcare

1 / 124

What we will cover today

  • Just the tip of the iceberg...

  • There's not enough time to cover everything

  • The content presented today is largely based on the Data science in a Box materials

  • Given our roles, we will focus on:

    1. How to start interacting with R

    2. How to wrangle data in R

2 / 124

What we will cover today

  • Just the tip of the iceberg...

  • There's not enough time to cover everything

  • The content presented today is largely based on the Data science in a Box materials

  • Given our roles, we will focus on:

    1. How to start interacting with R

    2. How to wrangle data in R

Things we won't cover

...but I'll gladly help you with otherwise:

  • Advanced data wrangling
  • Detailed data visualization
  • Data modelling
  • Handling spatial data
2 / 124

What is Data Wrangling?

3 / 124

The data analysis cycle

Anyone who has ever taken wild-caught data through the full process of analysis knows that statistics, in the strict sense of fitting models and doing inference, is but one small part of the process.

Bryan & Wickham (2017)

4 / 124

Who Am I?

  • Coordinator in Dr. Possemato's lab

  • Fairly new to CIH (less than a year)

  • Learned R (largely on my own) during graduate school

  • Trained in biostats

  • Enjoy the challenge of wrangling messy data

5 / 124

Find me at...

BHOC F204

ivan.castro@va.gov

iecastro@syr.edu

iecastro

iecastro

6 / 124

Meet the toolkit

7 / 124

Toolkit

toolkit

  • Scriptability R
  • Literate programming (code, narrative, output in one place) R Markdown
  • Version control Git / GitHub
8 / 124

Reproducible data wrangling and analysis

9 / 124

Reproducibility checklist

What does it mean for a data analysis to be "reproducible"?

10 / 124

Reproducibility checklist

What does it mean for a data analysis to be "reproducible"?

Near-term goals:

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)

Long-term goals:

  • Can the code be used for other data?
  • Can you extend the code to do other things?
10 / 124

From manual tasking...

11 / 124

... to reproducible code

12 / 124

Reproducible plots

13 / 124

R and RStudio

14 / 124

What is R

  • R is a statistical programming language

  • But why learn programming?

15 / 124

What is R

  • R is a statistical programming language

  • But why learn programming?

You must use a computer to do data science; you cannot do it in your head, or with pencil and paper.

Hadley Wickham

  • Don't be discouraged by the word programming; R is first and foremost for data analysis

➥ Source: R for Data Science

15 / 124

What is RStudio?

  • RStudio is a convenient interface for R (an integreated development environment, IDE)
  • At its simplest:
    • R is like a car’s engine
    • RStudio is like a car’s dashboard

➥ Source: Modern Dive

16 / 124

Let's take a tour - R / RStudio

Follow this link and log in with your google account:

https://rstudio.cloud/project/395951

17 / 124

Let's take a tour - R / RStudio

Follow this link and log in with your google account:

https://rstudio.cloud/project/395951

Concepts introduced:

  • Console
  • Using R as a calculator
  • Environment
  • Loading and viewing a data frame
  • Accessing a variable in a data frame
  • R functions
17 / 124

R essentials

A short list (for now):

  • Functions are (most often) verbs, followed by what they will be applied to in parantheses:
do_this(to_this)
do_that(to_this, to_that, with_those)
18 / 124

R essentials

A short list (for now):

  • Functions are (most often) verbs, followed by what they will be applied to in parantheses:
do_this(to_this)
do_that(to_this, to_that, with_those)
  • Columns (variables) in data frames are accessed with $:
dataframe$var_name
18 / 124

R essentials

A short list (for now):

  • Functions are (most often) verbs, followed by what they will be applied to in parantheses:
do_this(to_this)
do_that(to_this, to_that, with_those)
  • Columns (variables) in data frames are accessed with $:
dataframe$var_name
  • Packages are installed with the install.packages function and loaded with the library function, once per session:
install.packages("package_name")
library(package_name)
18 / 124

R essentials

A short list (for now):

  • Functions are (most often) verbs, followed by what they will be applied to in parantheses:
do_this(to_this)
do_that(to_this, to_that, with_those)
  • Columns (variables) in data frames are accessed with $:
dataframe$var_name
  • Packages are installed with the install.packages function and loaded with the library function, once per session:
install.packages("package_name")
library(package_name)
  • For this project we'll need the following packages:
install.packages(c("tidyverse", "devtools", "datasauRus", "fivethirtyeight", "janitor", "DT"))
18 / 124

tidyverse

  • The tidyverse is an opinionated collection of R packages designed for data science.

  • All packages share an underlying philosophy and a common grammar.

19 / 124

R Markdown

20 / 124

R Markdown

  • Fully reproducible reports -- each time you knit the analysis is ran from the beginning

  • Simple markdown syntax for text

  • Code goes in chunks, defined by three backticks, narrative goes outside of chunks

21 / 124

Let's take a tour - R Markdown

Go to RStudio Cloud and open the application exercise Bechdel.

~/appex/ae-bechdel.Rmd

Concepts introduced:

  • Knitting documents

  • R Markdown and (some) R syntax

22 / 124

Bechdel Test

What is the Bechdel test?

23 / 124

Bechdel Test

What is the Bechdel test?

The Bechdel test asks whether a work of fiction features at least two women who talk to each other about something other than a man, and there must be two women named characters.

23 / 124

Bechdel Test

What is the Bechdel test?

The Bechdel test asks whether a work of fiction features at least two women who talk to each other about something other than a man, and there must be two women named characters.

  • Knit the R Markdown document.
23 / 124

Other things you can make in R Markdown

This presentation was written in R Markdown

HTML resume

Blog / Website

24 / 124

Other things you can make in R Markdown

This presentation was written in R Markdown

HTML resume

Blog / Website

... ok, enough self promotion 👨‍💼

24 / 124

R Markdown help

Markdown Quick Reference
Help -> Markdown Quick Reference

25 / 124

Workspaces

Remember this, and expect it to bite you a few times as you're learning to work with R Markdown: The workspace of your R Markdown document is separate from the Console!

  • Run the following in the console
x <- 2
x * 3

All looks good, eh?

26 / 124

Workspaces

Remember this, and expect it to bite you a few times as you're learning to work with R Markdown: The workspace of your R Markdown document is separate from the Console!

  • Run the following in the console
x <- 2
x * 3

All looks good, eh?

  • Then, add the following chunk in your R Markdown document and knit it
x * 3

What happens? Why the error?

26 / 124

Git and GitHub

27 / 124

Version control

  • GitHub as a platform for collaboration

  • It's actually designed for version control

28 / 124

Versioning

29 / 124

Versioning

with human readable messages

30 / 124

Why do we need version control?

31 / 124

Git and GitHub tips

  • Git is a version control system -- like “Track Changes” features from Microsoft Word on steroids. GitHub is the home for your Git-based projects on the internet -- like DropBox but much, much better).

  • This is outside the scope of this workshop.

  • There is a great resource for working with git and R: happygitwithr.com.

32 / 124

Tidy data and data wrangling
🔧

33 / 124

Tidy data

34 / 124

Tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

35 / 124

Tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

Characteristics of tidy data: 😄

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Characteristics of untidy data: 😦

!@#$%^&*()

35 / 124

Tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

Characteristics of tidy data: 😄

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Characteristics of untidy data: 😦

!@#$%^&*()

➥ Source: R for Data Science

35 / 124

Like families, tidy datasets are all alike but every messy dataset is messy in its own way.

Hadley Wickham

36 / 124

Summary tables

Is each of the following a dataset or a summary table?

## # A tibble: 87 x 3
## name height mass
## <chr> <int> <dbl>
## 1 Luke Skywalker 172 77
## 2 C-3PO 167 75
## 3 R2-D2 96 32
## 4 Darth Vader 202 136
## 5 Leia Organa 150 49
## 6 Owen Lars 178 120
## 7 Beru Whitesun lars 165 75
## 8 R5-D4 97 32
## 9 Biggs Darklighter 183 84
## 10 Obi-Wan Kenobi 182 77
## # … with 77 more rows
## # A tibble: 5 x 2
## gender avg_height
## <chr> <dbl>
## 1 female 165.
## 2 hermaphrodite 175
## 3 male 179.
## 4 none 200
## 5 <NA> 120
37 / 124

Pipes

38 / 124

Where does the name come from?

The pipe operator is implemented in the package magrittr, it's pronounced "and then".

pipe

magrittr

➥ Vignette: magrittr

39 / 124

Review: How does a pipe work?

  • You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.
40 / 124

Review: How does a pipe work?

  • You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.

  • Expressed as a set of nested functions in R pseudocode this would look like:

park(drive(start_car(find("keys")), to = "campus"))
40 / 124

Review: How does a pipe work?

  • You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.

  • Expressed as a set of nested functions in R pseudocode this would look like:

park(drive(start_car(find("keys")), to = "campus"))
  • Writing it out using pipes give it a more natural (and easier to read) structure:
find("keys") %>%
start_car() %>%
drive(to = "campus") %>%
park()
40 / 124

What about other arguments?

To send results to a function argument other than first one or to use the previous result for multiple arguments, use .:

starwars %>%
filter(species == "Human") %>%
lm(mass ~ height, data = .)
##
## Call:
## lm(formula = mass ~ height, data = .)
##
## Coefficients:
## (Intercept) height
## -116.58 1.11
41 / 124

Data wrangling

42 / 124

Bike crashes in NC 2007 - 2014

The dataset is in the dsbox package:

  • github packages require special install commands

  • the remotes package is automatically installed with devtools

remotes::install_github("rstudio-education/dsbox")
library(dsbox)
ncbikecrash
43 / 124

Variables

View the names of variables via

names(ncbikecrash)
## [1] "object_id" "city" "county"
## [4] "region" "development" "locality"
## [7] "on_road" "rural_urban" "speed_limit"
## [10] "traffic_control" "weather" "workzone"
## [13] "bike_age" "bike_age_group" "bike_alcohol"
## [16] "bike_alcohol_drugs" "bike_direction" "bike_injury"
## [19] "bike_position" "bike_race" "bike_sex"
## [22] "driver_age" "driver_age_group" "driver_alcohol"
## [25] "driver_alcohol_drugs" "driver_est_speed" "driver_injury"
## [28] "driver_race" "driver_sex" "driver_vehicle_type"
## [31] "crash_alcohol" "crash_date" "crash_day"
## [34] "crash_group" "crash_hour" "crash_location"
## [37] "crash_month" "crash_severity" "crash_time"
## [40] "crash_type" "crash_year" "ambulance_req"
## [43] "hit_run" "light_condition" "road_character"
## [46] "road_class" "road_condition" "road_configuration"
## [49] "road_defects" "road_feature" "road_surface"
## [52] "num_bikes_ai" "num_bikes_bi" "num_bikes_ci"
## [55] "num_bikes_ki" "num_bikes_no" "num_bikes_to"
## [58] "num_bikes_ui" "num_lanes" "num_units"
## [61] "distance_mi_from" "frm_road" "rte_invd_cd"
## [64] "towrd_road" "geo_point" "geo_shape"

and see detailed descriptions with ?ncbikecrash.

44 / 124

Viewing your data

  • In the Environment, after loading with data(ncbikecrash), and click on the name of the data frame to view it in the data viewer

  • Use the glimpse function to take a peek

45 / 124

Viewing your data

  • In the Environment, after loading with data(ncbikecrash), and click on the name of the data frame to view it in the data viewer

  • Use the glimpse function to take a peek

glimpse(ncbikecrash)
## Observations: 7,467
## Variables: 66
## $ object_id <int> 1686, 1674, 1673, 1687, 1653, 1665, 1642, 1…
## $ city <chr> "None - Rural Crash", "Henderson", "None - …
## $ county <chr> "Wayne", "Vance", "Lincoln", "Columbus", "N…
## $ region <chr> "Coastal", "Piedmont", "Piedmont", "Coastal…
## $ development <chr> "Farms, Woods, Pastures", "Residential", "F…
## $ locality <chr> "Rural (<30% Developed)", "Mixed (30% To 70…
## $ on_road <chr> "SR 1915", "NICHOLAS ST", "US 321", "W BURK…
## $ rural_urban <chr> "Rural", "Urban", "Rural", "Urban", "Urban"…
## $ speed_limit <chr> "50 - 55 MPH", "30 - 35 MPH", "50 - 55 M…
## $ traffic_control <chr> "No Control Present", "Stop Sign", "Double …
## $ weather <chr> "Clear", "Clear", "Clear", "Rain", "Clear",…
## $ workzone <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ bike_age <chr> "52", "66", "33", "52", "22", "15", "41", "…
## $ bike_age_group <chr> "50-59", "60-69", "30-39", "50-59", "20-24"…
## $ bike_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", …
## $ bike_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ bike_direction <chr> "With Traffic", "With Traffic", "With Traff…
## $ bike_injury <chr> "B: Evident Injury", "C: Possible Injury", …
## $ bike_position <chr> "Bike Lane / Paved Shoulder", "Travel Lane"…
## $ bike_race <chr> "Black", "Black", "White", "Black", "White"…
## $ bike_sex <chr> "Male", "Male", "Male", "Male", "Female", "…
## $ driver_age <chr> "34", NA, "37", "55", "25", "17", NA, "50",…
## $ driver_age_group <chr> "30-39", NA, "30-39", "50-59", "25-29", "0-…
## $ driver_alcohol <chr> "No", "Missing", "No", "No", "No", "No", "M…
## $ driver_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ driver_est_speed <chr> "51-55 mph", "6-10 mph", "41-45 mph", "11-1…
## $ driver_injury <chr> "O: No Injury", "Unknown Injury", "O: No In…
## $ driver_race <chr> "White", "Unknown/Missing", "Hispanic", "Bl…
## $ driver_sex <chr> "Male", NA, "Female", "Male", "Male", "Fema…
## $ driver_vehicle_type <chr> "Single Unit Truck (2-Axle, 6-Tire)", NA, "…
## $ crash_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", …
## $ crash_date <chr> "11DEC2013", "20NOV2013", "03NOV2013", "14D…
## $ crash_day <chr> "Wednesday", "Wednesday", "Sunday", "Saturd…
## $ crash_group <chr> "Motorist Overtaking Bicyclist", "Bicyclist…
## $ crash_hour <int> 6, 20, 18, 18, 13, 17, 17, 7, 15, 2, 12, 22…
## $ crash_location <chr> "Non-Intersection", "Intersection", "Non-In…
## $ crash_month <chr> "December", "November", "November", "Decemb…
## $ crash_severity <chr> "B: Evident Injury", "C: Possible Injury", …
## $ crash_time <drtn> 06:10:00, 20:41:00, 18:05:00, 18:34:00, 13…
## $ crash_type <chr> "Motorist Overtaking - Undetected Bicyclist…
## $ crash_year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ ambulance_req <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Y…
## $ hit_run <chr> "No", "Yes", "No", "No", "No", "No", "Yes",…
## $ light_condition <chr> "Dark - Roadway Not Lighted", NA, "Dark - R…
## $ road_character <chr> "Straight - Level", "Straight - Level", "St…
## $ road_class <chr> "State Secondary Route", "Local Street", "U…
## $ road_condition <chr> "Dry", "Dry", "Dry", "Water (Standing, Movi…
## $ road_configuration <chr> "Two-Way, Not Divided", "Two-Way, Divided, …
## $ road_defects <chr> "None", NA, "None", "None", "None", "None",…
## $ road_feature <chr> "No Special Feature", "T-Intersection", "No…
## $ road_surface <chr> "Coarse Asphalt", "Smooth Asphalt", "Smooth…
## $ num_bikes_ai <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_bi <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_ci <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_ki <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_no <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_to <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_ui <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_lanes <chr> "2 lanes", "2 lanes", "2 lanes", "1 lane", …
## $ num_units <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ distance_mi_from <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0"…
## $ frm_road <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ rte_invd_cd <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ towrd_road <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ geo_point <chr> "35.3336070056, -77.9955023901", "36.315187…
## $ geo_shape <chr> "{\"type\": \"Point\", \"coordinates\": [-7…
45 / 124

A Grammar of Data Manipulation

dplyr is based on the concepts of functions as verbs that manipulate data frames.

  • filter: pick rows matching criteria
  • slice: pick rows using index(es)
  • select: pick columns by name
  • pull: grab a column as a vector
  • arrange: reorder rows
  • mutate: add new variables
  • distinct: filter for unique rows
  • sample_n / sample_frac: randomly sample rows
  • summarise: reduce variables to values
  • ... (many more)
46 / 124

dplyr rules for functions

  • First argument is always a data frame

  • Subsequent arguments say what to do with that data frame

  • Always return a data frame

  • Don't modify in place

47 / 124

A note on piping and layering

  • The %>% operator in dplyr functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code.

  • The + operator in ggplot2 functions is used for "layering". This means you create the plot in layers, separated by +.

48 / 124

filter to select a subset of rows

for crashes in Durham County

ncbikecrash %>%
filter(county == "Durham")
## # A tibble: 340 x 66
## object_id city county region development locality on_road rural_urban
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2452 Durh… Durham Piedm… Residential Urban (… <NA> Urban
## 2 2441 Durh… Durham Piedm… Commercial Urban (… <NA> Urban
## 3 2466 Durh… Durham Piedm… Commercial Urban (… <NA> Urban
## 4 549 Durh… Durham Piedm… Residential Urban (… PARK A… Urban
## 5 598 Durh… Durham Piedm… Residential Urban (… BELT S… Urban
## 6 603 Durh… Durham Piedm… Residential Urban (… HINSON… Urban
## 7 3974 Durh… Durham Piedm… Commercial Urban (… <NA> Urban
## 8 7134 Durh… Durham Piedm… Commercial Urban (… <NA> Urban
## 9 1670 Durh… Durham Piedm… Commercial Urban (… INFINI… Urban
## 10 1773 Durh… Durham Piedm… Residential Urban (… <NA> Urban
## # … with 330 more rows, and 58 more variables: speed_limit <chr>,
## # traffic_control <chr>, weather <chr>, workzone <chr>, bike_age <chr>,
## # bike_age_group <chr>, bike_alcohol <chr>, bike_alcohol_drugs <chr>,
## # bike_direction <chr>, bike_injury <chr>, bike_position <chr>,
## # bike_race <chr>, bike_sex <chr>, driver_age <chr>,
## # driver_age_group <chr>, driver_alcohol <chr>,
## # driver_alcohol_drugs <chr>, driver_est_speed <chr>,
## # driver_injury <chr>, driver_race <chr>, driver_sex <chr>,
## # driver_vehicle_type <chr>, crash_alcohol <chr>, crash_date <chr>,
## # crash_day <chr>, crash_group <chr>, crash_hour <int>,
## # crash_location <chr>, crash_month <chr>, crash_severity <chr>,
## # crash_time <drtn>, crash_type <chr>, crash_year <int>,
## # ambulance_req <chr>, hit_run <chr>, light_condition <chr>,
## # road_character <chr>, road_class <chr>, road_condition <chr>,
## # road_configuration <chr>, road_defects <chr>, road_feature <chr>,
## # road_surface <chr>, num_bikes_ai <int>, num_bikes_bi <int>,
## # num_bikes_ci <int>, num_bikes_ki <int>, num_bikes_no <int>,
## # num_bikes_to <int>, num_bikes_ui <int>, num_lanes <chr>,
## # num_units <int>, distance_mi_from <chr>, frm_road <chr>,
## # rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, geo_shape <chr>
49 / 124

filter for many conditions at once

for crashes in Durham County where biker was 0-5 years old

ncbikecrash %>%
filter(county == "Durham", bike_age_group == "0-5")
## # A tibble: 4 x 66
## object_id city county region development locality on_road rural_urban
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 4062 Durh… Durham Piedm… Residential Urban (… <NA> Urban
## 2 414 Durh… Durham Piedm… Residential Urban (… PVA 90… Urban
## 3 3016 Durh… Durham Piedm… Residential Urban (… <NA> Urban
## 4 1383 Durh… Durham Piedm… Residential Urban (… PVA 62… Urban
## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,
## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## # bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## # driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## # crash_date <chr>, crash_day <chr>, crash_group <chr>,
## # crash_hour <int>, crash_location <chr>, crash_month <chr>,
## # crash_severity <chr>, crash_time <drtn>, crash_type <chr>,
## # crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## # light_condition <chr>, road_character <chr>, road_class <chr>,
## # road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>,
## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,
## # geo_shape <chr>
50 / 124

Logical operators in R

operator definition operator definition
< less than x | y x OR y
<= less than or equal to is.na(x) test if x is NA
> greater than !is.na(x) test if x is not NA
>= greater than or equal to x %in% y test if x is in y
== exactly equal to !(x %in% y) test if x is not in y
!= not equal to !x not x
x & y x AND y
51 / 124

select to keep variables

ncbikecrash %>%
filter(county == "Durham", bike_age_group == "0-5") %>%
select(locality, speed_limit)
## # A tibble: 4 x 2
## locality speed_limit
## <chr> <chr>
## 1 Urban (>70% Developed) 30 - 35 MPH
## 2 Urban (>70% Developed) 5 - 15 MPH
## 3 Urban (>70% Developed) 20 - 25 MPH
## 4 Urban (>70% Developed) 20 - 25 MPH
52 / 124

select to exclude variables

ncbikecrash %>%
select(-object_id)
## # A tibble: 7,467 x 65
## city county region development locality on_road rural_urban speed_limit
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural 50 - 55 M…
## 2 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban 30 - 35 M…
## 3 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural 50 - 55 M…
## 4 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban 30 - 35 M…
## 5 Wilm… New H… Coast… Residential Urban (… RACINE… Urban <NA>
## 6 None… Robes… Coast… Farms, Woo… Rural (… SR 1513 Rural 50 - 55 M…
## 7 None… Richm… Piedm… Residential Mixed (… SR 1903 Rural 30 - 35 M…
## 8 Rale… Wake Piedm… Commercial Urban (… PERSON… Urban 30 - 35 M…
## 9 Whit… Colum… Coast… Residential Rural (… FLOWER… Urban 30 - 35 M…
## 10 New … Craven Coast… Residential Urban (… SUTTON… Urban 20 - 25 M…
## # … with 7,457 more rows, and 57 more variables: traffic_control <chr>,
## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## # bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## # driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## # crash_date <chr>, crash_day <chr>, crash_group <chr>,
## # crash_hour <int>, crash_location <chr>, crash_month <chr>,
## # crash_severity <chr>, crash_time <drtn>, crash_type <chr>,
## # crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## # light_condition <chr>, road_character <chr>, road_class <chr>,
## # road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>,
## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,
## # geo_shape <chr>
53 / 124

select a range of variables

ncbikecrash %>%
select(city:locality)
## # A tibble: 7,467 x 5
## city county region development locality
## <chr> <chr> <chr> <chr> <chr>
## 1 None - Rural … Wayne Coastal Farms, Woods, Pa… Rural (<30% Develop…
## 2 Henderson Vance Piedmo… Residential Mixed (30% To 70% D…
## 3 None - Rural … Lincoln Piedmo… Farms, Woods, Pa… Rural (<30% Develop…
## 4 Whiteville Columbus Coastal Commercial Urban (>70% Develop…
## 5 Wilmington New Hanov… Coastal Residential Urban (>70% Develop…
## 6 None - Rural … Robeson Coastal Farms, Woods, Pa… Rural (<30% Develop…
## 7 None - Rural … Richmond Piedmo… Residential Mixed (30% To 70% D…
## 8 Raleigh Wake Piedmo… Commercial Urban (>70% Develop…
## 9 Whiteville Columbus Coastal Residential Rural (<30% Develop…
## 10 New Bern Craven Coastal Residential Urban (>70% Develop…
## # … with 7,457 more rows
54 / 124

slice for certain row numbers

First five

ncbikecrash %>%
slice(1:5)
## # A tibble: 5 x 66
## object_id city county region development locality on_road rural_urban
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1686 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural
## 2 1674 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban
## 3 1673 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural
## 4 1687 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban
## 5 1653 Wilm… New H… Coast… Residential Urban (… RACINE… Urban
## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,
## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## # bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## # driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## # crash_date <chr>, crash_day <chr>, crash_group <chr>,
## # crash_hour <int>, crash_location <chr>, crash_month <chr>,
## # crash_severity <chr>, crash_time <drtn>, crash_type <chr>,
## # crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## # light_condition <chr>, road_character <chr>, road_class <chr>,
## # road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>,
## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,
## # geo_shape <chr>
55 / 124

slice for certain row numbers

Last five

last_row <- nrow(ncbikecrash)
ncbikecrash %>%
slice((last_row - 4):last_row)
## # A tibble: 5 x 66
## object_id city county region development locality on_road rural_urban
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 6989 High… Guilf… Piedm… Residential Urban (… <NA> Urban
## 2 6991 Wilm… New H… Coast… Residential Urban (… <NA> Urban
## 3 6995 Kins… Lenoir Coast… Commercial Urban (… <NA> Urban
## 4 6998 Faye… Cumbe… Coast… Residential Urban (… <NA> Urban
## 5 7000 None… Onslow Coast… Farms, Woo… Rural (… <NA> Rural
## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,
## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## # bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## # driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## # crash_date <chr>, crash_day <chr>, crash_group <chr>,
## # crash_hour <int>, crash_location <chr>, crash_month <chr>,
## # crash_severity <chr>, crash_time <drtn>, crash_type <chr>,
## # crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## # light_condition <chr>, road_character <chr>, road_class <chr>,
## # road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>,
## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,
## # geo_shape <chr>
56 / 124

pull to extract a column as a vector

ncbikecrash %>%
slice(1:6) %>%
pull(locality)
## [1] "Rural (<30% Developed)" "Mixed (30% To 70% Developed)"
## [3] "Rural (<30% Developed)" "Urban (>70% Developed)"
## [5] "Urban (>70% Developed)" "Rural (<30% Developed)"

vs.

ncbikecrash %>%
slice(1:6) %>%
select(locality)
## # A tibble: 6 x 1
## locality
## <chr>
## 1 Rural (<30% Developed)
## 2 Mixed (30% To 70% Developed)
## 3 Rural (<30% Developed)
## 4 Urban (>70% Developed)
## 5 Urban (>70% Developed)
## 6 Rural (<30% Developed)
57 / 124

sample_n / sample_frac for a random sample

  • sample_n: randomly sample 5 observations
ncbikecrash_n5 <- ncbikecrash %>%
sample_n(5, replace = FALSE)
dim(ncbikecrash_n5)
## [1] 5 66
  • sample_frac: randomly sample 20% of observations
ncbikecrash_perc20 <-ncbikecrash %>%
sample_frac(0.2, replace = FALSE)
dim(ncbikecrash_perc20)
## [1] 1493 66
58 / 124

distinct to filter for unique rows

And arrange to order alphabetically

ncbikecrash %>%
select(county, city) %>%
distinct() %>%
arrange(county, city)
## # A tibble: 391 x 2
## county city
## <chr> <chr>
## 1 Alamance Alamance
## 2 Alamance Burlington
## 3 Alamance Elon
## 4 Alamance Elon College
## 5 Alamance Gibsonville
## 6 Alamance Graham
## 7 Alamance Green Level
## 8 Alamance Mebane
## 9 Alamance None - Rural Crash
## 10 Alexander None - Rural Crash
## # … with 381 more rows
59 / 124

summarise to reduce variables to values

ncbikecrash %>%
summarise(avg_hr = mean(crash_hour))
## # A tibble: 1 x 1
## avg_hr
## <dbl>
## 1 14.7
60 / 124

group_by to do calculations on groups

ncbikecrash %>%
group_by(hit_run) %>%
summarise(avg_hr = mean(crash_hour))
## # A tibble: 2 x 2
## hit_run avg_hr
## <chr> <dbl>
## 1 No 14.6
## 2 Yes 15.0
61 / 124

count observations in groups

ncbikecrash %>%
count(driver_alcohol_drugs)
## # A tibble: 6 x 2
## driver_alcohol_drugs n
## <chr> <int>
## 1 Missing 99
## 2 No 695
## 3 Yes-Alcohol, impairment suspected 12
## 4 Yes-Alcohol, no impairment detected 3
## 5 Yes-Drugs, impairment suspected 4
## 6 <NA> 6654
62 / 124

mutate to add new variables

ncbikecrash %>%
mutate(driver_alcohol_drugs_simplified = case_when(
driver_alcohol_drugs == "Missing" ~ NA,
str_detect(driver_alcohol_drugs, "Yes") ~ "Yes",
TRUE ~ "No"
))
63 / 124

"Save" when you mutate

Most often when you define a new variable with mutate you'll also want to save the resulting data frame, often by writing over the original data frame.

ncbikecrash <- ncbikecrash %>%
mutate(driver_alcohol_drugs_simplified = case_when(
str_detect(driver_alcohol_drugs, "Yes") ~ "Yes",
TRUE ~ driver_alcohol_drugs
))
64 / 124

Check before you move on

ncbikecrash %>%
count(driver_alcohol_drugs, driver_alcohol_drugs_simplified)
## # A tibble: 6 x 3
## driver_alcohol_drugs driver_alcohol_drugs_simplified n
## <chr> <chr> <int>
## 1 Missing Missing 99
## 2 No No 695
## 3 Yes-Alcohol, impairment suspected Yes 12
## 4 Yes-Alcohol, no impairment detected Yes 3
## 5 Yes-Drugs, impairment suspected Yes 4
## 6 <NA> <NA> 6654
ncbikecrash %>%
count(driver_alcohol_drugs_simplified)
## # A tibble: 4 x 2
## driver_alcohol_drugs_simplified n
## <chr> <int>
## 1 Missing 99
## 2 No 695
## 3 Yes 19
## 4 <NA> 6654
65 / 124

AE - NC bike crashes

  • Go to the cloud project and open application exercise NC bike crashes

    ~appex/ae-ncbikecrashes.Rmd

  • For each question you work on, set the eval chunk option to TRUE and knit

66 / 124

Coding style
🤵

67 / 124

Coding style

68 / 124

Style guide

Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.

Hadley Wickham

  • Style guide for this course is based on the Tidyverse style guide: http://style.tidyverse.org/

  • There's more to it than what we'll cover today, but we'll mention more as we introduce more functionality, and do a recap later in the semester

69 / 124

File names and code chunk labels

  • Do not use spaces in file names, use - or _ to separate words
  • Use all lowercase letters
# Good
ucb-admit.csv
# Bad
UCB Admit.csv
70 / 124

Object names

  • Use _ to separate words in object names
  • Use informative but short object names
  • Do not reuse object names within an analysis
# Good
acs_employed
# Bad
acs.employed
acs2
acs_subset
acs_subsetted_for_males
71 / 124

72 / 124

Spacing

  • Put a space before and after all infix operators (=, +, -, <-, etc.), and when naming arguments in function calls.
  • Always put a space after a comma, and never before (just like in regular English).
# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)
# Bad
average<-mean(feet/12+inches,na.rm=TRUE)
73 / 124

ggplot

  • Always end a line with +
  • Always indent the next line
# Good
ggplot(diamonds, mapping = aes(x = price)) +
geom_histogram()
# Bad
ggplot(diamonds,mapping=aes(x=price))+geom_histogram()
74 / 124

Long lines

  • Limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font.
  • Take advantage of RStudio editor's auto formatting for indentation at line breaks.
75 / 124

Assignment

  • Use <- not =
# Good
x <- 2
# Bad
x = 2
76 / 124

Assignment

  • Use <- not =
# Good
x <- 2
# Bad
x = 2

76 / 124

Quotes

Use ", not ', for quoting text. The only exception is when the text already contains double quotes and no single quotes.

ggplot(diamonds, mapping = aes(x = price)) +
geom_histogram() +
# Good
labs(title = "`Shine bright like a diamond`",
# Good
x = "Diamond prices",
# Bad
y = 'Frequency')
77 / 124

Data classes and types + Recoding
💽

78 / 124

Data classes and types

79 / 124

Data types in R

  • logical
  • double
  • integer
  • character
  • lists
  • and some more, but we won't be focusing on those
80 / 124

Logical & character

logical - boolean values TRUE and FALSE

typeof(TRUE)
## [1] "logical"

character - character strings

typeof("hello")
## [1] "character"
typeof('world') # but remember, we use double quotations!
## [1] "character"
81 / 124

Double & integer

double - floating point numerical values (default numerical type)

typeof(1.335)
## [1] "double"
typeof(7)
## [1] "double"

integer - integer numerical values (indicated with an L)

typeof(7L)
## [1] "integer"
typeof(1:3)
## [1] "integer"
82 / 124

Lists

Lists are 1d objects that can contain any combination of R objects

mylist <- list("A", 1:4, c(TRUE, FALSE), (1:4)/2)
mylist
## [[1]]
## [1] "A"
##
## [[2]]
## [1] 1 2 3 4
##
## [[3]]
## [1] TRUE FALSE
##
## [[4]]
## [1] 0.5 1.0 1.5 2.0
str(mylist)
## List of 4
## $ : chr "A"
## $ : int [1:4] 1 2 3 4
## $ : logi [1:2] TRUE FALSE
## $ : num [1:4] 0.5 1 1.5 2
83 / 124

Named lists

Because of their more complex structure we often want to name the elements of a list (we can also do this with vectors). This can make reading and accessing the list more straight forward.

myotherlist <- list(A = "hello", B = 1:4, "knock knock" = "who's there?")
str(myotherlist)
## List of 3
## $ A : chr "hello"
## $ B : int [1:4] 1 2 3 4
## $ knock knock: chr "who's there?"
names(myotherlist)
## [1] "A" "B" "knock knock"
myotherlist$B
## [1] 1 2 3 4
84 / 124

Concatenation

Vectors can be constructed using the c() function.

c(1, 2, 3)
## [1] 1 2 3
c("Hello", "World!")
## [1] "Hello" "World!"
c(1, c(2, c(3)))
## [1] 1 2 3
85 / 124

Coercion

R is a dynamically typed language -- it will happily convert between the various types without complaint.

c(1, "Hello")
## [1] "1" "Hello"
c(FALSE, 3L)
## [1] 0 3
c(1.2, 3L)
## [1] 1.2 3.0
86 / 124

Missing Values

R uses NA to represent missing values in its data structures.

typeof(NA)
## [1] "logical"
87 / 124

Other Special Values

NaN - Not a number

Inf - Positive infinity

-Inf - Negative infinity


pi / 0
## [1] Inf
0 / 0
## [1] NaN
1/0 + 1/0
## [1] Inf
1/0 - 1/0
## [1] NaN
NaN / NA
## [1] NaN
NaN * NA
## [1] NaN
88 / 124

Activity

What is the type of the following vectors? Explain why they have that type.

  • c(1, NA+1L, "C")
  • c(1L / 0, NA)
  • c(1:3, 5)
  • c(3L, NaN+1L)
  • c(NA, TRUE)
89 / 124

Example: Cat lovers

Go to RStudio Cloud and open the application exercise Cat Lovers.

~/appex/ae-catlovers.Rmd

A survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value.

cat_lovers <- read_csv("../data/cat-lovers.csv")
90 / 124

Oh why won't you work?!

cat_lovers %>%
summarise(mean = mean(number_of_cats))
## # A tibble: 1 x 1
## mean
## <dbl>
## 1 NA
91 / 124

Oh why won't you still work??!!

cat_lovers %>%
summarise(mean_cats = mean(number_of_cats, na.rm = TRUE))
## # A tibble: 1 x 1
## mean_cats
## <dbl>
## 1 NA
92 / 124

Take a breath and look at your data

What is the type of the number_of_cats variable?

glimpse(cat_lovers)
## Observations: 60
## Variables: 3
## $ name <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass",…
## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", "0", "0",…
## $ handedness <chr> "left", "left", "left", "left", "left", "left", "…
93 / 124

Let's take another look

94 / 124

Sometimes you need to babysit your respondents

cat_lovers %>%
mutate(number_of_cats = case_when(
name == "Ginger Clark" ~ 2,
name == "Doug Bass" ~ 3,
TRUE ~ as.numeric(number_of_cats)
)) %>%
summarise(mean_cats = mean(number_of_cats))
## # A tibble: 1 x 1
## mean_cats
## <dbl>
## 1 0.817
95 / 124

Always you need to respect data types

cat_lovers %>%
mutate(
number_of_cats = case_when(
name == "Ginger Clark" ~ "2",
name == "Doug Bass" ~ "3",
TRUE ~ number_of_cats
),
number_of_cats = as.numeric(number_of_cats)
) %>%
summarise(mean_cats = mean(number_of_cats))
## # A tibble: 1 x 1
## mean_cats
## <dbl>
## 1 0.817
96 / 124

Now that we know what we're doing...

cat_lovers <- cat_lovers %>%
mutate(
number_of_cats = case_when(
name == "Ginger Clark" ~ "2",
name == "Doug Bass" ~ "3",
TRUE ~ number_of_cats
),
number_of_cats = as.numeric(number_of_cats)
)
97 / 124

Moral of the story

  • If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason.

  • Go in and investigate your data, apply the fix, save your data, live happily ever after.

98 / 124

Vectors vs. lists

x <- c(8,4,7)
x[1]
## [1] 8
x[[1]]
## [1] 8
99 / 124

Vectors vs. lists

x <- c(8,4,7)
x[1]
## [1] 8
x[[1]]
## [1] 8
y <- list(8,4,7)
y[2]
## [[1]]
## [1] 4
y[[2]]
## [1] 4
99 / 124

Vectors vs. lists

x <- c(8,4,7)
x[1]
## [1] 8
x[[1]]
## [1] 8
y <- list(8,4,7)
y[2]
## [[1]]
## [1] 4
y[[2]]
## [1] 4


Note: When using tidyverse code you'll rarely need to refer to elements using square brackets, but it's good to be aware of this syntax, especially since you might encounter it when searching for help online.

99 / 124

Review on your own

100 / 124

Data "set"

101 / 124

Data "sets" in R

  • "set" is in quotation marks because it is not a formal data class

  • A tidy data "set" can be one of the following types:

    • tibble
    • data.frame
  • We'll often work with tibbles:

    • readr package (e.g. read_csv function) loads data as a tibble by default
    • tibbles are part of the tidyverse, so they work well with other packages we are using
    • they make minimal assumptions about your data, so are less likely to cause hard to track bugs in your code
102 / 124

Data frames

  • A data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.

  • A tibble is a type of data frame that ... makes your life (i.e. data analysis) easier.

  • Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.

df <- tibble(x = 1:3, y = c("a", "b", "c"))
class(df)
## [1] "tbl_df" "tbl" "data.frame"
glimpse(df)
## Observations: 3
## Variables: 2
## $ x <int> 1, 2, 3
## $ y <chr> "a", "b", "c"
103 / 124

Data frames (cont.)

attributes(df)
## $names
## [1] "x" "y"
##
## $row.names
## [1] 1 2 3
##
## $class
## [1] "tbl_df" "tbl" "data.frame"
class(df$x)
## [1] "integer"
class(df$y)
## [1] "character"
104 / 124

Working with tibbles in pipelines

How many respondents have below average number of cats?

mean_cats <- cat_lovers %>%
summarise(mean_cats = mean(number_of_cats))
cat_lovers %>%
filter(number_of_cats < mean_cats) %>%
nrow()
## [1] 60

Do you believe this number? Why, why not?

105 / 124

A result of a pipeline is always a tibble

mean_cats
## # A tibble: 1 x 1
## mean_cats
## <dbl>
## 1 0.817
class(mean_cats)
## [1] "tbl_df" "tbl" "data.frame"
106 / 124

pull() can be your new best friend

But use it sparingly!

mean_cats <- cat_lovers %>%
summarise(mean_cats = mean(number_of_cats)) %>%
pull()
cat_lovers %>%
filter(number_of_cats < mean_cats) %>%
nrow()
## [1] 33
107 / 124

pull() can be your new best friend

But use it sparingly!

mean_cats <- cat_lovers %>%
summarise(mean_cats = mean(number_of_cats)) %>%
pull()
cat_lovers %>%
filter(number_of_cats < mean_cats) %>%
nrow()
## [1] 33
mean_cats
## [1] 0.8166667
class(mean_cats)
## [1] "numeric"
107 / 124

Factors

108 / 124

Factors

Factor objects are how R stores data for categorical variables (fixed numbers of discrete values).

(x = factor(c("BS", "MS", "PhD", "MS")))
## [1] BS MS PhD MS
## Levels: BS MS PhD
glimpse(x)
## Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2
typeof(x)
## [1] "integer"
109 / 124

Read data in as character strings

glimpse(cat_lovers)
## Observations: 60
## Variables: 3
## $ name <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass",…
## $ number_of_cats <dbl> 0, 0, 1, 3, 3, 2, 1, 1, 0, 0, 0, 0, 1, 3, 3, 2, 1…
## $ handedness <chr> "left", "left", "left", "left", "left", "left", "…
110 / 124

But coerce when plotting

p <- ggplot(cat_lovers, mapping = aes(x = handedness)) +
geom_bar()
p

111 / 124

Use forcats to manipulate factors

cat_lovers <- cat_lovers %>%
mutate(handedness = fct_relevel(handedness,
"right", "left", "ambidextrous"))
p <- ggplot(cat_lovers, mapping = aes(x = handedness)) +
geom_bar()
p

112 / 124

Come for the functionality

... stay for the logo

  • R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Historically, factors were much easier to work with than character vectors, so many base R functions automatically convert character vectors to factors.

  • However, factors are still useful when you have true categorical data, and when you want to override the ordering of character vectors to improve display. The goal of the forcats package is to provide a suite of useful tools that solve common problems with factors.

113 / 124

Recap

  • Always best to think of data as part of a tibble

    • This plays nicely with the tidyverse as well
    • Rows are observations, columns are variables
  • Be careful about data types / classes

    • Sometimes R makes silly assumptions about your data class
      • Using tibbles help, but it might not solve all issues
      • Think about your data in context, e.g. 0/1 variable is most likely a factor
    • If a plot/output is not behaving the way you expect, first investigate the data class
    • If you are absolutely sure of a data class, overwrite it in your tibble so that you don't need to keep having to keep track of it
      • mutate the variable with the correct class
  • Check out Alison Hill's "Working with Data in R"
    saved in the R folder of the RStudio project

114 / 124

Resources

115 / 124

Online Books

R for Data Science

This book will teach you how to do data science with R: You’ll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it.

ModernDive: Statistical Inference via Data Science

This is intended to be a gentle introduction to the practice of analyzing data and answering questions using data the way data scientists, statisticians, data journalists, and other researchers would.

Data Visualization: A practical Introduction

This book is a hands-on introduction to the principles and practice of looking at and presenting data using R and ggplot.

Fundamentals of Data Visualization

The book is meant as a guide to making visualizations that accurately reflect the data, tell a story, and look professional. Even though nearly all of the figures in this book were made with R and ggplot2, this is not an R book. It focuses on the concepts and the figures, not on the code.

Open source R-based Courses

Alison Hill - Introduction to Biostatistics for the Basic Sciences - Oregon Health & Science University

Mine Cetinkaya-Rundel - Intro to Data Science - Duke

116 / 124

These links will take you to each relevant section in the presentation

117 / 124

Who wants to make a meme?
☝️

118 / 124

Welcome to the memer package 📦

remotes::install_github("sctyner/memer")
library(memer)

memer is a a tidyverse-compatible R package for creating memes

119 / 124

Welcome to the memer package 📦

remotes::install_github("sctyner/memer")
library(memer)

memer is a a tidyverse-compatible R package for creating memes

meme_get("OprahGiveaway") %>%
meme_text_bottom("EVERYONE GETS A MEME!", size = 30)

119 / 124

What's in the package?

meme_list()
## [1] "AllTheThings" "AmericanChopper" "AncientAliens"
## [4] "BatmanRobin" "DistractedBf" "EvilKermit"
## [7] "ExpandingBrain" "FirstWorldProbs" "FryNotSure"
## [10] "HotlineDrake" "IsThisAPigeon" "NoneOfMyBusiness"
## [13] "CheersLeo" "OneDoesNotSimply" "DosEquisMan"
## [16] "OffRamp" "OprahGiveaway" "Philosoraptor"
## [19] "PicardFacePalm" "PicardWTH" "Purples"
## [22] "PutItPatrick" "Rainbow" "ShiaJustDoIt"
## [25] "Spongebob" "SuccessKid" "ThatWouldBeGreat"
## [28] "TheRockDriving" "ThinkAboutIt" "TrumpBillSigning"
## [31] "TwoButtonsAnxiety" "WhatIfIToldYou" "CondescendingWonka"
## [34] "YoDawg" "Y-U-NOguy"
120 / 124

Let's make a meme

121 / 124

Let's make a meme

meme_get("TheRockDriving") %>%
meme_text_rock("Hey, how do I prep for an IRB audit?",
"\nPrint, \n...everything.")

121 / 124

Now you try

meme_get("SuccessKid") %>%
meme_text_bottom("ENTER TEXT HERE")
122 / 124

Coffee Time

Can't touch this

123 / 124

124 / 124

What we will cover today

  • Just the tip of the iceberg...

  • There's not enough time to cover everything

  • The content presented today is largely based on the Data science in a Box materials

  • Given our roles, we will focus on:

    1. How to start interacting with R

    2. How to wrangle data in R

2 / 124
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow