WelcomeBrief intro to R 
 🙌Ivan Castro 
 VISN2 Center for Integrated Healthcare1 / 124

What we will cover today

Just the tip of the iceberg...
There's not enough time to cover everything
The content presented today is largely based on the Data science in a Box materials
Given our roles, we will focus on:
1. How to start interacting with R
2. How to wrangle data in R

2 / 124

What we will cover today

Just the tip of the iceberg...
There's not enough time to cover everything
The content presented today is largely based on the Data science in a Box materials
Given our roles, we will focus on:
1. How to start interacting with R
2. How to wrangle data in R

Things we won't cover

...but I'll gladly help you with otherwise:

Advanced data wrangling
Detailed data visualization
Data modelling
Handling spatial data

2 / 124

What is Data Wrangling?3 / 124

The data analysis cycle

Anyone who has ever taken wild-caught data through the full process of analysis knows that statistics, in the strict sense of fitting models and doing inference, is but one small part of the process.

Bryan & Wickham (2017)

4 / 124

Who Am I?

Coordinator in Dr. Possemato's lab
Fairly new to CIH (less than a year)
Learned R (largely on my own) during graduate school
Trained in biostats
Enjoy the challenge of wrangling messy data

5 / 124

Find me at...

BHOC F204

ivan.castro@va.gov

iecastro@syr.edu

iecastro

6 / 124

Meet the toolkit 
 ⚒7 / 124

Toolkit

toolkit

Scriptability R
Literate programming (code, narrative, output in one place) R Markdown
Version control Git / GitHub

8 / 124

Reproducible data wrangling and analysis9 / 124

Reproducibility checklist

What does it mean for a data analysis to be "reproducible"?

10 / 124

Reproducibility checklist

What does it mean for a data analysis to be "reproducible"?

Near-term goals:

Are the tables and figures reproducible from the code and data?
Does the code actually do what you think it does?
In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)

Long-term goals:

Can the code be used for other data?
Can you extend the code to do other things?

10 / 124

From manual tasking...

11 / 124

... to reproducible code

12 / 124

Reproducible plots

13 / 124

R and RStudio14 / 124

What is R

R is a statistical programming language
But why learn programming?

15 / 124

What is R

R is a statistical programming language
But why learn programming?

You must use a computer to do data science; you cannot do it in your head, or with pencil and paper.

Hadley Wickham

Don't be discouraged by the word programming; R is first and foremost for data analysis

➥ Source: R for Data Science

15 / 124

What is RStudio?

RStudio is a convenient interface for R (an integreated development environment, IDE)
At its simplest:^➥
- R is like a car’s engine
- RStudio is like a car’s dashboard

➥ Source: Modern Dive

16 / 124

Let's take a tour - R / RStudio

Follow this link and log in with your google account:

https://rstudio.cloud/project/395951

17 / 124

Let's take a tour - R / RStudio

Follow this link and log in with your google account:

https://rstudio.cloud/project/395951

Concepts introduced:

Console
Using R as a calculator
Environment
Loading and viewing a data frame
Accessing a variable in a data frame
R functions

17 / 124

R essentials

A short list (for now):

Functions are (most often) verbs, followed by what they will be applied to in parantheses:

do_this(to_this)
do_that(to_this, to_that, with_those)

18 / 124

R essentials

A short list (for now):

Functions are (most often) verbs, followed by what they will be applied to in parantheses:

do_this(to_this)
do_that(to_this, to_that, with_those)

Columns (variables) in data frames are accessed with $:

dataframe$var_name

18 / 124

R essentials

A short list (for now):

Functions are (most often) verbs, followed by what they will be applied to in parantheses:

do_this(to_this)
do_that(to_this, to_that, with_those)

Columns (variables) in data frames are accessed with $:

dataframe$var_name

Packages are installed with the install.packages function and loaded with the library function, once per session:

install.packages("package_name")
library(package_name)

18 / 124

R essentials

A short list (for now):

Functions are (most often) verbs, followed by what they will be applied to in parantheses:

do_this(to_this)
do_that(to_this, to_that, with_those)

Columns (variables) in data frames are accessed with $:

dataframe$var_name

Packages are installed with the install.packages function and loaded with the library function, once per session:

install.packages("package_name")
library(package_name)

For this project we'll need the following packages:

install.packages(c("tidyverse", "devtools", "datasauRus", "fivethirtyeight", "janitor", "DT"))

18 / 124

tidyverse

tidyverse.org

The tidyverse is an opinionated collection of R packages designed for data science.
All packages share an underlying philosophy and a common grammar.

19 / 124

R Markdown20 / 124

R Markdown

Fully reproducible reports -- each time you knit the analysis is ran from the beginning
Simple markdown syntax for text
Code goes in chunks, defined by three backticks, narrative goes outside of chunks

21 / 124

Let's take a tour - R Markdown

Go to RStudio Cloud and open the application exercise Bechdel.

~/appex/ae-bechdel.Rmd

Concepts introduced:

Knitting documents
R Markdown and (some) R syntax

22 / 124

Bechdel Test

What is the Bechdel test?

23 / 124

Bechdel Test

What is the Bechdel test?

The Bechdel test asks whether a work of fiction features at least two women who talk to each other about something other than a man, and there must be two women named characters.

23 / 124

Bechdel Test

What is the Bechdel test?

The Bechdel test asks whether a work of fiction features at least two women who talk to each other about something other than a man, and there must be two women named characters.

Knit the R Markdown document.

23 / 124

Other things you can make in R Markdown

This presentation was written in R Markdown

HTML resume

Blog / Website

24 / 124

Other things you can make in R Markdown

This presentation was written in R Markdown

HTML resume

Blog / Website

... ok, enough self promotion 👨‍💼

24 / 124

R Markdown help

R Markdown cheat sheet

Markdown Quick Reference
Help -> Markdown Quick Reference

25 / 124

Workspaces

Remember this, and expect it to bite you a few times as you're learning to work with R Markdown: The workspace of your R Markdown document is separate from the Console!

Run the following in the console

x <- 2
x * 3

All looks good, eh?

26 / 124

Workspaces

Remember this, and expect it to bite you a few times as you're learning to work with R Markdown: The workspace of your R Markdown document is separate from the Console!

Run the following in the console

x <- 2
x * 3

All looks good, eh?

Then, add the following chunk in your R Markdown document and knit it

x * 3

What happens? Why the error?

26 / 124

Git and GitHub27 / 124

Version control

GitHub as a platform for collaboration
It's actually designed for version control

28 / 124

Versioning

29 / 124

Versioning

with human readable messages

30 / 124

Why do we need version control?

31 / 124

Git and GitHub tips

Git is a version control system -- like “Track Changes” features from Microsoft Word on steroids. GitHub is the home for your Git-based projects on the internet -- like DropBox but much, much better).
This is outside the scope of this workshop.
There is a great resource for working with git and R: happygitwithr.com.

32 / 124

Tidy data and data wrangling 
 🔧33 / 124

Tidy data34 / 124

Tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

35 / 124

Tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

Characteristics of tidy data: 😄

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Characteristics of untidy data: 😦

!@#$%^&*()

35 / 124

Tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

Characteristics of tidy data: 😄

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Characteristics of untidy data: 😦

!@#$%^&*()

➥ Source: R for Data Science

35 / 124

Like families, tidy datasets are all alike but every messy dataset is messy in its own way.

Hadley Wickham

36 / 124

Summary tables

Is each of the following a dataset or a summary table?

## # A tibble: 87 x 3
##    name               height  mass
##    <chr>               <int> <dbl>
##  1 Luke Skywalker        172    77
##  2 C-3PO                 167    75
##  3 R2-D2                  96    32
##  4 Darth Vader           202   136
##  5 Leia Organa           150    49
##  6 Owen Lars             178   120
##  7 Beru Whitesun lars    165    75
##  8 R5-D4                  97    32
##  9 Biggs Darklighter     183    84
## 10 Obi-Wan Kenobi        182    77
## # … with 77 more rows

## # A tibble: 5 x 2
##   gender        avg_height
##   <chr>              <dbl>
## 1 female              165.
## 2 hermaphrodite       175 
## 3 male                179.
## 4 none                200 
## 5 <NA>                120

37 / 124

Pipes38 / 124

Where does the name come from?

The pipe operator is implemented in the package magrittr, it's pronounced "and then".

pipe

magrittr

➥ Vignette: magrittr

39 / 124

Review: How does a pipe work?You can think about the following sequence of actions - find key, 
unlock car, start car, drive to school, park.
40 / 124

Review: How does a pipe work?

You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.
Expressed as a set of nested functions in R pseudocode this would look like:

park(drive(start_car(find("keys")), to = "campus"))

40 / 124

Review: How does a pipe work?

You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.
Expressed as a set of nested functions in R pseudocode this would look like:

park(drive(start_car(find("keys")), to = "campus"))

Writing it out using pipes give it a more natural (and easier to read) structure:

find("keys") %>%
  start_car() %>%
  drive(to = "campus") %>%
  park()

40 / 124

What about other arguments?

To send results to a function argument other than first one or to use the previous result for multiple arguments, use .:

starwars %>%
  filter(species == "Human") %>%
  lm(mass ~ height, data = .)

## 
## Call:
## lm(formula = mass ~ height, data = .)
## 
## Coefficients:
## (Intercept)       height  
##     -116.58         1.11

41 / 124

Data wrangling42 / 124

Bike crashes in NC 2007 - 2014

The dataset is in the dsbox package:

github packages require special install commands
the remotes package is automatically installed with devtools

remotes::install_github("rstudio-education/dsbox")
library(dsbox)
ncbikecrash

43 / 124

Variables

View the names of variables via

names(ncbikecrash)

##  [1] "object_id"            "city"                 "county"              
##  [4] "region"               "development"          "locality"            
##  [7] "on_road"              "rural_urban"          "speed_limit"         
## [10] "traffic_control"      "weather"              "workzone"            
## [13] "bike_age"             "bike_age_group"       "bike_alcohol"        
## [16] "bike_alcohol_drugs"   "bike_direction"       "bike_injury"         
## [19] "bike_position"        "bike_race"            "bike_sex"            
## [22] "driver_age"           "driver_age_group"     "driver_alcohol"      
## [25] "driver_alcohol_drugs" "driver_est_speed"     "driver_injury"       
## [28] "driver_race"          "driver_sex"           "driver_vehicle_type" 
## [31] "crash_alcohol"        "crash_date"           "crash_day"           
## [34] "crash_group"          "crash_hour"           "crash_location"      
## [37] "crash_month"          "crash_severity"       "crash_time"          
## [40] "crash_type"           "crash_year"           "ambulance_req"       
## [43] "hit_run"              "light_condition"      "road_character"      
## [46] "road_class"           "road_condition"       "road_configuration"  
## [49] "road_defects"         "road_feature"         "road_surface"        
## [52] "num_bikes_ai"         "num_bikes_bi"         "num_bikes_ci"        
## [55] "num_bikes_ki"         "num_bikes_no"         "num_bikes_to"        
## [58] "num_bikes_ui"         "num_lanes"            "num_units"           
## [61] "distance_mi_from"     "frm_road"             "rte_invd_cd"         
## [64] "towrd_road"           "geo_point"            "geo_shape"

and see detailed descriptions with ?ncbikecrash.

44 / 124

Viewing your data

In the Environment, after loading with data(ncbikecrash), and click on the name of the data frame to view it in the data viewer
Use the glimpse function to take a peek

45 / 124

Viewing your data

In the Environment, after loading with data(ncbikecrash), and click on the name of the data frame to view it in the data viewer
Use the glimpse function to take a peek

glimpse(ncbikecrash)

## Observations: 7,467
## Variables: 66
## $ object_id            <int> 1686, 1674, 1673, 1687, 1653, 1665, 1642, 1…
## $ city                 <chr> "None - Rural Crash", "Henderson", "None - …
## $ county               <chr> "Wayne", "Vance", "Lincoln", "Columbus", "N…
## $ region               <chr> "Coastal", "Piedmont", "Piedmont", "Coastal…
## $ development          <chr> "Farms, Woods, Pastures", "Residential", "F…
## $ locality             <chr> "Rural (<30% Developed)", "Mixed (30% To 70…
## $ on_road              <chr> "SR 1915", "NICHOLAS ST", "US 321", "W BURK…
## $ rural_urban          <chr> "Rural", "Urban", "Rural", "Urban", "Urban"…
## $ speed_limit          <chr> "50 - 55  MPH", "30 - 35  MPH", "50 - 55  M…
## $ traffic_control      <chr> "No Control Present", "Stop Sign", "Double …
## $ weather              <chr> "Clear", "Clear", "Clear", "Rain", "Clear",…
## $ workzone             <chr> "No", "No", "No", "No", "No", "No", "No", "…
## $ bike_age             <chr> "52", "66", "33", "52", "22", "15", "41", "…
## $ bike_age_group       <chr> "50-59", "60-69", "30-39", "50-59", "20-24"…
## $ bike_alcohol         <chr> "No", "No", "No", "Yes", "No", "No", "No", …
## $ bike_alcohol_drugs   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ bike_direction       <chr> "With Traffic", "With Traffic", "With Traff…
## $ bike_injury          <chr> "B: Evident Injury", "C: Possible Injury", …
## $ bike_position        <chr> "Bike Lane / Paved Shoulder", "Travel Lane"…
## $ bike_race            <chr> "Black", "Black", "White", "Black", "White"…
## $ bike_sex             <chr> "Male", "Male", "Male", "Male", "Female", "…
## $ driver_age           <chr> "34", NA, "37", "55", "25", "17", NA, "50",…
## $ driver_age_group     <chr> "30-39", NA, "30-39", "50-59", "25-29", "0-…
## $ driver_alcohol       <chr> "No", "Missing", "No", "No", "No", "No", "M…
## $ driver_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ driver_est_speed     <chr> "51-55 mph", "6-10 mph", "41-45 mph", "11-1…
## $ driver_injury        <chr> "O: No Injury", "Unknown Injury", "O: No In…
## $ driver_race          <chr> "White", "Unknown/Missing", "Hispanic", "Bl…
## $ driver_sex           <chr> "Male", NA, "Female", "Male", "Male", "Fema…
## $ driver_vehicle_type  <chr> "Single Unit Truck (2-Axle, 6-Tire)", NA, "…
## $ crash_alcohol        <chr> "No", "No", "No", "Yes", "No", "No", "No", …
## $ crash_date           <chr> "11DEC2013", "20NOV2013", "03NOV2013", "14D…
## $ crash_day            <chr> "Wednesday", "Wednesday", "Sunday", "Saturd…
## $ crash_group          <chr> "Motorist Overtaking Bicyclist", "Bicyclist…
## $ crash_hour           <int> 6, 20, 18, 18, 13, 17, 17, 7, 15, 2, 12, 22…
## $ crash_location       <chr> "Non-Intersection", "Intersection", "Non-In…
## $ crash_month          <chr> "December", "November", "November", "Decemb…
## $ crash_severity       <chr> "B: Evident Injury", "C: Possible Injury", …
## $ crash_time           <drtn> 06:10:00, 20:41:00, 18:05:00, 18:34:00, 13…
## $ crash_type           <chr> "Motorist Overtaking - Undetected Bicyclist…
## $ crash_year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ ambulance_req        <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Y…
## $ hit_run              <chr> "No", "Yes", "No", "No", "No", "No", "Yes",…
## $ light_condition      <chr> "Dark - Roadway Not Lighted", NA, "Dark - R…
## $ road_character       <chr> "Straight - Level", "Straight - Level", "St…
## $ road_class           <chr> "State Secondary Route", "Local Street", "U…
## $ road_condition       <chr> "Dry", "Dry", "Dry", "Water (Standing, Movi…
## $ road_configuration   <chr> "Two-Way, Not Divided", "Two-Way, Divided, …
## $ road_defects         <chr> "None", NA, "None", "None", "None", "None",…
## $ road_feature         <chr> "No Special Feature", "T-Intersection", "No…
## $ road_surface         <chr> "Coarse Asphalt", "Smooth Asphalt", "Smooth…
## $ num_bikes_ai         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_bi         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_ci         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_ki         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_no         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_to         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_bikes_ui         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_lanes            <chr> "2 lanes", "2 lanes", "2 lanes", "1 lane", …
## $ num_units            <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ distance_mi_from     <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0"…
## $ frm_road             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ rte_invd_cd          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ towrd_road           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ geo_point            <chr> "35.3336070056, -77.9955023901", "36.315187…
## $ geo_shape            <chr> "{\"type\": \"Point\", \"coordinates\": [-7…

45 / 124

A Grammar of Data Manipulation

dplyr is based on the concepts of functions as verbs that manipulate data frames.

filter: pick rows matching criteria
slice: pick rows using index(es)
select: pick columns by name
pull: grab a column as a vector
arrange: reorder rows
mutate: add new variables
distinct: filter for unique rows
sample_n / sample_frac: randomly sample rows
summarise: reduce variables to values
... (many more)

46 / 124

dplyr rules for functions

First argument is always a data frame
Subsequent arguments say what to do with that data frame
Always return a data frame
Don't modify in place

47 / 124

A note on piping and layering

The %>% operator in dplyr functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code.
The + operator in ggplot2 functions is used for "layering". This means you create the plot in layers, separated by +.

48 / 124

`filter` to select a subset of rows

for crashes in Durham County

ncbikecrash %>%
  filter(county == "Durham")

## # A tibble: 340 x 66
##    object_id city  county region development locality on_road rural_urban
##        <int> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      
##  1      2452 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      
##  2      2441 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      
##  3      2466 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      
##  4       549 Durh… Durham Piedm… Residential Urban (… PARK A… Urban      
##  5       598 Durh… Durham Piedm… Residential Urban (… BELT S… Urban      
##  6       603 Durh… Durham Piedm… Residential Urban (… HINSON… Urban      
##  7      3974 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      
##  8      7134 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      
##  9      1670 Durh… Durham Piedm… Commercial  Urban (… INFINI… Urban      
## 10      1773 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      
## # … with 330 more rows, and 58 more variables: speed_limit <chr>,
## #   traffic_control <chr>, weather <chr>, workzone <chr>, bike_age <chr>,
## #   bike_age_group <chr>, bike_alcohol <chr>, bike_alcohol_drugs <chr>,
## #   bike_direction <chr>, bike_injury <chr>, bike_position <chr>,
## #   bike_race <chr>, bike_sex <chr>, driver_age <chr>,
## #   driver_age_group <chr>, driver_alcohol <chr>,
## #   driver_alcohol_drugs <chr>, driver_est_speed <chr>,
## #   driver_injury <chr>, driver_race <chr>, driver_sex <chr>,
## #   driver_vehicle_type <chr>, crash_alcohol <chr>, crash_date <chr>,
## #   crash_day <chr>, crash_group <chr>, crash_hour <int>,
## #   crash_location <chr>, crash_month <chr>, crash_severity <chr>,
## #   crash_time <drtn>, crash_type <chr>, crash_year <int>,
## #   ambulance_req <chr>, hit_run <chr>, light_condition <chr>,
## #   road_character <chr>, road_class <chr>, road_condition <chr>,
## #   road_configuration <chr>, road_defects <chr>, road_feature <chr>,
## #   road_surface <chr>, num_bikes_ai <int>, num_bikes_bi <int>,
## #   num_bikes_ci <int>, num_bikes_ki <int>, num_bikes_no <int>,
## #   num_bikes_to <int>, num_bikes_ui <int>, num_lanes <chr>,
## #   num_units <int>, distance_mi_from <chr>, frm_road <chr>,
## #   rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, geo_shape <chr>

49 / 124

`filter` for many conditions at once

for crashes in Durham County where biker was 0-5 years old

ncbikecrash %>%
  filter(county == "Durham", bike_age_group == "0-5")

## # A tibble: 4 x 66
##   object_id city  county region development locality on_road rural_urban
##       <int> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      
## 1      4062 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      
## 2       414 Durh… Durham Piedm… Residential Urban (… PVA 90… Urban      
## 3      3016 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      
## 4      1383 Durh… Durham Piedm… Residential Urban (… PVA 62… Urban      
## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <int>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <drtn>, crash_type <chr>,
## #   crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## #   num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## #   num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## #   num_lanes <chr>, num_units <int>, distance_mi_from <chr>,
## #   frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,
## #   geo_shape <chr>

50 / 124

Logical operators in R

operator
definition

operator
definition


<
less than

x | y
x OR y 

<=
less than or equal to

is.na(x)
test if x is NA

>
greater than

!is.na(x)
test if x is not NA

>=
greater than or equal to

x %in% y
test if x is in y

==
exactly equal to

!(x %in% y)
test if x is not in y

!=
not equal to

!x
not x

x & y
x AND y




51 / 124

operator	definition	operator	definition
`<`	less than	`x` \| `y`	`x` OR `y`
`<=`	less than or equal to	`is.na(x)`	test if `x` is `NA`
`>`	greater than	`!is.na(x)`	test if `x` is not `NA`
`>=`	greater than or equal to	`x %in% y`	test if `x` is in `y`
`==`	exactly equal to	`!(x %in% y)`	test if `x` is not in `y`
`!=`	not equal to	`!x`	not `x`
`x & y`	`x` AND `y`

`select` to keep variables

ncbikecrash %>%
  filter(county == "Durham", bike_age_group == "0-5") %>%
  select(locality, speed_limit)

## # A tibble: 4 x 2
##   locality               speed_limit 
##   <chr>                  <chr>       
## 1 Urban (>70% Developed) 30 - 35  MPH
## 2 Urban (>70% Developed) 5 - 15 MPH  
## 3 Urban (>70% Developed) 20 - 25  MPH
## 4 Urban (>70% Developed) 20 - 25  MPH

52 / 124

`select` to exclude variables

ncbikecrash %>%
  select(-object_id)

## # A tibble: 7,467 x 65
##    city  county region development locality on_road rural_urban speed_limit
##    <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>       <chr>      
##  1 None… Wayne  Coast… Farms, Woo… Rural (… SR 1915 Rural       50 - 55  M…
##  2 Hend… Vance  Piedm… Residential Mixed (… NICHOL… Urban       30 - 35  M…
##  3 None… Linco… Piedm… Farms, Woo… Rural (… US 321  Rural       50 - 55  M…
##  4 Whit… Colum… Coast… Commercial  Urban (… W BURK… Urban       30 - 35  M…
##  5 Wilm… New H… Coast… Residential Urban (… RACINE… Urban       <NA>       
##  6 None… Robes… Coast… Farms, Woo… Rural (… SR 1513 Rural       50 - 55  M…
##  7 None… Richm… Piedm… Residential Mixed (… SR 1903 Rural       30 - 35  M…
##  8 Rale… Wake   Piedm… Commercial  Urban (… PERSON… Urban       30 - 35  M…
##  9 Whit… Colum… Coast… Residential Rural (… FLOWER… Urban       30 - 35  M…
## 10 New … Craven Coast… Residential Urban (… SUTTON… Urban       20 - 25  M…
## # … with 7,457 more rows, and 57 more variables: traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <int>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <drtn>, crash_type <chr>,
## #   crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## #   num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## #   num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## #   num_lanes <chr>, num_units <int>, distance_mi_from <chr>,
## #   frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,
## #   geo_shape <chr>

53 / 124

`select` a range of variables

ncbikecrash %>%
  select(city:locality)

## # A tibble: 7,467 x 5
##    city           county     region  development       locality            
##    <chr>          <chr>      <chr>   <chr>             <chr>               
##  1 None - Rural … Wayne      Coastal Farms, Woods, Pa… Rural (<30% Develop…
##  2 Henderson      Vance      Piedmo… Residential       Mixed (30% To 70% D…
##  3 None - Rural … Lincoln    Piedmo… Farms, Woods, Pa… Rural (<30% Develop…
##  4 Whiteville     Columbus   Coastal Commercial        Urban (>70% Develop…
##  5 Wilmington     New Hanov… Coastal Residential       Urban (>70% Develop…
##  6 None - Rural … Robeson    Coastal Farms, Woods, Pa… Rural (<30% Develop…
##  7 None - Rural … Richmond   Piedmo… Residential       Mixed (30% To 70% D…
##  8 Raleigh        Wake       Piedmo… Commercial        Urban (>70% Develop…
##  9 Whiteville     Columbus   Coastal Residential       Rural (<30% Develop…
## 10 New Bern       Craven     Coastal Residential       Urban (>70% Develop…
## # … with 7,457 more rows

54 / 124

`slice` for certain row numbers

First five

ncbikecrash %>%
  slice(1:5)

## # A tibble: 5 x 66
##   object_id city  county region development locality on_road rural_urban
##       <int> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      
## 1      1686 None… Wayne  Coast… Farms, Woo… Rural (… SR 1915 Rural      
## 2      1674 Hend… Vance  Piedm… Residential Mixed (… NICHOL… Urban      
## 3      1673 None… Linco… Piedm… Farms, Woo… Rural (… US 321  Rural      
## 4      1687 Whit… Colum… Coast… Commercial  Urban (… W BURK… Urban      
## 5      1653 Wilm… New H… Coast… Residential Urban (… RACINE… Urban      
## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <int>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <drtn>, crash_type <chr>,
## #   crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## #   num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## #   num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## #   num_lanes <chr>, num_units <int>, distance_mi_from <chr>,
## #   frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,
## #   geo_shape <chr>

55 / 124

`slice` for certain row numbers

Last five

last_row <- nrow(ncbikecrash)
ncbikecrash %>%
  slice((last_row - 4):last_row)

## # A tibble: 5 x 66
##   object_id city  county region development locality on_road rural_urban
##       <int> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      
## 1      6989 High… Guilf… Piedm… Residential Urban (… <NA>    Urban      
## 2      6991 Wilm… New H… Coast… Residential Urban (… <NA>    Urban      
## 3      6995 Kins… Lenoir Coast… Commercial  Urban (… <NA>    Urban      
## 4      6998 Faye… Cumbe… Coast… Residential Urban (… <NA>    Urban      
## 5      7000 None… Onslow Coast… Farms, Woo… Rural (… <NA>    Rural      
## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,
## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <int>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <drtn>, crash_type <chr>,
## #   crash_year <int>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,
## #   num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,
## #   num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,
## #   num_lanes <chr>, num_units <int>, distance_mi_from <chr>,
## #   frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,
## #   geo_shape <chr>

56 / 124

`pull` to extract a column as a vector

ncbikecrash %>%
  slice(1:6) %>%
  pull(locality)

## [1] "Rural (<30% Developed)"       "Mixed (30% To 70% Developed)"
## [3] "Rural (<30% Developed)"       "Urban (>70% Developed)"      
## [5] "Urban (>70% Developed)"       "Rural (<30% Developed)"

vs.

ncbikecrash %>%
  slice(1:6) %>%
  select(locality)

## # A tibble: 6 x 1
##   locality                    
##   <chr>                       
## 1 Rural (<30% Developed)      
## 2 Mixed (30% To 70% Developed)
## 3 Rural (<30% Developed)      
## 4 Urban (>70% Developed)      
## 5 Urban (>70% Developed)      
## 6 Rural (<30% Developed)

57 / 124

`sample_n` / `sample_frac` for a random sample

sample_n: randomly sample 5 observations

ncbikecrash_n5 <- ncbikecrash %>%
  sample_n(5, replace = FALSE)
dim(ncbikecrash_n5)

## [1]  5 66

sample_frac: randomly sample 20% of observations

ncbikecrash_perc20 <-ncbikecrash %>%
  sample_frac(0.2, replace = FALSE)
dim(ncbikecrash_perc20)

## [1] 1493   66

58 / 124

`distinct` to filter for unique rows

And arrange to order alphabetically

ncbikecrash %>% 
  select(county, city) %>% 
  distinct() %>% 
  arrange(county, city)

## # A tibble: 391 x 2
##    county    city              
##    <chr>     <chr>             
##  1 Alamance  Alamance          
##  2 Alamance  Burlington        
##  3 Alamance  Elon              
##  4 Alamance  Elon College      
##  5 Alamance  Gibsonville       
##  6 Alamance  Graham            
##  7 Alamance  Green Level       
##  8 Alamance  Mebane            
##  9 Alamance  None - Rural Crash
## 10 Alexander None - Rural Crash
## # … with 381 more rows

59 / 124

`summarise` to reduce variables to values

ncbikecrash %>%
  summarise(avg_hr = mean(crash_hour))

## # A tibble: 1 x 1
##   avg_hr
##    <dbl>
## 1   14.7

60 / 124

`group_by` to do calculations on groups

ncbikecrash %>%
  group_by(hit_run) %>%
  summarise(avg_hr = mean(crash_hour))

## # A tibble: 2 x 2
##   hit_run avg_hr
##   <chr>    <dbl>
## 1 No        14.6
## 2 Yes       15.0

61 / 124

`count` observations in groups

ncbikecrash %>%
  count(driver_alcohol_drugs)

## # A tibble: 6 x 2
##   driver_alcohol_drugs                    n
##   <chr>                               <int>
## 1 Missing                                99
## 2 No                                    695
## 3 Yes-Alcohol,  impairment suspected     12
## 4 Yes-Alcohol, no impairment detected     3
## 5 Yes-Drugs, impairment suspected         4
## 6 <NA>                                 6654

62 / 124

`mutate` to add new variables

ncbikecrash %>%
  mutate(driver_alcohol_drugs_simplified = case_when(
    driver_alcohol_drugs == "Missing"       ~ NA,
    str_detect(driver_alcohol_drugs, "Yes") ~ "Yes",
    TRUE                                    ~ "No"
  ))

63 / 124

"Save" when you `mutate`

Most often when you define a new variable with mutate you'll also want to save the resulting data frame, often by writing over the original data frame.

ncbikecrash <- ncbikecrash %>%
  mutate(driver_alcohol_drugs_simplified = case_when(
    str_detect(driver_alcohol_drugs, "Yes") ~ "Yes",
    TRUE                                    ~ driver_alcohol_drugs
  ))

64 / 124

Check before you move on

ncbikecrash %>% 
  count(driver_alcohol_drugs, driver_alcohol_drugs_simplified)

## # A tibble: 6 x 3
##   driver_alcohol_drugs                driver_alcohol_drugs_simplified     n
##   <chr>                               <chr>                           <int>
## 1 Missing                             Missing                            99
## 2 No                                  No                                695
## 3 Yes-Alcohol,  impairment suspected  Yes                                12
## 4 Yes-Alcohol, no impairment detected Yes                                 3
## 5 Yes-Drugs, impairment suspected     Yes                                 4
## 6 <NA>                                <NA>                             6654

ncbikecrash %>% 
  count(driver_alcohol_drugs_simplified)

## # A tibble: 4 x 2
##   driver_alcohol_drugs_simplified     n
##   <chr>                           <int>
## 1 Missing                            99
## 2 No                                695
## 3 Yes                                19
## 4 <NA>                             6654

65 / 124

AE - NC bike crashes

Go to the cloud project and open application exercise NC bike crashes

~appex/ae-ncbikecrashes.Rmd
For each question you work on, set the eval chunk option to TRUE and knit

66 / 124

Coding style 
 🤵67 / 124

Coding style68 / 124

Style guide

Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.

Hadley Wickham

Style guide for this course is based on the Tidyverse style guide: http://style.tidyverse.org/
There's more to it than what we'll cover today, but we'll mention more as we introduce more functionality, and do a recap later in the semester

69 / 124

File names and code chunk labels

Do not use spaces in file names, use - or _ to separate words
Use all lowercase letters

# Good
ucb-admit.csv
# Bad
UCB Admit.csv

70 / 124

Object names

Use _ to separate words in object names
Use informative but short object names
Do not reuse object names within an analysis

# Good
acs_employed
# Bad
acs.employed
acs2
acs_subset
acs_subsetted_for_males

71 / 124

72 / 124

Spacing

Put a space before and after all infix operators (=, +, -, <-, etc.), and when naming arguments in function calls.
Always put a space after a comma, and never before (just like in regular English).

# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)
# Bad
average<-mean(feet/12+inches,na.rm=TRUE)

73 / 124

ggplot

Always end a line with +
Always indent the next line

# Good
ggplot(diamonds, mapping = aes(x = price)) +
  geom_histogram()
# Bad
ggplot(diamonds,mapping=aes(x=price))+geom_histogram()

74 / 124

Long linesLimit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font.
Take advantage of RStudio editor's auto formatting for indentation at line breaks.
75 / 124

Assignment

Use <- not =

# Good
x <- 2
# Bad
x = 2

76 / 124

Assignment

Use <- not =

# Good
x <- 2
# Bad
x = 2

76 / 124

Quotes

Use ", not ', for quoting text. The only exception is when the text already contains double quotes and no single quotes.

ggplot(diamonds, mapping = aes(x = price)) +
  geom_histogram() +
  # Good
  labs(title = "`Shine bright like a diamond`",
  # Good
       x = "Diamond prices",
  # Bad
       y = 'Frequency')

77 / 124

Data classes and types + Recoding 
 💽78 / 124

Data classes and types79 / 124

Data types in Rlogical
double
integer
character
lists
and some more, but we won't be focusing on those
80 / 124

Logical & character

logical - boolean values TRUE and FALSE

typeof(TRUE)

## [1] "logical"

character - character strings

typeof("hello")

## [1] "character"

typeof('world') # but remember, we use double quotations!

## [1] "character"

81 / 124

Double & integer

double - floating point numerical values (default numerical type)

typeof(1.335)

## [1] "double"

typeof(7)

## [1] "double"

integer - integer numerical values (indicated with an L)

typeof(7L)

## [1] "integer"

typeof(1:3)

## [1] "integer"

82 / 124

Lists

Lists are 1d objects that can contain any combination of R objects

mylist <- list("A", 1:4, c(TRUE, FALSE), (1:4)/2)
mylist

## [[1]]
## [1] "A"
## 
## [[2]]
## [1] 1 2 3 4
## 
## [[3]]
## [1]  TRUE FALSE
## 
## [[4]]
## [1] 0.5 1.0 1.5 2.0

str(mylist)

## List of 4
##  $ : chr "A"
##  $ : int [1:4] 1 2 3 4
##  $ : logi [1:2] TRUE FALSE
##  $ : num [1:4] 0.5 1 1.5 2

83 / 124

Named lists

Because of their more complex structure we often want to name the elements of a list (we can also do this with vectors). This can make reading and accessing the list more straight forward.

myotherlist <- list(A = "hello", B = 1:4, "knock knock" = "who's there?")
str(myotherlist)

## List of 3
##  $ A          : chr "hello"
##  $ B          : int [1:4] 1 2 3 4
##  $ knock knock: chr "who's there?"

names(myotherlist)

## [1] "A"           "B"           "knock knock"

myotherlist$B

## [1] 1 2 3 4

84 / 124

Concatenation

Vectors can be constructed using the c() function.

c(1, 2, 3)

## [1] 1 2 3

c("Hello", "World!")

## [1] "Hello"  "World!"

c(1, c(2, c(3)))

## [1] 1 2 3

85 / 124

Coercion

R is a dynamically typed language -- it will happily convert between the various types without complaint.

c(1, "Hello")

## [1] "1"     "Hello"

c(FALSE, 3L)

## [1] 0 3

c(1.2, 3L)

## [1] 1.2 3.0

86 / 124

Missing Values

R uses NA to represent missing values in its data structures.

typeof(NA)

## [1] "logical"

87 / 124

Other Special Values

NaN - Not a number

Inf - Positive infinity

-Inf - Negative infinity

pi / 0

## [1] Inf

0 / 0

## [1] NaN

1/0 + 1/0

## [1] Inf

1/0 - 1/0

## [1] NaN

NaN / NA

## [1] NaN

NaN * NA

## [1] NaN

88 / 124

Activity

What is the type of the following vectors? Explain why they have that type.

c(1, NA+1L, "C")
c(1L / 0, NA)
c(1:3, 5)
c(3L, NaN+1L)
c(NA, TRUE)

89 / 124

Example: Cat lovers

Go to RStudio Cloud and open the application exercise Cat Lovers.

~/appex/ae-catlovers.Rmd

A survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value.

cat_lovers <- read_csv("../data/cat-lovers.csv")

Show entries

Search:

	name	number_of_cats	handedness
1	Bernice Warren	0	left
2	Woodrow Stone	0	left
3	Willie Bass	1	left
4	Tyrone Estrada	3	left
5	Alex Daniels	3	left
6	Jane Bates	2	left
7	Latoya Simpson	1	left
8	Darin Woods	1	left
9	Agnes Cobb	0	left
10	Tabitha Grant	0	left

Showing 1 to 10 of 60 entries

Previous1 2 3 4 5 6Next

90 / 124

Oh why won't you work?!

cat_lovers %>%
  summarise(mean = mean(number_of_cats))

## # A tibble: 1 x 1
##    mean
##   <dbl>
## 1    NA

91 / 124

Oh why won't you still work??!!

cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats, na.rm = TRUE))

## # A tibble: 1 x 1
##   mean_cats
##       <dbl>
## 1        NA

92 / 124

Take a breath and look at your data

What is the type of the number_of_cats variable?

glimpse(cat_lovers)

## Observations: 60
## Variables: 3
## $ name           <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass",…
## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", "0", "0",…
## $ handedness     <chr> "left", "left", "left", "left", "left", "left", "…

93 / 124

Let's take another look

Show entries

Search:

	name	number_of_cats	handedness
1	Bernice Warren	0	left
2	Woodrow Stone	0	left
3	Willie Bass	1	left
4	Tyrone Estrada	3	left
5	Alex Daniels	3	left
6	Jane Bates	2	left
7	Latoya Simpson	1	left
8	Darin Woods	1	left
9	Agnes Cobb	0	left
10	Tabitha Grant	0	left

Showing 1 to 10 of 60 entries

Previous1 2 3 4 5 6Next

94 / 124

Sometimes you need to babysit your respondents

cat_lovers %>%
  mutate(number_of_cats = case_when(
    name == "Ginger Clark" ~ 2,
    name == "Doug Bass"    ~ 3,
    TRUE                   ~ as.numeric(number_of_cats)
    )) %>%
  summarise(mean_cats = mean(number_of_cats))

## # A tibble: 1 x 1
##   mean_cats
##       <dbl>
## 1     0.817

95 / 124

Always you need to respect data types

cat_lovers %>%
  mutate(
    number_of_cats = case_when(
      name == "Ginger Clark" ~ "2",
      name == "Doug Bass"    ~ "3",
      TRUE                   ~ number_of_cats
      ),
    number_of_cats = as.numeric(number_of_cats)
    ) %>%
  summarise(mean_cats = mean(number_of_cats))

## # A tibble: 1 x 1
##   mean_cats
##       <dbl>
## 1     0.817

96 / 124

Now that we know what we're doing...

cat_lovers <- cat_lovers %>%
  mutate(
    number_of_cats = case_when(
      name == "Ginger Clark" ~ "2",
      name == "Doug Bass"    ~ "3",
      TRUE                   ~ number_of_cats
      ),
    number_of_cats = as.numeric(number_of_cats)
    )

97 / 124

Moral of the story

If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason.
Go in and investigate your data, apply the fix, save your data, live happily ever after.

98 / 124

Vectors vs. listsx <- c(8,4,7)

x[1]

## [1] 8
x[[1]]

## [1] 8
99 / 124

Vectors vs. listsx <- c(8,4,7)

x[1]

## [1] 8
x[[1]]

## [1] 8
y <- list(8,4,7)

y[2]

## [[1]]
## [1] 4
y[[2]]

## [1] 4
99 / 124

Vectors vs. lists

x <- c(8,4,7)

x[1]

## [1] 8

x[[1]]

## [1] 8

y <- list(8,4,7)

y[2]

## [[1]]
## [1] 4

y[[2]]

## [1] 4

Note: When using tidyverse code you'll rarely need to refer to elements using square brackets, but it's good to be aware of this syntax, especially since you might encounter it when searching for help online.

99 / 124

Review on your own100 / 124

Data "set"101 / 124

Data "sets" in R

"set" is in quotation marks because it is not a formal data class
A tidy data "set" can be one of the following types:
- tibble
- data.frame
We'll often work with tibbles:
- readr package (e.g. read_csv function) loads data as a tibble by default
- tibbles are part of the tidyverse, so they work well with other packages we are using
- they make minimal assumptions about your data, so are less likely to cause hard to track bugs in your code

102 / 124

Data frames

A data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.
A tibble is a type of data frame that ... makes your life (i.e. data analysis) easier.
Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.

df <- tibble(x = 1:3, y = c("a", "b", "c"))
class(df)

## [1] "tbl_df"     "tbl"        "data.frame"

glimpse(df)

## Observations: 3
## Variables: 2
## $ x <int> 1, 2, 3
## $ y <chr> "a", "b", "c"

103 / 124

Data frames (cont.)

attributes(df)

## $names
## [1] "x" "y"
## 
## $row.names
## [1] 1 2 3
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"

class(df$x)

## [1] "integer"

class(df$y)

## [1] "character"

104 / 124

Working with tibbles in pipelines

How many respondents have below average number of cats?

mean_cats <- cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats))
cat_lovers %>%
  filter(number_of_cats < mean_cats) %>%
  nrow()

## [1] 60

Do you believe this number? Why, why not?

105 / 124

A result of a pipeline is always a tibble

mean_cats

## # A tibble: 1 x 1
##   mean_cats
##       <dbl>
## 1     0.817

class(mean_cats)

## [1] "tbl_df"     "tbl"        "data.frame"

106 / 124

`pull()` can be your new best friend

But use it sparingly!

mean_cats <- cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats)) %>%
  pull()
cat_lovers %>%
  filter(number_of_cats < mean_cats) %>%
  nrow()

## [1] 33

107 / 124

`pull()` can be your new best friend

But use it sparingly!

mean_cats <- cat_lovers %>%
  summarise(mean_cats = mean(number_of_cats)) %>%
  pull()
cat_lovers %>%
  filter(number_of_cats < mean_cats) %>%
  nrow()

## [1] 33

mean_cats

## [1] 0.8166667

class(mean_cats)

## [1] "numeric"

107 / 124

Factors108 / 124

Factors

Factor objects are how R stores data for categorical variables (fixed numbers of discrete values).

(x = factor(c("BS", "MS", "PhD", "MS")))

## [1] BS  MS  PhD MS 
## Levels: BS MS PhD

glimpse(x)

##  Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2

typeof(x)

## [1] "integer"

109 / 124

Read data in as character strings

glimpse(cat_lovers)

## Observations: 60
## Variables: 3
## $ name           <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass",…
## $ number_of_cats <dbl> 0, 0, 1, 3, 3, 2, 1, 1, 0, 0, 0, 0, 1, 3, 3, 2, 1…
## $ handedness     <chr> "left", "left", "left", "left", "left", "left", "…

110 / 124

But coerce when plotting

p <- ggplot(cat_lovers, mapping = aes(x = handedness)) +
  geom_bar()
p

111 / 124

Use forcats to manipulate factors

cat_lovers <- cat_lovers %>%
  mutate(handedness = fct_relevel(handedness, 
                                  "right", "left", "ambidextrous"))

p <- ggplot(cat_lovers, mapping = aes(x = handedness)) +
  geom_bar()
p

112 / 124

Come for the functionality

... stay for the logo

R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Historically, factors were much easier to work with than character vectors, so many base R functions automatically convert character vectors to factors.
However, factors are still useful when you have true categorical data, and when you want to override the ordering of character vectors to improve display. The goal of the forcats package is to provide a suite of useful tools that solve common problems with factors.

Source: forcats.tidyverse.org

113 / 124

Recap

Always best to think of data as part of a tibble
- This plays nicely with the tidyverse as well
- Rows are observations, columns are variables
Be careful about data types / classes
- Sometimes R makes silly assumptions about your data class
  - Using tibbles help, but it might not solve all issues
  - Think about your data in context, e.g. 0/1 variable is most likely a factor
- If a plot/output is not behaving the way you expect, first investigate the data class
- If you are absolutely sure of a data class, overwrite it in your tibble so that you don't need to keep having to keep track of it
  - mutate the variable with the correct class
Check out Alison Hill's "Working with Data in R"
saved in the R folder of the RStudio project

114 / 124

Resources115 / 124

Online Books

R for Data Science

This book will teach you how to do data science with R: You’ll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it.

ModernDive: Statistical Inference via Data Science

This is intended to be a gentle introduction to the practice of analyzing data and answering questions using data the way data scientists, statisticians, data journalists, and other researchers would.

Data Visualization: A practical Introduction

This book is a hands-on introduction to the principles and practice of looking at and presenting data using R and ggplot.

Fundamentals of Data Visualization

The book is meant as a guide to making visualizations that accurately reflect the data, tell a story, and look professional. Even though nearly all of the figures in this book were made with R and ggplot2, this is not an R book. It focuses on the concepts and the figures, not on the code.

Open source R-based Courses

Alison Hill - Introduction to Biostatistics for the Basic Sciences - Oregon Health & Science University

Mine Cetinkaya-Rundel - Intro to Data Science - Duke

116 / 124

These links will take you to each relevant section in the presentation

117 / 124

Who wants to make a meme? 
  ☝️118 / 124

Welcome to the `memer` package 📦

remotes::install_github("sctyner/memer")
library(memer)

memer is a a tidyverse-compatible R package for creating memes

119 / 124

Welcome to the `memer` package 📦

remotes::install_github("sctyner/memer")
library(memer)

memer is a a tidyverse-compatible R package for creating memes

meme_get("OprahGiveaway") %>% 
  meme_text_bottom("EVERYONE GETS A MEME!", size = 30)

119 / 124

What's in the package?

meme_list()

##  [1] "AllTheThings"       "AmericanChopper"    "AncientAliens"     
##  [4] "BatmanRobin"        "DistractedBf"       "EvilKermit"        
##  [7] "ExpandingBrain"     "FirstWorldProbs"    "FryNotSure"        
## [10] "HotlineDrake"       "IsThisAPigeon"      "NoneOfMyBusiness"  
## [13] "CheersLeo"          "OneDoesNotSimply"   "DosEquisMan"       
## [16] "OffRamp"            "OprahGiveaway"      "Philosoraptor"     
## [19] "PicardFacePalm"     "PicardWTH"          "Purples"           
## [22] "PutItPatrick"       "Rainbow"            "ShiaJustDoIt"      
## [25] "Spongebob"          "SuccessKid"         "ThatWouldBeGreat"  
## [28] "TheRockDriving"     "ThinkAboutIt"       "TrumpBillSigning"  
## [31] "TwoButtonsAnxiety"  "WhatIfIToldYou"     "CondescendingWonka"
## [34] "YoDawg"             "Y-U-NOguy"

120 / 124

Let's make a meme121 / 124

Let's make a meme

meme_get("TheRockDriving") %>% 
  meme_text_rock("Hey, how do I prep for an IRB audit?", 
                 "\nPrint, \n...everything.")

121 / 124

Now you try

meme_get("SuccessKid") %>% 
  meme_text_bottom("ENTER TEXT HERE")

122 / 124

Coffee Time

Can't touch this

123 / 124

124 / 124

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Welcome

Brief intro to R 🙌

Ivan Castro VISN2 Center for Integrated Healthcare

What we will cover today

What we will cover today

Things we won't cover

What is Data Wrangling?

The data analysis cycle

Who Am I?

Find me at...

Meet the toolkit ⚒

Toolkit

Reproducible data wrangling and analysis

Reproducibility checklist

Reproducibility checklist

From manual tasking...

... to reproducible code

Reproducible plots

R and RStudio

What is R

What is R

What is RStudio?

Let's take a tour - R / RStudio

Let's take a tour - R / RStudio

R essentials

R essentials

R essentials

R essentials

tidyverse

R Markdown

R Markdown

Let's take a tour - R Markdown

Bechdel Test

Bechdel Test

Bechdel Test

Other things you can make in R Markdown

Other things you can make in R Markdown

R Markdown help

Workspaces

Workspaces

Git and GitHub

Version control

Versioning

Versioning

Why do we need version control?

Git and GitHub tips

Tidy data and data wrangling 🔧

Tidy data

Tidy data

Tidy data

Tidy data

Summary tables

Pipes

Where does the name come from?

Review: How does a pipe work?

Review: How does a pipe work?

Review: How does a pipe work?

What about other arguments?

Data wrangling

Bike crashes in NC 2007 - 2014

Variables

Viewing your data

Viewing your data

A Grammar of Data Manipulation

dplyr rules for functions

A note on piping and layering

filter to select a subset of rows

filter for many conditions at once

Logical operators in R

select to keep variables

select to exclude variables

select a range of variables

slice for certain row numbers

slice for certain row numbers

pull to extract a column as a vector

sample_n / sample_frac for a random sample

distinct to filter for unique rows

summarise to reduce variables to values

group_by to do calculations on groups

count observations in groups

Brief intro to R
🙌

Ivan Castro
VISN2 Center for Integrated Healthcare

Meet the toolkit
⚒

Tidy data and data wrangling
🔧

`filter` to select a subset of rows

`filter` for many conditions at once

`select` to keep variables

`select` to exclude variables

`select` a range of variables

`slice` for certain row numbers

`slice` for certain row numbers

`pull` to extract a column as a vector

`sample_n` / `sample_frac` for a random sample

`distinct` to filter for unique rows

`summarise` to reduce variables to values

`group_by` to do calculations on groups

`count` observations in groups

`mutate` to add new variables

"Save" when you `mutate`

Coding style
🤵

Data classes and types + Recoding
💽

`pull()` can be your new best friend

`pull()` can be your new best friend

Who wants to make a meme?
☝️

Welcome to the `memer` package 📦

Welcome to the `memer` package 📦