Just the tip of the iceberg...
There's not enough time to cover everything
The content presented today is largely based on the Data science in a Box materials
Given our roles, we will focus on:
How to start interacting with R
How to wrangle data in R
Just the tip of the iceberg...
There's not enough time to cover everything
The content presented today is largely based on the Data science in a Box materials
Given our roles, we will focus on:
How to start interacting with R
How to wrangle data in R
...but I'll gladly help you with otherwise:
Anyone who has ever taken wild-caught data through the full process of analysis knows that statistics, in the strict sense of fitting models and doing inference, is but one small part of the process.
Bryan & Wickham (2017)

Coordinator in Dr. Possemato's lab
Fairly new to CIH (less than a year)
Learned R (largely on my own) during graduate school
Trained in biostats
Enjoy the challenge of wrangling messy data


What does it mean for a data analysis to be "reproducible"?
What does it mean for a data analysis to be "reproducible"?
Near-term goals:
Long-term goals:


R is a statistical programming language
But why learn programming?
R is a statistical programming language
But why learn programming?
You must use a computer to do data science; you cannot do it in your head, or with pencil and paper.
Hadley Wickham

➥ Source: R for Data Science

➥ Source: Modern Dive
Follow this link and log in with your google account:
https://rstudio.cloud/project/395951
Follow this link and log in with your google account:
https://rstudio.cloud/project/395951
Concepts introduced:
A short list (for now):
do_this(to_this)do_that(to_this, to_that, with_those)A short list (for now):
do_this(to_this)do_that(to_this, to_that, with_those)
$:dataframe$var_nameA short list (for now):
do_this(to_this)do_that(to_this, to_that, with_those)
$:dataframe$var_name
install.packages function and loaded with the library function, once per session:install.packages("package_name")library(package_name)A short list (for now):
do_this(to_this)do_that(to_this, to_that, with_those)
$:dataframe$var_name
install.packages function and loaded with the library function, once per session:install.packages("package_name")library(package_name)
install.packages(c("tidyverse", "devtools", "datasauRus", "fivethirtyeight", "janitor", "DT"))
The tidyverse is an opinionated collection of R packages designed for data science.
All packages share an underlying philosophy and a common grammar.
Fully reproducible reports -- each time you knit the analysis is ran from the beginning
Simple markdown syntax for text
Code goes in chunks, defined by three backticks, narrative goes outside of chunks
Go to RStudio Cloud and open the application exercise Bechdel.
~/appex/ae-bechdel.RmdConcepts introduced:
Knitting documents
R Markdown and (some) R syntax
What is the Bechdel test?
What is the Bechdel test?
The Bechdel test asks whether a work of fiction features at least two women who talk to each other about something other than a man, and there must be two women named characters.
What is the Bechdel test?
The Bechdel test asks whether a work of fiction features at least two women who talk to each other about something other than a man, and there must be two women named characters.
This presentation was written in R Markdown
This presentation was written in R Markdown
... ok, enough self promotion 👨💼
Markdown Quick ReferenceHelp -> Markdown Quick Reference
Remember this, and expect it to bite you a few times as you're learning to work with R Markdown: The workspace of your R Markdown document is separate from the Console!
x <- 2x * 3
All looks good, eh?
Remember this, and expect it to bite you a few times as you're learning to work with R Markdown: The workspace of your R Markdown document is separate from the Console!
x <- 2x * 3
All looks good, eh?
x * 3
What happens? Why the error?
GitHub as a platform for collaboration
It's actually designed for version control

with human readable messages


Git is a version control system -- like “Track Changes” features from Microsoft Word on steroids. GitHub is the home for your Git-based projects on the internet -- like DropBox but much, much better).
This is outside the scope of this workshop.
There is a great resource for working with git and R: happygitwithr.com.
Happy families are all alike; every unhappy family is unhappy in its own way.
Leo Tolstoy
Happy families are all alike; every unhappy family is unhappy in its own way.
Leo Tolstoy
Characteristics of tidy data: 😄
Characteristics of untidy data: 😦
!@#$%^&*()
Happy families are all alike; every unhappy family is unhappy in its own way.
Leo Tolstoy
Characteristics of tidy data: 😄
Characteristics of untidy data: 😦
!@#$%^&*()

➥ Source: R for Data Science
Like families, tidy datasets are all alike but every messy dataset is messy in its own way.
Hadley Wickham
Is each of the following a dataset or a summary table?
## # A tibble: 87 x 3##    name               height  mass##    <chr>               <int> <dbl>##  1 Luke Skywalker        172    77##  2 C-3PO                 167    75##  3 R2-D2                  96    32##  4 Darth Vader           202   136##  5 Leia Organa           150    49##  6 Owen Lars             178   120##  7 Beru Whitesun lars    165    75##  8 R5-D4                  97    32##  9 Biggs Darklighter     183    84## 10 Obi-Wan Kenobi        182    77## # … with 77 more rows## # A tibble: 5 x 2##   gender        avg_height##   <chr>              <dbl>## 1 female              165.## 2 hermaphrodite       175 ## 3 male                179.## 4 none                200 ## 5 <NA>                120The pipe operator is implemented in the package magrittr, it's pronounced "and then".


➥ Vignette: magrittr
You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.
Expressed as a set of nested functions in R pseudocode this would look like:
park(drive(start_car(find("keys")), to = "campus"))You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.
Expressed as a set of nested functions in R pseudocode this would look like:
park(drive(start_car(find("keys")), to = "campus"))
find("keys") %>%  start_car() %>%  drive(to = "campus") %>%  park()To send results to a function argument other than first one or to use the previous result for multiple arguments, use .:
starwars %>%  filter(species == "Human") %>%  lm(mass ~ height, data = .)
## ## Call:## lm(formula = mass ~ height, data = .)## ## Coefficients:## (Intercept)       height  ##     -116.58         1.11The dataset is in the dsbox package:
github packages require special install commands
the remotes package is automatically installed with devtools
remotes::install_github("rstudio-education/dsbox")library(dsbox)ncbikecrashView the names of variables via
names(ncbikecrash)
##  [1] "object_id"            "city"                 "county"              ##  [4] "region"               "development"          "locality"            ##  [7] "on_road"              "rural_urban"          "speed_limit"         ## [10] "traffic_control"      "weather"              "workzone"            ## [13] "bike_age"             "bike_age_group"       "bike_alcohol"        ## [16] "bike_alcohol_drugs"   "bike_direction"       "bike_injury"         ## [19] "bike_position"        "bike_race"            "bike_sex"            ## [22] "driver_age"           "driver_age_group"     "driver_alcohol"      ## [25] "driver_alcohol_drugs" "driver_est_speed"     "driver_injury"       ## [28] "driver_race"          "driver_sex"           "driver_vehicle_type" ## [31] "crash_alcohol"        "crash_date"           "crash_day"           ## [34] "crash_group"          "crash_hour"           "crash_location"      ## [37] "crash_month"          "crash_severity"       "crash_time"          ## [40] "crash_type"           "crash_year"           "ambulance_req"       ## [43] "hit_run"              "light_condition"      "road_character"      ## [46] "road_class"           "road_condition"       "road_configuration"  ## [49] "road_defects"         "road_feature"         "road_surface"        ## [52] "num_bikes_ai"         "num_bikes_bi"         "num_bikes_ci"        ## [55] "num_bikes_ki"         "num_bikes_no"         "num_bikes_to"        ## [58] "num_bikes_ui"         "num_lanes"            "num_units"           ## [61] "distance_mi_from"     "frm_road"             "rte_invd_cd"         ## [64] "towrd_road"           "geo_point"            "geo_shape"and see detailed descriptions with ?ncbikecrash.
In the Environment, after loading with data(ncbikecrash), and click on the 
name of the data frame to view it in the data viewer
Use the glimpse function to take a peek
In the Environment, after loading with data(ncbikecrash), and click on the 
name of the data frame to view it in the data viewer
Use the glimpse function to take a peek
glimpse(ncbikecrash)
## Observations: 7,467## Variables: 66## $ object_id            <int> 1686, 1674, 1673, 1687, 1653, 1665, 1642, 1…## $ city                 <chr> "None - Rural Crash", "Henderson", "None - …## $ county               <chr> "Wayne", "Vance", "Lincoln", "Columbus", "N…## $ region               <chr> "Coastal", "Piedmont", "Piedmont", "Coastal…## $ development          <chr> "Farms, Woods, Pastures", "Residential", "F…## $ locality             <chr> "Rural (<30% Developed)", "Mixed (30% To 70…## $ on_road              <chr> "SR 1915", "NICHOLAS ST", "US 321", "W BURK…## $ rural_urban          <chr> "Rural", "Urban", "Rural", "Urban", "Urban"…## $ speed_limit          <chr> "50 - 55  MPH", "30 - 35  MPH", "50 - 55  M…## $ traffic_control      <chr> "No Control Present", "Stop Sign", "Double …## $ weather              <chr> "Clear", "Clear", "Clear", "Rain", "Clear",…## $ workzone             <chr> "No", "No", "No", "No", "No", "No", "No", "…## $ bike_age             <chr> "52", "66", "33", "52", "22", "15", "41", "…## $ bike_age_group       <chr> "50-59", "60-69", "30-39", "50-59", "20-24"…## $ bike_alcohol         <chr> "No", "No", "No", "Yes", "No", "No", "No", …## $ bike_alcohol_drugs   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…## $ bike_direction       <chr> "With Traffic", "With Traffic", "With Traff…## $ bike_injury          <chr> "B: Evident Injury", "C: Possible Injury", …## $ bike_position        <chr> "Bike Lane / Paved Shoulder", "Travel Lane"…## $ bike_race            <chr> "Black", "Black", "White", "Black", "White"…## $ bike_sex             <chr> "Male", "Male", "Male", "Male", "Female", "…## $ driver_age           <chr> "34", NA, "37", "55", "25", "17", NA, "50",…## $ driver_age_group     <chr> "30-39", NA, "30-39", "50-59", "25-29", "0-…## $ driver_alcohol       <chr> "No", "Missing", "No", "No", "No", "No", "M…## $ driver_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…## $ driver_est_speed     <chr> "51-55 mph", "6-10 mph", "41-45 mph", "11-1…## $ driver_injury        <chr> "O: No Injury", "Unknown Injury", "O: No In…## $ driver_race          <chr> "White", "Unknown/Missing", "Hispanic", "Bl…## $ driver_sex           <chr> "Male", NA, "Female", "Male", "Male", "Fema…## $ driver_vehicle_type  <chr> "Single Unit Truck (2-Axle, 6-Tire)", NA, "…## $ crash_alcohol        <chr> "No", "No", "No", "Yes", "No", "No", "No", …## $ crash_date           <chr> "11DEC2013", "20NOV2013", "03NOV2013", "14D…## $ crash_day            <chr> "Wednesday", "Wednesday", "Sunday", "Saturd…## $ crash_group          <chr> "Motorist Overtaking Bicyclist", "Bicyclist…## $ crash_hour           <int> 6, 20, 18, 18, 13, 17, 17, 7, 15, 2, 12, 22…## $ crash_location       <chr> "Non-Intersection", "Intersection", "Non-In…## $ crash_month          <chr> "December", "November", "November", "Decemb…## $ crash_severity       <chr> "B: Evident Injury", "C: Possible Injury", …## $ crash_time           <drtn> 06:10:00, 20:41:00, 18:05:00, 18:34:00, 13…## $ crash_type           <chr> "Motorist Overtaking - Undetected Bicyclist…## $ crash_year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…## $ ambulance_req        <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Y…## $ hit_run              <chr> "No", "Yes", "No", "No", "No", "No", "Yes",…## $ light_condition      <chr> "Dark - Roadway Not Lighted", NA, "Dark - R…## $ road_character       <chr> "Straight - Level", "Straight - Level", "St…## $ road_class           <chr> "State Secondary Route", "Local Street", "U…## $ road_condition       <chr> "Dry", "Dry", "Dry", "Water (Standing, Movi…## $ road_configuration   <chr> "Two-Way, Not Divided", "Two-Way, Divided, …## $ road_defects         <chr> "None", NA, "None", "None", "None", "None",…## $ road_feature         <chr> "No Special Feature", "T-Intersection", "No…## $ road_surface         <chr> "Coarse Asphalt", "Smooth Asphalt", "Smooth…## $ num_bikes_ai         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ num_bikes_bi         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ num_bikes_ci         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ num_bikes_ki         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ num_bikes_no         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ num_bikes_to         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ num_bikes_ui         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ num_lanes            <chr> "2 lanes", "2 lanes", "2 lanes", "1 lane", …## $ num_units            <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…## $ distance_mi_from     <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0"…## $ frm_road             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…## $ rte_invd_cd          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ towrd_road           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…## $ geo_point            <chr> "35.3336070056, -77.9955023901", "36.315187…## $ geo_shape            <chr> "{\"type\": \"Point\", \"coordinates\": [-7…dplyr is based on the concepts of functions as verbs that manipulate data frames.

filter: pick rows matching criteriaslice: pick rows using index(es)select: pick columns by namepull: grab a column as a vectorarrange: reorder rowsmutate: add new variablesdistinct: filter for unique rowssample_n / sample_frac: randomly sample rowssummarise: reduce variables to valuesFirst argument is always a data frame
Subsequent arguments say what to do with that data frame
Always return a data frame
Don't modify in place
The %>% operator in dplyr functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code.
The + operator in ggplot2 functions is used for "layering". This means you create the plot in layers, separated by +.
filter to select a subset of rowsfor crashes in Durham County
ncbikecrash %>%  filter(county == "Durham")
## # A tibble: 340 x 66##    object_id city  county region development locality on_road rural_urban##        <int> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      ##  1      2452 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      ##  2      2441 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      ##  3      2466 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      ##  4       549 Durh… Durham Piedm… Residential Urban (… PARK A… Urban      ##  5       598 Durh… Durham Piedm… Residential Urban (… BELT S… Urban      ##  6       603 Durh… Durham Piedm… Residential Urban (… HINSON… Urban      ##  7      3974 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      ##  8      7134 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      ##  9      1670 Durh… Durham Piedm… Commercial  Urban (… INFINI… Urban      ## 10      1773 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      ## # … with 330 more rows, and 58 more variables: speed_limit <chr>,## #   traffic_control <chr>, weather <chr>, workzone <chr>, bike_age <chr>,## #   bike_age_group <chr>, bike_alcohol <chr>, bike_alcohol_drugs <chr>,## #   bike_direction <chr>, bike_injury <chr>, bike_position <chr>,## #   bike_race <chr>, bike_sex <chr>, driver_age <chr>,## #   driver_age_group <chr>, driver_alcohol <chr>,## #   driver_alcohol_drugs <chr>, driver_est_speed <chr>,## #   driver_injury <chr>, driver_race <chr>, driver_sex <chr>,## #   driver_vehicle_type <chr>, crash_alcohol <chr>, crash_date <chr>,## #   crash_day <chr>, crash_group <chr>, crash_hour <int>,## #   crash_location <chr>, crash_month <chr>, crash_severity <chr>,## #   crash_time <drtn>, crash_type <chr>, crash_year <int>,## #   ambulance_req <chr>, hit_run <chr>, light_condition <chr>,## #   road_character <chr>, road_class <chr>, road_condition <chr>,## #   road_configuration <chr>, road_defects <chr>, road_feature <chr>,## #   road_surface <chr>, num_bikes_ai <int>, num_bikes_bi <int>,## #   num_bikes_ci <int>, num_bikes_ki <int>, num_bikes_no <int>,## #   num_bikes_to <int>, num_bikes_ui <int>, num_lanes <chr>,## #   num_units <int>, distance_mi_from <chr>, frm_road <chr>,## #   rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, geo_shape <chr>filter for many conditions at oncefor crashes in Durham County where biker was 0-5 years old
ncbikecrash %>%  filter(county == "Durham", bike_age_group == "0-5")
## # A tibble: 4 x 66##   object_id city  county region development locality on_road rural_urban##       <int> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      ## 1      4062 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      ## 2       414 Durh… Durham Piedm… Residential Urban (… PVA 90… Urban      ## 3      3016 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      ## 4      1383 Durh… Durham Piedm… Residential Urban (… PVA 62… Urban      ## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,## #   crash_hour <int>, crash_location <chr>, crash_month <chr>,## #   crash_severity <chr>, crash_time <drtn>, crash_type <chr>,## #   crash_year <int>, ambulance_req <chr>, hit_run <chr>,## #   light_condition <chr>, road_character <chr>, road_class <chr>,## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,## #   road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,## #   num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,## #   num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,## #   num_lanes <chr>, num_units <int>, distance_mi_from <chr>,## #   frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,## #   geo_shape <chr>| operator | definition | operator | definition | |
|---|---|---|---|---|
< | 
less than | x | y | 
x OR y  | 
|
<= | 
less than or equal to | is.na(x) | 
test if x is NA | 
|
> | 
greater than | !is.na(x) | 
test if x is not NA | 
|
>= | 
greater than or equal to | x %in% y | 
test if x is in y | 
|
== | 
exactly equal to | !(x %in% y) | 
test if x is not in y | 
|
!= | 
not equal to | !x | 
not x | 
|
x & y | 
x AND y | 
select to keep variablesncbikecrash %>%  filter(county == "Durham", bike_age_group == "0-5") %>%  select(locality, speed_limit)
## # A tibble: 4 x 2##   locality               speed_limit ##   <chr>                  <chr>       ## 1 Urban (>70% Developed) 30 - 35  MPH## 2 Urban (>70% Developed) 5 - 15 MPH  ## 3 Urban (>70% Developed) 20 - 25  MPH## 4 Urban (>70% Developed) 20 - 25  MPHselect to exclude variablesncbikecrash %>%  select(-object_id)
## # A tibble: 7,467 x 65##    city  county region development locality on_road rural_urban speed_limit##    <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>       <chr>      ##  1 None… Wayne  Coast… Farms, Woo… Rural (… SR 1915 Rural       50 - 55  M…##  2 Hend… Vance  Piedm… Residential Mixed (… NICHOL… Urban       30 - 35  M…##  3 None… Linco… Piedm… Farms, Woo… Rural (… US 321  Rural       50 - 55  M…##  4 Whit… Colum… Coast… Commercial  Urban (… W BURK… Urban       30 - 35  M…##  5 Wilm… New H… Coast… Residential Urban (… RACINE… Urban       <NA>       ##  6 None… Robes… Coast… Farms, Woo… Rural (… SR 1513 Rural       50 - 55  M…##  7 None… Richm… Piedm… Residential Mixed (… SR 1903 Rural       30 - 35  M…##  8 Rale… Wake   Piedm… Commercial  Urban (… PERSON… Urban       30 - 35  M…##  9 Whit… Colum… Coast… Residential Rural (… FLOWER… Urban       30 - 35  M…## 10 New … Craven Coast… Residential Urban (… SUTTON… Urban       20 - 25  M…## # … with 7,457 more rows, and 57 more variables: traffic_control <chr>,## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,## #   crash_hour <int>, crash_location <chr>, crash_month <chr>,## #   crash_severity <chr>, crash_time <drtn>, crash_type <chr>,## #   crash_year <int>, ambulance_req <chr>, hit_run <chr>,## #   light_condition <chr>, road_character <chr>, road_class <chr>,## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,## #   road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,## #   num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,## #   num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,## #   num_lanes <chr>, num_units <int>, distance_mi_from <chr>,## #   frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,## #   geo_shape <chr>select a range of variablesncbikecrash %>%  select(city:locality)
## # A tibble: 7,467 x 5##    city           county     region  development       locality            ##    <chr>          <chr>      <chr>   <chr>             <chr>               ##  1 None - Rural … Wayne      Coastal Farms, Woods, Pa… Rural (<30% Develop…##  2 Henderson      Vance      Piedmo… Residential       Mixed (30% To 70% D…##  3 None - Rural … Lincoln    Piedmo… Farms, Woods, Pa… Rural (<30% Develop…##  4 Whiteville     Columbus   Coastal Commercial        Urban (>70% Develop…##  5 Wilmington     New Hanov… Coastal Residential       Urban (>70% Develop…##  6 None - Rural … Robeson    Coastal Farms, Woods, Pa… Rural (<30% Develop…##  7 None - Rural … Richmond   Piedmo… Residential       Mixed (30% To 70% D…##  8 Raleigh        Wake       Piedmo… Commercial        Urban (>70% Develop…##  9 Whiteville     Columbus   Coastal Residential       Rural (<30% Develop…## 10 New Bern       Craven     Coastal Residential       Urban (>70% Develop…## # … with 7,457 more rowsslice for certain row numbersFirst five
ncbikecrash %>%  slice(1:5)
## # A tibble: 5 x 66##   object_id city  county region development locality on_road rural_urban##       <int> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      ## 1      1686 None… Wayne  Coast… Farms, Woo… Rural (… SR 1915 Rural      ## 2      1674 Hend… Vance  Piedm… Residential Mixed (… NICHOL… Urban      ## 3      1673 None… Linco… Piedm… Farms, Woo… Rural (… US 321  Rural      ## 4      1687 Whit… Colum… Coast… Commercial  Urban (… W BURK… Urban      ## 5      1653 Wilm… New H… Coast… Residential Urban (… RACINE… Urban      ## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,## #   crash_hour <int>, crash_location <chr>, crash_month <chr>,## #   crash_severity <chr>, crash_time <drtn>, crash_type <chr>,## #   crash_year <int>, ambulance_req <chr>, hit_run <chr>,## #   light_condition <chr>, road_character <chr>, road_class <chr>,## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,## #   road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,## #   num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,## #   num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,## #   num_lanes <chr>, num_units <int>, distance_mi_from <chr>,## #   frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,## #   geo_shape <chr>slice for certain row numbersLast five
last_row <- nrow(ncbikecrash)ncbikecrash %>%  slice((last_row - 4):last_row)
## # A tibble: 5 x 66##   object_id city  county region development locality on_road rural_urban##       <int> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      ## 1      6989 High… Guilf… Piedm… Residential Urban (… <NA>    Urban      ## 2      6991 Wilm… New H… Coast… Residential Urban (… <NA>    Urban      ## 3      6995 Kins… Lenoir Coast… Commercial  Urban (… <NA>    Urban      ## 4      6998 Faye… Cumbe… Coast… Residential Urban (… <NA>    Urban      ## 5      7000 None… Onslow Coast… Farms, Woo… Rural (… <NA>    Rural      ## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>,## #   bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>,## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,## #   crash_date <chr>, crash_day <chr>, crash_group <chr>,## #   crash_hour <int>, crash_location <chr>, crash_month <chr>,## #   crash_severity <chr>, crash_time <drtn>, crash_type <chr>,## #   crash_year <int>, ambulance_req <chr>, hit_run <chr>,## #   light_condition <chr>, road_character <chr>, road_class <chr>,## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,## #   road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,## #   num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,## #   num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,## #   num_lanes <chr>, num_units <int>, distance_mi_from <chr>,## #   frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,## #   geo_shape <chr>pull to extract a column as a vectorncbikecrash %>%  slice(1:6) %>%  pull(locality)
## [1] "Rural (<30% Developed)"       "Mixed (30% To 70% Developed)"## [3] "Rural (<30% Developed)"       "Urban (>70% Developed)"      ## [5] "Urban (>70% Developed)"       "Rural (<30% Developed)"vs.
ncbikecrash %>%  slice(1:6) %>%  select(locality)
## # A tibble: 6 x 1##   locality                    ##   <chr>                       ## 1 Rural (<30% Developed)      ## 2 Mixed (30% To 70% Developed)## 3 Rural (<30% Developed)      ## 4 Urban (>70% Developed)      ## 5 Urban (>70% Developed)      ## 6 Rural (<30% Developed)sample_n / sample_frac for a random samplesample_n: randomly sample 5 observationsncbikecrash_n5 <- ncbikecrash %>%  sample_n(5, replace = FALSE)dim(ncbikecrash_n5)
## [1]  5 66sample_frac: randomly sample 20% of observationsncbikecrash_perc20 <-ncbikecrash %>%  sample_frac(0.2, replace = FALSE)dim(ncbikecrash_perc20)
## [1] 1493   66distinct to filter for unique rowsAnd arrange to order alphabetically
ncbikecrash %>%   select(county, city) %>%   distinct() %>%   arrange(county, city)
## # A tibble: 391 x 2##    county    city              ##    <chr>     <chr>             ##  1 Alamance  Alamance          ##  2 Alamance  Burlington        ##  3 Alamance  Elon              ##  4 Alamance  Elon College      ##  5 Alamance  Gibsonville       ##  6 Alamance  Graham            ##  7 Alamance  Green Level       ##  8 Alamance  Mebane            ##  9 Alamance  None - Rural Crash## 10 Alexander None - Rural Crash## # … with 381 more rowssummarise to reduce variables to valuesncbikecrash %>%  summarise(avg_hr = mean(crash_hour))
## # A tibble: 1 x 1##   avg_hr##    <dbl>## 1   14.7group_by to do calculations on groupsncbikecrash %>%  group_by(hit_run) %>%  summarise(avg_hr = mean(crash_hour))
## # A tibble: 2 x 2##   hit_run avg_hr##   <chr>    <dbl>## 1 No        14.6## 2 Yes       15.0count observations in groupsncbikecrash %>%  count(driver_alcohol_drugs)
## # A tibble: 6 x 2##   driver_alcohol_drugs                    n##   <chr>                               <int>## 1 Missing                                99## 2 No                                    695## 3 Yes-Alcohol,  impairment suspected     12## 4 Yes-Alcohol, no impairment detected     3## 5 Yes-Drugs, impairment suspected         4## 6 <NA>                                 6654mutate to add new variablesncbikecrash %>%  mutate(driver_alcohol_drugs_simplified = case_when(    driver_alcohol_drugs == "Missing"       ~ NA,    str_detect(driver_alcohol_drugs, "Yes") ~ "Yes",    TRUE                                    ~ "No"  ))mutateMost often when you define a new variable with mutate you'll also want to save the resulting data frame, often by writing over the original data frame.
ncbikecrash <- ncbikecrash %>%  mutate(driver_alcohol_drugs_simplified = case_when(    str_detect(driver_alcohol_drugs, "Yes") ~ "Yes",    TRUE                                    ~ driver_alcohol_drugs  ))ncbikecrash %>%   count(driver_alcohol_drugs, driver_alcohol_drugs_simplified)
## # A tibble: 6 x 3##   driver_alcohol_drugs                driver_alcohol_drugs_simplified     n##   <chr>                               <chr>                           <int>## 1 Missing                             Missing                            99## 2 No                                  No                                695## 3 Yes-Alcohol,  impairment suspected  Yes                                12## 4 Yes-Alcohol, no impairment detected Yes                                 3## 5 Yes-Drugs, impairment suspected     Yes                                 4## 6 <NA>                                <NA>                             6654ncbikecrash %>%   count(driver_alcohol_drugs_simplified)
## # A tibble: 4 x 2##   driver_alcohol_drugs_simplified     n##   <chr>                           <int>## 1 Missing                            99## 2 No                                695## 3 Yes                                19## 4 <NA>                             6654Go to the cloud project and open application exercise NC bike crashes
~appex/ae-ncbikecrashes.Rmd
For each question you work on, set the eval chunk option to TRUE and knit
Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.
Hadley Wickham
Style guide for this course is based on the Tidyverse style guide: http://style.tidyverse.org/
There's more to it than what we'll cover today, but we'll mention more as we introduce more functionality, and do a recap later in the semester
- or _ to separate words# Gooducb-admit.csv# BadUCB Admit.csv_ to separate words in object names# Goodacs_employed# Badacs.employedacs2acs_subsetacs_subsetted_for_males
# Goodaverage <- mean(feet / 12 + inches, na.rm = TRUE)# Badaverage<-mean(feet/12+inches,na.rm=TRUE)+# Goodggplot(diamonds, mapping = aes(x = price)) +  geom_histogram()# Badggplot(diamonds,mapping=aes(x=price))+geom_histogram()<- not =# Goodx <- 2# Badx = 2<- not =# Goodx <- 2# Badx = 2

Use ", not ', for quoting text. The only exception is when the text already contains double quotes and no single quotes.
ggplot(diamonds, mapping = aes(x = price)) +  geom_histogram() +  # Good  labs(title = "`Shine bright like a diamond`",  # Good       x = "Diamond prices",  # Bad       y = 'Frequency')logical - boolean values TRUE and FALSE
typeof(TRUE)
## [1] "logical"character - character strings
typeof("hello")
## [1] "character"typeof('world') # but remember, we use double quotations!
## [1] "character"double - floating point numerical values (default numerical type)
typeof(1.335)
## [1] "double"typeof(7)
## [1] "double"integer - integer numerical values (indicated with an L)
typeof(7L)
## [1] "integer"typeof(1:3)
## [1] "integer"Lists are 1d objects that can contain any combination of R objects
mylist <- list("A", 1:4, c(TRUE, FALSE), (1:4)/2)mylist
## [[1]]## [1] "A"## ## [[2]]## [1] 1 2 3 4## ## [[3]]## [1]  TRUE FALSE## ## [[4]]## [1] 0.5 1.0 1.5 2.0str(mylist)
## List of 4##  $ : chr "A"##  $ : int [1:4] 1 2 3 4##  $ : logi [1:2] TRUE FALSE##  $ : num [1:4] 0.5 1 1.5 2Because of their more complex structure we often want to name the elements of a list (we can also do this with vectors). This can make reading and accessing the list more straight forward.
myotherlist <- list(A = "hello", B = 1:4, "knock knock" = "who's there?")str(myotherlist)
## List of 3##  $ A          : chr "hello"##  $ B          : int [1:4] 1 2 3 4##  $ knock knock: chr "who's there?"names(myotherlist)
## [1] "A"           "B"           "knock knock"myotherlist$B
## [1] 1 2 3 4Vectors can be constructed using the c() function.
c(1, 2, 3)
## [1] 1 2 3c("Hello", "World!")
## [1] "Hello"  "World!"c(1, c(2, c(3)))
## [1] 1 2 3R is a dynamically typed language -- it will happily convert between the various types without complaint.
c(1, "Hello")
## [1] "1"     "Hello"c(FALSE, 3L)
## [1] 0 3c(1.2, 3L)
## [1] 1.2 3.0R uses NA to represent missing values in its data structures.
typeof(NA)
## [1] "logical"NaN - Not a number
Inf - Positive infinity
-Inf - Negative infinity
pi / 0
## [1] Inf0 / 0
## [1] NaN1/0 + 1/0
## [1] Inf1/0 - 1/0
## [1] NaNNaN / NA
## [1] NaNNaN * NA
## [1] NaNWhat is the type of the following vectors? Explain why they have that type.
c(1, NA+1L, "C")c(1L / 0, NA)c(1:3, 5)c(3L, NaN+1L)c(NA, TRUE)Go to RStudio Cloud and open the application exercise Cat Lovers.
~/appex/ae-catlovers.RmdA survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value.
cat_lovers <- read_csv("../data/cat-lovers.csv")
cat_lovers %>%  summarise(mean = mean(number_of_cats))
## # A tibble: 1 x 1##    mean##   <dbl>## 1    NAcat_lovers %>%  summarise(mean_cats = mean(number_of_cats, na.rm = TRUE))
## # A tibble: 1 x 1##   mean_cats##       <dbl>## 1        NAWhat is the type of the number_of_cats variable?
glimpse(cat_lovers)
## Observations: 60## Variables: 3## $ name           <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass",…## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", "0", "0",…## $ handedness     <chr> "left", "left", "left", "left", "left", "left", "…cat_lovers %>%  mutate(number_of_cats = case_when(    name == "Ginger Clark" ~ 2,    name == "Doug Bass"    ~ 3,    TRUE                   ~ as.numeric(number_of_cats)    )) %>%  summarise(mean_cats = mean(number_of_cats))
## # A tibble: 1 x 1##   mean_cats##       <dbl>## 1     0.817cat_lovers %>%  mutate(    number_of_cats = case_when(      name == "Ginger Clark" ~ "2",      name == "Doug Bass"    ~ "3",      TRUE                   ~ number_of_cats      ),    number_of_cats = as.numeric(number_of_cats)    ) %>%  summarise(mean_cats = mean(number_of_cats))
## # A tibble: 1 x 1##   mean_cats##       <dbl>## 1     0.817cat_lovers <- cat_lovers %>%  mutate(    number_of_cats = case_when(      name == "Ginger Clark" ~ "2",      name == "Doug Bass"    ~ "3",      TRUE                   ~ number_of_cats      ),    number_of_cats = as.numeric(number_of_cats)    )If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason.
Go in and investigate your data, apply the fix, save your data, live happily ever after.
x <- c(8,4,7)
x[1]
## [1] 8x[[1]]
## [1] 8x <- c(8,4,7)
x[1]
## [1] 8x[[1]]
## [1] 8y <- list(8,4,7)
y[2]
## [[1]]## [1] 4y[[2]]
## [1] 4x <- c(8,4,7)
x[1]
## [1] 8x[[1]]
## [1] 8y <- list(8,4,7)
y[2]
## [[1]]## [1] 4y[[2]]
## [1] 4Note: When using tidyverse code you'll rarely need to refer to elements using square brackets, but it's good to be aware of this syntax, especially since you might encounter it when searching for help online.
"set" is in quotation marks because it is not a formal data class
A tidy data "set" can be one of the following types:
tibbledata.frameWe'll often work with tibbles:
readr package (e.g. read_csv function) loads data as a tibble by defaulttibbles are part of the tidyverse, so they work well with other packages we are usingA data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.
A tibble is a type of data frame that ... makes your life (i.e. data analysis) easier.
Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.
df <- tibble(x = 1:3, y = c("a", "b", "c"))class(df)
## [1] "tbl_df"     "tbl"        "data.frame"glimpse(df)
## Observations: 3## Variables: 2## $ x <int> 1, 2, 3## $ y <chr> "a", "b", "c"attributes(df)
## $names## [1] "x" "y"## ## $row.names## [1] 1 2 3## ## $class## [1] "tbl_df"     "tbl"        "data.frame"class(df$x)
## [1] "integer"class(df$y)
## [1] "character"How many respondents have below average number of cats?
mean_cats <- cat_lovers %>%  summarise(mean_cats = mean(number_of_cats))cat_lovers %>%  filter(number_of_cats < mean_cats) %>%  nrow()
## [1] 60Do you believe this number? Why, why not?
mean_cats
## # A tibble: 1 x 1##   mean_cats##       <dbl>## 1     0.817class(mean_cats)
## [1] "tbl_df"     "tbl"        "data.frame"pull() can be your new best friendBut use it sparingly!
mean_cats <- cat_lovers %>%  summarise(mean_cats = mean(number_of_cats)) %>%  pull()cat_lovers %>%  filter(number_of_cats < mean_cats) %>%  nrow()
## [1] 33pull() can be your new best friendBut use it sparingly!
mean_cats <- cat_lovers %>%  summarise(mean_cats = mean(number_of_cats)) %>%  pull()cat_lovers %>%  filter(number_of_cats < mean_cats) %>%  nrow()
## [1] 33mean_cats
## [1] 0.8166667class(mean_cats)
## [1] "numeric"Factor objects are how R stores data for categorical variables (fixed numbers of discrete values).
(x = factor(c("BS", "MS", "PhD", "MS")))
## [1] BS  MS  PhD MS ## Levels: BS MS PhDglimpse(x)
##  Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2typeof(x)
## [1] "integer"glimpse(cat_lovers)
## Observations: 60## Variables: 3## $ name           <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass",…## $ number_of_cats <dbl> 0, 0, 1, 3, 3, 2, 1, 1, 0, 0, 0, 0, 1, 3, 3, 2, 1…## $ handedness     <chr> "left", "left", "left", "left", "left", "left", "…p <- ggplot(cat_lovers, mapping = aes(x = handedness)) +  geom_bar()p

cat_lovers <- cat_lovers %>%  mutate(handedness = fct_relevel(handedness,                                   "right", "left", "ambidextrous"))
p <- ggplot(cat_lovers, mapping = aes(x = handedness)) +  geom_bar()p

... stay for the logo

R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Historically, factors were much easier to work with than character vectors, so many base R functions automatically convert character vectors to factors.
However, factors are still useful when you have true categorical data, and when you want to override the ordering of character vectors to improve display. The goal of the forcats package is to provide a suite of useful tools that solve common problems with factors.
Source: forcats.tidyverse.org
Always best to think of data as part of a tibble
tidyverse as wellBe careful about data types / classes
R makes silly assumptions about your data class tibbles help, but it might not solve all issuesfactormutate the variable with the correct classCheck out Alison Hill's "Working with Data in R"saved in the R folder of the RStudio project
This book will teach you how to do data science with R: You’ll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it.
ModernDive: Statistical Inference via Data Science
This is intended to be a gentle introduction to the practice of analyzing data and answering questions using data the way data scientists, statisticians, data journalists, and other researchers would.
Data Visualization: A practical Introduction
This book is a hands-on introduction to the principles and practice of looking at and presenting data using R and ggplot.
Fundamentals of Data Visualization
The book is meant as a guide to making visualizations that accurately reflect the data, tell a story, and look professional. Even though nearly all of the figures in this book were made with R and ggplot2, this is not an R book. It focuses on the concepts and the figures, not on the code.
Alison Hill - Introduction to Biostatistics for the Basic Sciences - Oregon Health & Science University
Mine Cetinkaya-Rundel - Intro to Data Science - Duke
These links will take you to each relevant section in the presentation
memer package 📦remotes::install_github("sctyner/memer")library(memer)
memer is a a
tidyverse-compatible R package for creating memes
memer package 📦remotes::install_github("sctyner/memer")library(memer)
memer is a a
tidyverse-compatible R package for creating memes
meme_get("OprahGiveaway") %>%   meme_text_bottom("EVERYONE GETS A MEME!", size = 30)

meme_list()
##  [1] "AllTheThings"       "AmericanChopper"    "AncientAliens"     ##  [4] "BatmanRobin"        "DistractedBf"       "EvilKermit"        ##  [7] "ExpandingBrain"     "FirstWorldProbs"    "FryNotSure"        ## [10] "HotlineDrake"       "IsThisAPigeon"      "NoneOfMyBusiness"  ## [13] "CheersLeo"          "OneDoesNotSimply"   "DosEquisMan"       ## [16] "OffRamp"            "OprahGiveaway"      "Philosoraptor"     ## [19] "PicardFacePalm"     "PicardWTH"          "Purples"           ## [22] "PutItPatrick"       "Rainbow"            "ShiaJustDoIt"      ## [25] "Spongebob"          "SuccessKid"         "ThatWouldBeGreat"  ## [28] "TheRockDriving"     "ThinkAboutIt"       "TrumpBillSigning"  ## [31] "TwoButtonsAnxiety"  "WhatIfIToldYou"     "CondescendingWonka"## [34] "YoDawg"             "Y-U-NOguy"meme_get("TheRockDriving") %>%   meme_text_rock("Hey, how do I prep for an IRB audit?",                  "\nPrint, \n...everything.")

meme_get("SuccessKid") %>%   meme_text_bottom("ENTER TEXT HERE")

Just the tip of the iceberg...
There's not enough time to cover everything
The content presented today is largely based on the Data science in a Box materials
Given our roles, we will focus on:
How to start interacting with R
How to wrangle data in R
Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide | 
| ↓, →, Pg Dn, Space, j | Go to next slide | 
| Home | Go to first slide | 
| End | Go to last slide | 
| Number + Return | Go to specific slide | 
| b / m / f | Toggle blackout / mirrored / fullscreen mode | 
| c | Clone slideshow | 
| p | Toggle presenter mode | 
| t | Restart the presentation timer | 
| ?, h | Toggle this help | 
| Esc | Back to slideshow |