Just the tip of the iceberg...
There's not enough time to cover everything
The content presented today is largely based on the Data science in a Box materials
Given our roles, we will focus on:
How to start interacting with R
How to wrangle data in R
Just the tip of the iceberg...
There's not enough time to cover everything
The content presented today is largely based on the Data science in a Box materials
Given our roles, we will focus on:
How to start interacting with R
How to wrangle data in R
...but I'll gladly help you with otherwise:
Anyone who has ever taken wild-caught data through the full process of analysis knows that statistics, in the strict sense of fitting models and doing inference, is but one small part of the process.
Bryan & Wickham (2017)
Coordinator in Dr. Possemato's lab
Fairly new to CIH (less than a year)
Learned R (largely on my own) during graduate school
Trained in biostats
Enjoy the challenge of wrangling messy data
What does it mean for a data analysis to be "reproducible"?
What does it mean for a data analysis to be "reproducible"?
Near-term goals:
Long-term goals:
R is a statistical programming language
But why learn programming?
R is a statistical programming language
But why learn programming?
You must use a computer to do data science; you cannot do it in your head, or with pencil and paper.
Hadley Wickham
➥ Source: R for Data Science
➥ Source: Modern Dive
Follow this link and log in with your google account:
https://rstudio.cloud/project/395951
Follow this link and log in with your google account:
https://rstudio.cloud/project/395951
Concepts introduced:
A short list (for now):
do_this(to_this)do_that(to_this, to_that, with_those)
A short list (for now):
do_this(to_this)do_that(to_this, to_that, with_those)
$
:dataframe$var_name
A short list (for now):
do_this(to_this)do_that(to_this, to_that, with_those)
$
:dataframe$var_name
install.packages
function and loaded with the library
function, once per session:install.packages("package_name")library(package_name)
A short list (for now):
do_this(to_this)do_that(to_this, to_that, with_those)
$
:dataframe$var_name
install.packages
function and loaded with the library
function, once per session:install.packages("package_name")library(package_name)
install.packages(c("tidyverse", "devtools", "datasauRus", "fivethirtyeight", "janitor", "DT"))
The tidyverse is an opinionated collection of R packages designed for data science.
All packages share an underlying philosophy and a common grammar.
Fully reproducible reports -- each time you knit the analysis is ran from the beginning
Simple markdown syntax for text
Code goes in chunks, defined by three backticks, narrative goes outside of chunks
Go to RStudio Cloud and open the application exercise Bechdel.
~/appex/ae-bechdel.Rmd
Concepts introduced:
Knitting documents
R Markdown and (some) R syntax
What is the Bechdel test?
What is the Bechdel test?
The Bechdel test asks whether a work of fiction features at least two women who talk to each other about something other than a man, and there must be two women named characters.
What is the Bechdel test?
The Bechdel test asks whether a work of fiction features at least two women who talk to each other about something other than a man, and there must be two women named characters.
This presentation was written in R Markdown
This presentation was written in R Markdown
... ok, enough self promotion 👨💼
Markdown Quick ReferenceHelp -> Markdown Quick Reference
Remember this, and expect it to bite you a few times as you're learning to work with R Markdown: The workspace of your R Markdown document is separate from the Console!
x <- 2x * 3
All looks good, eh?
Remember this, and expect it to bite you a few times as you're learning to work with R Markdown: The workspace of your R Markdown document is separate from the Console!
x <- 2x * 3
All looks good, eh?
x * 3
What happens? Why the error?
GitHub as a platform for collaboration
It's actually designed for version control
with human readable messages
Git is a version control system -- like “Track Changes” features from Microsoft Word on steroids. GitHub is the home for your Git-based projects on the internet -- like DropBox but much, much better).
This is outside the scope of this workshop.
There is a great resource for working with git and R: happygitwithr.com.
Happy families are all alike; every unhappy family is unhappy in its own way.
Leo Tolstoy
Happy families are all alike; every unhappy family is unhappy in its own way.
Leo Tolstoy
Characteristics of tidy data: 😄
Characteristics of untidy data: 😦
!@#$%^&*()
Happy families are all alike; every unhappy family is unhappy in its own way.
Leo Tolstoy
Characteristics of tidy data: 😄
Characteristics of untidy data: 😦
!@#$%^&*()
➥ Source: R for Data Science
Like families, tidy datasets are all alike but every messy dataset is messy in its own way.
Hadley Wickham
Is each of the following a dataset or a summary table?
## # A tibble: 87 x 3## name height mass## <chr> <int> <dbl>## 1 Luke Skywalker 172 77## 2 C-3PO 167 75## 3 R2-D2 96 32## 4 Darth Vader 202 136## 5 Leia Organa 150 49## 6 Owen Lars 178 120## 7 Beru Whitesun lars 165 75## 8 R5-D4 97 32## 9 Biggs Darklighter 183 84## 10 Obi-Wan Kenobi 182 77## # … with 77 more rows
## # A tibble: 5 x 2## gender avg_height## <chr> <dbl>## 1 female 165.## 2 hermaphrodite 175 ## 3 male 179.## 4 none 200 ## 5 <NA> 120
The pipe operator is implemented in the package magrittr, it's pronounced "and then".
➥ Vignette: magrittr
You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.
Expressed as a set of nested functions in R pseudocode this would look like:
park(drive(start_car(find("keys")), to = "campus"))
You can think about the following sequence of actions - find key, unlock car, start car, drive to school, park.
Expressed as a set of nested functions in R pseudocode this would look like:
park(drive(start_car(find("keys")), to = "campus"))
find("keys") %>% start_car() %>% drive(to = "campus") %>% park()
To send results to a function argument other than first one or to use the previous result for multiple arguments, use .
:
starwars %>% filter(species == "Human") %>% lm(mass ~ height, data = .)
## ## Call:## lm(formula = mass ~ height, data = .)## ## Coefficients:## (Intercept) height ## -116.58 1.11
The dataset is in the dsbox package:
github packages require special install commands
the remotes package is automatically installed with devtools
remotes::install_github("rstudio-education/dsbox")library(dsbox)ncbikecrash
View the names of variables via
names(ncbikecrash)
## [1] "object_id" "city" "county" ## [4] "region" "development" "locality" ## [7] "on_road" "rural_urban" "speed_limit" ## [10] "traffic_control" "weather" "workzone" ## [13] "bike_age" "bike_age_group" "bike_alcohol" ## [16] "bike_alcohol_drugs" "bike_direction" "bike_injury" ## [19] "bike_position" "bike_race" "bike_sex" ## [22] "driver_age" "driver_age_group" "driver_alcohol" ## [25] "driver_alcohol_drugs" "driver_est_speed" "driver_injury" ## [28] "driver_race" "driver_sex" "driver_vehicle_type" ## [31] "crash_alcohol" "crash_date" "crash_day" ## [34] "crash_group" "crash_hour" "crash_location" ## [37] "crash_month" "crash_severity" "crash_time" ## [40] "crash_type" "crash_year" "ambulance_req" ## [43] "hit_run" "light_condition" "road_character" ## [46] "road_class" "road_condition" "road_configuration" ## [49] "road_defects" "road_feature" "road_surface" ## [52] "num_bikes_ai" "num_bikes_bi" "num_bikes_ci" ## [55] "num_bikes_ki" "num_bikes_no" "num_bikes_to" ## [58] "num_bikes_ui" "num_lanes" "num_units" ## [61] "distance_mi_from" "frm_road" "rte_invd_cd" ## [64] "towrd_road" "geo_point" "geo_shape"
and see detailed descriptions with ?ncbikecrash
.
In the Environment, after loading with data(ncbikecrash)
, and click on the
name of the data frame to view it in the data viewer
Use the glimpse
function to take a peek
In the Environment, after loading with data(ncbikecrash)
, and click on the
name of the data frame to view it in the data viewer
Use the glimpse
function to take a peek
glimpse(ncbikecrash)
## Observations: 7,467## Variables: 66## $ object_id <int> 1686, 1674, 1673, 1687, 1653, 1665, 1642, 1…## $ city <chr> "None - Rural Crash", "Henderson", "None - …## $ county <chr> "Wayne", "Vance", "Lincoln", "Columbus", "N…## $ region <chr> "Coastal", "Piedmont", "Piedmont", "Coastal…## $ development <chr> "Farms, Woods, Pastures", "Residential", "F…## $ locality <chr> "Rural (<30% Developed)", "Mixed (30% To 70…## $ on_road <chr> "SR 1915", "NICHOLAS ST", "US 321", "W BURK…## $ rural_urban <chr> "Rural", "Urban", "Rural", "Urban", "Urban"…## $ speed_limit <chr> "50 - 55 MPH", "30 - 35 MPH", "50 - 55 M…## $ traffic_control <chr> "No Control Present", "Stop Sign", "Double …## $ weather <chr> "Clear", "Clear", "Clear", "Rain", "Clear",…## $ workzone <chr> "No", "No", "No", "No", "No", "No", "No", "…## $ bike_age <chr> "52", "66", "33", "52", "22", "15", "41", "…## $ bike_age_group <chr> "50-59", "60-69", "30-39", "50-59", "20-24"…## $ bike_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", …## $ bike_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…## $ bike_direction <chr> "With Traffic", "With Traffic", "With Traff…## $ bike_injury <chr> "B: Evident Injury", "C: Possible Injury", …## $ bike_position <chr> "Bike Lane / Paved Shoulder", "Travel Lane"…## $ bike_race <chr> "Black", "Black", "White", "Black", "White"…## $ bike_sex <chr> "Male", "Male", "Male", "Male", "Female", "…## $ driver_age <chr> "34", NA, "37", "55", "25", "17", NA, "50",…## $ driver_age_group <chr> "30-39", NA, "30-39", "50-59", "25-29", "0-…## $ driver_alcohol <chr> "No", "Missing", "No", "No", "No", "No", "M…## $ driver_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…## $ driver_est_speed <chr> "51-55 mph", "6-10 mph", "41-45 mph", "11-1…## $ driver_injury <chr> "O: No Injury", "Unknown Injury", "O: No In…## $ driver_race <chr> "White", "Unknown/Missing", "Hispanic", "Bl…## $ driver_sex <chr> "Male", NA, "Female", "Male", "Male", "Fema…## $ driver_vehicle_type <chr> "Single Unit Truck (2-Axle, 6-Tire)", NA, "…## $ crash_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", …## $ crash_date <chr> "11DEC2013", "20NOV2013", "03NOV2013", "14D…## $ crash_day <chr> "Wednesday", "Wednesday", "Sunday", "Saturd…## $ crash_group <chr> "Motorist Overtaking Bicyclist", "Bicyclist…## $ crash_hour <int> 6, 20, 18, 18, 13, 17, 17, 7, 15, 2, 12, 22…## $ crash_location <chr> "Non-Intersection", "Intersection", "Non-In…## $ crash_month <chr> "December", "November", "November", "Decemb…## $ crash_severity <chr> "B: Evident Injury", "C: Possible Injury", …## $ crash_time <drtn> 06:10:00, 20:41:00, 18:05:00, 18:34:00, 13…## $ crash_type <chr> "Motorist Overtaking - Undetected Bicyclist…## $ crash_year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…## $ ambulance_req <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Y…## $ hit_run <chr> "No", "Yes", "No", "No", "No", "No", "Yes",…## $ light_condition <chr> "Dark - Roadway Not Lighted", NA, "Dark - R…## $ road_character <chr> "Straight - Level", "Straight - Level", "St…## $ road_class <chr> "State Secondary Route", "Local Street", "U…## $ road_condition <chr> "Dry", "Dry", "Dry", "Water (Standing, Movi…## $ road_configuration <chr> "Two-Way, Not Divided", "Two-Way, Divided, …## $ road_defects <chr> "None", NA, "None", "None", "None", "None",…## $ road_feature <chr> "No Special Feature", "T-Intersection", "No…## $ road_surface <chr> "Coarse Asphalt", "Smooth Asphalt", "Smooth…## $ num_bikes_ai <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ num_bikes_bi <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ num_bikes_ci <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ num_bikes_ki <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ num_bikes_no <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ num_bikes_to <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ num_bikes_ui <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ num_lanes <chr> "2 lanes", "2 lanes", "2 lanes", "1 lane", …## $ num_units <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…## $ distance_mi_from <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0"…## $ frm_road <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…## $ rte_invd_cd <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…## $ towrd_road <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…## $ geo_point <chr> "35.3336070056, -77.9955023901", "36.315187…## $ geo_shape <chr> "{\"type\": \"Point\", \"coordinates\": [-7…
dplyr is based on the concepts of functions as verbs that manipulate data frames.
filter
: pick rows matching criteriaslice
: pick rows using index(es)select
: pick columns by namepull
: grab a column as a vectorarrange
: reorder rowsmutate
: add new variablesdistinct
: filter for unique rowssample_n
/ sample_frac
: randomly sample rowssummarise
: reduce variables to valuesFirst argument is always a data frame
Subsequent arguments say what to do with that data frame
Always return a data frame
Don't modify in place
The %>%
operator in dplyr functions is called the pipe operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code.
The +
operator in ggplot2 functions is used for "layering". This means you create the plot in layers, separated by +
.
filter
to select a subset of rowsfor crashes in Durham County
ncbikecrash %>% filter(county == "Durham")
## # A tibble: 340 x 66## object_id city county region development locality on_road rural_urban## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 2452 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 2 2441 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 3 2466 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 4 549 Durh… Durham Piedm… Residential Urban (… PARK A… Urban ## 5 598 Durh… Durham Piedm… Residential Urban (… BELT S… Urban ## 6 603 Durh… Durham Piedm… Residential Urban (… HINSON… Urban ## 7 3974 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 8 7134 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 9 1670 Durh… Durham Piedm… Commercial Urban (… INFINI… Urban ## 10 1773 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## # … with 330 more rows, and 58 more variables: speed_limit <chr>,## # traffic_control <chr>, weather <chr>, workzone <chr>, bike_age <chr>,## # bike_age_group <chr>, bike_alcohol <chr>, bike_alcohol_drugs <chr>,## # bike_direction <chr>, bike_injury <chr>, bike_position <chr>,## # bike_race <chr>, bike_sex <chr>, driver_age <chr>,## # driver_age_group <chr>, driver_alcohol <chr>,## # driver_alcohol_drugs <chr>, driver_est_speed <chr>,## # driver_injury <chr>, driver_race <chr>, driver_sex <chr>,## # driver_vehicle_type <chr>, crash_alcohol <chr>, crash_date <chr>,## # crash_day <chr>, crash_group <chr>, crash_hour <int>,## # crash_location <chr>, crash_month <chr>, crash_severity <chr>,## # crash_time <drtn>, crash_type <chr>, crash_year <int>,## # ambulance_req <chr>, hit_run <chr>, light_condition <chr>,## # road_character <chr>, road_class <chr>, road_condition <chr>,## # road_configuration <chr>, road_defects <chr>, road_feature <chr>,## # road_surface <chr>, num_bikes_ai <int>, num_bikes_bi <int>,## # num_bikes_ci <int>, num_bikes_ki <int>, num_bikes_no <int>,## # num_bikes_to <int>, num_bikes_ui <int>, num_lanes <chr>,## # num_units <int>, distance_mi_from <chr>, frm_road <chr>,## # rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>, geo_shape <chr>
filter
for many conditions at oncefor crashes in Durham County where biker was 0-5 years old
ncbikecrash %>% filter(county == "Durham", bike_age_group == "0-5")
## # A tibble: 4 x 66## object_id city county region development locality on_road rural_urban## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 4062 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 2 414 Durh… Durham Piedm… Residential Urban (… PVA 90… Urban ## 3 3016 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 4 1383 Durh… Durham Piedm… Residential Urban (… PVA 62… Urban ## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,## # bike_injury <chr>, bike_position <chr>, bike_race <chr>,## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,## # driver_alcohol <chr>, driver_alcohol_drugs <chr>,## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,## # crash_date <chr>, crash_day <chr>, crash_group <chr>,## # crash_hour <int>, crash_location <chr>, crash_month <chr>,## # crash_severity <chr>, crash_time <drtn>, crash_type <chr>,## # crash_year <int>, ambulance_req <chr>, hit_run <chr>,## # light_condition <chr>, road_character <chr>, road_class <chr>,## # road_condition <chr>, road_configuration <chr>, road_defects <chr>,## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>,## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,## # geo_shape <chr>
operator | definition | operator | definition | |
---|---|---|---|---|
< |
less than | x | y |
x OR y |
|
<= |
less than or equal to | is.na(x) |
test if x is NA |
|
> |
greater than | !is.na(x) |
test if x is not NA |
|
>= |
greater than or equal to | x %in% y |
test if x is in y |
|
== |
exactly equal to | !(x %in% y) |
test if x is not in y |
|
!= |
not equal to | !x |
not x |
|
x & y |
x AND y |
select
to keep variablesncbikecrash %>% filter(county == "Durham", bike_age_group == "0-5") %>% select(locality, speed_limit)
## # A tibble: 4 x 2## locality speed_limit ## <chr> <chr> ## 1 Urban (>70% Developed) 30 - 35 MPH## 2 Urban (>70% Developed) 5 - 15 MPH ## 3 Urban (>70% Developed) 20 - 25 MPH## 4 Urban (>70% Developed) 20 - 25 MPH
select
to exclude variablesncbikecrash %>% select(-object_id)
## # A tibble: 7,467 x 65## city county region development locality on_road rural_urban speed_limit## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural 50 - 55 M…## 2 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban 30 - 35 M…## 3 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural 50 - 55 M…## 4 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban 30 - 35 M…## 5 Wilm… New H… Coast… Residential Urban (… RACINE… Urban <NA> ## 6 None… Robes… Coast… Farms, Woo… Rural (… SR 1513 Rural 50 - 55 M…## 7 None… Richm… Piedm… Residential Mixed (… SR 1903 Rural 30 - 35 M…## 8 Rale… Wake Piedm… Commercial Urban (… PERSON… Urban 30 - 35 M…## 9 Whit… Colum… Coast… Residential Rural (… FLOWER… Urban 30 - 35 M…## 10 New … Craven Coast… Residential Urban (… SUTTON… Urban 20 - 25 M…## # … with 7,457 more rows, and 57 more variables: traffic_control <chr>,## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,## # bike_injury <chr>, bike_position <chr>, bike_race <chr>,## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,## # driver_alcohol <chr>, driver_alcohol_drugs <chr>,## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,## # crash_date <chr>, crash_day <chr>, crash_group <chr>,## # crash_hour <int>, crash_location <chr>, crash_month <chr>,## # crash_severity <chr>, crash_time <drtn>, crash_type <chr>,## # crash_year <int>, ambulance_req <chr>, hit_run <chr>,## # light_condition <chr>, road_character <chr>, road_class <chr>,## # road_condition <chr>, road_configuration <chr>, road_defects <chr>,## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>,## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,## # geo_shape <chr>
select
a range of variablesncbikecrash %>% select(city:locality)
## # A tibble: 7,467 x 5## city county region development locality ## <chr> <chr> <chr> <chr> <chr> ## 1 None - Rural … Wayne Coastal Farms, Woods, Pa… Rural (<30% Develop…## 2 Henderson Vance Piedmo… Residential Mixed (30% To 70% D…## 3 None - Rural … Lincoln Piedmo… Farms, Woods, Pa… Rural (<30% Develop…## 4 Whiteville Columbus Coastal Commercial Urban (>70% Develop…## 5 Wilmington New Hanov… Coastal Residential Urban (>70% Develop…## 6 None - Rural … Robeson Coastal Farms, Woods, Pa… Rural (<30% Develop…## 7 None - Rural … Richmond Piedmo… Residential Mixed (30% To 70% D…## 8 Raleigh Wake Piedmo… Commercial Urban (>70% Develop…## 9 Whiteville Columbus Coastal Residential Rural (<30% Develop…## 10 New Bern Craven Coastal Residential Urban (>70% Develop…## # … with 7,457 more rows
slice
for certain row numbersFirst five
ncbikecrash %>% slice(1:5)
## # A tibble: 5 x 66## object_id city county region development locality on_road rural_urban## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 1686 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural ## 2 1674 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban ## 3 1673 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural ## 4 1687 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban ## 5 1653 Wilm… New H… Coast… Residential Urban (… RACINE… Urban ## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,## # bike_injury <chr>, bike_position <chr>, bike_race <chr>,## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,## # driver_alcohol <chr>, driver_alcohol_drugs <chr>,## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,## # crash_date <chr>, crash_day <chr>, crash_group <chr>,## # crash_hour <int>, crash_location <chr>, crash_month <chr>,## # crash_severity <chr>, crash_time <drtn>, crash_type <chr>,## # crash_year <int>, ambulance_req <chr>, hit_run <chr>,## # light_condition <chr>, road_character <chr>, road_class <chr>,## # road_condition <chr>, road_configuration <chr>, road_defects <chr>,## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>,## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,## # geo_shape <chr>
slice
for certain row numbersLast five
last_row <- nrow(ncbikecrash)ncbikecrash %>% slice((last_row - 4):last_row)
## # A tibble: 5 x 66## object_id city county region development locality on_road rural_urban## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 6989 High… Guilf… Piedm… Residential Urban (… <NA> Urban ## 2 6991 Wilm… New H… Coast… Residential Urban (… <NA> Urban ## 3 6995 Kins… Lenoir Coast… Commercial Urban (… <NA> Urban ## 4 6998 Faye… Cumbe… Coast… Residential Urban (… <NA> Urban ## 5 7000 None… Onslow Coast… Farms, Woo… Rural (… <NA> Rural ## # … with 58 more variables: speed_limit <chr>, traffic_control <chr>,## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,## # bike_injury <chr>, bike_position <chr>, bike_race <chr>,## # bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,## # driver_alcohol <chr>, driver_alcohol_drugs <chr>,## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,## # crash_date <chr>, crash_day <chr>, crash_group <chr>,## # crash_hour <int>, crash_location <chr>, crash_month <chr>,## # crash_severity <chr>, crash_time <drtn>, crash_type <chr>,## # crash_year <int>, ambulance_req <chr>, hit_run <chr>,## # light_condition <chr>, road_character <chr>, road_class <chr>,## # road_condition <chr>, road_configuration <chr>, road_defects <chr>,## # road_feature <chr>, road_surface <chr>, num_bikes_ai <int>,## # num_bikes_bi <int>, num_bikes_ci <int>, num_bikes_ki <int>,## # num_bikes_no <int>, num_bikes_to <int>, num_bikes_ui <int>,## # num_lanes <chr>, num_units <int>, distance_mi_from <chr>,## # frm_road <chr>, rte_invd_cd <int>, towrd_road <chr>, geo_point <chr>,## # geo_shape <chr>
pull
to extract a column as a vectorncbikecrash %>% slice(1:6) %>% pull(locality)
## [1] "Rural (<30% Developed)" "Mixed (30% To 70% Developed)"## [3] "Rural (<30% Developed)" "Urban (>70% Developed)" ## [5] "Urban (>70% Developed)" "Rural (<30% Developed)"
vs.
ncbikecrash %>% slice(1:6) %>% select(locality)
## # A tibble: 6 x 1## locality ## <chr> ## 1 Rural (<30% Developed) ## 2 Mixed (30% To 70% Developed)## 3 Rural (<30% Developed) ## 4 Urban (>70% Developed) ## 5 Urban (>70% Developed) ## 6 Rural (<30% Developed)
sample_n
/ sample_frac
for a random samplesample_n
: randomly sample 5 observationsncbikecrash_n5 <- ncbikecrash %>% sample_n(5, replace = FALSE)dim(ncbikecrash_n5)
## [1] 5 66
sample_frac
: randomly sample 20% of observationsncbikecrash_perc20 <-ncbikecrash %>% sample_frac(0.2, replace = FALSE)dim(ncbikecrash_perc20)
## [1] 1493 66
distinct
to filter for unique rowsAnd arrange
to order alphabetically
ncbikecrash %>% select(county, city) %>% distinct() %>% arrange(county, city)
## # A tibble: 391 x 2## county city ## <chr> <chr> ## 1 Alamance Alamance ## 2 Alamance Burlington ## 3 Alamance Elon ## 4 Alamance Elon College ## 5 Alamance Gibsonville ## 6 Alamance Graham ## 7 Alamance Green Level ## 8 Alamance Mebane ## 9 Alamance None - Rural Crash## 10 Alexander None - Rural Crash## # … with 381 more rows
summarise
to reduce variables to valuesncbikecrash %>% summarise(avg_hr = mean(crash_hour))
## # A tibble: 1 x 1## avg_hr## <dbl>## 1 14.7
group_by
to do calculations on groupsncbikecrash %>% group_by(hit_run) %>% summarise(avg_hr = mean(crash_hour))
## # A tibble: 2 x 2## hit_run avg_hr## <chr> <dbl>## 1 No 14.6## 2 Yes 15.0
count
observations in groupsncbikecrash %>% count(driver_alcohol_drugs)
## # A tibble: 6 x 2## driver_alcohol_drugs n## <chr> <int>## 1 Missing 99## 2 No 695## 3 Yes-Alcohol, impairment suspected 12## 4 Yes-Alcohol, no impairment detected 3## 5 Yes-Drugs, impairment suspected 4## 6 <NA> 6654
mutate
to add new variablesncbikecrash %>% mutate(driver_alcohol_drugs_simplified = case_when( driver_alcohol_drugs == "Missing" ~ NA, str_detect(driver_alcohol_drugs, "Yes") ~ "Yes", TRUE ~ "No" ))
mutate
Most often when you define a new variable with mutate
you'll also want to save the resulting data frame, often by writing over the original data frame.
ncbikecrash <- ncbikecrash %>% mutate(driver_alcohol_drugs_simplified = case_when( str_detect(driver_alcohol_drugs, "Yes") ~ "Yes", TRUE ~ driver_alcohol_drugs ))
ncbikecrash %>% count(driver_alcohol_drugs, driver_alcohol_drugs_simplified)
## # A tibble: 6 x 3## driver_alcohol_drugs driver_alcohol_drugs_simplified n## <chr> <chr> <int>## 1 Missing Missing 99## 2 No No 695## 3 Yes-Alcohol, impairment suspected Yes 12## 4 Yes-Alcohol, no impairment detected Yes 3## 5 Yes-Drugs, impairment suspected Yes 4## 6 <NA> <NA> 6654
ncbikecrash %>% count(driver_alcohol_drugs_simplified)
## # A tibble: 4 x 2## driver_alcohol_drugs_simplified n## <chr> <int>## 1 Missing 99## 2 No 695## 3 Yes 19## 4 <NA> 6654
Go to the cloud project and open application exercise NC bike crashes
~appex/ae-ncbikecrashes.Rmd
For each question you work on, set the eval
chunk option to TRUE
and knit
Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread.
Hadley Wickham
Style guide for this course is based on the Tidyverse style guide: http://style.tidyverse.org/
There's more to it than what we'll cover today, but we'll mention more as we introduce more functionality, and do a recap later in the semester
-
or _
to separate words# Gooducb-admit.csv# BadUCB Admit.csv
_
to separate words in object names# Goodacs_employed# Badacs.employedacs2acs_subsetacs_subsetted_for_males
# Goodaverage <- mean(feet / 12 + inches, na.rm = TRUE)# Badaverage<-mean(feet/12+inches,na.rm=TRUE)
+
# Goodggplot(diamonds, mapping = aes(x = price)) + geom_histogram()# Badggplot(diamonds,mapping=aes(x=price))+geom_histogram()
<-
not =
# Goodx <- 2# Badx = 2
<-
not =
# Goodx <- 2# Badx = 2
Use "
, not '
, for quoting text. The only exception is when the text already contains double quotes and no single quotes.
ggplot(diamonds, mapping = aes(x = price)) + geom_histogram() + # Good labs(title = "`Shine bright like a diamond`", # Good x = "Diamond prices", # Bad y = 'Frequency')
logical - boolean values TRUE
and FALSE
typeof(TRUE)
## [1] "logical"
character - character strings
typeof("hello")
## [1] "character"
typeof('world') # but remember, we use double quotations!
## [1] "character"
double - floating point numerical values (default numerical type)
typeof(1.335)
## [1] "double"
typeof(7)
## [1] "double"
integer - integer numerical values (indicated with an L
)
typeof(7L)
## [1] "integer"
typeof(1:3)
## [1] "integer"
Lists are 1d objects that can contain any combination of R objects
mylist <- list("A", 1:4, c(TRUE, FALSE), (1:4)/2)mylist
## [[1]]## [1] "A"## ## [[2]]## [1] 1 2 3 4## ## [[3]]## [1] TRUE FALSE## ## [[4]]## [1] 0.5 1.0 1.5 2.0
str(mylist)
## List of 4## $ : chr "A"## $ : int [1:4] 1 2 3 4## $ : logi [1:2] TRUE FALSE## $ : num [1:4] 0.5 1 1.5 2
Because of their more complex structure we often want to name the elements of a list (we can also do this with vectors). This can make reading and accessing the list more straight forward.
myotherlist <- list(A = "hello", B = 1:4, "knock knock" = "who's there?")str(myotherlist)
## List of 3## $ A : chr "hello"## $ B : int [1:4] 1 2 3 4## $ knock knock: chr "who's there?"
names(myotherlist)
## [1] "A" "B" "knock knock"
myotherlist$B
## [1] 1 2 3 4
Vectors can be constructed using the c()
function.
c(1, 2, 3)
## [1] 1 2 3
c("Hello", "World!")
## [1] "Hello" "World!"
c(1, c(2, c(3)))
## [1] 1 2 3
R is a dynamically typed language -- it will happily convert between the various types without complaint.
c(1, "Hello")
## [1] "1" "Hello"
c(FALSE, 3L)
## [1] 0 3
c(1.2, 3L)
## [1] 1.2 3.0
R uses NA
to represent missing values in its data structures.
typeof(NA)
## [1] "logical"
NaN
- Not a number
Inf
- Positive infinity
-Inf
- Negative infinity
pi / 0
## [1] Inf
0 / 0
## [1] NaN
1/0 + 1/0
## [1] Inf
1/0 - 1/0
## [1] NaN
NaN / NA
## [1] NaN
NaN * NA
## [1] NaN
What is the type of the following vectors? Explain why they have that type.
c(1, NA+1L, "C")
c(1L / 0, NA)
c(1:3, 5)
c(3L, NaN+1L)
c(NA, TRUE)
Go to RStudio Cloud and open the application exercise Cat Lovers.
~/appex/ae-catlovers.Rmd
A survey asked respondents their name and number of cats. The instructions said to enter the number of cats as a numerical value.
cat_lovers <- read_csv("../data/cat-lovers.csv")
cat_lovers %>% summarise(mean = mean(number_of_cats))
## # A tibble: 1 x 1## mean## <dbl>## 1 NA
cat_lovers %>% summarise(mean_cats = mean(number_of_cats, na.rm = TRUE))
## # A tibble: 1 x 1## mean_cats## <dbl>## 1 NA
What is the type of the number_of_cats
variable?
glimpse(cat_lovers)
## Observations: 60## Variables: 3## $ name <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass",…## $ number_of_cats <chr> "0", "0", "1", "3", "3", "2", "1", "1", "0", "0",…## $ handedness <chr> "left", "left", "left", "left", "left", "left", "…
cat_lovers %>% mutate(number_of_cats = case_when( name == "Ginger Clark" ~ 2, name == "Doug Bass" ~ 3, TRUE ~ as.numeric(number_of_cats) )) %>% summarise(mean_cats = mean(number_of_cats))
## # A tibble: 1 x 1## mean_cats## <dbl>## 1 0.817
cat_lovers %>% mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) ) %>% summarise(mean_cats = mean(number_of_cats))
## # A tibble: 1 x 1## mean_cats## <dbl>## 1 0.817
cat_lovers <- cat_lovers %>% mutate( number_of_cats = case_when( name == "Ginger Clark" ~ "2", name == "Doug Bass" ~ "3", TRUE ~ number_of_cats ), number_of_cats = as.numeric(number_of_cats) )
If your data does not behave how you expect it to, type coercion upon reading in the data might be the reason.
Go in and investigate your data, apply the fix, save your data, live happily ever after.
x <- c(8,4,7)
x[1]
## [1] 8
x[[1]]
## [1] 8
x <- c(8,4,7)
x[1]
## [1] 8
x[[1]]
## [1] 8
y <- list(8,4,7)
y[2]
## [[1]]## [1] 4
y[[2]]
## [1] 4
x <- c(8,4,7)
x[1]
## [1] 8
x[[1]]
## [1] 8
y <- list(8,4,7)
y[2]
## [[1]]## [1] 4
y[[2]]
## [1] 4
Note: When using tidyverse code you'll rarely need to refer to elements using square brackets, but it's good to be aware of this syntax, especially since you might encounter it when searching for help online.
"set" is in quotation marks because it is not a formal data class
A tidy data "set" can be one of the following types:
tibble
data.frame
We'll often work with tibble
s:
readr
package (e.g. read_csv
function) loads data as a tibble
by defaulttibble
s are part of the tidyverse, so they work well with other packages we are usingA data frame is the most commonly used data structure in R, they are just a list of equal length vectors (usually atomic, but you can use generic as well). Each vector is treated as a column and elements of the vectors as rows.
A tibble is a type of data frame that ... makes your life (i.e. data analysis) easier.
Most often a data frame will be constructed by reading in from a file, but we can also create them from scratch.
df <- tibble(x = 1:3, y = c("a", "b", "c"))class(df)
## [1] "tbl_df" "tbl" "data.frame"
glimpse(df)
## Observations: 3## Variables: 2## $ x <int> 1, 2, 3## $ y <chr> "a", "b", "c"
attributes(df)
## $names## [1] "x" "y"## ## $row.names## [1] 1 2 3## ## $class## [1] "tbl_df" "tbl" "data.frame"
class(df$x)
## [1] "integer"
class(df$y)
## [1] "character"
How many respondents have below average number of cats?
mean_cats <- cat_lovers %>% summarise(mean_cats = mean(number_of_cats))cat_lovers %>% filter(number_of_cats < mean_cats) %>% nrow()
## [1] 60
Do you believe this number? Why, why not?
mean_cats
## # A tibble: 1 x 1## mean_cats## <dbl>## 1 0.817
class(mean_cats)
## [1] "tbl_df" "tbl" "data.frame"
pull()
can be your new best friendBut use it sparingly!
mean_cats <- cat_lovers %>% summarise(mean_cats = mean(number_of_cats)) %>% pull()cat_lovers %>% filter(number_of_cats < mean_cats) %>% nrow()
## [1] 33
pull()
can be your new best friendBut use it sparingly!
mean_cats <- cat_lovers %>% summarise(mean_cats = mean(number_of_cats)) %>% pull()cat_lovers %>% filter(number_of_cats < mean_cats) %>% nrow()
## [1] 33
mean_cats
## [1] 0.8166667
class(mean_cats)
## [1] "numeric"
Factor objects are how R stores data for categorical variables (fixed numbers of discrete values).
(x = factor(c("BS", "MS", "PhD", "MS")))
## [1] BS MS PhD MS ## Levels: BS MS PhD
glimpse(x)
## Factor w/ 3 levels "BS","MS","PhD": 1 2 3 2
typeof(x)
## [1] "integer"
glimpse(cat_lovers)
## Observations: 60## Variables: 3## $ name <chr> "Bernice Warren", "Woodrow Stone", "Willie Bass",…## $ number_of_cats <dbl> 0, 0, 1, 3, 3, 2, 1, 1, 0, 0, 0, 0, 1, 3, 3, 2, 1…## $ handedness <chr> "left", "left", "left", "left", "left", "left", "…
p <- ggplot(cat_lovers, mapping = aes(x = handedness)) + geom_bar()p
cat_lovers <- cat_lovers %>% mutate(handedness = fct_relevel(handedness, "right", "left", "ambidextrous"))
p <- ggplot(cat_lovers, mapping = aes(x = handedness)) + geom_bar()p
... stay for the logo
R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Historically, factors were much easier to work with than character vectors, so many base R functions automatically convert character vectors to factors.
However, factors are still useful when you have true categorical data, and when you want to override the ordering of character vectors to improve display. The goal of the forcats package is to provide a suite of useful tools that solve common problems with factors.
Source: forcats.tidyverse.org
Always best to think of data as part of a tibble
tidyverse
as wellBe careful about data types / classes
R
makes silly assumptions about your data class tibble
s help, but it might not solve all issuesfactor
mutate
the variable with the correct classCheck out Alison Hill's "Working with Data in R"saved in the R folder of the RStudio project
This book will teach you how to do data science with R: You’ll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it.
ModernDive: Statistical Inference via Data Science
This is intended to be a gentle introduction to the practice of analyzing data and answering questions using data the way data scientists, statisticians, data journalists, and other researchers would.
Data Visualization: A practical Introduction
This book is a hands-on introduction to the principles and practice of looking at and presenting data using R and ggplot.
Fundamentals of Data Visualization
The book is meant as a guide to making visualizations that accurately reflect the data, tell a story, and look professional. Even though nearly all of the figures in this book were made with R and ggplot2, this is not an R book. It focuses on the concepts and the figures, not on the code.
Alison Hill - Introduction to Biostatistics for the Basic Sciences - Oregon Health & Science University
Mine Cetinkaya-Rundel - Intro to Data Science - Duke
These links will take you to each relevant section in the presentation
memer
package 📦remotes::install_github("sctyner/memer")library(memer)
memer is a a
tidyverse
-compatible R package for creating memes
memer
package 📦remotes::install_github("sctyner/memer")library(memer)
memer is a a
tidyverse
-compatible R package for creating memes
meme_get("OprahGiveaway") %>% meme_text_bottom("EVERYONE GETS A MEME!", size = 30)
meme_list()
## [1] "AllTheThings" "AmericanChopper" "AncientAliens" ## [4] "BatmanRobin" "DistractedBf" "EvilKermit" ## [7] "ExpandingBrain" "FirstWorldProbs" "FryNotSure" ## [10] "HotlineDrake" "IsThisAPigeon" "NoneOfMyBusiness" ## [13] "CheersLeo" "OneDoesNotSimply" "DosEquisMan" ## [16] "OffRamp" "OprahGiveaway" "Philosoraptor" ## [19] "PicardFacePalm" "PicardWTH" "Purples" ## [22] "PutItPatrick" "Rainbow" "ShiaJustDoIt" ## [25] "Spongebob" "SuccessKid" "ThatWouldBeGreat" ## [28] "TheRockDriving" "ThinkAboutIt" "TrumpBillSigning" ## [31] "TwoButtonsAnxiety" "WhatIfIToldYou" "CondescendingWonka"## [34] "YoDawg" "Y-U-NOguy"
meme_get("TheRockDriving") %>% meme_text_rock("Hey, how do I prep for an IRB audit?", "\nPrint, \n...everything.")
meme_get("SuccessKid") %>% meme_text_bottom("ENTER TEXT HERE")
Just the tip of the iceberg...
There's not enough time to cover everything
The content presented today is largely based on the Data science in a Box materials
Given our roles, we will focus on:
How to start interacting with R
How to wrangle data in R
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |