At a high leve, the tidyverse is a language for solving data science challenges with R code. Its primary goal is to facilitate a conversation between a human and a computer about data. Less abstractly, the tidyverse is a collection of R packages that share a high-level design philosophy and a low-level grammar and data structures, so that learning one package makes it easier to learn the next.
What are some packages you have used in previous courses that use tidyverse principles?
ggplot2
dplyr
readr
infer
Design for Humans
Consider sorting the mtcars first by gear, then by mpg. First we consider a base R solution:
Tibbles have slighlty different rules than basic data frames in R. For example, tibbles work with column names that are not syntactically valid variable names.
# Wants valid names:data.frame(`variable 1`=1:2, two =3:4)
variable.1 two
1 1 3
2 2 4
# But can be coerced to use them with an extra optiondf <-data.frame(`variable 1`=1:2, two =3:4, check.names =FALSE)df
variable 1 two
1 1 3
2 2 4
# Tibbles just work!tbbl <-tibble(`variable 1`=1:2, two =3:4)tbbl
# A tibble: 2 × 2
`variable 1` two
<int> <int>
1 1 3
2 2 4
Standard data frames enable partial matching of arguments so that code using only a portion of the column names still works. Tibbles prevent this from happening since it can lead to accidental errors:
df$tw
[1] 3 4
tbbl$tw
Warning: Unknown or uninitialised column: `tw`.
NULL
Tibbles also prevent one of the most common R errors: dropping dimensions. If a standard data frame subsets the columns down to a single column, the object is converted to a vector. Tibbles never do this:
df[, "two"]
[1] 3 4
tbbl[, "two"]
# A tibble: 2 × 1
two
<int>
1 3
2 4
Practical Example
To demonstrate some syntax, let’s use tidyverse functions to read in data that could be used in modeling. The data set comes from the city of Chicago’s data portal and contains daily ridership data for the city’s elevated train stations. The data set has columns for:
The station identifier (numeric)
The station name (character)
The date (character in mm/dd/yyyy format)
The day of the week (character)
The number of riders (numeric)
Our tidyverse pipeline will conduct the following tasks, in order:
Use the tidyverse package readr to read the data from the source website and convert them into a tibble. To do this, the read_csv() function can determine the type of data by reading an initial number of rows. Alternatively, if the column names and types are already known, a column specification can be created in R and passed to read_csv().
Filter the data to eliminate a few columns that are not needed (such as the station ID) and change the column stationname to station. The function select() is used for this. When filtering, use either the column names or a dplyr selector function. When selecting names, a new variable name can be declared using the argument format new_name = old_name.
Convert the date field to the R date format using the mdy() function from the lubridate package. We also convert the ridership numbers to thousands. Both of these computations are executed using the dplyr::mutate() function.
Use the maximum number of rides for each station and day combination. This mitigates the issue of a small number of days that have more than one record of ridership numbers at certain stations. We group the ridership data by station and day, and then summarize within each of the 1999 unique combinations with the maximum statistic.
library(lubridate)url <-"http://bit.ly/raw-train-data-csv"all_stations <-# Step 1: Read in the data.read_csv(url) |># Step 2: filter columns and rename stationname dplyr::select(station = stationname, date, rides) |># Step 3: Convert the character date field to a date encoding.# Also, put the data in units of 1K ridesmutate(date =mdy(date), rides = rides /1000) |># Step 4: Summarize the multiple records using the maximum.filter(date =="2001-01-03") |>group_by(date, station) |>summarize(rides =max(rides), .groups ="drop")
Rows: 1245839 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): stationname, date, daytype
dbl (1): station_id
num (1): rides
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
all_stations |> DT::datatable()
Practice
Read the counties.rds file stored in the Data directory into R and store the results in counties. Examine the structure of counties.