Tidyverse Principles

Author

STT 3860

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.7     ✔ rsample      1.2.1
✔ dials        1.3.0     ✔ tune         1.2.1
✔ infer        1.0.7     ✔ workflows    1.1.4
✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
✔ parsnip      1.2.1     ✔ yardstick    1.3.2
✔ recipes      1.1.0     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use tidymodels_prefer() to resolve common conflicts.
tidymodels_prefer()
What is the tidyverse?

At a high leve, the tidyverse is a language for solving data science challenges with R code. Its primary goal is to facilitate a conversation between a human and a computer about data. Less abstractly, the tidyverse is a collection of R packages that share a high-level design philosophy and a low-level grammar and data structures, so that learning one package makes it easier to learn the next.

What are some packages you have used in previous courses that use tidyverse principles?

Design for Humans

  • Consider sorting the mtcars first by gear, then by mpg. First we consider a base R solution:
mtcars[order(mtcars$gear, mtcars$mpg), ]
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
mtcars[order(mtcars$gear, mtcars$mpg), ] |> 
  DT::datatable()

Next consiter sorting mtcars using dplyr:

mtcars |> 
  arrange(gear, mpg) |> 
  relocate(gear, .before = mpg) |> 
  DT::datatable()

Design for the pipe (|>)

arrange_mtcars <- arrange(mtcars, gear)
small_cars <- slice(arrange_mtcars, 1:10)
small_cars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
# or more compactly
small_cars <- slice(arrange(mtcars, gear), 1:10)
small_cars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4

Notice how we have nested functions inside of functions. Consider using the pipe (|>) to make the code more readable.

mtcars |> 
  arrange(gear) |> 
  slice(1:10) -> small_cars
small_cars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4

Tibbles

Tibbles have slighlty different rules than basic data frames in R. For example, tibbles work with column names that are not syntactically valid variable names.

# Wants valid names:
data.frame(`variable 1` = 1:2, two = 3:4)
  variable.1 two
1          1   3
2          2   4
# But can be coerced to use them with an extra option
df <- data.frame(`variable 1` = 1:2, two = 3:4, check.names = FALSE)
df
  variable 1 two
1          1   3
2          2   4
# Tibbles just work!
tbbl <- tibble(`variable 1` = 1:2, two = 3:4)
tbbl
# A tibble: 2 × 2
  `variable 1`   two
         <int> <int>
1            1     3
2            2     4

Standard data frames enable partial matching of arguments so that code using only a portion of the column names still works. Tibbles prevent this from happening since it can lead to accidental errors:

df$tw
[1] 3 4
tbbl$tw
Warning: Unknown or uninitialised column: `tw`.
NULL

Tibbles also prevent one of the most common R errors: dropping dimensions. If a standard data frame subsets the columns down to a single column, the object is converted to a vector. Tibbles never do this:

df[, "two"]
[1] 3 4
tbbl[, "two"]
# A tibble: 2 × 1
    two
  <int>
1     3
2     4

Practical Example

To demonstrate some syntax, let’s use tidyverse functions to read in data that could be used in modeling. The data set comes from the city of Chicago’s data portal and contains daily ridership data for the city’s elevated train stations. The data set has columns for:

  • The station identifier (numeric)
  • The station name (character)
  • The date (character in mm/dd/yyyy format)
  • The day of the week (character)
  • The number of riders (numeric)

Our tidyverse pipeline will conduct the following tasks, in order:

  1. Use the tidyverse package readr to read the data from the source website and convert them into a tibble. To do this, the read_csv() function can determine the type of data by reading an initial number of rows. Alternatively, if the column names and types are already known, a column specification can be created in R and passed to read_csv().

  2. Filter the data to eliminate a few columns that are not needed (such as the station ID) and change the column stationname to station. The function select() is used for this. When filtering, use either the column names or a dplyr selector function. When selecting names, a new variable name can be declared using the argument format new_name = old_name.

  3. Convert the date field to the R date format using the mdy() function from the lubridate package. We also convert the ridership numbers to thousands. Both of these computations are executed using the dplyr::mutate() function.

  4. Use the maximum number of rides for each station and day combination. This mitigates the issue of a small number of days that have more than one record of ridership numbers at certain stations. We group the ridership data by station and day, and then summarize within each of the 1999 unique combinations with the maximum statistic.

library(lubridate)
url <- "http://bit.ly/raw-train-data-csv"
all_stations <- 
  # Step 1: Read in the data.
  read_csv(url) |>  
  # Step 2: filter columns and rename stationname
  dplyr::select(station = stationname, date, rides) |>  
  # Step 3: Convert the character date field to a date encoding.
  # Also, put the data in units of 1K rides
  mutate(date = mdy(date), rides = rides / 1000) |>  
  # Step 4: Summarize the multiple records using the maximum.
  filter(date == "2001-01-03") |> 
  group_by(date, station) |> 
  summarize(rides = max(rides), .groups = "drop")
Rows: 1245839 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): stationname, date, daytype
dbl (1): station_id
num (1): rides

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
all_stations |> 
  DT::datatable()

Practice

Read the counties.rds file stored in the Data directory into R and store the results in counties. Examine the structure of counties.

counties <- readRDS("../Data/counties.rds")
str(counties)
spc_tbl_ [3,138 × 40] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ census_id         : chr [1:3138] "1001" "1003" "1005" "1007" ...
 $ state             : chr [1:3138] "Alabama" "Alabama" "Alabama" "Alabama" ...
 $ county            : chr [1:3138] "Autauga" "Baldwin" "Barbour" "Bibb" ...
 $ region            : chr [1:3138] "South" "South" "South" "South" ...
 $ metro             : chr [1:3138] "Metro" "Metro" "Nonmetro" "Metro" ...
 $ population        : num [1:3138] 55221 195121 26932 22604 57710 ...
 $ men               : num [1:3138] 26745 95314 14497 12073 28512 ...
 $ women             : num [1:3138] 28476 99807 12435 10531 29198 ...
 $ hispanic          : num [1:3138] 2.6 4.5 4.6 2.2 8.6 4.4 1.2 3.5 0.4 1.5 ...
 $ white             : num [1:3138] 75.8 83.1 46.2 74.5 87.9 22.2 53.3 73 57.3 91.7 ...
 $ black             : num [1:3138] 18.5 9.5 46.7 21.4 1.5 70.7 43.8 20.3 40.3 4.8 ...
 $ native            : num [1:3138] 0.4 0.6 0.2 0.4 0.3 1.2 0.1 0.2 0.2 0.6 ...
 $ asian             : num [1:3138] 1 0.7 0.4 0.1 0.1 0.2 0.4 0.9 0.8 0.3 ...
 $ pacific           : num [1:3138] 0 0 0 0 0 0 0 0 0 0 ...
 $ citizens          : num [1:3138] 40725 147695 20714 17495 42345 ...
 $ income            : num [1:3138] 51281 50254 32964 38678 45813 ...
 $ income_err        : num [1:3138] 2391 1263 2973 3995 3141 ...
 $ income_per_cap    : num [1:3138] 24974 27317 16824 18431 20532 ...
 $ income_per_cap_err: num [1:3138] 1080 711 798 1618 708 ...
 $ poverty           : num [1:3138] 12.9 13.4 26.7 16.8 16.7 24.6 25.4 20.5 21.6 19.2 ...
 $ child_poverty     : num [1:3138] 18.6 19.2 45.3 27.9 27.2 38.4 39.2 31.6 37.2 30.1 ...
 $ professional      : num [1:3138] 33.2 33.1 26.8 21.5 28.5 18.8 27.5 27.3 23.3 29.3 ...
 $ service           : num [1:3138] 17 17.7 16.1 17.9 14.1 15 16.6 17.7 14.5 16 ...
 $ office            : num [1:3138] 24.2 27.1 23.1 17.8 23.9 19.7 21.9 24.2 26.3 19.5 ...
 $ construction      : num [1:3138] 8.6 10.8 10.8 19 13.5 20.1 10.3 10.5 11.5 13.7 ...
 $ production        : num [1:3138] 17.1 11.2 23.1 23.7 19.9 26.4 23.7 20.4 24.4 21.5 ...
 $ drive             : num [1:3138] 87.5 84.7 83.8 83.2 84.9 74.9 84.5 85.3 85.1 83.9 ...
 $ carpool           : num [1:3138] 8.8 8.8 10.9 13.5 11.2 14.9 12.4 9.4 11.9 12.1 ...
 $ transit           : num [1:3138] 0.1 0.1 0.4 0.5 0.4 0.7 0 0.2 0.2 0.2 ...
 $ walk              : num [1:3138] 0.5 1 1.8 0.6 0.9 5 0.8 1.2 0.3 0.6 ...
 $ other_transp      : num [1:3138] 1.3 1.4 1.5 1.5 0.4 1.7 0.6 1.2 0.4 0.7 ...
 $ work_at_home      : num [1:3138] 1.8 3.9 1.6 0.7 2.3 2.8 1.7 2.7 2.1 2.5 ...
 $ mean_commute      : num [1:3138] 26.5 26.4 24.1 28.8 34.9 27.5 24.6 24.1 25.1 27.4 ...
 $ employed          : num [1:3138] 23986 85953 8597 8294 22189 ...
 $ private_work      : num [1:3138] 73.6 81.5 71.8 76.8 82 79.5 77.4 74.1 85.1 73.1 ...
 $ public_work       : num [1:3138] 20.9 12.3 20.8 16.1 13.5 15.1 16.2 20.8 12.1 18.5 ...
 $ self_employed     : num [1:3138] 5.5 5.8 7.3 6.7 4.2 5.4 6.2 5 2.8 7.9 ...
 $ family_work       : num [1:3138] 0 0.4 0.1 0.4 0.4 0 0.2 0.1 0 0.5 ...
 $ unemployment      : num [1:3138] 7.6 7.5 17.6 8.3 7.7 18 10.9 12.3 8.9 7.9 ...
 $ land_area         : num [1:3138] 594 1590 885 623 645 ...
  • Select the variables state, county, population, private_work, public_work, and self_employed and store the result in counties_selected.
Code
counties_selected <- counties |> 
  select(state, 
         county, 
         population, 
         private_work, 
         public_work, 
         self_employed)
  • Add a verb to sort the observations of the public_work variable in descending order.
Code
counties_selected |> 
  arrange(desc(public_work))
# A tibble: 3,138 × 6
   state        county         population private_work public_work self_employed
   <chr>        <chr>               <dbl>        <dbl>       <dbl>         <dbl>
 1 Hawaii       Kalawao                85         25          64.1          10.9
 2 Alaska       Yukon-Koyukuk…       5644         33.3        61.7           5.1
 3 Wisconsin    Menominee            4451         36.8        59.1           3.7
 4 North Dakota Sioux                4380         32.9        56.8          10.2
 5 South Dakota Todd                 9942         34.4        55             9.8
 6 Alaska       Lake and Peni…       1474         42.2        51.6           6.1
 7 California   Lassen              32645         42.6        50.5           6.8
 8 South Dakota Buffalo              2038         48.4        49.5           1.8
 9 South Dakota Dewey                5579         34.9        49.2          14.7
10 Texas        Kenedy                565         51.9        48.1           0  
# ℹ 3,128 more rows