A Review of R Modeling Fundamentals

Author

STT 3860

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.7     ✔ rsample      1.2.1
✔ dials        1.3.0     ✔ tune         1.2.1
✔ infer        1.0.7     ✔ workflows    1.1.4
✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
✔ parsnip      1.2.1     ✔ yardstick    1.3.2
✔ recipes      1.1.0     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org
tidymodels_prefer()

An Example

library(modeldata) # Used for the data crickets
names(crickets)
[1] "species" "temp"    "rate"   
  • Plot the chirp rate on the y-axis and the temperature on the x-axis, colored by species.
ggplot(data = crickets, 
       aes(x = temp, 
           y = rate, 
           color = species)) + 
  geom_point()

ggplot(data = crickets, 
       aes(x = temp, 
           y = rate, 
           color = species,
           pch = species)) + 
  geom_point()

ggplot(data = crickets, 
       aes(x = temp, 
           y = rate, 
           color = species,
           pch = species,
           lty = species)) + 
  geom_point(size = 2) + 
  geom_smooth(method = "lm", se = FALSE, alpha = 0.5) + 
  theme_bw()

ggplot(data = crickets, 
       aes(x = temp, 
           y = rate, 
           color = species,
           pch = species,
           lty = species)) + 
  geom_point(size = 2) + 
  geom_smooth(method = "lm", se = FALSE, alpha = 0.5) + 
  scale_color_brewer(palette = "Paired") + 
  theme_bw() + 
  labs(x = "Temperature (C)", y = "Chirp Rate (per minute)") + 
  theme(legend.position = "top")

  • The variable species is stored in R as a factor. Most models need to encode the species data in a numeric format. The model formula rate ~ temp + species, will automatically encode the species column by adding a new numeric column that has a 0 when the species is O. exclamationis and a 1 when the species is O. niveus.
mod_lm <- lm(rate ~ temp + species, data = crickets)
model.matrix(mod_lm) |> 
  bind_cols(SPECIES = crickets$species)
# A tibble: 31 × 4
   `(Intercept)`  temp `speciesO. niveus` SPECIES         
           <dbl> <dbl>              <dbl> <fct>           
 1             1  20.8                  0 O. exclamationis
 2             1  20.8                  0 O. exclamationis
 3             1  24                    0 O. exclamationis
 4             1  24                    0 O. exclamationis
 5             1  24                    0 O. exclamationis
 6             1  24                    0 O. exclamationis
 7             1  26.2                  0 O. exclamationis
 8             1  26.2                  0 O. exclamationis
 9             1  26.2                  0 O. exclamationis
10             1  26.2                  0 O. exclamationis
# ℹ 21 more rows
Tip

There are four levels to the variable heating. The model formula automatically adds three additional binary columns that are binary indicators for three of the heating types. The reference level of the factor (i.e., the first level) is omitted.

library(PASWR2) # For VIT2005
# heating has four levels
mod_example <- lm(totalprice ~ area + heating, data = VIT2005)
model.matrix(mod_example) |> 
  bind_cols(HEATING = VIT2005$heating)
# A tibble: 218 × 6
   `(Intercept)`  area heating3A heating3B heating4A HEATING
           <dbl> <dbl>     <dbl>     <dbl>     <dbl> <fct>  
 1             1  75.3         1         0         0 3A     
 2             1 101.          0         0         1 4A     
 3             1  88.9         1         0         0 3A     
 4             1  62.6         0         0         0 1A     
 5             1 146.          0         0         1 4A     
 6             1  77.2         1         0         0 3A     
 7             1  77.0         0         0         1 4A     
 8             1  62.9         1         0         0 3A     
 9             1  58.5         1         0         0 3A     
10             1  59.2         1         0         0 3A     
# ℹ 208 more rows
  • The model formula rate ~ temp + species creates a model with different y-intercepts for each species and the same slope.
mod_lm

Call:
lm(formula = rate ~ temp + species, data = crickets)

Coefficients:
     (Intercept)              temp  speciesO. niveus  
          -7.211             3.603           -10.065  
# The following may be easier to understand
mod_lm2 <- lm(rate ~ temp + species + 0, data = crickets)
mod_lm2

Call:
lm(formula = rate ~ temp + species + 0, data = crickets)

Coefficients:
                   temp  speciesO. exclamationis         speciesO. niveus  
                  3.603                   -7.211                  -17.276  
  • To get different slopes for each species, consider adding an interaction with one of the following:
mod_int1 <- lm(rate ~ temp + species + temp:species, 
               data = crickets)
mod_int1

Call:
lm(formula = rate ~ temp + species + temp:species, data = crickets)

Coefficients:
          (Intercept)                   temp       speciesO. niveus  
              -11.041                  3.751                 -4.348  
temp:speciesO. niveus  
               -0.234  
mod_int2 <- lm(rate ~ temp*species, 
               data = crickets)
mod_int2

Call:
lm(formula = rate ~ temp * species, data = crickets)

Coefficients:
          (Intercept)                   temp       speciesO. niveus  
              -11.041                  3.751                 -4.348  
temp:speciesO. niveus  
               -0.234  
mod_int3 <- lm(rate ~ (temp + species)^2, 
               data = crickets)
mod_int3

Call:
lm(formula = rate ~ (temp + species)^2, data = crickets)

Coefficients:
          (Intercept)                   temp       speciesO. niveus  
              -11.041                  3.751                 -4.348  
temp:speciesO. niveus  
               -0.234  

Write out the two separate lines from mod_int1.

\[\widehat{\text{rate}} = (-11.041) + (3.751)\cdot \text{temp}\]

\[\widehat{\text{rate}} = (-11.041 -4.348) + (3.751 - 0.234)\cdot \text{temp}\]

  • The output/models may be easier to understand with:
mod_int4 <- lm(rate ~ (temp + species + temp:species) + 0, 
               data = crickets)
mod_int4

Call:
lm(formula = rate ~ (temp + species + temp:species) + 0, data = crickets)

Coefficients:
                   temp  speciesO. exclamationis         speciesO. niveus  
                  3.751                  -11.041                  -15.389  
  temp:speciesO. niveus  
                 -0.234  

Is the interaction term warranted?

mod_reduced <- lm(rate ~ temp + species,
                  data = crickets)
mod_full <- lm(rate ~ temp + species + temp:species,
               data = crickets)
anova(mod_reduced, mod_full) |> 
  tidy() -> ans
ans |> 
  knitr::kable()
term df.residual rss df sumsq statistic p.value
rate ~ temp + species 28 89.34987 NA NA NA NA
rate ~ temp + species + temp:species 27 85.07409 1 4.275779 1.357006 0.2542464

The p-value of 0.25 suggests \(\beta_3 = 0\). The interaction term is not warranted and should be dropped from the model.

The model is a parallel slopes model and is shown in Figure 1.

ggplot(data = crickets, 
       aes(x = temp, 
           y = rate, 
           color = species,
           pch = species,
           lty = species)) + 
  geom_point(size = 2) + 
  moderndive::geom_parallel_slopes(se = FALSE, alpha = 0.5) + 
  scale_color_brewer(palette = "Paired") + 
  theme_bw() + 
  labs(x = "Temperature (C)", y = "Chirp Rate (per minute)") + 
  theme(legend.position = "top")

Figure 1: Parallel slopes model of chirp rate versus temperature for different species

Or using the coefficients from mod_lm:

mod_lm

Call:
lm(formula = rate ~ temp + species, data = crickets)

Coefficients:
     (Intercept)              temp  speciesO. niveus  
          -7.211             3.603           -10.065  
ggplot(data = crickets, 
       aes(x = temp, 
           y = rate, 
           color = species,
           pch = species,
           lty = species)) + 
  geom_point(size = 2) + 
  geom_abline(intercept = -7.211, slope = 3.603, lty = "dotted", color = "red") + 
  geom_abline(intercept = -17.276, slope = 3.603, lty = "dashed", color = "lightblue") + 
  theme_bw() + 
  xlim(0, 30) + 
  ylim(-20, 100)

mtcars |> select(-mpg)
                    cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4             6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag         6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710            4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive        6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout     8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant               6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360            8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D             4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230              4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280              6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C             6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE            8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL            8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC           8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood    8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial     8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128              4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic           4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla        4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona         4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger      8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin           8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28            8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird      8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9             4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2         4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa          4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L        8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino          6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora         8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E            4 121.0 109 4.11 2.780 18.60  1  1    4    2
purrr::map(mtcars |> select(-mpg), cor.test, y = mtcars$mpg)
$cyl

    Pearson's product-moment correlation

data:  .x[[i]] and mtcars$mpg
t = -8.9197, df = 30, p-value = 6.113e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9257694 -0.7163171
sample estimates:
      cor 
-0.852162 


$disp

    Pearson's product-moment correlation

data:  .x[[i]] and mtcars$mpg
t = -8.7472, df = 30, p-value = 9.38e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9233594 -0.7081376
sample estimates:
       cor 
-0.8475514 


$hp

    Pearson's product-moment correlation

data:  .x[[i]] and mtcars$mpg
t = -6.7424, df = 30, p-value = 1.788e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.8852686 -0.5860994
sample estimates:
       cor 
-0.7761684 


$drat

    Pearson's product-moment correlation

data:  .x[[i]] and mtcars$mpg
t = 5.096, df = 30, p-value = 1.776e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4360484 0.8322010
sample estimates:
      cor 
0.6811719 


$wt

    Pearson's product-moment correlation

data:  .x[[i]] and mtcars$mpg
t = -9.559, df = 30, p-value = 1.294e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9338264 -0.7440872
sample estimates:
       cor 
-0.8676594 


$qsec

    Pearson's product-moment correlation

data:  .x[[i]] and mtcars$mpg
t = 2.5252, df = 30, p-value = 0.01708
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.08195487 0.66961864
sample estimates:
     cor 
0.418684 


$vs

    Pearson's product-moment correlation

data:  .x[[i]] and mtcars$mpg
t = 4.8644, df = 30, p-value = 3.416e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4103630 0.8223262
sample estimates:
      cor 
0.6640389 


$am

    Pearson's product-moment correlation

data:  .x[[i]] and mtcars$mpg
t = 4.1061, df = 30, p-value = 0.000285
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3175583 0.7844520
sample estimates:
      cor 
0.5998324 


$gear

    Pearson's product-moment correlation

data:  .x[[i]] and mtcars$mpg
t = 2.9992, df = 30, p-value = 0.005401
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1580618 0.7100628
sample estimates:
      cor 
0.4802848 


$carb

    Pearson's product-moment correlation

data:  .x[[i]] and mtcars$mpg
t = -3.6157, df = 30, p-value = 0.001084
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7546480 -0.2503183
sample estimates:
       cor 
-0.5509251 
corr_res <- purrr::map(mtcars |> select(-mpg), cor.test, y = mtcars$mpg)
corr_res[[10]]

    Pearson's product-moment correlation

data:  .x[[i]] and mtcars$mpg
t = -3.6157, df = 30, p-value = 0.001084
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7546480 -0.2503183
sample estimates:
       cor 
-0.5509251 
corr_res[[10]] |> 
broom::tidy()
# A tibble: 1 × 8
  estimate statistic p.value parameter conf.low conf.high method     alternative
     <dbl>     <dbl>   <dbl>     <int>    <dbl>     <dbl> <chr>      <chr>      
1   -0.551     -3.62 0.00108        30   -0.755    -0.250 Pearson's… two.sided  
corr_res |> 
  map_dfr(tidy, .id = "predictor") -> output 
output |> 
  knitr::kable()
predictor estimate statistic p.value parameter conf.low conf.high method alternative
cyl -0.8521620 -8.919699 0.0000000 30 -0.9257694 -0.7163171 Pearson’s product-moment correlation two.sided
disp -0.8475514 -8.747151 0.0000000 30 -0.9233594 -0.7081376 Pearson’s product-moment correlation two.sided
hp -0.7761684 -6.742388 0.0000002 30 -0.8852686 -0.5860994 Pearson’s product-moment correlation two.sided
drat 0.6811719 5.096042 0.0000178 30 0.4360484 0.8322010 Pearson’s product-moment correlation two.sided
wt -0.8676594 -9.559044 0.0000000 30 -0.9338264 -0.7440872 Pearson’s product-moment correlation two.sided
qsec 0.4186840 2.525213 0.0170820 30 0.0819549 0.6696186 Pearson’s product-moment correlation two.sided
vs 0.6640389 4.864385 0.0000342 30 0.4103630 0.8223262 Pearson’s product-moment correlation two.sided
am 0.5998324 4.106127 0.0002850 30 0.3175583 0.7844520 Pearson’s product-moment correlation two.sided
gear 0.4802848 2.999191 0.0054009 30 0.1580618 0.7100628 Pearson’s product-moment correlation two.sided
carb -0.5509251 -3.615750 0.0010844 30 -0.7546480 -0.2503183 Pearson’s product-moment correlation two.sided

Create a graph with output.

ggplot(data = output, aes(x = fct_reorder(predictor, estimate))) + 
  geom_point(aes(y = estimate)) + 
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) + 
  labs(x = "", y = "Correlation with mpg") + 
  theme_bw()

Figure 2: Correlations (and 95% confidence intervals) between predictors and the outcome in the mtcars data set.

Combing Base R Models and Tidyverse

crickets |> 
  group_nest(species) |> 
  knitr::kable()
species data
O. exclamationis 20.8, 20.8, 24.0, 24.0, 24.0, 24.0, 26.2, 26.2, 26.2, 26.2, 28.4, 29.0, 30.4, 30.4, 67.9, 65.1, 77.3, 78.7, 79.4, 80.4, 85.8, 86.6, 87.5, 89.1, 98.6, 100.8, 99.3, 101.7
O. niveus 17.2, 18.3, 18.3, 18.3, 18.9, 18.9, 20.4, 21.0, 21.0, 22.1, 23.5, 24.2, 25.9, 26.5, 26.5, 26.5, 28.6, 44.3, 47.2, 47.6, 49.6, 50.3, 51.8, 60.0, 58.5, 58.9, 60.7, 69.8, 70.9, 76.2, 76.1, 77.0, 77.7, 84.7
crickets |> 
  group_nest(species) |> 
  mutate(model = map(data, ~lm(rate ~ temp, data = .x)))
# A tibble: 2 × 3
  species                        data model 
  <fct>            <list<tibble[,2]>> <list>
1 O. exclamationis           [14 × 2] <lm>  
2 O. niveus                  [17 × 2] <lm>  
#
crickets |> 
  group_nest(species) |> 
  mutate(model = map(data, ~lm(rate ~ temp, data = .x))) |> 
  mutate(coef = map(model, tidy))
# A tibble: 2 × 4
  species                        data model  coef            
  <fct>            <list<tibble[,2]>> <list> <list>          
1 O. exclamationis           [14 × 2] <lm>   <tibble [2 × 5]>
2 O. niveus                  [17 × 2] <lm>   <tibble [2 × 5]>
#
crickets |> 
  group_nest(species) |> 
  mutate(model = map(data, ~lm(rate ~ temp, data = .x))) |> 
  mutate(coef = map(model, tidy)) |> 
  select(species, coef) |> 
  unnest(cols = c(coef))
# A tibble: 4 × 6
  species          term        estimate std.error statistic  p.value
  <fct>            <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 O. exclamationis (Intercept)   -11.0      4.77      -2.32 3.90e- 2
2 O. exclamationis temp            3.75     0.184     20.4  1.10e-10
3 O. niveus        (Intercept)   -15.4      2.35      -6.56 9.07e- 6
4 O. niveus        temp            3.52     0.105     33.6  1.57e-15