For this exercise, you will need the Myopia Study dataset. Use google to find the datasets for Applied Logistic Regression. Upload the Myopia dataset to a Data directory on the server. Read the data into an object named myopia and show the first six rows of data.

myopia <- read.table("./Data/MYOPIA.txt", sep = "\t", header = TRUE)
head(myopia)
  ID STUDYYEAR MYOPIC AGE GENDER  SPHEQ    AL   ACD    LT   VCD SPORTHR
1  1      1992      1   6      1 -0.052 21.89 3.690 3.498 14.70      45
2  2      1995      0   6      1  0.608 22.38 3.702 3.392 15.29       4
3  3      1991      0   6      1  1.179 22.49 3.462 3.514 15.52      14
4  4      1990      1   6      1  0.525 22.20 3.862 3.612 14.73      18
5  5      1995      0   5      0  0.697 23.29 3.676 3.454 16.16      14
6  6      1995      0   6      0  1.744 22.14 3.224 3.556 15.36      10
  READHR COMPHR STUDYHR TVHR DIOPTERHR MOMMY DADMY
1      8      0       0   10        34     1     1
2      0      1       1    7        12     1     1
3      0      2       0   10        14     0     0
4     11      0       0    4        37     0     1
5      0      0       0    4         4     1     0
6      6      2       1   19        44     0     1

Write down the equation for the logistic regression model of MYOPIC on SPHEQ. Write down the equation for the logit transformation of this logistic regression model. What characteristic of the outcome variable, MYOPIC, leads us to consider the logistic regression model as opposed to the usual linear regression model to describe the relationship between MYOPIC and SPHEQ?

According to the code book, MYOPIC is 1 when the subject has myopia and 0 when the subject shows no sign of myopia in the first five years of follow up. The logistic regression model is

\[\begin{align} \pi(x) &= E(Y|x)=\frac{\exp(\beta_0 + \beta_1x)}{1 + \exp(\beta_0 + \beta_1x)} \\ \pi(\texttt{SPEQ}) &= E(\texttt{MYOPIC}|\texttt{SPEQ})=\frac{\exp(\beta_0 + \beta_1\texttt{SPEQ})}{1 + \exp(\beta_0 + \beta_1\texttt{SPEQ})}\nonumber \end{align}\]

The logit tranformation of this logistic regression model is

\[\begin{equation} \ln\left(\frac{\pi(x)}{1 - \pi(x)}\right) = \beta_0 + \beta_1x. \end{equation}\]

Normal regression can not be used to describe the relationship between MYOPIC and SPHEQ since the response variable (MYOPIC) is binary.

Form a scatterplot of MYOPIC vs. SPHEQ.

library(ggplot2)
ggplot(data = myopia, aes(x = SPHEQ, y = MYOPIC)) + 
  geom_jitter(width = 0, height = 0.05, alpha = 0.5) + 
  theme_bw() + 
  labs(x = "Spherical Equivalent Refraction", 
       y = "Absence (0) or presence (1) of myopia")

ggplot(data = myopia, aes(x = SPHEQ, y = MYOPIC)) + 
   geom_jitter(width = 0, height = 0.05, alpha = 0.5) + 
   theme_bw() + 
   labs(x = "Spherical Equivalent Refraction", 
        y = "Absence (0) or presence (1) of myopia") +
   stat_smooth(method = "glm", method.args = list(family = "binomial"))

What is the expected probability of myopia for a patient with a SPHEQ value of 0.0?

mod <- glm(MYOPIC ~ SPHEQ, data = myopia, family = "binomial")
summary(mod)

Call:
glm(formula = MYOPIC ~ SPHEQ, family = "binomial", data = myopia)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6435  -0.4533  -0.2681  -0.1029   3.1602  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.05397    0.20675   0.261    0.794    
SPHEQ       -3.83310    0.41837  -9.162   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 480.08  on 617  degrees of freedom
Residual deviance: 337.34  on 616  degrees of freedom
AIC: 341.34

Number of Fisher Scoring iterations: 6
predict(mod, newdata = data.frame(SPHEQ = 0.0), type = "response")
      1 
0.51349 

Logistic Regression Example from ISLR

Consider the Default data set, where the response default falls into one of two categories, Yes or No. Rather than modeling this response \(Y\) directly, logistic regression models the probability that \(Y\) belongs to a particular category. The predicted probabilities of default using logistic regression is shown in Figure 1

library(ISLR)
Default$default_bin <- ifelse(Default$default == "Yes", 1, 0)
ggplot(data = Default, aes(x = balance, y = default_bin)) + 
  geom_jitter(width = 0, height = 0.05, alpha = 0.1) + 
  theme_bw() + 
  labs(x = "Balance", 
       y = "Probability of Default") + 
  geom_smooth(method = "glm", se = FALSE, 
              method.args = list(family = binomial(link = "logit")),
              color = "purple") 
Predicted probabilities of `default` using logistic regression.

Figure 1: Predicted probabilities of default using logistic regression.

The Logistic Model

How should we model the relationship between \(p(X) = \text{Pr}(Y=1|X)\) and \(X\)? In logistic regression, we use the logistic function,

\[\begin{equation} p(X) = \text{Pr}(Y=1|X) = \frac{\exp(\beta_0 + \beta_1X)}{1 + \exp(\beta_0 + \beta_1X)}. \tag{1} \end{equation}\]

Notice that for low bablances we predict the probability of default as close to, but never below, zero. Likewise, for high balbnces we predicgt a default probability close, to, but never above, one. The logistic function will always produce an S_shaped curve of this form, and so regardless of the value of \(X\), we will obtain a sensible prediction. With some algebra,

\[\begin{equation} \frac{p(X)}{1 - p(X)}= \exp(\beta_0 + \beta_1 X). \tag{2} \end{equation}\]

The quantity \(p(X)/(1 - p(X))\) is called the odds, and can take on any value between \(0\) and \(\infty\). Values of the odds close to \(0\) and \(\infty\) indicate very low and very high probabilities of default, respectively.

For example, on average 1 in 5 people with an odds of 1/4 will default, since \(p(X)=0.2\) implies an odds of \(\frac{0.2}{1 - 0.2} = 1/4\). Likewise on average nine out of every ten people with an odds of 9 will default, since \(p(X)=0.9\) implies an odds of \(\frac{0.9}{1 - 0.9}=9\).

By taking logarithm of both sides of (2), we arrive at

\[\begin{equation} \log\left(\frac{p(X)}{1 - p(X)}\right)= \beta_0 + \beta_1 X. \tag{3} \end{equation}\]

The left hand side of (2) is called the log-odds or logit. In a logistic regression mode, increasing \(X\) by one unit changes the log odds by \(\beta_1\), or equivalently it multiplies the odds by \(\exp(\beta_1)\)

The relationship between \(p(X)\) and \(X\) is (1) is not a straight line, \(\beta_1\) does not correspond to the change in \(p(X)\) with a one-unit increase in \(X\). The amount that \(p(X)\) changes due to a one-unit change in \(X\) will depend on the current value of \(X\). But regardless of the value of \(X\), if \(\beta_1\) is positive then increasing \(X\) will be associated with increasing \(p(X)\), and if \(\beta_1\) is negative then increasing \(X\) will be associated with decreasing \(p(X)\).

logmod <- glm(default_bin ~ balance, data = Default, family = "binomial")
summary(logmod)

Call:
glm(formula = default_bin ~ balance, family = "binomial", data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2697  -0.1465  -0.0589  -0.0221   3.7589  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
balance      5.499e-03  2.204e-04   24.95   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1596.5  on 9998  degrees of freedom
AIC: 1600.5

Number of Fisher Scoring iterations: 8
Table 1: Summary table for logistic regression of default_bin on balance.
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.6513306 0.3611574 -29.49221 0
balance 0.0054989 0.0002204 24.95309 0

Making Predictions

Using the estimated coefficients from Table 1, predict the probability of default for an individual with a balance of $1,000, and $2,000.

\[\begin{equation} \hat{p}(X) = \frac{\exp(\hat{\beta}_0 + \hat{\beta}_1X)}{1 + \exp(\hat{\beta}_0 + \hat{\beta}_1X)}= \frac{\exp(-10.6513 + 0.0055\times 1000)}{1 + \exp(-10.6513 + 0.0055\times 1000)} = 0.00575. \tag{4} \end{equation}\]
predict(logmod, newdata = data.frame(balance = c(1000, 2000)), type = "response")
          1           2 
0.005752145 0.585769370 
logmodstu <- glm(default_bin ~ student, data = Default, family = "binomial")
summary(logmodstu)

Call:
glm(formula = default_bin ~ student, family = "binomial", data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.2970  -0.2970  -0.2434  -0.2434   2.6585  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.50413    0.07071  -49.55  < 2e-16 ***
studentYes   0.40489    0.11502    3.52 0.000431 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 2908.7  on 9998  degrees of freedom
AIC: 2912.7

Number of Fisher Scoring iterations: 6
Table 2: Summary table for logistic regression of default_bin on student.
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.5041278 0.0707130 -49.554219 0.0000000
studentYes 0.4048871 0.1150188 3.520181 0.0004313

The results in Table 2 suggest students tend to have higher default probabilities than non-students.

predict(logmodstu, newdata = data.frame(student = c("Yes", "No")), type = "response")
         1          2 
0.04313859 0.02919501 
\[\begin{equation} \widehat{\text{Pr}}(\text{default = Yes | student = Yes}) = \frac{\exp(-3.50413 + 0.40489 \times 1)}{1 + \exp(-3.50413 + 0.40489 \times 1)} = 0.0431 \end{equation}\] \[\begin{equation} \widehat{\text{Pr}}(\text{default = Yes | student = No}) = \frac{\exp(-3.50413 + 0.40489 \times 0)}{1 + \exp(-3.50413 + 0.40489 \times 0)} = 0.0292 \end{equation}\]

Multiple Logistic Regression

mlog <- glm(default ~balance + income + student, data = Default, 
            family = "binomial")
summary(mlog)

Call:
glm(formula = default ~ balance + income + student, family = "binomial", 
    data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4691  -0.1418  -0.0557  -0.0203   3.7383  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
income       3.033e-06  8.203e-06   0.370  0.71152    
studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1571.5  on 9996  degrees of freedom
AIC: 1579.5

Number of Fisher Scoring iterations: 8
Table 3: Summary table for logistic regression of default on balance, income, and student.
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.8690452 0.4922555 -22.080088 0.0000000
balance 0.0057365 0.0002319 24.737563 0.0000000
income 0.0000030 0.0000082 0.369815 0.7115203
studentYes -0.6467758 0.2362525 -2.737646 0.0061881

Note that students are less likely to default than non-students. The negative coefficient for student in the multiple logistic regression indicates that for a fixed value of balance and income, a student is less likely to default than a non-student.