For this exercise, you will need the Myopia Study dataset. Use google to find the datasets for Applied Logistic Regression. Upload the Myopia dataset to a Data directory on the server. Read the data into an object named myopia
and show the first six rows of data.
myopia <- read.table("./Data/MYOPIA.txt", sep = "\t", header = TRUE)
head(myopia)
ID STUDYYEAR MYOPIC AGE GENDER SPHEQ AL ACD LT VCD SPORTHR
1 1 1992 1 6 1 -0.052 21.89 3.690 3.498 14.70 45
2 2 1995 0 6 1 0.608 22.38 3.702 3.392 15.29 4
3 3 1991 0 6 1 1.179 22.49 3.462 3.514 15.52 14
4 4 1990 1 6 1 0.525 22.20 3.862 3.612 14.73 18
5 5 1995 0 5 0 0.697 23.29 3.676 3.454 16.16 14
6 6 1995 0 6 0 1.744 22.14 3.224 3.556 15.36 10
READHR COMPHR STUDYHR TVHR DIOPTERHR MOMMY DADMY
1 8 0 0 10 34 1 1
2 0 1 1 7 12 1 1
3 0 2 0 10 14 0 0
4 11 0 0 4 37 0 1
5 0 0 0 4 4 1 0
6 6 2 1 19 44 0 1
Write down the equation for the logistic regression model of MYOPIC
on SPHEQ
. Write down the equation for the logit transformation of this logistic regression model. What characteristic of the outcome variable, MYOPIC
, leads us to consider the logistic regression model as opposed to the usual linear regression model to describe the relationship between MYOPIC
and SPHEQ
?
According to the code book, MYOPIC
is 1 when the subject has myopia and 0 when the subject shows no sign of myopia in the first five years of follow up. The logistic regression model is
The logit tranformation of this logistic regression model is
\[\begin{equation} \ln\left(\frac{\pi(x)}{1 - \pi(x)}\right) = \beta_0 + \beta_1x. \end{equation}\]Normal regression can not be used to describe the relationship between MYOPIC
and SPHEQ
since the response variable (MYOPIC
) is binary.
Form a scatterplot of MYOPIC
vs. SPHEQ
.
library(ggplot2)
ggplot(data = myopia, aes(x = SPHEQ, y = MYOPIC)) +
geom_jitter(width = 0, height = 0.05, alpha = 0.5) +
theme_bw() +
labs(x = "Spherical Equivalent Refraction",
y = "Absence (0) or presence (1) of myopia")
ggplot(data = myopia, aes(x = SPHEQ, y = MYOPIC)) +
geom_jitter(width = 0, height = 0.05, alpha = 0.5) +
theme_bw() +
labs(x = "Spherical Equivalent Refraction",
y = "Absence (0) or presence (1) of myopia") +
stat_smooth(method = "glm", method.args = list(family = "binomial"))
What is the expected probability of myopia for a patient with a SPHEQ
value of 0.0?
mod <- glm(MYOPIC ~ SPHEQ, data = myopia, family = "binomial")
summary(mod)
Call:
glm(formula = MYOPIC ~ SPHEQ, family = "binomial", data = myopia)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6435 -0.4533 -0.2681 -0.1029 3.1602
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.05397 0.20675 0.261 0.794
SPHEQ -3.83310 0.41837 -9.162 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 480.08 on 617 degrees of freedom
Residual deviance: 337.34 on 616 degrees of freedom
AIC: 341.34
Number of Fisher Scoring iterations: 6
predict(mod, newdata = data.frame(SPHEQ = 0.0), type = "response")
1
0.51349
Consider the Default
data set, where the response default
falls into one of two categories, Yes
or No
. Rather than modeling this response \(Y\) directly, logistic regression models the probability that \(Y\) belongs to a particular category. The predicted probabilities of default using logistic regression is shown in Figure 1
library(ISLR)
Default$default_bin <- ifelse(Default$default == "Yes", 1, 0)
ggplot(data = Default, aes(x = balance, y = default_bin)) +
geom_jitter(width = 0, height = 0.05, alpha = 0.1) +
theme_bw() +
labs(x = "Balance",
y = "Probability of Default") +
geom_smooth(method = "glm", se = FALSE,
method.args = list(family = binomial(link = "logit")),
color = "purple")
Figure 1: Predicted probabilities of default
using logistic regression.
How should we model the relationship between \(p(X) = \text{Pr}(Y=1|X)\) and \(X\)? In logistic regression, we use the logistic function,
\[\begin{equation} p(X) = \text{Pr}(Y=1|X) = \frac{\exp(\beta_0 + \beta_1X)}{1 + \exp(\beta_0 + \beta_1X)}. \tag{1} \end{equation}\]Notice that for low bablances we predict the probability of default as close to, but never below, zero. Likewise, for high balbnces we predicgt a default probability close, to, but never above, one. The logistic function will always produce an S_shaped curve of this form, and so regardless of the value of \(X\), we will obtain a sensible prediction. With some algebra,
\[\begin{equation} \frac{p(X)}{1 - p(X)}= \exp(\beta_0 + \beta_1 X). \tag{2} \end{equation}\]The quantity \(p(X)/(1 - p(X))\) is called the odds, and can take on any value between \(0\) and \(\infty\). Values of the odds close to \(0\) and \(\infty\) indicate very low and very high probabilities of default, respectively.
For example, on average 1 in 5 people with an odds of 1/4 will default, since \(p(X)=0.2\) implies an odds of \(\frac{0.2}{1 - 0.2} = 1/4\). Likewise on average nine out of every ten people with an odds of 9 will default, since \(p(X)=0.9\) implies an odds of \(\frac{0.9}{1 - 0.9}=9\).
By taking logarithm of both sides of (2), we arrive at
\[\begin{equation} \log\left(\frac{p(X)}{1 - p(X)}\right)= \beta_0 + \beta_1 X. \tag{3} \end{equation}\]The left hand side of (2) is called the log-odds or logit. In a logistic regression mode, increasing \(X\) by one unit changes the log odds by \(\beta_1\), or equivalently it multiplies the odds by \(\exp(\beta_1)\)
The relationship between \(p(X)\) and \(X\) is (1) is not a straight line, \(\beta_1\) does not correspond to the change in \(p(X)\) with a one-unit increase in \(X\). The amount that \(p(X)\) changes due to a one-unit change in \(X\) will depend on the current value of \(X\). But regardless of the value of \(X\), if \(\beta_1\) is positive then increasing \(X\) will be associated with increasing \(p(X)\), and if \(\beta_1\) is negative then increasing \(X\) will be associated with decreasing \(p(X)\).
logmod <- glm(default_bin ~ balance, data = Default, family = "binomial")
summary(logmod)
Call:
glm(formula = default_bin ~ balance, family = "binomial", data = Default)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2697 -0.1465 -0.0589 -0.0221 3.7589
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1596.5 on 9998 degrees of freedom
AIC: 1600.5
Number of Fisher Scoring iterations: 8
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -10.6513306 | 0.3611574 | -29.49221 | 0 |
balance | 0.0054989 | 0.0002204 | 24.95309 | 0 |
Using the estimated coefficients from Table 1, predict the probability of default for an individual with a balance
of $1,000, and $2,000.
predict(logmod, newdata = data.frame(balance = c(1000, 2000)), type = "response")
1 2
0.005752145 0.585769370
logmodstu <- glm(default_bin ~ student, data = Default, family = "binomial")
summary(logmodstu)
Call:
glm(formula = default_bin ~ student, family = "binomial", data = Default)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.2970 -0.2970 -0.2434 -0.2434 2.6585
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.50413 0.07071 -49.55 < 2e-16 ***
studentYes 0.40489 0.11502 3.52 0.000431 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 2908.7 on 9998 degrees of freedom
AIC: 2912.7
Number of Fisher Scoring iterations: 6
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -3.5041278 | 0.0707130 | -49.554219 | 0.0000000 |
studentYes | 0.4048871 | 0.1150188 | 3.520181 | 0.0004313 |
The results in Table 2 suggest students tend to have higher default probabilities than non-students.
predict(logmodstu, newdata = data.frame(student = c("Yes", "No")), type = "response")
1 2
0.04313859 0.02919501
\[\begin{equation}
\widehat{\text{Pr}}(\text{default = Yes | student = Yes}) = \frac{\exp(-3.50413 + 0.40489 \times 1)}{1 + \exp(-3.50413 + 0.40489 \times 1)} = 0.0431
\end{equation}\]
\[\begin{equation}
\widehat{\text{Pr}}(\text{default = Yes | student = No}) = \frac{\exp(-3.50413 + 0.40489 \times 0)}{1 + \exp(-3.50413 + 0.40489 \times 0)} = 0.0292
\end{equation}\]
mlog <- glm(default ~balance + income + student, data = Default,
family = "binomial")
summary(mlog)
Call:
glm(formula = default ~ balance + income + student, family = "binomial",
data = Default)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4691 -0.1418 -0.0557 -0.0203 3.7383
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1571.5 on 9996 degrees of freedom
AIC: 1579.5
Number of Fisher Scoring iterations: 8
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -10.8690452 | 0.4922555 | -22.080088 | 0.0000000 |
balance | 0.0057365 | 0.0002319 | 24.737563 | 0.0000000 |
income | 0.0000030 | 0.0000082 | 0.369815 | 0.7115203 |
studentYes | -0.6467758 | 0.2362525 | -2.737646 | 0.0061881 |
Note that students are less likely to default than non-students. The negative coefficient for student
in the multiple logistic regression indicates that for a fixed value of balance
and income
, a student is less likely to default than a non-student.