Logistic回归两种变种的R实现

Xianxiong Ma

2020年3月14日

无序多分类Logistic回归

利用 nnet 包中的函数multinom ，建立多元logistic回归模型：

案例：进入高中的学生在做计划选择时，面临着一般计划、职业计划及学术计划（general，vocational，academic）三种选择。他们的选择往往是通过对他们的写作成绩及社会经济地位进行模型化决定。hsbdemo.dta数据集包含200个学生的记录，结果变量为prog(program type)三种项目类型，预测变量为：ses(social economic status)社会经济地位，3个分类变量（id,female,schtyp)及其它连续变量。

library(foreign)
library(nnet)
library(ggplot2)
library(reshape2)

ml <- read.dta("hsbdemo.dta")
with(ml, table(ses, prog))

##         prog
## ses      general academic vocation
##   low         16       19       12
##   middle      20       44       31
##   high         9       42        7

with(ml, do.call(rbind, tapply(write, prog, function(x) c(M = mean(x), SD = sd(x)))))

##                 M       SD
## general  51.33333 9.397775
## academic 56.25714 7.943343
## vocation 46.76000 9.318754

ml$prog2 <- relevel(ml$prog, ref = "academic")
test <- multinom(prog2 ~ ses + write, data = ml)

## # weights:  15 (8 variable)
## initial  value 219.722458 
## iter  10 value 179.982880
## final  value 179.981726 
## converged

summary(test)

## Call:
## multinom(formula = prog2 ~ ses + write, data = ml)
## 
## Coefficients:
##          (Intercept)  sesmiddle    seshigh      write
## general     2.852198 -0.5332810 -1.1628226 -0.0579287
## vocation    5.218260  0.2913859 -0.9826649 -0.1136037
## 
## Std. Errors:
##          (Intercept) sesmiddle   seshigh      write
## general     1.166441 0.4437323 0.5142196 0.02141097
## vocation    1.163552 0.4763739 0.5955665 0.02221996
## 
## Residual Deviance: 359.9635 
## AIC: 375.9635

# 2-tailed z test
z <- summary(test)$coefficients/summary(test)$standard.errors
z

##          (Intercept)  sesmiddle   seshigh     write
## general     2.445214 -1.2018081 -2.261334 -2.705562
## vocation    4.484769  0.6116747 -1.649967 -5.112689

p <- (1 - pnorm(abs(z), 0, 1)) * 2
p

##           (Intercept) sesmiddle    seshigh        write
## general  0.0144766100 0.2294379 0.02373856 6.818902e-03
## vocation 0.0000072993 0.5407530 0.09894976 3.176045e-07

# extract the coefficients from the model and exponentiate
exp(coef(test))

##          (Intercept) sesmiddle   seshigh     write
## general     17.32582 0.5866769 0.3126026 0.9437172
## vocation   184.61262 1.3382809 0.3743123 0.8926116

head(pp <- fitted(test))

##    academic   general  vocation
## 1 0.1482764 0.3382454 0.5134781
## 2 0.1202017 0.1806283 0.6991700
## 3 0.4186747 0.2368082 0.3445171
## 4 0.1726885 0.3508384 0.4764731
## 5 0.1001231 0.1689374 0.7309395
## 6 0.3533566 0.2377976 0.4088458

dses <- data.frame(ses = c("low", "middle", "high"), write = mean(ml$write))
predict(test, newdata = dses, "probs")

##    academic   general  vocation
## 1 0.4396845 0.3581917 0.2021238
## 2 0.4777488 0.2283353 0.2939159
## 3 0.7009007 0.1784939 0.1206054

dwrite <- data.frame(ses = rep(c("low", "middle", "high"), each = 41), write = rep(c(30:70), 
    3))

# store the predicted probabilities for each value of ses and write
pp.write <- cbind(dwrite, predict(test, newdata = dwrite, type = "probs", se = TRUE))

# calculate the mean probabilities within each level of ses
by(pp.write[, 3:5], pp.write$ses, colMeans)

## pp.write$ses: high
##  academic   general  vocation 
## 0.6164315 0.1808037 0.2027648 
## ------------------------------------------------------------------ 
## pp.write$ses: low
##  academic   general  vocation 
## 0.3972977 0.3278174 0.2748849 
## ------------------------------------------------------------------ 
## pp.write$ses: middle
##  academic   general  vocation 
## 0.4256198 0.2010864 0.3732938

# melt data set to long for ggplot2
lpp <- melt(pp.write, id.vars = c("ses", "write"), value.name = "probability")
head(lpp)  # view first few rows

##   ses write variable probability
## 1 low    30 academic  0.09843588
## 2 low    31 academic  0.10716868
## 3 low    32 academic  0.11650390
## 4 low    33 academic  0.12645834
## 5 low    34 academic  0.13704576
## 6 low    35 academic  0.14827643

# plot predicted probabilities across write values for each level of ses facetted by
# program type
ggplot(lpp, aes(x = write, y = probability, colour = ses)) + geom_line() + facet_grid(variable ~ 
    ., scales = "free")

plot of chunk unnamed-chunk-1 尽量将多分类变量变成二分类变量处理。logit回归更加适合解释二分类变量。

程序包mlogit提供了多项logit的模型拟合函数

library(Formula)
library(maxLik)
library(miscTools)
library(mlogit)
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, shape = "wide", choice = "mode")
summary(mlogit(mode ~ 0 | income, data = Fish))

## 
## Call:
## mlogit(formula = mode ~ 0 | income, data = Fish, method = "nr")
## 
## Frequencies of alternatives:
##   beach    boat charter    pier 
## 0.11337 0.35364 0.38240 0.15059 
## 
## nr method
## 4 iterations, 0h:0m:0s 
## g'(-H)^-1g = 8.32E-07 
## gradient close to zero 
## 
## Coefficients :
##                        Estimate  Std. Error z-value  Pr(>|z|)    
## boat:(intercept)     7.3892e-01  1.9673e-01  3.7560 0.0001727 ***
## charter:(intercept)  1.3413e+00  1.9452e-01  6.8955 5.367e-12 ***
## pier:(intercept)     8.1415e-01  2.2863e-01  3.5610 0.0003695 ***
## boat:income          9.1906e-05  4.0664e-05  2.2602 0.0238116 *  
## charter:income      -3.1640e-05  4.1846e-05 -0.7561 0.4495908    
## pier:income         -1.4340e-04  5.3288e-05 -2.6911 0.0071223 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Log-Likelihood: -1477.2
## McFadden R^2:  0.013736 
## Likelihood ratio test : chisq = 41.145 (p.value = 6.0931e-09)

formula : mlogit提供了标记logit，多项logit，混合logit多种模型，对于多项logit的估计模型应写为：因变量~0|自变量，如：mode ~ 0|model

data : 使用mlogit.data函数使得数据结构符合mlogit函数要求。

choice : 确定分类变量是什么

shape : 如果每一行是一个观测，我们选择wide，如果每一行是表示一个选择，我们就应该选择long

alt.var : 对于shape为long的数据，需要标明所有的选择名称

有序多分类Logistic回归

程序包MASS提供polr()函数可以进行poly ordered logit regression或probit回归

案例：A study looks at factors that influence the decision of whether to apply to graduate school. College juniors are asked if they are unlikely, somewhat likely, or very likely to apply to graduate school. Hence, our outcome variable has three categories. Data on parental educational status, whether the undergraduate institution is public or private, and current GPA is also collected. The researchers have reason to believe that the “distances” between these three points are not equal. For example, the “distance” between 'unlikely' and “somewhat likely” may be shorter than the distance between “somewhat likely” and “very likely”.

require(foreign)
require(ggplot2)
require(MASS)
require(Hmisc)
require(reshape2)

dat <- read.dta("ologit.dta")
head(dat)

##             apply pared public  gpa
## 1     very likely     0      0 3.26
## 2 somewhat likely     1      0 3.21
## 3        unlikely     1      1 3.94
## 4 somewhat likely     0      0 2.81
## 5 somewhat likely     0      0 2.53
## 6        unlikely     0      1 2.59

## one at a time, table apply, pared, and public
lapply(dat[, c("apply", "pared", "public")], table)

## $apply
## 
##        unlikely somewhat likely     very likely 
##             220             140              40 
## 
## $pared
## 
##   0   1 
## 337  63 
## 
## $public
## 
##   0   1 
## 343  57

## three way cross tabs (xtabs) and flatten the table
ftable(xtabs(~public + apply + pared, data = dat))

##                        pared   0   1
## public apply                        
## 0      unlikely              175  14
##        somewhat likely        98  26
##        very likely            20  10
## 1      unlikely               25   6
##        somewhat likely        12   4
##        very likely             7   3

summary(dat$gpa)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.900   2.720   2.990   2.999   3.270   4.000

sd(dat$gpa)

## [1] 0.3979409

ggplot(dat, aes(x = apply, y = gpa)) + geom_boxplot(size = 0.75) + geom_jitter(alpha = 0.5) + 
    facet_grid(pared ~ public, margins = TRUE) + theme(axis.text.x = element_text(angle = 45, 
    hjust = 1, vjust = 1))

plot of chunk unnamed-chunk-3

## fit ordered logit model and store results 'm'
m <- polr(apply ~ pared + public + gpa, data = dat, Hess = TRUE)
## view a summary of the model
summary(m)

## Call:
## polr(formula = apply ~ pared + public + gpa, data = dat, Hess = TRUE)
## 
## Coefficients:
##           Value Std. Error t value
## pared   1.04769     0.2658  3.9418
## public -0.05879     0.2979 -0.1974
## gpa     0.61594     0.2606  2.3632
## 
## Intercepts:
##                             Value   Std. Error t value
## unlikely|somewhat likely     2.2039  0.7795     2.8272
## somewhat likely|very likely  4.2994  0.8043     5.3453
## 
## Residual Deviance: 717.0249 
## AIC: 727.0249

## store table
(ctable <- coef(summary(m)))

##                                   Value Std. Error    t value
## pared                        1.04769010  0.2657894  3.9418050
## public                      -0.05878572  0.2978614 -0.1973593
## gpa                          0.61594057  0.2606340  2.3632399
## unlikely|somewhat likely     2.20391473  0.7795455  2.8271792
## somewhat likely|very likely  4.29936315  0.8043267  5.3452947

## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
## combined table
(ctable <- cbind(ctable, `p value` = p))

##                                   Value Std. Error    t value      p value
## pared                        1.04769010  0.2657894  3.9418050 8.087072e-05
## public                      -0.05878572  0.2978614 -0.1973593 8.435464e-01
## gpa                          0.61594057  0.2606340  2.3632399 1.811594e-02
## unlikely|somewhat likely     2.20391473  0.7795455  2.8271792 4.696004e-03
## somewhat likely|very likely  4.29936315  0.8043267  5.3452947 9.027008e-08

(ci <- confint(m))  # default method gives profiled CIs

##             2.5 %    97.5 %
## pared   0.5281768 1.5721750
## public -0.6522060 0.5191384
## gpa     0.1076202 1.1309148

confint.default(m)  # CIs assuming normality

##             2.5 %    97.5 %
## pared   0.5267524 1.5686278
## public -0.6425833 0.5250119
## gpa     0.1051074 1.1267737

## odds ratios
exp(coef(m))

##     pared    public       gpa 
## 2.8510579 0.9429088 1.8513972

## OR and CI
exp(cbind(OR = coef(m), ci))

##               OR     2.5 %   97.5 %
## pared  2.8510579 1.6958376 4.817114
## public 0.9429088 0.5208954 1.680579
## gpa    1.8513972 1.1136247 3.098490

sf <- function(y) {
    c(`Y>=1` = qlogis(mean(y >= 1)), `Y>=2` = qlogis(mean(y >= 2)), `Y>=3` = qlogis(mean(y >= 
        3)))
}
(s <- with(dat, summary(as.numeric(apply) ~ pared + public + gpa, fun = sf)))

## as.numeric(apply)     N= 400 
## 
## +-------+-----------+---+----+-----------+---------+
## |       |           |N  |Y>=1|Y>=2       |Y>=3     |
## +-------+-----------+---+----+-----------+---------+
## |pared  |No         |337|Inf |-0.37833644|-2.440735|
## |       |Yes        | 63|Inf | 0.76546784|-1.347074|
## +-------+-----------+---+----+-----------+---------+
## |public |No         |343|Inf |-0.20479441|-2.345006|
## |       |Yes        | 57|Inf |-0.17589067|-1.547563|
## +-------+-----------+---+----+-----------+---------+
## |gpa    |[1.90,2.73)|102|Inf |-0.39730180|-2.772589|
## |       |[2.73,3.00)| 99|Inf |-0.26415158|-2.302585|
## |       |[3.00,3.28)|100|Inf |-0.20067070|-2.090741|
## |       |[3.28,4.00]| 99|Inf | 0.06062462|-1.803594|
## +-------+-----------+---+----+-----------+---------+
## |Overall|           |400|Inf |-0.20067070|-2.197225|
## +-------+-----------+---+----+-----------+---------+

glm(I(as.numeric(apply) >= 2) ~ pared, family = "binomial", data = dat)

## 
## Call:  glm(formula = I(as.numeric(apply) >= 2) ~ pared, family = "binomial", 
##     data = dat)
## 
## Coefficients:
## (Intercept)        pared  
##     -0.3783       1.1438  
## 
## Degrees of Freedom: 399 Total (i.e. Null);  398 Residual
## Null Deviance:       550.5 
## Residual Deviance: 534.1     AIC: 538.1

glm(I(as.numeric(apply) >= 3) ~ pared, family = "binomial", data = dat)

## 
## Call:  glm(formula = I(as.numeric(apply) >= 3) ~ pared, family = "binomial", 
##     data = dat)
## 
## Coefficients:
## (Intercept)        pared  
##      -2.441        1.094  
## 
## Degrees of Freedom: 399 Total (i.e. Null);  398 Residual
## Null Deviance:       260.1 
## Residual Deviance: 252.2     AIC: 256.2

s[, 4] <- s[, 4] - s[, 3]
s[, 3] <- s[, 3] - s[, 3]
s  # print

## as.numeric(apply)     N= 400 
## 
## +-------+-----------+---+----+----+---------+
## |       |           |N  |Y>=1|Y>=2|Y>=3     |
## +-------+-----------+---+----+----+---------+
## |pared  |No         |337|Inf |0   |-2.062399|
## |       |Yes        | 63|Inf |0   |-2.112541|
## +-------+-----------+---+----+----+---------+
## |public |No         |343|Inf |0   |-2.140211|
## |       |Yes        | 57|Inf |0   |-1.371672|
## +-------+-----------+---+----+----+---------+
## |gpa    |[1.90,2.73)|102|Inf |0   |-2.375287|
## |       |[2.73,3.00)| 99|Inf |0   |-2.038434|
## |       |[3.00,3.28)|100|Inf |0   |-1.890070|
## |       |[3.28,4.00]| 99|Inf |0   |-1.864219|
## +-------+-----------+---+----+----+---------+
## |Overall|           |400|Inf |0   |-1.996554|
## +-------+-----------+---+----+----+---------+

plot(s, which = 1:3, pch = 1:3, xlab = "logit", main = " ", xlim = range(s[, 3:4]))

plot of chunk unnamed-chunk-3

newdat <- data.frame(pared = rep(0:1, 200), public = rep(0:1, each = 200), gpa = rep(seq(from = 1.9, 
    to = 4, length.out = 100), 4))
newdat <- cbind(newdat, predict(m, newdat, type = "probs"))

## show first few rows
head(newdat)

##   pared public      gpa  unlikely somewhat likely very likely
## 1     0      0 1.900000 0.7376186       0.2204577  0.04192370
## 2     1      0 1.921212 0.4932185       0.3945673  0.11221424
## 3     0      0 1.942424 0.7325300       0.2244841  0.04298593
## 4     1      0 1.963636 0.4866885       0.3984676  0.11484395
## 5     0      0 1.984848 0.7273792       0.2285470  0.04407383
## 6     1      0 2.006061 0.4801630       0.4023098  0.11752712

lnewdat <- melt(newdat, id.vars = c("pared", "public", "gpa"), variable.name = "Level", value.name = "Probability")
## view first few rows
head(lnewdat)

##   pared public      gpa    Level Probability
## 1     0      0 1.900000 unlikely   0.7376186
## 2     1      0 1.921212 unlikely   0.4932185
## 3     0      0 1.942424 unlikely   0.7325300
## 4     1      0 1.963636 unlikely   0.4866885
## 5     0      0 1.984848 unlikely   0.7273792
## 6     1      0 2.006061 unlikely   0.4801630

ggplot(lnewdat, aes(x = gpa, y = Probability, colour = Level)) + geom_line() + facet_grid(pared ~ 
    public, labeller = "label_both")

plot of chunk unnamed-chunk-3

普通二分类logistic回归用广义线性模型glm()

因变量多分类logistic回归

有序分类因变量：用MASS包里的polr
无序分类因变量：用nnet包里的multinom

条件logistic回归用survival包里的clogit