Xianxiong Ma
2020年3月14日
案例:进入高中的学生在做计划选择时,面临着一般计划、职业计划及学术计划(general,vocational,academic)三种选择。他们的选择往往是通过对他们的写作成绩及社会经济地位进行模型化决定。hsbdemo.dta数据集包含200个学生的记录,结果变量为prog(program type)三种项目类型,预测变量为:ses(social economic status)社会经济地位,3个分类变量(id,female,schtyp)及其它连续变量。
library(foreign)
library(nnet)
library(ggplot2)
library(reshape2)
ml <- read.dta("hsbdemo.dta")
with(ml, table(ses, prog))
## prog
## ses general academic vocation
## low 16 19 12
## middle 20 44 31
## high 9 42 7
with(ml, do.call(rbind, tapply(write, prog, function(x) c(M = mean(x), SD = sd(x)))))
## M SD
## general 51.33333 9.397775
## academic 56.25714 7.943343
## vocation 46.76000 9.318754
ml$prog2 <- relevel(ml$prog, ref = "academic")
test <- multinom(prog2 ~ ses + write, data = ml)
## # weights: 15 (8 variable)
## initial value 219.722458
## iter 10 value 179.982880
## final value 179.981726
## converged
summary(test)
## Call:
## multinom(formula = prog2 ~ ses + write, data = ml)
##
## Coefficients:
## (Intercept) sesmiddle seshigh write
## general 2.852198 -0.5332810 -1.1628226 -0.0579287
## vocation 5.218260 0.2913859 -0.9826649 -0.1136037
##
## Std. Errors:
## (Intercept) sesmiddle seshigh write
## general 1.166441 0.4437323 0.5142196 0.02141097
## vocation 1.163552 0.4763739 0.5955665 0.02221996
##
## Residual Deviance: 359.9635
## AIC: 375.9635
# 2-tailed z test
z <- summary(test)$coefficients/summary(test)$standard.errors
z
## (Intercept) sesmiddle seshigh write
## general 2.445214 -1.2018081 -2.261334 -2.705562
## vocation 4.484769 0.6116747 -1.649967 -5.112689
p <- (1 - pnorm(abs(z), 0, 1)) * 2
p
## (Intercept) sesmiddle seshigh write
## general 0.0144766100 0.2294379 0.02373856 6.818902e-03
## vocation 0.0000072993 0.5407530 0.09894976 3.176045e-07
# extract the coefficients from the model and exponentiate
exp(coef(test))
## (Intercept) sesmiddle seshigh write
## general 17.32582 0.5866769 0.3126026 0.9437172
## vocation 184.61262 1.3382809 0.3743123 0.8926116
head(pp <- fitted(test))
## academic general vocation
## 1 0.1482764 0.3382454 0.5134781
## 2 0.1202017 0.1806283 0.6991700
## 3 0.4186747 0.2368082 0.3445171
## 4 0.1726885 0.3508384 0.4764731
## 5 0.1001231 0.1689374 0.7309395
## 6 0.3533566 0.2377976 0.4088458
dses <- data.frame(ses = c("low", "middle", "high"), write = mean(ml$write))
predict(test, newdata = dses, "probs")
## academic general vocation
## 1 0.4396845 0.3581917 0.2021238
## 2 0.4777488 0.2283353 0.2939159
## 3 0.7009007 0.1784939 0.1206054
dwrite <- data.frame(ses = rep(c("low", "middle", "high"), each = 41), write = rep(c(30:70),
3))
# store the predicted probabilities for each value of ses and write
pp.write <- cbind(dwrite, predict(test, newdata = dwrite, type = "probs", se = TRUE))
# calculate the mean probabilities within each level of ses
by(pp.write[, 3:5], pp.write$ses, colMeans)
## pp.write$ses: high
## academic general vocation
## 0.6164315 0.1808037 0.2027648
## ------------------------------------------------------------------
## pp.write$ses: low
## academic general vocation
## 0.3972977 0.3278174 0.2748849
## ------------------------------------------------------------------
## pp.write$ses: middle
## academic general vocation
## 0.4256198 0.2010864 0.3732938
# melt data set to long for ggplot2
lpp <- melt(pp.write, id.vars = c("ses", "write"), value.name = "probability")
head(lpp) # view first few rows
## ses write variable probability
## 1 low 30 academic 0.09843588
## 2 low 31 academic 0.10716868
## 3 low 32 academic 0.11650390
## 4 low 33 academic 0.12645834
## 5 low 34 academic 0.13704576
## 6 low 35 academic 0.14827643
# plot predicted probabilities across write values for each level of ses facetted by
# program type
ggplot(lpp, aes(x = write, y = probability, colour = ses)) + geom_line() + facet_grid(variable ~
., scales = "free")
尽量将多分类变量变成二分类变量处理。logit回归更加适合解释二分类变量。
library(Formula)
library(maxLik)
library(miscTools)
library(mlogit)
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, shape = "wide", choice = "mode")
summary(mlogit(mode ~ 0 | income, data = Fish))
##
## Call:
## mlogit(formula = mode ~ 0 | income, data = Fish, method = "nr")
##
## Frequencies of alternatives:
## beach boat charter pier
## 0.11337 0.35364 0.38240 0.15059
##
## nr method
## 4 iterations, 0h:0m:0s
## g'(-H)^-1g = 8.32E-07
## gradient close to zero
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## boat:(intercept) 7.3892e-01 1.9673e-01 3.7560 0.0001727 ***
## charter:(intercept) 1.3413e+00 1.9452e-01 6.8955 5.367e-12 ***
## pier:(intercept) 8.1415e-01 2.2863e-01 3.5610 0.0003695 ***
## boat:income 9.1906e-05 4.0664e-05 2.2602 0.0238116 *
## charter:income -3.1640e-05 4.1846e-05 -0.7561 0.4495908
## pier:income -1.4340e-04 5.3288e-05 -2.6911 0.0071223 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -1477.2
## McFadden R^2: 0.013736
## Likelihood ratio test : chisq = 41.145 (p.value = 6.0931e-09)
formula : mlogit提供了标记logit,多项logit,混合logit多种模型,对于多项logit的估计模型应写为:因变量~0|自变量,如:mode ~ 0|model
data : 使用mlogit.data函数使得数据结构符合mlogit函数要求。
choice : 确定分类变量是什么
shape : 如果每一行是一个观测,我们选择wide,如果每一行是表示一个选择,我们就应该选择long
alt.var : 对于shape为long的数据,需要标明所有的选择名称
案例:A study looks at factors that influence the decision of whether to apply to graduate school. College juniors are asked if they are unlikely, somewhat likely, or very likely to apply to graduate school. Hence, our outcome variable has three categories. Data on parental educational status, whether the undergraduate institution is public or private, and current GPA is also collected. The researchers have reason to believe that the “distances” between these three points are not equal. For example, the “distance” between 'unlikely' and “somewhat likely” may be shorter than the distance between “somewhat likely” and “very likely”.
require(foreign)
require(ggplot2)
require(MASS)
require(Hmisc)
require(reshape2)
dat <- read.dta("ologit.dta")
head(dat)
## apply pared public gpa
## 1 very likely 0 0 3.26
## 2 somewhat likely 1 0 3.21
## 3 unlikely 1 1 3.94
## 4 somewhat likely 0 0 2.81
## 5 somewhat likely 0 0 2.53
## 6 unlikely 0 1 2.59
## one at a time, table apply, pared, and public
lapply(dat[, c("apply", "pared", "public")], table)
## $apply
##
## unlikely somewhat likely very likely
## 220 140 40
##
## $pared
##
## 0 1
## 337 63
##
## $public
##
## 0 1
## 343 57
## three way cross tabs (xtabs) and flatten the table
ftable(xtabs(~public + apply + pared, data = dat))
## pared 0 1
## public apply
## 0 unlikely 175 14
## somewhat likely 98 26
## very likely 20 10
## 1 unlikely 25 6
## somewhat likely 12 4
## very likely 7 3
summary(dat$gpa)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.900 2.720 2.990 2.999 3.270 4.000
sd(dat$gpa)
## [1] 0.3979409
ggplot(dat, aes(x = apply, y = gpa)) + geom_boxplot(size = 0.75) + geom_jitter(alpha = 0.5) +
facet_grid(pared ~ public, margins = TRUE) + theme(axis.text.x = element_text(angle = 45,
hjust = 1, vjust = 1))
## fit ordered logit model and store results 'm'
m <- polr(apply ~ pared + public + gpa, data = dat, Hess = TRUE)
## view a summary of the model
summary(m)
## Call:
## polr(formula = apply ~ pared + public + gpa, data = dat, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## pared 1.04769 0.2658 3.9418
## public -0.05879 0.2979 -0.1974
## gpa 0.61594 0.2606 2.3632
##
## Intercepts:
## Value Std. Error t value
## unlikely|somewhat likely 2.2039 0.7795 2.8272
## somewhat likely|very likely 4.2994 0.8043 5.3453
##
## Residual Deviance: 717.0249
## AIC: 727.0249
## store table
(ctable <- coef(summary(m)))
## Value Std. Error t value
## pared 1.04769010 0.2657894 3.9418050
## public -0.05878572 0.2978614 -0.1973593
## gpa 0.61594057 0.2606340 2.3632399
## unlikely|somewhat likely 2.20391473 0.7795455 2.8271792
## somewhat likely|very likely 4.29936315 0.8043267 5.3452947
## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
## combined table
(ctable <- cbind(ctable, `p value` = p))
## Value Std. Error t value p value
## pared 1.04769010 0.2657894 3.9418050 8.087072e-05
## public -0.05878572 0.2978614 -0.1973593 8.435464e-01
## gpa 0.61594057 0.2606340 2.3632399 1.811594e-02
## unlikely|somewhat likely 2.20391473 0.7795455 2.8271792 4.696004e-03
## somewhat likely|very likely 4.29936315 0.8043267 5.3452947 9.027008e-08
(ci <- confint(m)) # default method gives profiled CIs
## 2.5 % 97.5 %
## pared 0.5281768 1.5721750
## public -0.6522060 0.5191384
## gpa 0.1076202 1.1309148
confint.default(m) # CIs assuming normality
## 2.5 % 97.5 %
## pared 0.5267524 1.5686278
## public -0.6425833 0.5250119
## gpa 0.1051074 1.1267737
## odds ratios
exp(coef(m))
## pared public gpa
## 2.8510579 0.9429088 1.8513972
## OR and CI
exp(cbind(OR = coef(m), ci))
## OR 2.5 % 97.5 %
## pared 2.8510579 1.6958376 4.817114
## public 0.9429088 0.5208954 1.680579
## gpa 1.8513972 1.1136247 3.098490
sf <- function(y) {
c(`Y>=1` = qlogis(mean(y >= 1)), `Y>=2` = qlogis(mean(y >= 2)), `Y>=3` = qlogis(mean(y >=
3)))
}
(s <- with(dat, summary(as.numeric(apply) ~ pared + public + gpa, fun = sf)))
## as.numeric(apply) N= 400
##
## +-------+-----------+---+----+-----------+---------+
## | | |N |Y>=1|Y>=2 |Y>=3 |
## +-------+-----------+---+----+-----------+---------+
## |pared |No |337|Inf |-0.37833644|-2.440735|
## | |Yes | 63|Inf | 0.76546784|-1.347074|
## +-------+-----------+---+----+-----------+---------+
## |public |No |343|Inf |-0.20479441|-2.345006|
## | |Yes | 57|Inf |-0.17589067|-1.547563|
## +-------+-----------+---+----+-----------+---------+
## |gpa |[1.90,2.73)|102|Inf |-0.39730180|-2.772589|
## | |[2.73,3.00)| 99|Inf |-0.26415158|-2.302585|
## | |[3.00,3.28)|100|Inf |-0.20067070|-2.090741|
## | |[3.28,4.00]| 99|Inf | 0.06062462|-1.803594|
## +-------+-----------+---+----+-----------+---------+
## |Overall| |400|Inf |-0.20067070|-2.197225|
## +-------+-----------+---+----+-----------+---------+
glm(I(as.numeric(apply) >= 2) ~ pared, family = "binomial", data = dat)
##
## Call: glm(formula = I(as.numeric(apply) >= 2) ~ pared, family = "binomial",
## data = dat)
##
## Coefficients:
## (Intercept) pared
## -0.3783 1.1438
##
## Degrees of Freedom: 399 Total (i.e. Null); 398 Residual
## Null Deviance: 550.5
## Residual Deviance: 534.1 AIC: 538.1
glm(I(as.numeric(apply) >= 3) ~ pared, family = "binomial", data = dat)
##
## Call: glm(formula = I(as.numeric(apply) >= 3) ~ pared, family = "binomial",
## data = dat)
##
## Coefficients:
## (Intercept) pared
## -2.441 1.094
##
## Degrees of Freedom: 399 Total (i.e. Null); 398 Residual
## Null Deviance: 260.1
## Residual Deviance: 252.2 AIC: 256.2
s[, 4] <- s[, 4] - s[, 3]
s[, 3] <- s[, 3] - s[, 3]
s # print
## as.numeric(apply) N= 400
##
## +-------+-----------+---+----+----+---------+
## | | |N |Y>=1|Y>=2|Y>=3 |
## +-------+-----------+---+----+----+---------+
## |pared |No |337|Inf |0 |-2.062399|
## | |Yes | 63|Inf |0 |-2.112541|
## +-------+-----------+---+----+----+---------+
## |public |No |343|Inf |0 |-2.140211|
## | |Yes | 57|Inf |0 |-1.371672|
## +-------+-----------+---+----+----+---------+
## |gpa |[1.90,2.73)|102|Inf |0 |-2.375287|
## | |[2.73,3.00)| 99|Inf |0 |-2.038434|
## | |[3.00,3.28)|100|Inf |0 |-1.890070|
## | |[3.28,4.00]| 99|Inf |0 |-1.864219|
## +-------+-----------+---+----+----+---------+
## |Overall| |400|Inf |0 |-1.996554|
## +-------+-----------+---+----+----+---------+
plot(s, which = 1:3, pch = 1:3, xlab = "logit", main = " ", xlim = range(s[, 3:4]))
newdat <- data.frame(pared = rep(0:1, 200), public = rep(0:1, each = 200), gpa = rep(seq(from = 1.9,
to = 4, length.out = 100), 4))
newdat <- cbind(newdat, predict(m, newdat, type = "probs"))
## show first few rows
head(newdat)
## pared public gpa unlikely somewhat likely very likely
## 1 0 0 1.900000 0.7376186 0.2204577 0.04192370
## 2 1 0 1.921212 0.4932185 0.3945673 0.11221424
## 3 0 0 1.942424 0.7325300 0.2244841 0.04298593
## 4 1 0 1.963636 0.4866885 0.3984676 0.11484395
## 5 0 0 1.984848 0.7273792 0.2285470 0.04407383
## 6 1 0 2.006061 0.4801630 0.4023098 0.11752712
lnewdat <- melt(newdat, id.vars = c("pared", "public", "gpa"), variable.name = "Level", value.name = "Probability")
## view first few rows
head(lnewdat)
## pared public gpa Level Probability
## 1 0 0 1.900000 unlikely 0.7376186
## 2 1 0 1.921212 unlikely 0.4932185
## 3 0 0 1.942424 unlikely 0.7325300
## 4 1 0 1.963636 unlikely 0.4866885
## 5 0 0 1.984848 unlikely 0.7273792
## 6 1 0 2.006061 unlikely 0.4801630
ggplot(lnewdat, aes(x = gpa, y = Probability, colour = Level)) + geom_line() + facet_grid(pared ~
public, labeller = "label_both")
普通二分类logistic回归用广义线性模型glm()
因变量多分类logistic回归
条件logistic回归 用survival包里的clogit