Xianxiong Ma
2020年3月2日
使用场景 :
结果变量是类别型,二值变量和多分类变量,不满足正态分布
结果变量是计数型,并且他们的均值和方差都是相关的
解决方法 :使用广义线性模型,它包含费正太因变量的分析
glm(formula, family=family.generator, data,control = list(...))
family:每一种响应分布(指数分布族)允许各种关联函数将均值和线性预测器关联起来。
常用的family:
binomal(link='logit') ---响应变量服从二项分布,连接函数为logit,即logistic回归
binomal(link='probit') ----响应变量服从二项分布,连接函数为probit
poisson(link='identity') ----响应变量服从泊松分布,即泊松回归
control:控制算法误差和最大迭代次数
glm.control(epsilon = 1e-8, maxit = 25, trace = FALSE)
-----maxit:算法最大迭代次数,改变最大迭代次数:control=list(maxit=100)
requires packages AER, robust, qcc
# get summary statistics
data(Affairs, package = "AER")
summary(Affairs)
## affairs gender age yearsmarried children religiousness
## Min. : 0.000 female:315 Min. :17.50 Min. : 0.125 no :171 Min. :1.000
## 1st Qu.: 0.000 male :286 1st Qu.:27.00 1st Qu.: 4.000 yes:430 1st Qu.:2.000
## Median : 0.000 Median :32.00 Median : 7.000 Median :3.000
## Mean : 1.456 Mean :32.49 Mean : 8.178 Mean :3.116
## 3rd Qu.: 0.000 3rd Qu.:37.00 3rd Qu.:15.000 3rd Qu.:4.000
## Max. :12.000 Max. :57.00 Max. :15.000 Max. :5.000
## education occupation rating
## Min. : 9.00 Min. :1.000 Min. :1.000
## 1st Qu.:14.00 1st Qu.:3.000 1st Qu.:3.000
## Median :16.00 Median :5.000 Median :4.000
## Mean :16.17 Mean :4.195 Mean :3.932
## 3rd Qu.:18.00 3rd Qu.:6.000 3rd Qu.:5.000
## Max. :20.00 Max. :7.000 Max. :5.000
table(Affairs$affairs)
##
## 0 1 2 3 7 12
## 451 34 17 19 42 38
affairs : numeric. How often engaged in extramarital sexual intercourse during the past year? 0 = none, 1 = once, 2 = twice, 3 = 3 times, 7 = 4–10 times, 12 = monthly, 12 = weekly, 12 = daily.
gender : factor indicating gender.
age : numeric variable coding age in years: 17.5 = under 20, 22 = 20–24, 27 = 25–29, 32 = 30–34, 37 = 35–39, 42 = 40–44, 47 = 45–49, 52 = 50–54, 57 = 55 or over.
yearsmarried : numeric variable coding number of years married: 0.125 = 3 months or less, 0.417 = 4–6 months, 0.75 = 6 months–1 year, 1.5 = 1–2 years, 4 = 3–5 years, 7 = 6–8 years, 10 = 9–11 years, 15 = 12 or more years.
children : factor. Are there children in the marriage?
religiousness : numeric variable coding religiousness: 1 = anti, 2 = not at all, 3 = slightly, 4 = somewhat, 5 = very.
education : numeric variable coding level of education: 9 = grade school, 12 = high school graduate, 14 = some college, 16 = college graduate, 17 = some graduate work, 18 = master's degree, 20 = Ph.D., M.D., or other advanced degree.
occupation : numeric variable coding occupation according to Hollingshead classification (reverse numbering).
rating : numeric variable coding self rating of marriage: 1 = very unhappy, 2 = somewhat unhappy, 3 = average, 4 = happier than average, 5 = very happy.
# create binary outcome variable
Affairs$ynaffair[Affairs$affairs > 0] <- 1
Affairs$ynaffair[Affairs$affairs == 0] <- 0
Affairs$ynaffair <- factor(Affairs$ynaffair, levels = c(0, 1), labels = c("No", "Yes"))
table(Affairs$ynaffair)
##
## No Yes
## 451 150
# fit full model
fit.full <- glm(ynaffair ~ gender + age + yearsmarried + children + religiousness + education + occupation +
rating, data = Affairs, family = binomial())
summary(fit.full)
##
## Call:
## glm(formula = ynaffair ~ gender + age + yearsmarried + children +
## religiousness + education + occupation + rating, family = binomial(),
## data = Affairs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5713 -0.7499 -0.5690 -0.2539 2.5191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.37726 0.88776 1.551 0.120807
## gendermale 0.28029 0.23909 1.172 0.241083
## age -0.04426 0.01825 -2.425 0.015301 *
## yearsmarried 0.09477 0.03221 2.942 0.003262 **
## childrenyes 0.39767 0.29151 1.364 0.172508
## religiousness -0.32472 0.08975 -3.618 0.000297 ***
## education 0.02105 0.05051 0.417 0.676851
## occupation 0.03092 0.07178 0.431 0.666630
## rating -0.46845 0.09091 -5.153 2.56e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 675.38 on 600 degrees of freedom
## Residual deviance: 609.51 on 592 degrees of freedom
## AIC: 627.51
##
## Number of Fisher Scoring iterations: 4
# fit reduced model
fit.reduced <- glm(ynaffair ~ age + yearsmarried + religiousness + rating, data = Affairs, family = binomial())
summary(fit.reduced)
##
## Call:
## glm(formula = ynaffair ~ age + yearsmarried + religiousness +
## rating, family = binomial(), data = Affairs)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6278 -0.7550 -0.5701 -0.2624 2.3998
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.93083 0.61032 3.164 0.001558 **
## age -0.03527 0.01736 -2.032 0.042127 *
## yearsmarried 0.10062 0.02921 3.445 0.000571 ***
## religiousness -0.32902 0.08945 -3.678 0.000235 ***
## rating -0.46136 0.08884 -5.193 2.06e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 675.38 on 600 degrees of freedom
## Residual deviance: 615.36 on 596 degrees of freedom
## AIC: 625.36
##
## Number of Fisher Scoring iterations: 4
# compare models
anova(fit.reduced, fit.full, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: ynaffair ~ age + yearsmarried + religiousness + rating
## Model 2: ynaffair ~ gender + age + yearsmarried + children + religiousness +
## education + occupation + rating
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 596 615.36
## 2 592 609.51 4 5.8474 0.2108
# interpret coefficients
coef(fit.reduced)
## (Intercept) age yearsmarried religiousness rating
## 1.93083017 -0.03527112 0.10062274 -0.32902386 -0.46136144
exp(coef(fit.reduced))
## (Intercept) age yearsmarried religiousness rating
## 6.8952321 0.9653437 1.1058594 0.7196258 0.6304248
exp(confint(fit.reduced))
## 2.5 % 97.5 %
## (Intercept) 2.1255764 23.3506030
## age 0.9323342 0.9981470
## yearsmarried 1.0448584 1.1718250
## religiousness 0.6026782 0.8562807
## rating 0.5286586 0.7493370
# calculate probability of extramariatal affair by marital ratings
testdata <- data.frame(rating = c(1, 2, 3, 4, 5), age = mean(Affairs$age), yearsmarried = mean(Affairs$yearsmarried),
religiousness = mean(Affairs$religiousness))
testdata$prob <- predict(fit.reduced, newdata = testdata, type = "response")
testdata
## rating age yearsmarried religiousness prob
## 1 1 32.48752 8.177696 3.116473 0.5302296
## 2 2 32.48752 8.177696 3.116473 0.4157377
## 3 3 32.48752 8.177696 3.116473 0.3096712
## 4 4 32.48752 8.177696 3.116473 0.2204547
## 5 5 32.48752 8.177696 3.116473 0.1513079
# calculate probabilites of extramariatal affair by age
testdata <- data.frame(rating = mean(Affairs$rating), age = seq(17, 57, 10), yearsmarried = mean(Affairs$yearsmarried),
religiousness = mean(Affairs$religiousness))
testdata$prob <- predict(fit.reduced, newdata = testdata, type = "response")
testdata
## rating age yearsmarried religiousness prob
## 1 3.93178 17 8.177696 3.116473 0.3350834
## 2 3.93178 27 8.177696 3.116473 0.2615373
## 3 3.93178 37 8.177696 3.116473 0.1992953
## 4 3.93178 47 8.177696 3.116473 0.1488796
## 5 3.93178 57 8.177696 3.116473 0.1094738
# evaluate overdispersion
fit <- glm(ynaffair ~ age + yearsmarried + religiousness + rating, family = binomial(), data = Affairs)
fit.od <- glm(ynaffair ~ age + yearsmarried + religiousness + rating, family = quasibinomial(), data = Affairs)
pchisq(summary(fit.od)$dispersion * fit$df.residual, fit$df.residual, lower = F)
## [1] 0.340122
Example11_4 <- read.table("example11_4.csv", header = TRUE, sep = ",")
head(Example11_4)
## y x1 x2
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 0 0
## 6 1 0 0
attach(Example11_4)
fit1 <- glm(y ~ x1 + x2, family = binomial(), data = Example11_4)
summary(fit1)
##
## Call:
## glm(formula = y ~ x1 + x2, family = binomial(), data = Example11_4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3965 -1.0193 -0.8225 0.9730 1.5800
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9099 0.1358 -6.699 2.11e-11 ***
## x1 0.8856 0.1500 5.904 3.54e-09 ***
## x2 0.5261 0.1572 3.348 0.000815 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1228.0 on 885 degrees of freedom
## Residual deviance: 1159.4 on 883 degrees of freedom
## AIC: 1165.4
##
## Number of Fisher Scoring iterations: 4
coefficients(fit1)
## (Intercept) x1 x2
## -0.9099467 0.8855837 0.5261234
exp(coefficients(fit1))
## (Intercept) x1 x2
## 0.4025457 2.4243991 1.6923589
exp(confint(fit1))
## 2.5 % 97.5 %
## (Intercept) 0.3071678 0.5234695
## x1 1.8090525 3.2580064
## x2 1.2441606 2.3048818
fit2 <- glm(y ~ x1 + x2 + x1:x2, family = binomial(), data = Example11_4)
summary(fit2)
##
## Call:
## glm(formula = y ~ x1 + x2 + x1:x2, family = binomial(), data = Example11_4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4237 -0.9623 -0.8725 0.9497 1.5167
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7695 0.1524 -5.049 4.43e-07 ***
## x1 0.5107 0.2520 2.027 0.0427 *
## x2 0.2398 0.2201 1.090 0.2759
## x1:x2 0.5815 0.3148 1.847 0.0647 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1228 on 885 degrees of freedom
## Residual deviance: 1156 on 882 degrees of freedom
## AIC: 1164
##
## Number of Fisher Scoring iterations: 4
coefficients(fit2)
## (Intercept) x1 x2 x1:x2
## -0.7695202 0.5106585 0.2398261 0.5814856
exp(coefficients(fit2))
## (Intercept) x1 x2 x1:x2
## 0.4632353 1.6663882 1.2710280 1.7886937
exp(confint(fit2))
## 2.5 % 97.5 %
## (Intercept) 0.3415446 0.6214419
## x1 1.0159204 2.7326277
## x2 0.8255781 1.9585429
## x1:x2 0.9658438 3.3214114
Example11_4$x11 <- ifelse(x1 == 1 & x2 == 1, 1, 0)
Example11_4$x10 <- ifelse(x1 == 1 & x2 == 0, 1, 0)
Example11_4$x01 <- ifelse(x1 == 0 & x2 == 1, 1, 0)
Example11_4$x00 <- ifelse(x1 == 0 & x2 == 0, 1, 0)
fit3 <- glm(y ~ x11 + x10 + x01, family = binomial(), data = Example11_4)
summary(fit3)
##
## Call:
## glm(formula = y ~ x11 + x10 + x01, family = binomial(), data = Example11_4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4237 -0.9623 -0.8725 0.9497 1.5167
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7695 0.1524 -5.049 4.43e-07 ***
## x11 1.3320 0.1834 7.264 3.76e-13 ***
## x10 0.5107 0.2520 2.027 0.0427 *
## x01 0.2398 0.2201 1.090 0.2759
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1228 on 885 degrees of freedom
## Residual deviance: 1156 on 882 degrees of freedom
## AIC: 1164
##
## Number of Fisher Scoring iterations: 4
coefficients(fit3)
## (Intercept) x11 x10 x01
## -0.7695202 1.3319701 0.5106585 0.2398261
exp(coefficients(fit3))
## (Intercept) x11 x10 x01
## 0.4632353 3.7884999 1.6663882 1.2710280
exp(confint(fit3))
## 2.5 % 97.5 %
## (Intercept) 0.3415446 0.6214419
## x11 2.6552919 5.4525611
## x10 1.0159204 2.7326277
## x01 0.8255781 1.9585429
detach(Example11_4)
Example11_5 <- read.table("example11_5.csv", header = TRUE, sep = ",")
head(Example11_5)
## x1 x2 x3 x4 x5 x6 x7 x8 y
## 1 3 1 0 1 0 0 1 1 0
## 2 2 0 1 1 0 0 1 0 0
## 3 2 1 0 1 0 0 1 0 0
## 4 2 0 0 1 0 0 1 0 0
## 5 3 0 0 1 0 1 1 1 0
## 6 3 0 1 1 0 0 2 1 0
attach(Example11_5)
fullfit <- glm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, family = binomial(), data = Example11_5)
summary(fullfit)
##
## Call:
## glm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, family = binomial(),
## data = Example11_5)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3567 -0.4949 -0.1643 0.6351 2.3456
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.8896 1.9721 -2.986 0.00282 **
## x1 0.6443 0.4987 1.292 0.19642
## x2 0.9098 0.8362 1.088 0.27662
## x3 0.9698 0.9058 1.071 0.28433
## x4 0.9948 1.2095 0.823 0.41079
## x5 0.7409 0.8801 0.842 0.39988
## x6 3.4559 1.4152 2.442 0.01461 *
## x7 0.3019 0.5906 0.511 0.60917
## x8 1.9169 0.9189 2.086 0.03697 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 74.786 on 53 degrees of freedom
## Residual deviance: 42.194 on 45 degrees of freedom
## AIC: 60.194
##
## Number of Fisher Scoring iterations: 6
nothing <- glm(y ~ 1, family = binomial(), data = Example11_5)
summary(nothing)
##
## Call:
## glm(formula = y ~ 1, family = binomial(), data = Example11_5)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.146 -1.146 -1.146 1.209 1.209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.07411 0.27235 -0.272 0.786
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 74.786 on 53 degrees of freedom
## Residual deviance: 74.786 on 53 degrees of freedom
## AIC: 76.786
##
## Number of Fisher Scoring iterations: 3
bothways <- step(nothing, list(lower = formula(nothing), upper = formula(fullfit)), direction = "both")
## Start: AIC=76.79
## y ~ 1
##
## Df Deviance AIC
## + x6 1 63.467 67.467
## + x5 1 67.137 71.137
## + x8 1 67.727 71.727
## + x2 1 68.705 72.705
## + x1 1 68.706 72.706
## + x7 1 69.335 73.335
## + x3 1 69.966 73.966
## + x4 1 70.272 74.272
## <none> 74.786 76.786
##
## Step: AIC=67.47
## y ~ x6
##
## Df Deviance AIC
## + x5 1 55.480 61.480
## + x8 1 56.751 62.751
## + x2 1 56.890 62.890
## + x3 1 58.265 64.265
## + x1 1 58.335 64.335
## + x7 1 58.753 64.753
## + x4 1 60.617 66.617
## <none> 63.467 67.467
## - x6 1 74.786 76.786
##
## Step: AIC=61.48
## y ~ x6 + x5
##
## Df Deviance AIC
## + x8 1 50.402 58.402
## + x2 1 52.108 60.108
## + x1 1 52.454 60.454
## <none> 55.480 61.480
## + x7 1 53.709 61.709
## + x4 1 54.124 62.124
## + x3 1 54.174 62.174
## - x5 1 63.467 67.467
## - x6 1 67.137 71.137
##
## Step: AIC=58.4
## y ~ x6 + x5 + x8
##
## Df Deviance AIC
## + x1 1 46.224 56.224
## + x2 1 46.717 56.717
## <none> 50.402 58.402
## + x3 1 48.601 58.601
## + x4 1 49.039 59.039
## + x7 1 49.702 59.702
## - x8 1 55.480 61.480
## - x5 1 56.751 62.751
## - x6 1 61.148 67.148
##
## Step: AIC=56.22
## y ~ x6 + x5 + x8 + x1
##
## Df Deviance AIC
## + x2 1 44.163 56.163
## <none> 46.224 56.224
## + x3 1 44.534 56.534
## + x4 1 45.755 57.755
## + x7 1 45.781 57.781
## - x1 1 50.402 58.402
## - x5 1 50.491 58.491
## - x8 1 52.454 60.454
## - x6 1 56.439 64.439
##
## Step: AIC=56.16
## y ~ x6 + x5 + x8 + x1 + x2
##
## Df Deviance AIC
## <none> 44.163 56.163
## - x2 1 46.224 56.224
## - x1 1 46.717 56.717
## + x3 1 42.999 56.999
## - x5 1 47.099 57.099
## + x4 1 43.817 57.817
## + x7 1 43.907 57.907
## - x8 1 50.405 60.405
## - x6 1 54.874 64.874
fit1 <- glm(y ~ x6 + x5 + x8 + x1 + x2, family = binomial(), data = Example11_5)
summary(fit1)
##
## Call:
## glm(formula = y ~ x6 + x5 + x8 + x1 + x2, family = binomial(),
## data = Example11_5)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4100 -0.6282 -0.2313 0.6034 2.2802
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.7603 1.5803 -3.012 0.00259 **
## x6 3.3698 1.3311 2.532 0.01136 *
## x5 1.2928 0.7654 1.689 0.09122 .
## x8 2.0004 0.8730 2.291 0.02194 *
## x1 0.7460 0.4841 1.541 0.12335
## x2 1.1453 0.8077 1.418 0.15620
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 74.786 on 53 degrees of freedom
## Residual deviance: 44.163 on 48 degrees of freedom
## AIC: 56.163
##
## Number of Fisher Scoring iterations: 5
fit2 <- glm(y ~ x6 + x5 + x8 + x1, family = binomial(), data = Example11_5)
summary(fit2)
##
## Call:
## glm(formula = y ~ x6 + x5 + x8 + x1, family = binomial(), data = Example11_5)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5264 -0.5445 -0.2733 0.6319 2.0339
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.7050 1.5432 -3.049 0.0023 **
## x6 3.1355 1.2489 2.511 0.0121 *
## x5 1.4959 0.7439 2.011 0.0443 *
## x8 1.9471 0.8466 2.300 0.0215 *
## x1 0.9239 0.4766 1.939 0.0525 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 74.786 on 53 degrees of freedom
## Residual deviance: 46.224 on 49 degrees of freedom
## AIC: 56.224
##
## Number of Fisher Scoring iterations: 5
coefficients(fit2)
## (Intercept) x6 x5 x8 x1
## -4.7050170 3.1354895 1.4959351 1.9470861 0.9238998
exp(coefficients(fit2))
## (Intercept) x6 x5 x8 x1
## 0.009049761 22.999891063 4.463508324 7.008236568 2.519095102
exp(confint(fit2))
## 2.5 % 97.5 %
## (Intercept) 0.0002770793 0.1345707
## x6 2.9251702716 552.8160472
## x5 1.0786693022 20.8559589
## x8 1.4921555369 44.9031777
## x1 1.0375623088 7.0051815
detach(Example11_5)
# install.packages('survival')
library(survival)
Example11_6 <- read.table("example11_6.csv", header = TRUE, sep = ",")
head(Example11_6)
## id outcome exposure
## 1 1 1 1
## 2 1 0 1
## 3 2 1 1
## 4 2 0 1
## 5 3 1 1
## 6 3 0 1
attach(Example11_6)
model <- clogit(outcome ~ exposure + strata(id))
summary(model)
## Call:
## coxph(formula = Surv(rep(1, 168L), outcome) ~ exposure + strata(id),
## method = "exact")
##
## n= 168, number of events= 84
##
## coef exp(coef) se(coef) z Pr(>|z|)
## exposure 1.030 2.800 0.521 1.976 0.0481 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## exposure 2.8 0.3571 1.009 7.774
##
## Concordance= 0.554 (se = 0.036 )
## Likelihood ratio test= 4.44 on 1 df, p=0.04
## Wald test = 3.91 on 1 df, p=0.05
## Score (logrank) test = 4.26 on 1 df, p=0.04
detach(Example11_6)
library(foreign)
library(rms)
mydata <- read.spss("lweight.sav")
mydata <- as.data.frame(mydata)
head(mydata)
## id low age lwt race smoke ptl ht ui ftv bwt
## 1 85 正常体重 19 182 黑种人 不吸烟 0 无妊高症 有 0 2523
## 2 86 正常体重 33 155 其他种族 不吸烟 0 无妊高症 无 3 2551
## 3 87 正常体重 20 105 白种人 吸烟 0 无妊高症 无 1 2557
## 4 88 正常体重 21 108 白种人 吸烟 0 无妊高症 有 2 2594
## 5 89 正常体重 18 107 白种人 吸烟 0 无妊高症 有 0 2600
## 6 91 正常体重 21 124 其他种族 不吸烟 0 无妊高症 无 0 2622
mydata$low <- ifelse(mydata$low == "低出生体重", 1, 0)
mydata$race1 <- ifelse(mydata$race == "白种人", 1, 0)
mydata$race2 <- ifelse(mydata$race == "黑种人", 1, 0)
mydata$race3 <- ifelse(mydata$race == "其他种族", 1, 0)
attach(mydata)
dd <- datadist(mydata)
options(datadist = "dd")
fit1 <- lrm(low ~ age + ftv + ht + lwt + ptl + smoke + ui + race1 + race2, data = mydata, x = T, y = T)
fit1
## Logistic Regression Model
##
## lrm(formula = low ~ age + ftv + ht + lwt + ptl + smoke + ui +
## race1 + race2, data = mydata, x = T, y = T)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 189 LR chi2 31.12 R2 0.213 C 0.738
## 0 130 d.f. 9 g 1.122 Dxy 0.476
## 1 59 Pr(> chi2) 0.0003 gr 3.070 gamma 0.477
## max |deriv| 7e-05 gp 0.207 tau-a 0.206
## Brier 0.181
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 1.1427 1.0873 1.05 0.2933
## age -0.0255 0.0366 -0.69 0.4871
## ftv 0.0321 0.1708 0.19 0.8509
## ht=妊高症 1.7631 0.6894 2.56 0.0105
## lwt -0.0137 0.0068 -2.02 0.0431
## ptl 0.5517 0.3446 1.60 0.1094
## smoke=吸烟 0.9275 0.3986 2.33 0.0200
## ui=有 0.6488 0.4676 1.39 0.1653
## race1 -0.9082 0.4367 -2.08 0.0375
## race2 0.3293 0.5339 0.62 0.5374
##
summary(fit1)
## Effects Response : low
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## age 19 26 7 -0.178250 0.25649 -0.68095 0.324460
## Odds Ratio 19 26 7 0.836740 NA 0.50613 1.383300
## ftv 0 1 1 0.032104 0.17077 -0.30260 0.366810
## Odds Ratio 0 1 1 1.032600 NA 0.73889 1.443100
## lwt 110 141 31 -0.424740 0.21002 -0.83637 -0.013113
## Odds Ratio 110 141 31 0.653940 NA 0.43328 0.986970
## ptl 0 3 3 1.655000 1.03390 -0.37141 3.681400
## Odds Ratio 0 3 3 5.233000 NA 0.68976 39.701000
## race1 0 1 1 -0.908230 0.43667 -1.76410 -0.052382
## Odds Ratio 0 1 1 0.403240 NA 0.17134 0.948970
## race2 0 1 1 0.329270 0.53392 -0.71718 1.375700
## Odds Ratio 0 1 1 1.390000 NA 0.48813 3.958000
## ht - 妊高症:无妊高症 1 2 NA 1.763100 0.68941 0.41191 3.114400
## Odds Ratio 1 2 NA 5.830700 NA 1.50970 22.519000
## smoke - 吸烟:不吸烟 1 2 NA 0.927480 0.39859 0.14626 1.708700
## Odds Ratio 1 2 NA 2.528100 NA 1.15750 5.521800
## ui - 有:无 1 2 NA 0.648810 0.46760 -0.26767 1.565300
## Odds Ratio 1 2 NA 1.913300 NA 0.76516 4.784100
nom1 <- nomogram(fit1, fun = plogis, fun.at = c(0.001, 0.01, 0.05, seq(0.1, 0.9, by = 0.1), 0.95, 0.99, 0.999),
lp = F, funlabel = "Low weight rate")
plot(nom1)
cal1 <- calibrate(fit1, method = "boot", B = 1000)
plot(cal1, xlim = c(0, 1), ylim = c(0, 1))
##
## n=189 Mean absolute error=0.038 Mean squared error=0.0019
## 0.9 Quantile of absolute error=0.056
mydata$race <- as.factor(ifelse(mydata$race == "白种人", "白种人", "黑人及其他种族"))
dd <- datadist(mydata)
options(datadist = "dd")
fit2 <- lrm(low ~ age + ftv + ht + lwt + ptl + smoke + ui + race, data = mydata, x = T, y = T)
fit2
## Logistic Regression Model
##
## lrm(formula = low ~ age + ftv + ht + lwt + ptl + smoke + ui +
## race, data = mydata, x = T, y = T)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 189 LR chi2 30.74 R2 0.211 C 0.735
## 0 130 d.f. 8 g 1.112 Dxy 0.469
## 1 59 Pr(> chi2) 0.0002 gr 3.039 gamma 0.470
## max |deriv| 7e-05 gp 0.205 tau-a 0.203
## Brier 0.182
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 0.0921 1.1475 0.08 0.9360
## age -0.0272 0.0365 -0.74 0.4563
## ftv 0.0365 0.1692 0.22 0.8293
## ht=妊高症 1.7516 0.6885 2.54 0.0110
## lwt -0.0125 0.0065 -1.94 0.0528
## ptl 0.5479 0.3451 1.59 0.1124
## smoke=吸烟 0.9737 0.3922 2.48 0.0130
## ui=有 0.6372 0.4701 1.36 0.1753
## race=黑人及其他种族 1.0239 0.3940 2.60 0.0094
##
summary(fit2)
## Effects Response : low
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## age 19 26 7 -0.190330 0.25550 -0.69110 0.310440
## Odds Ratio 19 26 7 0.826690 NA 0.50103 1.364000
## ftv 0 1 1 0.036487 0.16923 -0.29519 0.368170
## Odds Ratio 0 1 1 1.037200 NA 0.74439 1.445100
## lwt 110 141 31 -0.387840 0.20030 -0.78042 0.004744
## Odds Ratio 110 141 31 0.678520 NA 0.45821 1.004800
## ptl 0 3 3 1.643800 1.03540 -0.38543 3.673100
## Odds Ratio 0 3 3 5.175000 NA 0.68016 39.374000
## ht - 妊高症:无妊高症 1 2 NA 1.751600 0.68854 0.40206 3.101100
## Odds Ratio 1 2 NA 5.763700 NA 1.49490 22.223000
## smoke - 吸烟:不吸烟 1 2 NA 0.973710 0.39221 0.20499 1.742400
## Odds Ratio 1 2 NA 2.647700 NA 1.22750 5.711100
## ui - 有:无 1 2 NA 0.637190 0.47011 -0.28422 1.558600
## Odds Ratio 1 2 NA 1.891200 NA 0.75260 4.752100
## race - 黑人及其他种族:白种人 1 2 NA 1.023900 0.39404 0.25154 1.796200
## Odds Ratio 1 2 NA 2.783900 NA 1.28600 6.026500
nom2 <- nomogram(fit2, fun = plogis, fun.at = c(0.001, 0.01, 0.05, seq(0.1, 0.9, by = 0.1), 0.95, 0.99, 0.999),
lp = F, funlabel = "Low weight rate")
plot(nom2)
cal2 <- calibrate(fit2, method = "boot", B = 1000)
plot(cal2, xlim = c(0, 1), ylim = c(0, 1))
##
## n=189 Mean absolute error=0.035 Mean squared error=0.00172
## 0.9 Quantile of absolute error=0.055
fit3 <- lrm(low ~ ht + lwt + ptl + smoke + race, data = mydata, x = T, y = T)
fit3
## Logistic Regression Model
##
## lrm(formula = low ~ ht + lwt + ptl + smoke + race, data = mydata,
## x = T, y = T)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 189 LR chi2 28.19 R2 0.195 C 0.732
## 0 130 d.f. 5 g 1.037 Dxy 0.465
## 1 59 Pr(> chi2) <0.0001 gr 2.820 gamma 0.467
## max |deriv| 1e-05 gp 0.194 tau-a 0.201
## Brier 0.184
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -0.2744 0.8832 -0.31 0.7561
## ht=妊高症 1.6754 0.6863 2.44 0.0146
## lwt -0.0137 0.0064 -2.14 0.0322
## ptl 0.6006 0.3342 1.80 0.0723
## smoke=吸烟 0.9919 0.3869 2.56 0.0104
## race=黑人及其他种族 1.0487 0.3842 2.73 0.0063
##
summary(fit3)
## Effects Response : low
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## lwt 110 141 31 -0.42393 0.19796 -0.81191 -0.035937
## Odds Ratio 110 141 31 0.65447 NA 0.44401 0.964700
## ptl 0 3 3 1.80180 1.00260 -0.16316 3.766900
## Odds Ratio 0 3 3 6.06080 NA 0.84946 43.244000
## ht - 妊高症:无妊高症 1 2 NA 1.67540 0.68632 0.33024 3.020600
## Odds Ratio 1 2 NA 5.34090 NA 1.39130 20.503000
## smoke - 吸烟:不吸烟 1 2 NA 0.99188 0.38694 0.23349 1.750300
## Odds Ratio 1 2 NA 2.69630 NA 1.26300 5.756100
## race - 黑人及其他种族:白种人 1 2 NA 1.04870 0.38418 0.29571 1.801700
## Odds Ratio 1 2 NA 2.85390 NA 1.34410 6.059700
nom3 <- nomogram(fit3, fun = plogis, fun.at = c(0.001, 0.01, 0.05, seq(0.1, 0.9, by = 0.1), 0.95, 0.99, 0.999),
lp = F, funlabel = "Low weight rate")
plot(nom3)
cal3 <- calibrate(fit3, method = "boot", B = 1000)
plot(cal3, xlim = c(0, 1), ylim = c(0, 1))
##
## n=189 Mean absolute error=0.02 Mean squared error=0.00065
## 0.9 Quantile of absolute error=0.033
# C-statistics计算
library(foreign)
library(rms)
mydata <- read.spss("lweight.sav")
mydata <- as.data.frame(mydata)
head(mydata)
## id low age lwt race smoke ptl ht ui ftv bwt
## 1 85 正常体重 19 182 黑种人 不吸烟 0 无妊高症 有 0 2523
## 2 86 正常体重 33 155 其他种族 不吸烟 0 无妊高症 无 3 2551
## 3 87 正常体重 20 105 白种人 吸烟 0 无妊高症 无 1 2557
## 4 88 正常体重 21 108 白种人 吸烟 0 无妊高症 有 2 2594
## 5 89 正常体重 18 107 白种人 吸烟 0 无妊高症 有 0 2600
## 6 91 正常体重 21 124 其他种族 不吸烟 0 无妊高症 无 0 2622
NA
## [1] NA
NA
## [1] NA
NA
## [1] NA
NA
## [1] NA
attach(mydata)
dd <- datadist(mydata)
options(datadist = "dd")
fit1 <- lrm(low ~ age + ftv + ht + lwt + ptl + smoke + ui + race1 + race2, data = mydata, x = T, y = T)
fit1 #直接读取模型中Rank Discrim.参数 C
## Logistic Regression Model
##
## lrm(formula = low ~ age + ftv + ht + lwt + ptl + smoke + ui +
## race1 + race2, data = mydata, x = T, y = T)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 189 LR chi2 31.12 R2 0.213 C 0.738
## 正常体重 130 d.f. 9 g 1.122 Dxy 0.476
## 低出生体重 59 Pr(> chi2) 0.0003 gr 3.070 gamma 0.477
## max |deriv| 7e-05 gp 0.207 tau-a 0.206
## Brier 0.181
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 1.1427 1.0873 1.05 0.2933
## age -0.0255 0.0366 -0.69 0.4871
## ftv 0.0321 0.1708 0.19 0.8509
## ht=妊高症 1.7631 0.6894 2.56 0.0105
## lwt -0.0137 0.0068 -2.02 0.0431
## ptl 0.5517 0.3446 1.60 0.1094
## smoke=吸烟 0.9275 0.3986 2.33 0.0200
## ui=有 0.6488 0.4676 1.39 0.1653
## race1 -0.9082 0.4367 -2.08 0.0375
## race2 0.3293 0.5339 0.62 0.5374
##
mydata$predvalue <- predict(fit1)
library(ROCR)
pred <- prediction(mydata$predvalue, mydata$low)
perf <- performance(pred, "tpr", "fpr")
plot(perf)
abline(0, 1)
auc <- performance(pred, "auc")
auc #auc即是C-statistics
## An object of class "performance"
## Slot "x.name":
## [1] "None"
##
## Slot "y.name":
## [1] "Area under the ROC curve"
##
## Slot "alpha.name":
## [1] "none"
##
## Slot "x.values":
## list()
##
## Slot "y.values":
## [[1]]
## [1] 0.2617992
##
##
## Slot "alpha.values":
## list()
# library(Hmisc)
somers2(mydata$predvalue, mydata$low) #somers2 {Hmisc}
## Error in somers2(mydata$predvalue, mydata$low): y must be binary
library(forestplot)
rs_forest <- read.csv('rs_forest.csv',header = FALSE)
# 读入数据的时候大家一定要把header设置成FALSE,保证第一行不被当作列名称。
rs_forest
## V1 V2 V3 V4 V5 V6
## 1 Response No.of patients OR(95% CI) NA NA NA
## 2 Overall 1472 1.66(1.28,2.15) 1.66 1.28 2.15
## 3 Clinical stage NA NA NA
## 4 Advanced 951 1.67(1.23,2.26) 1.67 1.23 2.26
## 5 non-advanced 521 1.67(1.05,2.66) 1.67 1.05 2.66
## 6 ctDNA assays NA NA NA
## 7 qPCR 1280 1.73(1.30,2.29) 1.73 1.30 2.29
## 8 ARMS 192 1.58(0.50,5.04) 1.58 0.50 5.04
## 9 Ethnicity NA NA NA
## 10 Asian 326 1.93(0.84,4.43) 1.93 0.84 4.43
## 11 Caucasian 1146 1.57(1.20,2.06) 1.57 1.20 2.06
# tiff('Figure 1.tiff',height = 1600,width = 2400,res= 300)
forestplot(labeltext = as.matrix(rs_forest[,1:3]),
#设置用于文本展示的列,此处我们用数据的前三列作为文本,在图中展示
mean = rs_forest$V4, #设置均值
lower = rs_forest$V5, #设置均值的lowlimits限
upper = rs_forest$V6, #设置均值的uplimits限
is.summary = c(T,T,T,F,F,T,F,F,T,F,F),
#该参数接受一个逻辑向量,用于定义数据中的每一行是否是汇总值,若是,则在对应位置设置为TRUE,若否,则设置为FALSE;设置为TRUE的行则以粗体出现
zero = 1, #设置参照值,此处我们展示的是OR值,故参照值是1,而不是0
boxsize = 0.4, #设置点估计的方形大小
lineheight = unit(10,'mm'),#设置图形中的行距
colgap = unit(3,'mm'),#设置图形中的列间距
lwd.zero = 2,#设置参考线的粗细
lwd.ci = 1.5,#设置区间估计线的粗细
col=fpColors(box='#458B00', summary= "#8B008B",lines = 'black',zero = '#7AC5CD'),
#使用fpColors()函数定义图形元素的颜色,从左至右分别对应点估计方形,汇总值,区间估计线,参考线
xlab="The estimates",#设置x轴标签
graph.pos = 3)#设置森林图的位置,此处设置为3,则出现在第三列
使用场景 : 通过一系列连续型或类别型预测变量来预测计数型结果变量时采用泊松分布
案例:药物治疗是否能减小癫痫的发病数
# look at dataset
data(breslow.dat, package = "robust")
head(breslow.dat)
## ID Y1 Y2 Y3 Y4 Base Age Trt Ysum sumY Age10 Base4
## 1 104 5 3 3 3 11 31 placebo 14 14 3.1 2.75
## 2 106 3 5 3 3 11 30 placebo 14 14 3.0 2.75
## 3 107 2 4 0 5 6 25 placebo 11 11 2.5 1.50
## 4 114 4 4 1 4 8 36 placebo 13 13 3.6 2.00
## 5 116 7 18 9 21 66 22 placebo 55 55 2.2 16.50
## 6 118 5 2 8 7 27 29 placebo 22 22 2.9 6.75
names(breslow.dat)
## [1] "ID" "Y1" "Y2" "Y3" "Y4" "Base" "Age" "Trt" "Ysum" "sumY" "Age10" "Base4"
summary(breslow.dat[c(6, 7, 8, 10)])
## Base Age Trt sumY
## Min. : 6.00 Min. :18.00 placebo :28 Min. : 0.00
## 1st Qu.: 12.00 1st Qu.:23.00 progabide:31 1st Qu.: 11.50
## Median : 22.00 Median :28.00 Median : 16.00
## Mean : 31.22 Mean :28.34 Mean : 33.05
## 3rd Qu.: 41.00 3rd Qu.:32.00 3rd Qu.: 36.00
## Max. :151.00 Max. :42.00 Max. :302.00
ID : an integer value specifying the patient identification number.
Y1 : an integer value, the number of seizures during the first two week period.
Y2 : an integer value, the number of seizures during the second two week period.
Y3 : an integer value, the number of seizures during the third two week period.
Y4 : an integer value, the number of seizures during the fourth two week period.
Base : an integer value giving the eight-week baseline seizure count.
Age : an integer value giving the age of the parient in years.
Trt : the treatment: a factor with levels placebo and progabide.
Ysum : an integer value, the sum of Y1, Y2, Y3 and Y4.
sumY : an integer value, the sum of Y1, Y2, Y3 and Y4.
Age10 : a numeric value, Age divided by 10.
Base4 : a numeric value, Base divided by 4.
# plot distribution of post-treatment seizure counts
opar <- par(no.readonly = TRUE)
par(mfrow = c(1, 2))
attach(breslow.dat)
hist(sumY, breaks = 20, xlab = "Seizure Count", main = "Distribution of Seizures")
boxplot(sumY ~ Trt, xlab = "Treatment", main = "Group Comparisons")
par(opar)
# fit regression
fit <- glm(sumY ~ Base + Age + Trt, data = breslow.dat, family = poisson())
summary(fit)
##
## Call:
## glm(formula = sumY ~ Base + Age + Trt, family = poisson(), data = breslow.dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.0569 -2.0433 -0.9397 0.7929 11.0061
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.9488259 0.1356191 14.370 < 2e-16 ***
## Base 0.0226517 0.0005093 44.476 < 2e-16 ***
## Age 0.0227401 0.0040240 5.651 1.59e-08 ***
## Trtprogabide -0.1527009 0.0478051 -3.194 0.0014 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2122.73 on 58 degrees of freedom
## Residual deviance: 559.44 on 55 degrees of freedom
## AIC: 850.71
##
## Number of Fisher Scoring iterations: 5
# interpret model parameters
coef(fit)
## (Intercept) Base Age Trtprogabide
## 1.94882593 0.02265174 0.02274013 -0.15270095
exp(coef(fit))
## (Intercept) Base Age Trtprogabide
## 7.0204403 1.0229102 1.0230007 0.8583864
# evaluate overdispersion
deviance(fit)/df.residual(fit)
## [1] 10.1717
library(qcc)
qcc.overdispersion.test(breslow.dat$sumY, type = "poisson")
##
## Overdispersion test Obs.Var/Theor.Var Statistic p-value
## poisson data 62.87013 3646.468 0
# fit model with quasipoisson
fit.od <- glm(sumY ~ Base + Age + Trt, data = breslow.dat, family = quasipoisson())
summary(fit.od)
##
## Call:
## glm(formula = sumY ~ Base + Age + Trt, family = quasipoisson(),
## data = breslow.dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6.0569 -2.0433 -0.9397 0.7929 11.0061
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.948826 0.465091 4.190 0.000102 ***
## Base 0.022652 0.001747 12.969 < 2e-16 ***
## Age 0.022740 0.013800 1.648 0.105085
## Trtprogabide -0.152701 0.163943 -0.931 0.355702
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 11.76075)
##
## Null deviance: 2122.73 on 58 degrees of freedom
## Residual deviance: 559.44 on 55 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 5