GLM广义线性模型的R语言实现

Xianxiong Ma

2020年3月2日

使用场景 :

结果变量是类别型,二值变量和多分类变量,不满足正态分布
结果变量是计数型,并且他们的均值和方差都是相关的

解决方法 :使用广义线性模型,它包含费正太因变量的分析

glm函数介绍：

glm(formula, family=family.generator, data,control = list(...))
family：每一种响应分布（指数分布族）允许各种关联函数将均值和线性预测器关联起来。
 常用的family：

binomal(link='logit')         ---响应变量服从二项分布，连接函数为logit，即logistic回归
binomal(link='probit')       ----响应变量服从二项分布，连接函数为probit
poisson(link='identity')     ----响应变量服从泊松分布，即泊松回归

control:控制算法误差和最大迭代次数
glm.control(epsilon = 1e-8, maxit = 25, trace = FALSE)   

     -----maxit:算法最大迭代次数，改变最大迭代次数：control=list(maxit=100)

Logistic Regression (因变量为类别型)

requires packages AER, robust, qcc

Logistic回归案例1 婚外情模型

# get summary statistics
data(Affairs, package = "AER")
summary(Affairs)

##     affairs          gender         age         yearsmarried    children  religiousness  
##  Min.   : 0.000   female:315   Min.   :17.50   Min.   : 0.125   no :171   Min.   :1.000  
##  1st Qu.: 0.000   male  :286   1st Qu.:27.00   1st Qu.: 4.000   yes:430   1st Qu.:2.000  
##  Median : 0.000                Median :32.00   Median : 7.000             Median :3.000  
##  Mean   : 1.456                Mean   :32.49   Mean   : 8.178             Mean   :3.116  
##  3rd Qu.: 0.000                3rd Qu.:37.00   3rd Qu.:15.000             3rd Qu.:4.000  
##  Max.   :12.000                Max.   :57.00   Max.   :15.000             Max.   :5.000  
##    education       occupation        rating     
##  Min.   : 9.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.:14.00   1st Qu.:3.000   1st Qu.:3.000  
##  Median :16.00   Median :5.000   Median :4.000  
##  Mean   :16.17   Mean   :4.195   Mean   :3.932  
##  3rd Qu.:18.00   3rd Qu.:6.000   3rd Qu.:5.000  
##  Max.   :20.00   Max.   :7.000   Max.   :5.000

table(Affairs$affairs)

## 
##   0   1   2   3   7  12 
## 451  34  17  19  42  38

affairs : numeric. How often engaged in extramarital sexual intercourse during the past year? 0 = none, 1 = once, 2 = twice, 3 = 3 times, 7 = 4–10 times, 12 = monthly, 12 = weekly, 12 = daily.

gender : factor indicating gender.

age : numeric variable coding age in years: 17.5 = under 20, 22 = 20–24, 27 = 25–29, 32 = 30–34, 37 = 35–39, 42 = 40–44, 47 = 45–49, 52 = 50–54, 57 = 55 or over.

yearsmarried : numeric variable coding number of years married: 0.125 = 3 months or less, 0.417 = 4–6 months, 0.75 = 6 months–1 year, 1.5 = 1–2 years, 4 = 3–5 years, 7 = 6–8 years, 10 = 9–11 years, 15 = 12 or more years.

children : factor. Are there children in the marriage?

religiousness : numeric variable coding religiousness: 1 = anti, 2 = not at all, 3 = slightly, 4 = somewhat, 5 = very.

education : numeric variable coding level of education: 9 = grade school, 12 = high school graduate, 14 = some college, 16 = college graduate, 17 = some graduate work, 18 = master's degree, 20 = Ph.D., M.D., or other advanced degree.

occupation : numeric variable coding occupation according to Hollingshead classification (reverse numbering).

rating : numeric variable coding self rating of marriage: 1 = very unhappy, 2 = somewhat unhappy, 3 = average, 4 = happier than average, 5 = very happy.

# create binary outcome variable
Affairs$ynaffair[Affairs$affairs > 0] <- 1
Affairs$ynaffair[Affairs$affairs == 0] <- 0
Affairs$ynaffair <- factor(Affairs$ynaffair, levels = c(0, 1), labels = c("No", "Yes"))
table(Affairs$ynaffair)

## 
##  No Yes 
## 451 150

# fit full model
fit.full <- glm(ynaffair ~ gender + age + yearsmarried + children + religiousness + education + occupation + 
    rating, data = Affairs, family = binomial())
summary(fit.full)

## 
## Call:
## glm(formula = ynaffair ~ gender + age + yearsmarried + children + 
##     religiousness + education + occupation + rating, family = binomial(), 
##     data = Affairs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5713  -0.7499  -0.5690  -0.2539   2.5191  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.37726    0.88776   1.551 0.120807    
## gendermale     0.28029    0.23909   1.172 0.241083    
## age           -0.04426    0.01825  -2.425 0.015301 *  
## yearsmarried   0.09477    0.03221   2.942 0.003262 ** 
## childrenyes    0.39767    0.29151   1.364 0.172508    
## religiousness -0.32472    0.08975  -3.618 0.000297 ***
## education      0.02105    0.05051   0.417 0.676851    
## occupation     0.03092    0.07178   0.431 0.666630    
## rating        -0.46845    0.09091  -5.153 2.56e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 675.38  on 600  degrees of freedom
## Residual deviance: 609.51  on 592  degrees of freedom
## AIC: 627.51
## 
## Number of Fisher Scoring iterations: 4

# fit reduced model
fit.reduced <- glm(ynaffair ~ age + yearsmarried + religiousness + rating, data = Affairs, family = binomial())
summary(fit.reduced)

## 
## Call:
## glm(formula = ynaffair ~ age + yearsmarried + religiousness + 
##     rating, family = binomial(), data = Affairs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6278  -0.7550  -0.5701  -0.2624   2.3998  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.93083    0.61032   3.164 0.001558 ** 
## age           -0.03527    0.01736  -2.032 0.042127 *  
## yearsmarried   0.10062    0.02921   3.445 0.000571 ***
## religiousness -0.32902    0.08945  -3.678 0.000235 ***
## rating        -0.46136    0.08884  -5.193 2.06e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 675.38  on 600  degrees of freedom
## Residual deviance: 615.36  on 596  degrees of freedom
## AIC: 625.36
## 
## Number of Fisher Scoring iterations: 4

# compare models
anova(fit.reduced, fit.full, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: ynaffair ~ age + yearsmarried + religiousness + rating
## Model 2: ynaffair ~ gender + age + yearsmarried + children + religiousness + 
##     education + occupation + rating
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       596     615.36                     
## 2       592     609.51  4   5.8474   0.2108

# interpret coefficients
coef(fit.reduced)

##   (Intercept)           age  yearsmarried religiousness        rating 
##    1.93083017   -0.03527112    0.10062274   -0.32902386   -0.46136144

exp(coef(fit.reduced))

##   (Intercept)           age  yearsmarried religiousness        rating 
##     6.8952321     0.9653437     1.1058594     0.7196258     0.6304248

exp(confint(fit.reduced))

##                   2.5 %     97.5 %
## (Intercept)   2.1255764 23.3506030
## age           0.9323342  0.9981470
## yearsmarried  1.0448584  1.1718250
## religiousness 0.6026782  0.8562807
## rating        0.5286586  0.7493370

# calculate probability of extramariatal affair by marital ratings
testdata <- data.frame(rating = c(1, 2, 3, 4, 5), age = mean(Affairs$age), yearsmarried = mean(Affairs$yearsmarried), 
    religiousness = mean(Affairs$religiousness))
testdata$prob <- predict(fit.reduced, newdata = testdata, type = "response")
testdata

##   rating      age yearsmarried religiousness      prob
## 1      1 32.48752     8.177696      3.116473 0.5302296
## 2      2 32.48752     8.177696      3.116473 0.4157377
## 3      3 32.48752     8.177696      3.116473 0.3096712
## 4      4 32.48752     8.177696      3.116473 0.2204547
## 5      5 32.48752     8.177696      3.116473 0.1513079

# calculate probabilites of extramariatal affair by age
testdata <- data.frame(rating = mean(Affairs$rating), age = seq(17, 57, 10), yearsmarried = mean(Affairs$yearsmarried), 
    religiousness = mean(Affairs$religiousness))
testdata$prob <- predict(fit.reduced, newdata = testdata, type = "response")
testdata

##    rating age yearsmarried religiousness      prob
## 1 3.93178  17     8.177696      3.116473 0.3350834
## 2 3.93178  27     8.177696      3.116473 0.2615373
## 3 3.93178  37     8.177696      3.116473 0.1992953
## 4 3.93178  47     8.177696      3.116473 0.1488796
## 5 3.93178  57     8.177696      3.116473 0.1094738

# evaluate overdispersion
fit <- glm(ynaffair ~ age + yearsmarried + religiousness + rating, family = binomial(), data = Affairs)
fit.od <- glm(ynaffair ~ age + yearsmarried + religiousness + rating, family = quasibinomial(), data = Affairs)
pchisq(summary(fit.od)$dispersion * fit$df.residual, fit$df.residual, lower = F)

## [1] 0.340122

Logistic回归案例2

Example11_4 <- read.table("example11_4.csv", header = TRUE, sep = ",")
head(Example11_4)

##   y x1 x2
## 1 1  0  0
## 2 1  0  0
## 3 1  0  0
## 4 1  0  0
## 5 1  0  0
## 6 1  0  0

attach(Example11_4)

fit1 <- glm(y ~ x1 + x2, family = binomial(), data = Example11_4)
summary(fit1)

## 
## Call:
## glm(formula = y ~ x1 + x2, family = binomial(), data = Example11_4)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3965  -1.0193  -0.8225   0.9730   1.5800  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.9099     0.1358  -6.699 2.11e-11 ***
## x1            0.8856     0.1500   5.904 3.54e-09 ***
## x2            0.5261     0.1572   3.348 0.000815 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1228.0  on 885  degrees of freedom
## Residual deviance: 1159.4  on 883  degrees of freedom
## AIC: 1165.4
## 
## Number of Fisher Scoring iterations: 4

coefficients(fit1)

## (Intercept)          x1          x2 
##  -0.9099467   0.8855837   0.5261234

exp(coefficients(fit1))

## (Intercept)          x1          x2 
##   0.4025457   2.4243991   1.6923589

exp(confint(fit1))

##                 2.5 %    97.5 %
## (Intercept) 0.3071678 0.5234695
## x1          1.8090525 3.2580064
## x2          1.2441606 2.3048818

fit2 <- glm(y ~ x1 + x2 + x1:x2, family = binomial(), data = Example11_4)
summary(fit2)

## 
## Call:
## glm(formula = y ~ x1 + x2 + x1:x2, family = binomial(), data = Example11_4)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4237  -0.9623  -0.8725   0.9497   1.5167  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7695     0.1524  -5.049 4.43e-07 ***
## x1            0.5107     0.2520   2.027   0.0427 *  
## x2            0.2398     0.2201   1.090   0.2759    
## x1:x2         0.5815     0.3148   1.847   0.0647 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1228  on 885  degrees of freedom
## Residual deviance: 1156  on 882  degrees of freedom
## AIC: 1164
## 
## Number of Fisher Scoring iterations: 4

coefficients(fit2)

## (Intercept)          x1          x2       x1:x2 
##  -0.7695202   0.5106585   0.2398261   0.5814856

exp(coefficients(fit2))

## (Intercept)          x1          x2       x1:x2 
##   0.4632353   1.6663882   1.2710280   1.7886937

exp(confint(fit2))

##                 2.5 %    97.5 %
## (Intercept) 0.3415446 0.6214419
## x1          1.0159204 2.7326277
## x2          0.8255781 1.9585429
## x1:x2       0.9658438 3.3214114

Example11_4$x11 <- ifelse(x1 == 1 & x2 == 1, 1, 0)
Example11_4$x10 <- ifelse(x1 == 1 & x2 == 0, 1, 0)
Example11_4$x01 <- ifelse(x1 == 0 & x2 == 1, 1, 0)
Example11_4$x00 <- ifelse(x1 == 0 & x2 == 0, 1, 0)

fit3 <- glm(y ~ x11 + x10 + x01, family = binomial(), data = Example11_4)
summary(fit3)

## 
## Call:
## glm(formula = y ~ x11 + x10 + x01, family = binomial(), data = Example11_4)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4237  -0.9623  -0.8725   0.9497   1.5167  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.7695     0.1524  -5.049 4.43e-07 ***
## x11           1.3320     0.1834   7.264 3.76e-13 ***
## x10           0.5107     0.2520   2.027   0.0427 *  
## x01           0.2398     0.2201   1.090   0.2759    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1228  on 885  degrees of freedom
## Residual deviance: 1156  on 882  degrees of freedom
## AIC: 1164
## 
## Number of Fisher Scoring iterations: 4

coefficients(fit3)

## (Intercept)         x11         x10         x01 
##  -0.7695202   1.3319701   0.5106585   0.2398261

exp(coefficients(fit3))

## (Intercept)         x11         x10         x01 
##   0.4632353   3.7884999   1.6663882   1.2710280

exp(confint(fit3))

##                 2.5 %    97.5 %
## (Intercept) 0.3415446 0.6214419
## x11         2.6552919 5.4525611
## x10         1.0159204 2.7326277
## x01         0.8255781 1.9585429

detach(Example11_4)

Logistic回归案例3

Example11_5 <- read.table("example11_5.csv", header = TRUE, sep = ",")
head(Example11_5)

##   x1 x2 x3 x4 x5 x6 x7 x8 y
## 1  3  1  0  1  0  0  1  1 0
## 2  2  0  1  1  0  0  1  0 0
## 3  2  1  0  1  0  0  1  0 0
## 4  2  0  0  1  0  0  1  0 0
## 5  3  0  0  1  0  1  1  1 0
## 6  3  0  1  1  0  0  2  1 0

attach(Example11_5)
fullfit <- glm(y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, family = binomial(), data = Example11_5)
summary(fullfit)

## 
## Call:
## glm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8, family = binomial(), 
##     data = Example11_5)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3567  -0.4949  -0.1643   0.6351   2.3456  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -5.8896     1.9721  -2.986  0.00282 **
## x1            0.6443     0.4987   1.292  0.19642   
## x2            0.9098     0.8362   1.088  0.27662   
## x3            0.9698     0.9058   1.071  0.28433   
## x4            0.9948     1.2095   0.823  0.41079   
## x5            0.7409     0.8801   0.842  0.39988   
## x6            3.4559     1.4152   2.442  0.01461 * 
## x7            0.3019     0.5906   0.511  0.60917   
## x8            1.9169     0.9189   2.086  0.03697 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 74.786  on 53  degrees of freedom
## Residual deviance: 42.194  on 45  degrees of freedom
## AIC: 60.194
## 
## Number of Fisher Scoring iterations: 6

nothing <- glm(y ~ 1, family = binomial(), data = Example11_5)
summary(nothing)

## 
## Call:
## glm(formula = y ~ 1, family = binomial(), data = Example11_5)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.146  -1.146  -1.146   1.209   1.209  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.07411    0.27235  -0.272    0.786
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 74.786  on 53  degrees of freedom
## Residual deviance: 74.786  on 53  degrees of freedom
## AIC: 76.786
## 
## Number of Fisher Scoring iterations: 3

bothways <- step(nothing, list(lower = formula(nothing), upper = formula(fullfit)), direction = "both")

## Start:  AIC=76.79
## y ~ 1
## 
##        Df Deviance    AIC
## + x6    1   63.467 67.467
## + x5    1   67.137 71.137
## + x8    1   67.727 71.727
## + x2    1   68.705 72.705
## + x1    1   68.706 72.706
## + x7    1   69.335 73.335
## + x3    1   69.966 73.966
## + x4    1   70.272 74.272
## <none>      74.786 76.786
## 
## Step:  AIC=67.47
## y ~ x6
## 
##        Df Deviance    AIC
## + x5    1   55.480 61.480
## + x8    1   56.751 62.751
## + x2    1   56.890 62.890
## + x3    1   58.265 64.265
## + x1    1   58.335 64.335
## + x7    1   58.753 64.753
## + x4    1   60.617 66.617
## <none>      63.467 67.467
## - x6    1   74.786 76.786
## 
## Step:  AIC=61.48
## y ~ x6 + x5
## 
##        Df Deviance    AIC
## + x8    1   50.402 58.402
## + x2    1   52.108 60.108
## + x1    1   52.454 60.454
## <none>      55.480 61.480
## + x7    1   53.709 61.709
## + x4    1   54.124 62.124
## + x3    1   54.174 62.174
## - x5    1   63.467 67.467
## - x6    1   67.137 71.137
## 
## Step:  AIC=58.4
## y ~ x6 + x5 + x8
## 
##        Df Deviance    AIC
## + x1    1   46.224 56.224
## + x2    1   46.717 56.717
## <none>      50.402 58.402
## + x3    1   48.601 58.601
## + x4    1   49.039 59.039
## + x7    1   49.702 59.702
## - x8    1   55.480 61.480
## - x5    1   56.751 62.751
## - x6    1   61.148 67.148
## 
## Step:  AIC=56.22
## y ~ x6 + x5 + x8 + x1
## 
##        Df Deviance    AIC
## + x2    1   44.163 56.163
## <none>      46.224 56.224
## + x3    1   44.534 56.534
## + x4    1   45.755 57.755
## + x7    1   45.781 57.781
## - x1    1   50.402 58.402
## - x5    1   50.491 58.491
## - x8    1   52.454 60.454
## - x6    1   56.439 64.439
## 
## Step:  AIC=56.16
## y ~ x6 + x5 + x8 + x1 + x2
## 
##        Df Deviance    AIC
## <none>      44.163 56.163
## - x2    1   46.224 56.224
## - x1    1   46.717 56.717
## + x3    1   42.999 56.999
## - x5    1   47.099 57.099
## + x4    1   43.817 57.817
## + x7    1   43.907 57.907
## - x8    1   50.405 60.405
## - x6    1   54.874 64.874

fit1 <- glm(y ~ x6 + x5 + x8 + x1 + x2, family = binomial(), data = Example11_5)
summary(fit1)

## 
## Call:
## glm(formula = y ~ x6 + x5 + x8 + x1 + x2, family = binomial(), 
##     data = Example11_5)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4100  -0.6282  -0.2313   0.6034   2.2802  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -4.7603     1.5803  -3.012  0.00259 **
## x6            3.3698     1.3311   2.532  0.01136 * 
## x5            1.2928     0.7654   1.689  0.09122 . 
## x8            2.0004     0.8730   2.291  0.02194 * 
## x1            0.7460     0.4841   1.541  0.12335   
## x2            1.1453     0.8077   1.418  0.15620   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 74.786  on 53  degrees of freedom
## Residual deviance: 44.163  on 48  degrees of freedom
## AIC: 56.163
## 
## Number of Fisher Scoring iterations: 5

fit2 <- glm(y ~ x6 + x5 + x8 + x1, family = binomial(), data = Example11_5)
summary(fit2)

## 
## Call:
## glm(formula = y ~ x6 + x5 + x8 + x1, family = binomial(), data = Example11_5)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5264  -0.5445  -0.2733   0.6319   2.0339  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -4.7050     1.5432  -3.049   0.0023 **
## x6            3.1355     1.2489   2.511   0.0121 * 
## x5            1.4959     0.7439   2.011   0.0443 * 
## x8            1.9471     0.8466   2.300   0.0215 * 
## x1            0.9239     0.4766   1.939   0.0525 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 74.786  on 53  degrees of freedom
## Residual deviance: 46.224  on 49  degrees of freedom
## AIC: 56.224
## 
## Number of Fisher Scoring iterations: 5

coefficients(fit2)

## (Intercept)          x6          x5          x8          x1 
##  -4.7050170   3.1354895   1.4959351   1.9470861   0.9238998

exp(coefficients(fit2))

##  (Intercept)           x6           x5           x8           x1 
##  0.009049761 22.999891063  4.463508324  7.008236568  2.519095102

exp(confint(fit2))

##                    2.5 %      97.5 %
## (Intercept) 0.0002770793   0.1345707
## x6          2.9251702716 552.8160472
## x5          1.0786693022  20.8559589
## x8          1.4921555369  44.9031777
## x1          1.0375623088   7.0051815

detach(Example11_5)

条件Logistic回归案例 4

# install.packages('survival')
library(survival)
Example11_6 <- read.table("example11_6.csv", header = TRUE, sep = ",")
head(Example11_6)

##   id outcome exposure
## 1  1       1        1
## 2  1       0        1
## 3  2       1        1
## 4  2       0        1
## 5  3       1        1
## 6  3       0        1

attach(Example11_6)
model <- clogit(outcome ~ exposure + strata(id))
summary(model)

## Call:
## coxph(formula = Surv(rep(1, 168L), outcome) ~ exposure + strata(id), 
##     method = "exact")
## 
##   n= 168, number of events= 84 
## 
##           coef exp(coef) se(coef)     z Pr(>|z|)  
## exposure 1.030     2.800    0.521 1.976   0.0481 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##          exp(coef) exp(-coef) lower .95 upper .95
## exposure       2.8     0.3571     1.009     7.774
## 
## Concordance= 0.554  (se = 0.036 )
## Likelihood ratio test= 4.44  on 1 df,   p=0.04
## Wald test            = 3.91  on 1 df,   p=0.05
## Score (logrank) test = 4.26  on 1 df,   p=0.04

detach(Example11_6)

Logistic回归案例 5 低出生体重儿列线图

library(foreign)
library(rms)

mydata <- read.spss("lweight.sav")
mydata <- as.data.frame(mydata)
head(mydata)

##   id      low age lwt     race  smoke ptl       ht ui ftv  bwt
## 1 85 正常体重  19 182   黑种人 不吸烟   0 无妊高症 有   0 2523
## 2 86 正常体重  33 155 其他种族 不吸烟   0 无妊高症 无   3 2551
## 3 87 正常体重  20 105   白种人   吸烟   0 无妊高症 无   1 2557
## 4 88 正常体重  21 108   白种人   吸烟   0 无妊高症 有   2 2594
## 5 89 正常体重  18 107   白种人   吸烟   0 无妊高症 有   0 2600
## 6 91 正常体重  21 124 其他种族 不吸烟   0 无妊高症 无   0 2622

mydata$low <- ifelse(mydata$low == "低出生体重", 1, 0)
mydata$race1 <- ifelse(mydata$race == "白种人", 1, 0)
mydata$race2 <- ifelse(mydata$race == "黑种人", 1, 0)
mydata$race3 <- ifelse(mydata$race == "其他种族", 1, 0)

attach(mydata)
dd <- datadist(mydata)
options(datadist = "dd")

fit1 <- lrm(low ~ age + ftv + ht + lwt + ptl + smoke + ui + race1 + race2, data = mydata, x = T, y = T)
fit1

## Logistic Regression Model
##  
##  lrm(formula = low ~ age + ftv + ht + lwt + ptl + smoke + ui + 
##      race1 + race2, data = mydata, x = T, y = T)
##  
##                       Model Likelihood     Discrimination    Rank Discrim.    
##                          Ratio Test           Indexes           Indexes       
##  Obs           189    LR chi2     31.12    R2       0.213    C       0.738    
##   0            130    d.f.            9    g        1.122    Dxy     0.476    
##   1             59    Pr(> chi2) 0.0003    gr       3.070    gamma   0.477    
##  max |deriv| 7e-05                         gp       0.207    tau-a   0.206    
##                                            Brier    0.181                     
##  
##             Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept   1.1427 1.0873  1.05  0.2933  
##  age        -0.0255 0.0366 -0.69  0.4871  
##  ftv         0.0321 0.1708  0.19  0.8509  
##  ht=妊高症   1.7631 0.6894  2.56  0.0105  
##  lwt        -0.0137 0.0068 -2.02  0.0431  
##  ptl         0.5517 0.3446  1.60  0.1094  
##  smoke=吸烟  0.9275 0.3986  2.33  0.0200  
##  ui=有       0.6488 0.4676  1.39  0.1653  
##  race1      -0.9082 0.4367 -2.08  0.0375  
##  race2       0.3293 0.5339  0.62  0.5374  
##

summary(fit1)

##              Effects              Response : low 
## 
##  Factor               Low High Diff. Effect    S.E.    Lower 0.95 Upper 0.95
##  age                   19  26   7    -0.178250 0.25649 -0.68095    0.324460 
##   Odds Ratio           19  26   7     0.836740      NA  0.50613    1.383300 
##  ftv                    0   1   1     0.032104 0.17077 -0.30260    0.366810 
##   Odds Ratio            0   1   1     1.032600      NA  0.73889    1.443100 
##  lwt                  110 141  31    -0.424740 0.21002 -0.83637   -0.013113 
##   Odds Ratio          110 141  31     0.653940      NA  0.43328    0.986970 
##  ptl                    0   3   3     1.655000 1.03390 -0.37141    3.681400 
##   Odds Ratio            0   3   3     5.233000      NA  0.68976   39.701000 
##  race1                  0   1   1    -0.908230 0.43667 -1.76410   -0.052382 
##   Odds Ratio            0   1   1     0.403240      NA  0.17134    0.948970 
##  race2                  0   1   1     0.329270 0.53392 -0.71718    1.375700 
##   Odds Ratio            0   1   1     1.390000      NA  0.48813    3.958000 
##  ht - 妊高症:无妊高症   1   2  NA     1.763100 0.68941  0.41191    3.114400 
##   Odds Ratio            1   2  NA     5.830700      NA  1.50970   22.519000 
##  smoke - 吸烟:不吸烟    1   2  NA     0.927480 0.39859  0.14626    1.708700 
##   Odds Ratio            1   2  NA     2.528100      NA  1.15750    5.521800 
##  ui - 有:无             1   2  NA     0.648810 0.46760 -0.26767    1.565300 
##   Odds Ratio            1   2  NA     1.913300      NA  0.76516    4.784100

nom1 <- nomogram(fit1, fun = plogis, fun.at = c(0.001, 0.01, 0.05, seq(0.1, 0.9, by = 0.1), 0.95, 0.99, 0.999), 
    lp = F, funlabel = "Low weight rate")
plot(nom1)

plot of chunk unnamed-chunk-20

cal1 <- calibrate(fit1, method = "boot", B = 1000)
plot(cal1, xlim = c(0, 1), ylim = c(0, 1))

plot of chunk unnamed-chunk-21

## 
## n=189   Mean absolute error=0.038   Mean squared error=0.0019
## 0.9 Quantile of absolute error=0.056

mydata$race <- as.factor(ifelse(mydata$race == "白种人", "白种人", "黑人及其他种族"))

dd <- datadist(mydata)
options(datadist = "dd")

fit2 <- lrm(low ~ age + ftv + ht + lwt + ptl + smoke + ui + race, data = mydata, x = T, y = T)
fit2

## Logistic Regression Model
##  
##  lrm(formula = low ~ age + ftv + ht + lwt + ptl + smoke + ui + 
##      race, data = mydata, x = T, y = T)
##  
##                       Model Likelihood     Discrimination    Rank Discrim.    
##                          Ratio Test           Indexes           Indexes       
##  Obs           189    LR chi2     30.74    R2       0.211    C       0.735    
##   0            130    d.f.            8    g        1.112    Dxy     0.469    
##   1             59    Pr(> chi2) 0.0002    gr       3.039    gamma   0.470    
##  max |deriv| 7e-05                         gp       0.205    tau-a   0.203    
##                                            Brier    0.182                     
##  
##                      Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept            0.0921 1.1475  0.08  0.9360  
##  age                 -0.0272 0.0365 -0.74  0.4563  
##  ftv                  0.0365 0.1692  0.22  0.8293  
##  ht=妊高症            1.7516 0.6885  2.54  0.0110  
##  lwt                 -0.0125 0.0065 -1.94  0.0528  
##  ptl                  0.5479 0.3451  1.59  0.1124  
##  smoke=吸烟           0.9737 0.3922  2.48  0.0130  
##  ui=有                0.6372 0.4701  1.36  0.1753  
##  race=黑人及其他种族  1.0239 0.3940  2.60  0.0094  
##

summary(fit2)

##              Effects              Response : low 
## 
##  Factor                       Low High Diff. Effect    S.E.    Lower 0.95 Upper 0.95
##  age                           19  26   7    -0.190330 0.25550 -0.69110    0.310440 
##   Odds Ratio                   19  26   7     0.826690      NA  0.50103    1.364000 
##  ftv                            0   1   1     0.036487 0.16923 -0.29519    0.368170 
##   Odds Ratio                    0   1   1     1.037200      NA  0.74439    1.445100 
##  lwt                          110 141  31    -0.387840 0.20030 -0.78042    0.004744 
##   Odds Ratio                  110 141  31     0.678520      NA  0.45821    1.004800 
##  ptl                            0   3   3     1.643800 1.03540 -0.38543    3.673100 
##   Odds Ratio                    0   3   3     5.175000      NA  0.68016   39.374000 
##  ht - 妊高症:无妊高症           1   2  NA     1.751600 0.68854  0.40206    3.101100 
##   Odds Ratio                    1   2  NA     5.763700      NA  1.49490   22.223000 
##  smoke - 吸烟:不吸烟            1   2  NA     0.973710 0.39221  0.20499    1.742400 
##   Odds Ratio                    1   2  NA     2.647700      NA  1.22750    5.711100 
##  ui - 有:无                     1   2  NA     0.637190 0.47011 -0.28422    1.558600 
##   Odds Ratio                    1   2  NA     1.891200      NA  0.75260    4.752100 
##  race - 黑人及其他种族:白种人   1   2  NA     1.023900 0.39404  0.25154    1.796200 
##   Odds Ratio                    1   2  NA     2.783900      NA  1.28600    6.026500

nom2 <- nomogram(fit2, fun = plogis, fun.at = c(0.001, 0.01, 0.05, seq(0.1, 0.9, by = 0.1), 0.95, 0.99, 0.999), 
    lp = F, funlabel = "Low weight rate")
plot(nom2)

plot of chunk unnamed-chunk-23

cal2 <- calibrate(fit2, method = "boot", B = 1000)
plot(cal2, xlim = c(0, 1), ylim = c(0, 1))

plot of chunk unnamed-chunk-24

## 
## n=189   Mean absolute error=0.035   Mean squared error=0.00172
## 0.9 Quantile of absolute error=0.055

fit3 <- lrm(low ~ ht + lwt + ptl + smoke + race, data = mydata, x = T, y = T)
fit3

## Logistic Regression Model
##  
##  lrm(formula = low ~ ht + lwt + ptl + smoke + race, data = mydata, 
##      x = T, y = T)
##  
##                        Model Likelihood     Discrimination    Rank Discrim.    
##                           Ratio Test           Indexes           Indexes       
##  Obs           189    LR chi2      28.19    R2       0.195    C       0.732    
##   0            130    d.f.             5    g        1.037    Dxy     0.465    
##   1             59    Pr(> chi2) <0.0001    gr       2.820    gamma   0.467    
##  max |deriv| 1e-05                          gp       0.194    tau-a   0.201    
##                                             Brier    0.184                     
##  
##                      Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept           -0.2744 0.8832 -0.31  0.7561  
##  ht=妊高症            1.6754 0.6863  2.44  0.0146  
##  lwt                 -0.0137 0.0064 -2.14  0.0322  
##  ptl                  0.6006 0.3342  1.80  0.0723  
##  smoke=吸烟           0.9919 0.3869  2.56  0.0104  
##  race=黑人及其他种族  1.0487 0.3842  2.73  0.0063  
##

summary(fit3)

##              Effects              Response : low 
## 
##  Factor                       Low High Diff. Effect   S.E.    Lower 0.95 Upper 0.95
##  lwt                          110 141  31    -0.42393 0.19796 -0.81191   -0.035937 
##   Odds Ratio                  110 141  31     0.65447      NA  0.44401    0.964700 
##  ptl                            0   3   3     1.80180 1.00260 -0.16316    3.766900 
##   Odds Ratio                    0   3   3     6.06080      NA  0.84946   43.244000 
##  ht - 妊高症:无妊高症           1   2  NA     1.67540 0.68632  0.33024    3.020600 
##   Odds Ratio                    1   2  NA     5.34090      NA  1.39130   20.503000 
##  smoke - 吸烟:不吸烟            1   2  NA     0.99188 0.38694  0.23349    1.750300 
##   Odds Ratio                    1   2  NA     2.69630      NA  1.26300    5.756100 
##  race - 黑人及其他种族:白种人   1   2  NA     1.04870 0.38418  0.29571    1.801700 
##   Odds Ratio                    1   2  NA     2.85390      NA  1.34410    6.059700

nom3 <- nomogram(fit3, fun = plogis, fun.at = c(0.001, 0.01, 0.05, seq(0.1, 0.9, by = 0.1), 0.95, 0.99, 0.999), 
    lp = F, funlabel = "Low weight rate")
plot(nom3)

plot of chunk unnamed-chunk-26

cal3 <- calibrate(fit3, method = "boot", B = 1000)
plot(cal3, xlim = c(0, 1), ylim = c(0, 1))

plot of chunk unnamed-chunk-27

## 
## n=189   Mean absolute error=0.02   Mean squared error=0.00065
## 0.9 Quantile of absolute error=0.033

# C-statistics计算
library(foreign)
library(rms)

mydata <- read.spss("lweight.sav")
mydata <- as.data.frame(mydata)
head(mydata)

##   id      low age lwt     race  smoke ptl       ht ui ftv  bwt
## 1 85 正常体重  19 182   黑种人 不吸烟   0 无妊高症 有   0 2523
## 2 86 正常体重  33 155 其他种族 不吸烟   0 无妊高症 无   3 2551
## 3 87 正常体重  20 105   白种人   吸烟   0 无妊高症 无   1 2557
## 4 88 正常体重  21 108   白种人   吸烟   0 无妊高症 有   2 2594
## 5 89 正常体重  18 107   白种人   吸烟   0 无妊高症 有   0 2600
## 6 91 正常体重  21 124 其他种族 不吸烟   0 无妊高症 无   0 2622

NA

## [1] NA

NA

## [1] NA

NA

## [1] NA

NA

## [1] NA

attach(mydata)
dd <- datadist(mydata)
options(datadist = "dd")

fit1 <- lrm(low ~ age + ftv + ht + lwt + ptl + smoke + ui + race1 + race2, data = mydata, x = T, y = T)
fit1  #直接读取模型中Rank Discrim.参数 C

## Logistic Regression Model
##  
##  lrm(formula = low ~ age + ftv + ht + lwt + ptl + smoke + ui + 
##      race1 + race2, data = mydata, x = T, y = T)
##  
##                            Model Likelihood     Discrimination    Rank Discrim.    
##                               Ratio Test           Indexes           Indexes       
##  Obs           189         LR chi2     31.12    R2       0.213    C       0.738    
##   正常体重         130     d.f.            9    g        1.122    Dxy     0.476    
##   低出生体重         59    Pr(> chi2) 0.0003    gr       3.070    gamma   0.477    
##  max |deriv| 7e-05                              gp       0.207    tau-a   0.206    
##                                                 Brier    0.181                     
##  
##             Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept   1.1427 1.0873  1.05  0.2933  
##  age        -0.0255 0.0366 -0.69  0.4871  
##  ftv         0.0321 0.1708  0.19  0.8509  
##  ht=妊高症   1.7631 0.6894  2.56  0.0105  
##  lwt        -0.0137 0.0068 -2.02  0.0431  
##  ptl         0.5517 0.3446  1.60  0.1094  
##  smoke=吸烟  0.9275 0.3986  2.33  0.0200  
##  ui=有       0.6488 0.4676  1.39  0.1653  
##  race1      -0.9082 0.4367 -2.08  0.0375  
##  race2       0.3293 0.5339  0.62  0.5374  
##

mydata$predvalue <- predict(fit1)
library(ROCR)
pred <- prediction(mydata$predvalue, mydata$low)
perf <- performance(pred, "tpr", "fpr")
plot(perf)
abline(0, 1)

plot of chunk unnamed-chunk-30

auc <- performance(pred, "auc")
auc  #auc即是C-statistics

## An object of class "performance"
## Slot "x.name":
## [1] "None"
## 
## Slot "y.name":
## [1] "Area under the ROC curve"
## 
## Slot "alpha.name":
## [1] "none"
## 
## Slot "x.values":
## list()
## 
## Slot "y.values":
## [[1]]
## [1] 0.2617992
## 
## 
## Slot "alpha.values":
## list()

# library(Hmisc)
somers2(mydata$predvalue, mydata$low)  #somers2 {Hmisc}

## Error in somers2(mydata$predvalue, mydata$low): y must be binary

Logistic回归案例 6 亚组分析森林图

library(forestplot)
rs_forest <- read.csv('rs_forest.csv',header = FALSE)
# 读入数据的时候大家一定要把header设置成FALSE，保证第一行不被当作列名称。
rs_forest

##                V1             V2              V3   V4   V5   V6
## 1        Response No.of patients      OR(95% CI)   NA   NA   NA
## 2         Overall           1472 1.66(1.28,2.15) 1.66 1.28 2.15
## 3  Clinical stage                                  NA   NA   NA
## 4        Advanced            951 1.67(1.23,2.26) 1.67 1.23 2.26
## 5    non-advanced            521 1.67(1.05,2.66) 1.67 1.05 2.66
## 6    ctDNA assays                                  NA   NA   NA
## 7            qPCR           1280 1.73(1.30,2.29) 1.73 1.30 2.29
## 8            ARMS            192 1.58(0.50,5.04) 1.58 0.50 5.04
## 9       Ethnicity                                  NA   NA   NA
## 10          Asian            326 1.93(0.84,4.43) 1.93 0.84 4.43
## 11      Caucasian           1146 1.57(1.20,2.06) 1.57 1.20 2.06

# tiff('Figure 1.tiff',height = 1600,width = 2400,res= 300)
forestplot(labeltext = as.matrix(rs_forest[,1:3]),
           #设置用于文本展示的列，此处我们用数据的前三列作为文本，在图中展示
           mean = rs_forest$V4, #设置均值
           lower = rs_forest$V5, #设置均值的lowlimits限
           upper = rs_forest$V6, #设置均值的uplimits限
           is.summary = c(T,T,T,F,F,T,F,F,T,F,F),
           #该参数接受一个逻辑向量，用于定义数据中的每一行是否是汇总值，若是，则在对应位置设置为TRUE，若否，则设置为FALSE；设置为TRUE的行则以粗体出现
           zero = 1, #设置参照值，此处我们展示的是OR值，故参照值是1，而不是0
           boxsize = 0.4, #设置点估计的方形大小
           lineheight = unit(10,'mm'),#设置图形中的行距
           colgap = unit(3,'mm'),#设置图形中的列间距
           lwd.zero = 2,#设置参考线的粗细
           lwd.ci = 1.5,#设置区间估计线的粗细
           col=fpColors(box='#458B00', summary= "#8B008B",lines = 'black',zero = '#7AC5CD'),
           #使用fpColors()函数定义图形元素的颜色，从左至右分别对应点估计方形，汇总值，区间估计线，参考线
           xlab="The estimates",#设置x轴标签
           graph.pos = 3)#设置森林图的位置，此处设置为3，则出现在第三列

plot of chunk unnamed-chunk-33

Poisson Regression

使用场景 : 通过一系列连续型或类别型预测变量来预测计数型结果变量时采用泊松分布

案例:药物治疗是否能减小癫痫的发病数

# look at dataset
data(breslow.dat, package = "robust")
head(breslow.dat)

##    ID Y1 Y2 Y3 Y4 Base Age     Trt Ysum sumY Age10 Base4
## 1 104  5  3  3  3   11  31 placebo   14   14   3.1  2.75
## 2 106  3  5  3  3   11  30 placebo   14   14   3.0  2.75
## 3 107  2  4  0  5    6  25 placebo   11   11   2.5  1.50
## 4 114  4  4  1  4    8  36 placebo   13   13   3.6  2.00
## 5 116  7 18  9 21   66  22 placebo   55   55   2.2 16.50
## 6 118  5  2  8  7   27  29 placebo   22   22   2.9  6.75

names(breslow.dat)

##  [1] "ID"    "Y1"    "Y2"    "Y3"    "Y4"    "Base"  "Age"   "Trt"   "Ysum"  "sumY"  "Age10" "Base4"

summary(breslow.dat[c(6, 7, 8, 10)])

##       Base             Age               Trt          sumY       
##  Min.   :  6.00   Min.   :18.00   placebo  :28   Min.   :  0.00  
##  1st Qu.: 12.00   1st Qu.:23.00   progabide:31   1st Qu.: 11.50  
##  Median : 22.00   Median :28.00                  Median : 16.00  
##  Mean   : 31.22   Mean   :28.34                  Mean   : 33.05  
##  3rd Qu.: 41.00   3rd Qu.:32.00                  3rd Qu.: 36.00  
##  Max.   :151.00   Max.   :42.00                  Max.   :302.00

ID : an integer value specifying the patient identification number.

Y1 : an integer value, the number of seizures during the first two week period.

Y2 : an integer value, the number of seizures during the second two week period.

Y3 : an integer value, the number of seizures during the third two week period.

Y4 : an integer value, the number of seizures during the fourth two week period.

Base : an integer value giving the eight-week baseline seizure count.

Age : an integer value giving the age of the parient in years.

Trt : the treatment: a factor with levels placebo and progabide.

Ysum : an integer value, the sum of Y1, Y2, Y3 and Y4.

sumY : an integer value, the sum of Y1, Y2, Y3 and Y4.

Age10 : a numeric value, Age divided by 10.

Base4 : a numeric value, Base divided by 4.

# plot distribution of post-treatment seizure counts
opar <- par(no.readonly = TRUE)
par(mfrow = c(1, 2))
attach(breslow.dat)
hist(sumY, breaks = 20, xlab = "Seizure Count", main = "Distribution of Seizures")
boxplot(sumY ~ Trt, xlab = "Treatment", main = "Group Comparisons")

plot of chunk unnamed-chunk-35

par(opar)

# fit regression
fit <- glm(sumY ~ Base + Age + Trt, data = breslow.dat, family = poisson())
summary(fit)

## 
## Call:
## glm(formula = sumY ~ Base + Age + Trt, family = poisson(), data = breslow.dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.0569  -2.0433  -0.9397   0.7929  11.0061  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.9488259  0.1356191  14.370  < 2e-16 ***
## Base          0.0226517  0.0005093  44.476  < 2e-16 ***
## Age           0.0227401  0.0040240   5.651 1.59e-08 ***
## Trtprogabide -0.1527009  0.0478051  -3.194   0.0014 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2122.73  on 58  degrees of freedom
## Residual deviance:  559.44  on 55  degrees of freedom
## AIC: 850.71
## 
## Number of Fisher Scoring iterations: 5

# interpret model parameters
coef(fit)

##  (Intercept)         Base          Age Trtprogabide 
##   1.94882593   0.02265174   0.02274013  -0.15270095

exp(coef(fit))

##  (Intercept)         Base          Age Trtprogabide 
##    7.0204403    1.0229102    1.0230007    0.8583864

# evaluate overdispersion
deviance(fit)/df.residual(fit)

## [1] 10.1717

library(qcc)
qcc.overdispersion.test(breslow.dat$sumY, type = "poisson")

##                    
## Overdispersion test Obs.Var/Theor.Var Statistic p-value
##        poisson data          62.87013  3646.468       0

# fit model with quasipoisson
fit.od <- glm(sumY ~ Base + Age + Trt, data = breslow.dat, family = quasipoisson())
summary(fit.od)

## 
## Call:
## glm(formula = sumY ~ Base + Age + Trt, family = quasipoisson(), 
##     data = breslow.dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.0569  -2.0433  -0.9397   0.7929  11.0061  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.948826   0.465091   4.190 0.000102 ***
## Base          0.022652   0.001747  12.969  < 2e-16 ***
## Age           0.022740   0.013800   1.648 0.105085    
## Trtprogabide -0.152701   0.163943  -0.931 0.355702    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 11.76075)
## 
##     Null deviance: 2122.73  on 58  degrees of freedom
## Residual deviance:  559.44  on 55  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 5