生物统计学下的机器学习（1）-阿里云开发者社区

生物统计学下的机器学习（1）

2023-05-25 165

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 生物统计学下的机器学习（1）

引言

本课程主要介绍如何处理非线性问题。通过学习这门课程，你将掌握以下技能：

掌握模拟非线性效应的方法；
实现线性和多项式分段回归；
了解多项式样条、b 样条和自然样条的区别；
用不同样条拟合广义线性模型；
使用平滑样条来逼近非线性效应；
在广义加性模型中集成平滑样条。

数据准备

这里使用 MultiKink 包中的数据集 triceps 和 SA_heart 作为例子。使用下列命令加载数据：

install.packages("MultiKink")
library(MultiKink)
data("triceps")
head(triceps)
#    age lntriceps triceps
# 1 12.05  1.223776     3.4
# 2  9.91  1.386294     4.0
# 3 10.04  1.435084     4.2
# 4 11.49  1.435084     4.2
# 5 10.12  1.481605     4.4
# 6 11.87  1.481605     4.4

这些数据来自对西非三个冈比亚村庄 892 名 50 岁以下女性的人体测量研究。包括以下 3 个变量：

age：调查对象的年龄
intriceps: 三头肌皮褶厚度的对数
triceps：三头肌皮褶厚度

SA_heart数据是南非西开普心脏病高风险地区男性的回顾性样本。每个冠心病病例大约有两个对照样本，包括以下 10 个变量：

sbp：收缩压
tobacco：累积烟草消费量（kg）
ldl：胆固醇
adiposity：一个数字向量
famhist：因子型变量指示有无心脏病家族史
typea：A型行为
obesity：一个数字向量
alcohol：当前酒精消费量
age：发病年龄
chd：1 代表冠心病，0 代表冠心病

1 多项式回归

在线性回归中添加高次多项式，就可以得多项式回归方程：

该模型能更灵活地拟合数据，可以表示因变量与一些连续预测因子之间的关联。分别将视为一个变量，这样多项式回归就简化成了线性回归，那么对于回归系数仍然可以使用最小二乘法来估计。

但是，在实践中，我们很少使用 3 次以上的多项式。如果真实值和预测值之间是高度非线性的，那么最好使用下一章讨论的方法。

练习题

Q1：拟合三次多项式模型

我们想建立数据集triceps中三头肌皮褶厚度（变量：triceps）与年龄变量：age）的关系模型。

首先加载数据:

library(MultiKink) #for the data
library(ggplot2)   #for the plots
set.seed(1974)     #fix the random generator seed 
data("triceps")

绘制 triceps 和 age 的散点图：

tri.age.plot <- ggplot(triceps, aes(x=age, y=triceps)) +
                 geom_point(alpha=0.55, color="black") + 
                 theme_minimal() 
 tri.age.plot

使用 I() 函数来添加项：

model.cubic <- lm(triceps~age + I(age^2) + I(age^3),
                  data=triceps)
summary(model.cubic)

另一种方法是使用 poly()函数，但是这个函数是通过对线性变换得到。这是为了数值稳定性，因为这三项高度相关。

model.cubic.poly <- lm(triceps~poly(age,3),
                       data=triceps)
summary(model.cubic.poly)

模型model.cubic.poly 和 model.cubic 回归系数不相同，但model.cubic.poly 只是model.cubic的重新参数化，两个模型的预测值完全相同，可以使用散点图来验证：

plot(predict(model.cubic.poly), predict(model.cubic))

我们还可以使用ggplot在初始散点图中添加三次模型的拟合曲线：

#plots
 tri.age.plot + 
   stat_smooth(method = "lm", 
               formula = y~poly(x,3,raw=T), size = 1)

在上述图形中添加线性拟合线：

tri.age.plot + 
   stat_smooth(method = "lm", 
               formula = y~poly(x,3,raw=T), size = 1) +
     stat_smooth(method = "lm", 
               formula = y~poly(x,1,raw=T), lty = 2, col = "red" , size = 1)

Q2：计算二次模型的均方根误差

我们将使用与 Q1 中相同的数据集和变量，但现在需要计算二次模型交叉验证的均方根误差，可以利用caret包实现。我们将进行10 重交叉验证，并重复 10 次:

library(caret)
set.seed(1234)
#repeated CV for the MSE
trC.lm <- trainControl(method = "repeatedcv", 
                       number = 10,         #10-fold cross-validation
                       repeats = 10)        #10 times
#function to fit a polynomial model of degree x
pol.model <- train(triceps ~ poly(age,2),
                       data = triceps, 
                       method = "lm",
                       trControl = trC.lm)
 #this is the root mean squared error
  pol.model$results[2]

计算从 1 阶到 10 阶多项式模型均方根误差

需要自定义一个函数my.pol.f：

set.seed(1001)
   #repeated CV for the MSE
  trC.lm <- trainControl(method = "repeatedcv", 
                         number = 10, 
                         repeats = 10)
  #function to fit a polynomial model of degree x
  my.pol.f <- function(x) {
    xx<-poly(triceps$age, x, raw=T)                   #design matrix with age,
                                                      #age^2, ..., age^10
    new.data  <- cbind(triceps=triceps$triceps, xx)   #dataset with the added
                                                      #poly terms
    pol.model <- train(triceps ~ .,                   #the . uses all the 
                          data = new.data,            #predictors
                          method = "lm",
                          trControl = trC.lm)
    RMSE.cv = pol.model$results[2]
  }
    #RMSE
  t(sapply(1:10, my.pol.f))                        #applies the function
                                                   #to poly degrees 1 to 10

综合题

使用 SA_heart 数据集完成下列任务：

a. 将 tobacco 作为自变量， chd 作为因变量拟合逻辑斯蒂模型，并计算其 AIC；

b. 给 a 中得到的模型绘制拟合曲线；

c. 在 a 的基础上修改为 tobacco 的立方，并计算其 AIC；

d. 给 c 中得到的模型绘制拟合曲线；

e. 使用 caret 软件包，计算模型 a 和 c 交叉验证的 ROC 值。

## 问题 e 的代码
    library(caret)
    set.seed(2001)
    SA_heart <- read.csv("https://www.dropbox.com/s/cwkw3p91zyizcqz/SA_heart.csv?dl=1")
    # caret will give an error for factors coded as 0 and 1
    # because it uses the factors names to create
    # names of internal variables. This way it is better
    #to use an outcome variable with strings as the factor names
    SA_heart$chd.f <- ifelse(SA_heart$chd ==1, 
                             "chd", 
                             "nochd")     
    #sets the control for 10-fold cross-validation, 10 times
    # the classProbs = TRUE and summaryFunction = twoClassSummary
    # store the information to compute the area under the ROC
    trC.lm <- trainControl(method = "repeatedcv", 
                           number = 10, 
                           repeats = 10,
                           classProbs = TRUE,                 #necessary for 
                           summaryFunction = twoClassSummary) #the AUC ROC
     #linear effect
     roc.l <- train(form = chd.f  ~ tobacco,
                    data = SA_heart,
                    method = "glm",
                    family = "binomial",
                    trControl = trC.lm,
                    metric = "ROC") 
     #cubic effect
     roc.c <- train(form = chd.f  ~ poly(tobacco,3),
                    data = SA_heart,
                    method = "glm",
                    family = "binomial",
                    trControl = trC.lm,
                    metric = "ROC") 
     roc.l 
     roc.c

生物统计学下的机器学习（1）

引言

数据准备

1 多项式回归

练习题

Q1：拟合三次多项式模型

Q2：计算二次模型的均方根误差

计算从 1 阶到 10 阶多项式模型均方根误差

综合题

热门文章

最新文章

相关课程

相关电子书

相关实验场景