生物统计学下的机器学习（2）-阿里云开发者社区

生物统计学下的机器学习（2）

2023-05-25 161

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 生物统计学下的机器学习（2）

分段回归和样条曲线

第一章介绍了如何使用多项式回归来拟合非线性数据（生物统计学下的机器学习（1））。本章介绍另外一种拟合方法，即在预定位置设置断点（节点），将数据划分为多个区间，然后对不同区间内的数据分别拟合线段。

接下来，我们可以让各个部分相互连接：

此时，回归方程变为:

其中：

上述方法可以扩展到使用多项式分段。例如，使用三次分段，模型将变成：

上面的模型在由断点限定的区间内将是一条平滑的曲线，但是线段之间的连接不会是平滑的。为了在整个拟合曲线上强制平滑，我们可以将上面的模型限制为仅包括三次分量:

这将保证拟合曲线是可微分的，在方向上没有急剧变化。这被称为三次样条。通过在第一个断点之前和最后一个断点之后使用线性拟合，实现了数据边界中样条拟合的改进。

练习题

Q1：拟合分段线性回归

以 MultiKink 包中的triceps数据集为例，分别使用节点个数为5、10、20、30和 40 的分段线性回归来模拟女性的三头肌皮褶厚度（triceps）与年龄（age）的关系。

#加载数据集
#install.packages("MultiKink")  
library(MultiKink) #for the data
library(ggplot2)   #for the plots
set.seed(1974)     #fix the random generator seed 
data("triceps")

首先，利用散点图来查看两个变量之间的关系：

tri.age.plot <- ggplot(triceps, aes(x=age, y=triceps)) +
                 geom_point(alpha=0.55, color="black") + 
                 theme_minimal() 
 tri.age.plot

根据节点个数划分区间，在定义的区间内拟合线性模型。

###Piecewise regression
pred1 <- predict(lm(triceps~age, 
                    data = triceps[triceps$age<5,]))
pred2 <- predict(lm(triceps~age, 
                    data = triceps[triceps$age >=5 & triceps$age<10,]))
pred3 <- predict(lm(triceps~age, 
                    data = triceps[triceps$age>=10 & triceps$age<20,]))
pred4 <- predict(lm(triceps~age, 
                    data = triceps[triceps$age>=20 & triceps$age<30,]))
pred5 <- predict(lm(triceps~age, 
                    data = triceps[triceps$age>=30 & triceps$age<40,]))
pred6 <- predict(lm(triceps~age, 
                    data = triceps[triceps$age>=40,]))

把上述分段拟合线添加到散点图中：

tri.age.plot + 
  geom_line(data=triceps[triceps$age<5,], 
            aes(y = pred1, x=age), size = 1, col="blue") +
  geom_line(data=triceps[triceps$age >=5 & triceps$age<10,], 
            aes(y = pred2, x=age), size = 1, col="blue") +
  geom_line(data=triceps[triceps$age>=10 & triceps$age<20,], 
            aes(y = pred3, x=age), size = 1, col="blue") +
  geom_line(data=triceps[triceps$age>=20 & triceps$age<30,], 
            aes(y = pred4, x=age), size = 1, col="blue") +
  geom_line(data=triceps[triceps$age>=30 & triceps$age<40,], 
            aes(y = pred5, x=age), size = 1, col="blue") +
  geom_line(data=triceps[triceps$age>=40,], 
            aes(y = pred6, x=age), size = 1, col="blue")

我们可以对各个分段添加约束，使其成为一条完整的连续曲线。添加限制后的模型如下：

其中，

这表明当时，我们应该在回归模型中添加这一项。在代码中添加命令可以实现这一点。注意是一个逻辑语句，其值为和，函数I() 将会计算所有表达式的值。

pred7 <- predict(lm(triceps~ age + I((age-5)*(age>=5)) +
                                   I((age-10)*(age >= 10)) +
                                   I((age-20)*(age >= 20)) +
                                   I((age-30)*(age >= 30)) +
                                  I((age-40)*(age >= 40)),
                    data = triceps))
tri.age.plot +
  geom_line(data=triceps, 
            aes(y = pred7, x=age), size = 1, col="blue")

Q2: 拟合二次分段回归

只需要在上述代码的基础上，添加二次项：

pred.quad <- predict(lm(triceps~ age + I(age^2) + 
                    I((age-5)*(age>=5)) + I((age-5)^2*(age>=5)) +
                    I((age-10)*(age >= 10)) + I((age-10)^2*(age>=10)) +
                    I((age-20)*(age >= 20)) + I((age-20)^2*(age>=20)) +
                    I((age-30)*(age >= 30)) + I((age-30)^2*(age>=30)) +
                    I((age-40)*(age >= 40)) + I((age-40)^2*(age>=40)),
                    data = triceps))
tri.age.plot +
  geom_line(data=triceps, 
            aes(y = pred.quad, x=age), size = 1, col="blue")

Q3: 拟合自然三次样条

我们将继续使用triceps数据集来拟合一个自然三次样条，来模拟女性的三头肌皮褶厚度（triceps）与年龄（age）的关系。

splines包中的bs()函数可以为多项式样条生成 b 样条基矩阵，ns()函数可以为自然三次样条生成 b 样条基矩阵。下面对这两个函数的输出结果进行比较：

bs()函数

library(splines)
library(MultiKink) #for the data
library(ggplot2)   #for the plots
set.seed(1974)     #fix the random generator seed 
data("triceps")   #load the dataset triceps
                  #notice that the variable of interest
                  #it is also called triceps. Don't get 
                  #confused!
#linear model with the natural cubic splines function 
cub.splines.bs <- lm(triceps ~ bs(age, knots = c(5,10,20,30,40)), 
                   data=triceps)
summary(cub.splines.bs)
## 
## Call:
## lm(formula = triceps ~ bs(age, knots = c(5, 10, 20, 30, 40)), 
##     data = triceps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5234  -1.6912  -0.2917   1.1356  23.0922 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                              6.9598     0.9729   7.154 1.77e-12 ***
## bs(age, knots = c(5, 10, 20, 30, 40))1   2.5367     1.7154   1.479   0.1396    
## bs(age, knots = c(5, 10, 20, 30, 40))2  -0.3032     0.9629  -0.315   0.7529    
## bs(age, knots = c(5, 10, 20, 30, 40))3  -1.9092     1.2993  -1.469   0.1421    
## bs(age, knots = c(5, 10, 20, 30, 40))4   7.4056     1.2179   6.081 1.78e-09 ***
## bs(age, knots = c(5, 10, 20, 30, 40))5   6.1050     1.4043   4.347 1.54e-05 ***
## bs(age, knots = c(5, 10, 20, 30, 40))6  10.1770     1.5427   6.597 7.23e-11 ***
## bs(age, knots = c(5, 10, 20, 30, 40))7   3.9428     1.9082   2.066   0.0391 *  
## bs(age, knots = c(5, 10, 20, 30, 40))8  10.1473     1.7545   5.784 1.01e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.743 on 883 degrees of freedom
## Multiple R-squared:  0.4261, Adjusted R-squared:  0.4209 
## F-statistic: 81.94 on 8 and 883 DF,  p-value: < 2.2e-16

ns()函数

library(splines)
library(MultiKink) #for the data
library(ggplot2)   #for the plots
set.seed(1974)     #fix the random generator seed 
data("triceps")   #load the dataset triceps
                  #notice that the variable of interest
                  #it is also called triceps. Don't get 
                  #confused!
#linear model with the natural cubic splines function 
cub.splines.bs <- lm(triceps ~ bs(age, knots = c(5,10,20,30,40)), 
                   data=triceps)
summary(cub.splines.bs)
## 
## Call:
## lm(formula = triceps ~ bs(age, knots = c(5, 10, 20, 30, 40)), 
##     data = triceps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.5234  -1.6912  -0.2917   1.1356  23.0922 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                              6.9598     0.9729   7.154 1.77e-12 ***
## bs(age, knots = c(5, 10, 20, 30, 40))1   2.5367     1.7154   1.479   0.1396    
## bs(age, knots = c(5, 10, 20, 30, 40))2  -0.3032     0.9629  -0.315   0.7529    
## bs(age, knots = c(5, 10, 20, 30, 40))3  -1.9092     1.2993  -1.469   0.1421    
## bs(age, knots = c(5, 10, 20, 30, 40))4   7.4056     1.2179   6.081 1.78e-09 ***
## bs(age, knots = c(5, 10, 20, 30, 40))5   6.1050     1.4043   4.347 1.54e-05 ***
## bs(age, knots = c(5, 10, 20, 30, 40))6  10.1770     1.5427   6.597 7.23e-11 ***
## bs(age, knots = c(5, 10, 20, 30, 40))7   3.9428     1.9082   2.066   0.0391 *  
## bs(age, knots = c(5, 10, 20, 30, 40))8  10.1473     1.7545   5.784 1.01e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.743 on 883 degrees of freedom
## Multiple R-squared:  0.4261, Adjusted R-squared:  0.4209 
## F-statistic: 81.94 on 8 and 883 DF,  p-value: < 2.2e-16

比较两个函数的输出结果，发现自然三次样条的回归参数较少，这是因为它在极值处限制了拟合曲线为线性。利用图形可以更直观地展示这一点：

#simple scatter
 tri.age.plot <- ggplot(triceps, aes(x=age, y=triceps)) +
                 geom_point(alpha=0.55, color="black") + 
                 theme_minimal() 
  tri.age.plot +
    stat_smooth(method = "lm", 
               formula = y~bs(x,knots = c(5,10,20,30,40)), 
               lty = 1, col = "blue") + 
    stat_smooth(method = "lm", 
               formula = y~ns(x,knots = c(5,10,20,30,40)), 
               lty = 1, col = "red")