一、男性多项身体维度数据
1.背景描述
包含252名男性通过水下称重和各种体围测量确定的身体脂肪百分比的估计数
2.数据说明
通过水下称重确定的密度(从Siri(1956)的方程式中得出的体脂百分比)
Variables | 变量 |
Age (years) | 年龄(岁) |
Weight (lbs) | 体重(磅) |
Height (inches) | 身高(英寸) |
Neck circumference (cm) | 颈围(厘米) |
Chest circumference (cm) | 胸围(厘米) |
Abdomen 2 circumference (cm) | 腹围 (cm) |
Hip circumference (cm) | 臀围 (cm) |
Thigh circumference (cm) | 大腿周长 (cm) |
Knee circumference (cm) | 膝盖周长 (cm) |
Ankle circumference (cm) | 踝关节周长 (cm) |
# 二、健康知识
1.什么是体脂率?
体脂率指的是人体内脂肪重量在人体总体重中所占的比例,又称为体脂百分数。它可以反映人体内脂肪含量的多少。体脂率可以通过体质称、生物电阻抗法、以及DIC的方法进行测量。正常男性体内脂肪含量占体重的10%-20%,女性为15%-25%。如果男性体脂率>25%,女性体脂率>30%,可以考虑为肥胖。
2.体脂率如何计算?
性别 | 理想体脂率 |
男性 | 6%-14% |
女性 | 15%-25% |
女性运动员 | 5%-12% |
男性运动员 | 4%-8% |
3.体脂率计算
体脂率的计算可采用以下两个公式:
- 计算体重指数(BMI):BMI=体重(千克)/身高的平方(米)
- 计算体脂率:1.2×BMI+0.23×年龄-5.4-10.8×性别(男为1,女为0) 除此之外,体脂率可以通过多种方法测量得到,最常见的是皮褶测量法、生物电阻抗测量法。皮褶测量法是测量者用皮尺测量被测者身体上的几个特定点,然后将测量结果与体脂率标准图表相比较得到被测者的体脂率;生物电阻抗测量法是目前市场上体脂称所用的测量方法,被测者只要按照指示站在体脂称上即可得到自己的体脂率数据。上述两种方法可能的误差在8%以内,此外还有X线、CT、MRI等方法。
以上信息来自一言好帮手。
三、数据分析
1.数据读取
import pandas as pd df = pd.read_csv('data/data223180/bodyfat.csv') df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Density | BodyFat | Age | Weight | Height | Neck | Chest | Abdomen | Hip | Thigh | Knee | Ankle | Biceps | Forearm | Wrist | |
0 | 1.0708 | 12.3 | 23 | 154.25 | 67.75 | 36.2 | 93.1 | 85.2 | 94.5 | 59.0 | 37.3 | 21.9 | 32.0 | 27.4 | 17.1 |
1 | 1.0853 | 6.1 | 22 | 173.25 | 72.25 | 38.5 | 93.6 | 83.0 | 98.7 | 58.7 | 37.3 | 23.4 | 30.5 | 28.9 | 18.2 |
2 | 1.0414 | 25.3 | 22 | 154.00 | 66.25 | 34.0 | 95.8 | 87.9 | 99.2 | 59.6 | 38.9 | 24.0 | 28.8 | 25.2 | 16.6 |
3 | 1.0751 | 10.4 | 26 | 184.75 | 72.25 | 37.4 | 101.8 | 86.4 | 101.2 | 60.1 | 37.3 | 22.8 | 32.4 | 29.4 | 18.2 |
4 | 1.0340 | 28.7 | 24 | 184.25 | 71.25 | 34.4 | 97.3 | 100.0 | 101.9 | 63.2 | 42.2 | 24.0 | 32.2 | 27.7 | 17.7 |
2.缺失值统计 isnull().sum()
isnull().sum()就更加直观了,它直接告诉了我们每列缺失值的数量。
df.isnull().sum()
Density 0 BodyFat 0 Age 0 Weight 0 Height 0 Neck 0 Chest 0 Abdomen 0 Hip 0 Thigh 0 Knee 0 Ankle 0 Biceps 0 Forearm 0 Wrist 0 dtype: int64
可见数据不错,没有缺失值
3.数据换算
- 体重单位是磅,身高单位是英寸,根据说明,其他字段单位都是cm,所以我们先做一个单位的转换
- 1磅 = 0.45359237kg
- 1英寸 = 2.54cm
df['Weight'] = df['Weight']*0.45359237 df['Height'] = df['Height']*2.54 df.head()
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Density | BodyFat | Age | Weight | Height | Neck | Chest | Abdomen | Hip | Thigh | Knee | Ankle | Biceps | Forearm | Wrist | |
0 | 1.0708 | 12.3 | 23 | 69.966623 | 172.085 | 36.2 | 93.1 | 85.2 | 94.5 | 59.0 | 37.3 | 21.9 | 32.0 | 27.4 | 17.1 |
1 | 1.0853 | 6.1 | 22 | 78.584878 | 183.515 | 38.5 | 93.6 | 83.0 | 98.7 | 58.7 | 37.3 | 23.4 | 30.5 | 28.9 | 18.2 |
2 | 1.0414 | 25.3 | 22 | 69.853225 | 168.275 | 34.0 | 95.8 | 87.9 | 99.2 | 59.6 | 38.9 | 24.0 | 28.8 | 25.2 | 16.6 |
3 | 1.0751 | 10.4 | 26 | 83.801190 | 183.515 | 37.4 | 101.8 | 86.4 | 101.2 | 60.1 | 37.3 | 22.8 | 32.4 | 29.4 | 18.2 |
4 | 1.0340 | 28.7 | 24 | 83.574394 | 180.975 | 34.4 |
4.年龄分析
print('最大年龄为:{}岁;最小年龄为:{}岁。'.format(max(df['Age']),min(df['Age'])))
最大年龄为:81岁;最小年龄为:22岁。
from matplotlib import pyplot as plt %matplotlib inline import seaborn as sns fig,ax = plt.subplots(figsize=(6,3), dpi=120) plt.hist(x = df.Age, # 指定绘图数据 bins = 15, # 指定直方图中条块的个数 color = 'skyblue', # 指定直方图的填充色 edgecolor = 'black' # 指定直方图的边框色 ) # 添加x轴和y轴标签 plt.xlabel('age') plt.ylabel('num') # 添加标题 plt.title('age distribution')
Text(0.5,1,'age distribution')
5.体脂率分布
print('最大体脂率为:{}%;最小体脂率为:{}%。'.format(max(df['BodyFat']),min(df['BodyFat'])))
最大体脂率为:47.5%;最小体脂率为:0.0%。
from matplotlib import pyplot as plt %matplotlib inline import seaborn as sns fig,ax = plt.subplots(figsize=(6,3), dpi=120) plt.hist(x = df.BodyFat, # 指定绘图数据 bins = 15, # 指定直方图中条块的个数 color = 'skyblue', # 指定直方图的填充色 edgecolor = 'black' # 指定直方图的边框色 ) # 添加x轴和y轴标签 plt.xlabel('BodyFat') plt.ylabel('num') # 添加标题 plt.title('BodyFat distribution')
Text(0.5,1,'BodyFat distribution')
BodyFat 有0, 给删掉先。
df[df['BodyFat']==0.0]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Density | BodyFat | Age | Weight | Height | Neck | Chest | Abdomen | Hip | Thigh | Knee | Ankle | Biceps | Forearm | Wrist | |
181 | 1.1089 | 0.0 | 40 | 53.750696 | 172.72 | 33.8 | 79.3 | 69.4 | 85.0 | 47.2 | 33.5 | 20.2 | 27.7 | 24.6 | 16.5 |
df.drop(df.index[[181]],inplace=True)
from matplotlib import pyplot as plt %matplotlib inline import seaborn as sns fig,ax = plt.subplots(figsize=(6,3), dpi=120) plt.hist(x = df.BodyFat, # 指定绘图数据 bins = 15, # 指定直方图中条块的个数 color = 'skyblue', # 指定直方图的填充色 edgecolor = 'black' # 指定直方图的边框色 ) # 添加x轴和y轴标签 plt.xlabel('BodyFat') plt.ylabel('num') # 添加标题 plt.title('BodyFat distribution')
Text(0.5,1,'BodyFat distribution')
5.相关性分析
# 计算相关系数 cor = df.corr() cor=abs(cor) print(cor) plt.figure(figsize=(20,20)) #画出热力图 sns.heatmap(cor,vmin=0,vmax=3,center=0) #必须要配合matplotlib才可使用 plt.title("heatmap") plt.xlabel("x_ticks") plt.ylabel("y_ticks") plt.show()
Density BodyFat Age Weight Height Neck Chest \ Density 1.000000 0.988010 0.277771 0.585170 0.106209 0.463700 0.673337 BodyFat 0.988010 1.000000 0.291061 0.605048 0.095941 0.482739 0.695607 Age 0.277771 0.291061 1.000000 0.016075 0.172726 0.111530 0.174837 Weight 0.585170 0.605048 0.016075 1.000000 0.306263 0.828468 0.892382 Height 0.106209 0.095941 0.172726 0.306263 1.000000 0.251335 0.130705 Neck 0.463700 0.482739 0.111530 0.828468 0.251335 1.000000 0.782092 Chest 0.673337 0.695607 0.174837 0.892382 0.130705 0.782092 1.000000 Abdomen 0.794712 0.809725 0.229272 0.885997 0.083606 0.750657 0.914255 Hip 0.600660 0.617980 0.054044 0.939856 0.167079 0.731290 0.826102 Thigh 0.541394 0.550033 0.206001 0.866264 0.144651 0.691253 0.723367 Knee 0.483291 0.499019 0.014376 0.850579 0.283834 0.667770 0.713609 Ankle 0.251126 0.254455 0.108383 0.608315 0.262449 0.471652 0.474481 Biceps 0.479908 0.486752 0.043723 0.798382 0.205347 0.728375 0.725254 Forearm 0.337139 0.349466 0.088923 0.624087 0.225919 0.618471 0.571720 Wrist 0.312004 0.335434 0.212183 0.725655 0.320142 0.741548 0.654275 Abdomen Hip Thigh Knee Ankle Biceps Forearm \ Density 0.794712 0.600660 0.541394 0.483291 0.251126 0.479908 0.337139 BodyFat 0.809725 0.617980 0.550033 0.499019 0.254455 0.486752 0.349466 Age 0.229272 0.054044 0.206001 0.014376 0.108383 0.043723 0.088923 Weight 0.885997 0.939856 0.866264 0.850579 0.608315 0.798382 0.624087 Height 0.083606 0.167079 0.144651 0.283834 0.262449 0.205347 0.225919 Neck 0.750657 0.731290 0.691253 0.667770 0.471652 0.728375 0.618471 Chest 0.914255 0.826102 0.723367 0.713609 0.474481 0.725254 0.571720 Abdomen 1.000000 0.871783 0.761918 0.732328 0.445234 0.681395 0.494604 Hip 0.871783 1.000000 0.894477 0.820318 0.552157 0.736432 0.537275 Thigh 0.761918 0.894477 1.000000 0.795225 0.532763 0.759100 0.558701 Knee 0.732328 0.820318 0.795225 1.000000 0.606109 0.675043 0.548239 Ankle 0.445234 0.552157 0.532763 0.606109 1.000000 0.479496 0.411011 Biceps 0.681395 0.736432 0.759100 0.675043 0.479496 1.000000 0.674631 Forearm 0.494604 0.537275 0.558701 0.548239 0.411011 0.674631 1.000000 Wrist 0.613794 0.624363 0.551202 0.659266 0.560641 0.628100 0.579348 Wrist Density 0.312004 BodyFat 0.335434 Age 0.212183 Weight 0.725655 Height 0.320142 Neck 0.741548 Chest 0.654275 Abdomen 0.613794 Hip 0.624363 Thigh 0.551202 Knee 0.659266 Ankle 0.560641 Biceps 0.628100 Forearm 0.579348 Wrist 1.000000
print(cor['BodyFat'])
Density 0.988010 BodyFat 1.000000 Age 0.291061 Weight 0.605048 Height 0.095941 Neck 0.482739 Chest 0.695607 Abdomen 0.809725 Hip 0.617980 Thigh 0.550033 Knee 0.499019 Ankle 0.254455 Biceps 0.486752 Forearm 0.349466 Wrist 0.335434 Name: BodyFat, dtype: float64
相关性如上所述,和公式反映的差不多。
四、模型拟合
1.数据集划分
from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression X_data = pd.DataFrame(df['Density']) y_data = df['BodyFat'] X_train, x_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.2,random_state=2022)
2.线性回归模型
from sklearn.metrics import mean_squared_error, r2_score lr_model = LinearRegression() lr_model.fit(X_train,y_train) y_predictions = lr_model.predict(x_test) print(r2_score(y_test,y_predictions)) print(lr_model.coef_) print(lr_model.intercept_)
0.9754079369106297 [-434.10203913] 477.38901572061883
3.数据集划分
fig,ax = plt.subplots(figsize=(4,3), dpi=150) plt.scatter(x_test, y_test, color = 'skyblue', label = 'GT') #设定X,Y轴标签和title plt.ylabel('Bodyfat') plt.xlabel('Density') #绘制最佳拟合曲线 plt.plot(x_test, y_predictions, color = 'black', label = 'predict') #来个图例 plt.legend(loc = 'best')