DEL编码新药预测的多种机器学习模型对比

简介: 数据集描述数据集中每个分子具有三个构建块。该数据集用于表示分子的三个构建块是否能够与蛋白质相结合,如果能够结合标记为binds为1,否则binds为0.格式描述如下:• id- 我们用来识别分子结合靶标对的独特example_id。• buildingblock1_smiles- 在SMILES中,第一个构建块的结构• buildingblock2_smiles- 在SMILES中,第二个构建块的结构• buildingblock3_smiles- 在SMILES中,第三个构建块的结构• molecule_smiles- 完全组装的分子的结构,在SMILES中。这包括三个构建单元

数据集描述

数据集中每个分子具有三个构建块。该数据集用于表示分子的三个构建块是否能够与蛋白质相结合,如果能够结合标记为binds为1,否则binds为0.

格式描述如下:

  • id- 我们用来识别分子结合靶标对的独特example_id。
  • buildingblock1_smiles- 在SMILES中,第一个构建块的结构
  • buildingblock2_smiles- 在SMILES中,第二个构建块的结构
  • buildingblock3_smiles- 在SMILES中,第三个构建块的结构
  • molecule_smiles- 完全组装的分子的结构,在SMILES中。这包括三个构建单元和三嗪核心。请注意,我们使用 a 作为 DNA 接头的替代物。[Dy]
  • protein_name- 蛋白质靶标名称
  • binds- 目标列。分子是否与蛋白质结合的二元类标记。不适用于测试集。

工具库描述

  • rdkit 用于化学信息学的开源工具包,提供了丰富的功能来支持药物涉及、生物活性预测、化学反应预测和化学数据处理等领域。本案例中主要用于计算分子指纹。
  • duckdb 开源嵌入式分析型数据库管理系统,转为数据分析和在线分析处理(OLAP)二涉及。本案例主要用于列式存储数据分析。
  • PySMILES 用于处理SMILES格式的分子表示。

算法详解

!pip install duckdb
!pip install pysmiles
!pip install rdkit

数据加载

#导入系统库
import re
import os
import unicodedata
import itertools

#导入数据处理库
import pandas as pd
import numpy as np
import pandas

#导入数据库处理库
import duckdb

#导入数据虚拟化库
import pysmiles
import plotly
import seaborn as sns
import matplotlib.pylab as pl
import matplotlib.pylab as m
import matplotlib.pylab as mpk
import matplotlib.pyplot as plt
import plotly.express as px
from matplotlib import pyplot as plt
from rdkit import Chem
from rdkit.Chem import Draw, AllChem
from rdkit import RDLogger
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Draw import rdMolDraw2D

#python版本
from IPython.display import SVG
IPythonConsole.ipthon_userSVG=True

#配置图像的绘制风格
sns.set_theme(style='whitegrid')
palette='viridis'

#从Parquet导入数据库,读取数据
data_train = '/input/train.parquet'
test_path = '/input/test.parquet'

#连接duckdb数据库
con = duckdb.connect()

#查询出bind成功和bind失败的数据,打乱了随机排序,取前30000条。以dataframe的格式返回
data = con.query(f"""(SELECT * FROM parquet_scan('{data_train}') 
WHERE binds = 0
ORDER BY random()
LIMIT 30000)
UNION ALL
(SELECT * FROM parquet_scan('{data_train}')
WHERE binds = 1
ORDER BY random() 
LIMIT 30000)""").df()

#关闭数据库
con.close()

#保存数据到csv文件
data.to_csv('/working/dataset.csv')

数据预处理

在预处理阶段,我们执行几个基本步骤来准备用于分析的数据。首先,应用数据清洗技术去除重复项和处理缺失值;然后,根据数据的性质,使用适当的编码方法,如one-hot编码或标签编码,将分类变量转换为数值变量。此外,我们将数值变量标准化或标准化,以确保它们处于相同的尺度上,这对许多机器学习算法至关重要。这些预处理步骤确保了数据格式适合分析模型,提高了后续分析的准确性和效率。

#应用rdkit将分子式转换为rdkit分子对象
data['molecule'] = data['molecule_smiles'].apply(Chem.MolFromSmiles)

#创建分子指纹位图函数
def modl(molecule_data, radius=2, bits=1024):
    if molecule_data is None:
        return None
    return list(AllChem.GetMorganFingerprintAsBitVect(molecule_data, radius, nBits=bits))

#根据分子对象和位图函数生成指纹
data['H1_ecfp'] = data['molecule'].apply(modl)
from sklearn.preprocessing import OneHotEncoder

encoder_onehot = OneHotEncoder(sparse_output=False)
encoder_onehot_fit = encoder_onehot.fit_transform(data['protein_name'].values.reshape(-1,1))
#分子指纹和蛋白质独热编码进行组合,用于创建唯一特征减少分类
X = [ecfp + protein for ecfp, protein in zip(data['H1_ecfp'].tolist(), encoder_onehot_fit.tolist())]
y = data[binds].tolist

这里,我们对两个变量进行了划分:“H1_ecfp”和名为“绑定”的目标变量。这一步对于规范化数据至关重要,确保“H1_ecfp”的值相对于目标变量“绑定”进行缩放。归一化对于避免可能影响各种机器学习算法性能的尺度问题很重要,特别是那些基于距离的算法,如k近邻(KNN)和聚类方法。此外,这种操作可以为“H1_ecfp”和“绑定”之间的比例关系提供有价值的见解,允许更好地解释模型的结果。该部门可以突出数据中可能对预测建模至关重要的隐藏趋势或模式。通过适当的归一化,我们可以提高模型的稳定性和准确性,确保所有变量对学习过程的贡献相等。

模型训练

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#导入进度生成库
from tqdm import tqdm

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KneightborsClassifier
from sklearn.tree import DevisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier, plot_importance as plot_importance_xgb
from lightgbm import LGBMClassifier, plot_importance as plot_importance_lgbm


#度量标准和模型评估
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve, auc, confusion_matrix, accuracy_score, classification_report

#机器学习模型
models = {
  #逻辑回归模型
  "Logistic Regression": LogisticRegression(),

  #朴素贝叶斯模型
  "Naive bayes": GaussianNB(),

  #KNN模型
  "KNN": KNeighborsClassifier(),

  #AdaBoost模型(通过迭代弱分类器形成强分类器)
  "Ada Boost": AdaBoostClassifier(),

  #梯度提升模型(通过迭代训练决策树来提供预测准确率)
  "Gradient Boosting Classifier":GradientBoostingClassifier(),

  #决策树模型
  "Decision Tree Classifier" : DecisionTreeClassifier(max_depth=5,
  min_samples_split=2,
  random_state=105),

  #XGBoost 模型(优化的分布式梯度提升库)
  "XGBoost": XGBClassifier(n_estimators=100,
  max_depth=250,
  learning_rate=0.1,
  subsample=0.8,
  colsample_bytree=0.8
  num_class=3,
  random_state=42,
  tree_method='gpu_hist'),

  #LGBM 模型(基于决策树算法的分布式梯度提升框架)
  "LGBM": LGBMClassifier(boosting_type='gbdt',
  bagging_freq=5,
  verbose=0,
  device='gpu',
  num_leaves=31,
  max_depth=250,
  learning_rate=0.1,
  n_estimators=100)
}

#模型训练
for name, model in tqdm(models.items(), desc="traning models", total=len(models)):
  #模型学习
  model.fit(X_train, y_train)

  #通过交叉验证的方式找出最好的参数,折叠10次
  score_training = cross_val_score(model, X_train, y_train, cv=10)

  #使用模型进行预测
  pred_mode = mode.predict(X_test)

  #展示模型进度和结果
  tqdm.write("Model: {} has Accuracy {:.2f}%".format(model.__class__.__name__,round(score_training.mean(), 2) * 100))

  print()
Training models:  12%|█▎        | 1/8 [01:40<11:44, 100.63s/it]
Model: LogisticRegression has Accuracy 87.00%

Training models:  25%|██▌       | 2/8 [03:00<08:50, 88.36s/it] 
Model: GaussianNB has Accuracy 74.00%

Training models:  38%|███▊      | 3/8 [05:12<09:02, 108.45s/it]
Model: KNeighborsClassifier has Accuracy 80.00%

Training models:  50%|█████     | 4/8 [13:33<17:33, 263.47s/it]
Model: AdaBoostClassifier has Accuracy 79.00%

Training models:  62%|██████▎   | 5/8 [43:18<40:36, 812.11s/it]
Model: GradientBoostingClassifier has Accuracy 84.00%

Training models:  75%|███████▌  | 6/8 [44:35<18:44, 562.10s/it]
Model: DecisionTreeClassifier has Accuracy 75.00%

Training models:  88%|████████▊ | 7/8 [52:01<08:44, 524.23s/it]
Model: XGBClassifier has Accuracy 91.00%

[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
1 warning generated.
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Training models: 100%|██████████| 8/8 [53:43<00:00, 402.98s/it]
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Model: LGBMClassifier has Accuracy 89.00%

CPU times: user 58min 38s, sys: 1min 6s, total: 59min 44s
Wall time: 53min 47s

为这个项目生成了8个机器学习模型:逻辑回归、朴素贝叶斯、k近邻(KNN)、决策树、AdaBoost、梯度提升、XGBoost和LightGBM。使用特定的数据集对每个模型进行训练和评估,以确定表现最佳的模型。经过评估,LightGBM模型是最有效的,达到了90%的准确率。该模型不仅表现出了最好的准确率,而且在精确度、召回率和F1-score等其他性能指标上也表现出了鲁棒性,表明了其在各种情况下的一致性和可靠性。次优的表现是XGBoost模型,达到了84%的准确率。虽然准确率低于LightGBM,但XGBoost在其他评价指标上也表现出了良好的效果。此外,详细分析了每个模型在不同数据子集上的性能,以验证其泛化性并防止过拟合。基于此分析,LightGBM不仅在精度方面,而且在泛化能力和稳定性方面证明了其优越性。因此,综合考虑所有评估标准,LightGBM模型表现出最高的坚持度和性能,使其成为在此背景下未来实现的最推荐的选择。

auc图

for name, model in models.items():
  #模型训练
  model.fit(X_train, y_train)

  #在test集上进行预测
  y_pred = model.predict(X_test)
  print("Machine Learning Model:", name)

  # ROC curve 提取正向值进行对比
  fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test)[:,1])
  roc_auc = auc(fpr, tpr)

  plt.figure()
  plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
  plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
  plt.xlim([0.0, 1.0])
  plt.ylim([0.0, 1.05])
  plt.xlabel('False Positive Rate')
  plt.ylabel('True Positive Rate')
  plt.title('Receiver Operating Characteristic - {}'.format(name))
  plt.legend(loc="lower right")
  plt.grid()
相关文章
|
29天前
|
机器学习/深度学习 数据采集 数据挖掘
实战派教学:掌握Scikit-learn,轻松实现数据分析与机器学习模型优化!
【10月更文挑战第4天】Scikit-learn凭借高效、易用及全面性成为数据科学领域的首选工具,简化了数据预处理、模型训练与评估流程,并提供丰富算法库。本文通过实战教学,详细介绍Scikit-learn的基础入门、数据预处理、模型选择与训练、评估及调优等关键步骤,助你快速掌握并优化数据分析与机器学习模型。从环境搭建到参数调优,每一步都配有示例代码,便于理解和实践。
78 2
|
19天前
|
机器人
1024 云上见 使用 PAI+LLaMA Factory 微调 Qwen2-VL 模型,搭建 “文旅领域知识问答机器人” 领精美计时器
1024 云上见 使用 PAI+LLaMA Factory 微调 Qwen2-VL 模型,搭建 “文旅领域知识问答机器人” 领精美计时器
72 3
|
26天前
|
机器学习/深度学习 数据采集 监控
如何使用机器学习模型来自动化评估数据质量?
【10月更文挑战第6天】如何使用机器学习模型来自动化评估数据质量?
|
2月前
|
机器学习/深度学习 人工智能 分布式计算
使用PAI+LLaMA Factory 微调 Qwen2-VL 模型,搭建文旅领域知识问答机器人
本次教程介绍了如何使用 PAI ×LLaMA Factory 框架,基于全参方法微调 Qwen2-VL 模型,使其能够进行文旅领域知识问答,同时通过人工测试验证了微调的效果。
|
2月前
|
机器学习/深度学习 算法 数据可视化
机器学习模型中特征贡献度分析:预测贡献与错误贡献
本文将探讨特征重要性与特征有效性之间的关系,并引入两个关键概念:预测贡献度和错误贡献度。
222 3
|
25天前
|
数据采集 移动开发 数据可视化
模型预测笔记(一):数据清洗分析及可视化、模型搭建、模型训练和预测代码一体化和对应结果展示(可作为baseline)
这篇文章介绍了数据清洗、分析、可视化、模型搭建、训练和预测的全过程,包括缺失值处理、异常值处理、特征选择、数据归一化等关键步骤,并展示了模型融合技术。
40 1
模型预测笔记(一):数据清洗分析及可视化、模型搭建、模型训练和预测代码一体化和对应结果展示(可作为baseline)
|
8天前
|
机器学习/深度学习 数据采集 Python
从零到一:手把手教你完成机器学习项目,从数据预处理到模型部署全攻略
【10月更文挑战第25天】本文通过一个预测房价的案例,详细介绍了从数据预处理到模型部署的完整机器学习项目流程。涵盖数据清洗、特征选择与工程、模型训练与调优、以及使用Flask进行模型部署的步骤,帮助读者掌握机器学习的最佳实践。
31 1
|
11天前
|
机器学习/深度学习 数据采集 监控
如何使用机器学习模型来自动化评估数据质量?
如何使用机器学习模型来自动化评估数据质量?
|
17天前
|
机器人
1024 云上见 使用 PAI+LLaMA Factory 微调 Qwen2-VL 模型,搭建 “文旅领域知识问答机器人” 领 200个 精美计时器等你领
1024 云上见 使用 PAI+LLaMA Factory 微调 Qwen2-VL 模型,搭建 “文旅领域知识问答机器人” 领 200个 精美计时器等你领
61 2
|
25天前
|
机器学习/深度学习 前端开发 网络架构
Django如何调用机器学习模型进行预测
Django如何调用机器学习模型进行预测
49 5