数据集描述
数据集中每个分子具有三个构建块。该数据集用于表示分子的三个构建块是否能够与蛋白质相结合,如果能够结合标记为binds为1,否则binds为0.
格式描述如下:
- id- 我们用来识别分子结合靶标对的独特example_id。
- buildingblock1_smiles- 在SMILES中,第一个构建块的结构
- buildingblock2_smiles- 在SMILES中,第二个构建块的结构
- buildingblock3_smiles- 在SMILES中,第三个构建块的结构
- molecule_smiles- 完全组装的分子的结构,在SMILES中。这包括三个构建单元和三嗪核心。请注意,我们使用 a 作为 DNA 接头的替代物。[Dy]
- protein_name- 蛋白质靶标名称
- binds- 目标列。分子是否与蛋白质结合的二元类标记。不适用于测试集。
工具库描述
- rdkit 用于化学信息学的开源工具包,提供了丰富的功能来支持药物涉及、生物活性预测、化学反应预测和化学数据处理等领域。本案例中主要用于计算分子指纹。
- duckdb 开源嵌入式分析型数据库管理系统,转为数据分析和在线分析处理(OLAP)二涉及。本案例主要用于列式存储数据分析。
- PySMILES 用于处理SMILES格式的分子表示。
算法详解
!pip install duckdb !pip install pysmiles !pip install rdkit
数据加载
#导入系统库 import re import os import unicodedata import itertools #导入数据处理库 import pandas as pd import numpy as np import pandas #导入数据库处理库 import duckdb #导入数据虚拟化库 import pysmiles import plotly import seaborn as sns import matplotlib.pylab as pl import matplotlib.pylab as m import matplotlib.pylab as mpk import matplotlib.pyplot as plt import plotly.express as px from matplotlib import pyplot as plt from rdkit import Chem from rdkit.Chem import Draw, AllChem from rdkit import RDLogger from rdkit.Chem.Draw import IPythonConsole from rdkit.Chem.Draw import rdMolDraw2D #python版本 from IPython.display import SVG IPythonConsole.ipthon_userSVG=True #配置图像的绘制风格 sns.set_theme(style='whitegrid') palette='viridis' #从Parquet导入数据库,读取数据 data_train = '/input/train.parquet' test_path = '/input/test.parquet' #连接duckdb数据库 con = duckdb.connect() #查询出bind成功和bind失败的数据,打乱了随机排序,取前30000条。以dataframe的格式返回 data = con.query(f"""(SELECT * FROM parquet_scan('{data_train}') WHERE binds = 0 ORDER BY random() LIMIT 30000) UNION ALL (SELECT * FROM parquet_scan('{data_train}') WHERE binds = 1 ORDER BY random() LIMIT 30000)""").df() #关闭数据库 con.close() #保存数据到csv文件 data.to_csv('/working/dataset.csv')
数据预处理
在预处理阶段,我们执行几个基本步骤来准备用于分析的数据。首先,应用数据清洗技术去除重复项和处理缺失值;然后,根据数据的性质,使用适当的编码方法,如one-hot编码或标签编码,将分类变量转换为数值变量。此外,我们将数值变量标准化或标准化,以确保它们处于相同的尺度上,这对许多机器学习算法至关重要。这些预处理步骤确保了数据格式适合分析模型,提高了后续分析的准确性和效率。
#应用rdkit将分子式转换为rdkit分子对象 data['molecule'] = data['molecule_smiles'].apply(Chem.MolFromSmiles) #创建分子指纹位图函数 def modl(molecule_data, radius=2, bits=1024): if molecule_data is None: return None return list(AllChem.GetMorganFingerprintAsBitVect(molecule_data, radius, nBits=bits)) #根据分子对象和位图函数生成指纹 data['H1_ecfp'] = data['molecule'].apply(modl)
from sklearn.preprocessing import OneHotEncoder encoder_onehot = OneHotEncoder(sparse_output=False) encoder_onehot_fit = encoder_onehot.fit_transform(data['protein_name'].values.reshape(-1,1))
#分子指纹和蛋白质独热编码进行组合,用于创建唯一特征减少分类 X = [ecfp + protein for ecfp, protein in zip(data['H1_ecfp'].tolist(), encoder_onehot_fit.tolist())] y = data[binds].tolist
这里,我们对两个变量进行了划分:“H1_ecfp”和名为“绑定”的目标变量。这一步对于规范化数据至关重要,确保“H1_ecfp”的值相对于目标变量“绑定”进行缩放。归一化对于避免可能影响各种机器学习算法性能的尺度问题很重要,特别是那些基于距离的算法,如k近邻(KNN)和聚类方法。此外,这种操作可以为“H1_ecfp”和“绑定”之间的比例关系提供有价值的见解,允许更好地解释模型的结果。该部门可以突出数据中可能对预测建模至关重要的隐藏趋势或模式。通过适当的归一化,我们可以提高模型的稳定性和准确性,确保所有变量对学习过程的贡献相等。
模型训练
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) #导入进度生成库 from tqdm import tqdm from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KneightborsClassifier from sklearn.tree import DevisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.naive_bayes import GaussianNB from lightgbm import LGBMClassifier from xgboost import XGBClassifier, plot_importance as plot_importance_xgb from lightgbm import LGBMClassifier, plot_importance as plot_importance_lgbm #度量标准和模型评估 from sklearn.model_selection import GridSearchCV from sklearn.model_selection import cross_val_score from sklearn.metrics import roc_curve, auc, confusion_matrix, accuracy_score, classification_report #机器学习模型 models = { #逻辑回归模型 "Logistic Regression": LogisticRegression(), #朴素贝叶斯模型 "Naive bayes": GaussianNB(), #KNN模型 "KNN": KNeighborsClassifier(), #AdaBoost模型(通过迭代弱分类器形成强分类器) "Ada Boost": AdaBoostClassifier(), #梯度提升模型(通过迭代训练决策树来提供预测准确率) "Gradient Boosting Classifier":GradientBoostingClassifier(), #决策树模型 "Decision Tree Classifier" : DecisionTreeClassifier(max_depth=5, min_samples_split=2, random_state=105), #XGBoost 模型(优化的分布式梯度提升库) "XGBoost": XGBClassifier(n_estimators=100, max_depth=250, learning_rate=0.1, subsample=0.8, colsample_bytree=0.8 num_class=3, random_state=42, tree_method='gpu_hist'), #LGBM 模型(基于决策树算法的分布式梯度提升框架) "LGBM": LGBMClassifier(boosting_type='gbdt', bagging_freq=5, verbose=0, device='gpu', num_leaves=31, max_depth=250, learning_rate=0.1, n_estimators=100) } #模型训练 for name, model in tqdm(models.items(), desc="traning models", total=len(models)): #模型学习 model.fit(X_train, y_train) #通过交叉验证的方式找出最好的参数,折叠10次 score_training = cross_val_score(model, X_train, y_train, cv=10) #使用模型进行预测 pred_mode = mode.predict(X_test) #展示模型进度和结果 tqdm.write("Model: {} has Accuracy {:.2f}%".format(model.__class__.__name__,round(score_training.mean(), 2) * 100)) print()
Training models: 12%|█▎ | 1/8 [01:40<11:44, 100.63s/it] Model: LogisticRegression has Accuracy 87.00% Training models: 25%|██▌ | 2/8 [03:00<08:50, 88.36s/it] Model: GaussianNB has Accuracy 74.00% Training models: 38%|███▊ | 3/8 [05:12<09:02, 108.45s/it] Model: KNeighborsClassifier has Accuracy 80.00% Training models: 50%|█████ | 4/8 [13:33<17:33, 263.47s/it] Model: AdaBoostClassifier has Accuracy 79.00% Training models: 62%|██████▎ | 5/8 [43:18<40:36, 812.11s/it] Model: GradientBoostingClassifier has Accuracy 84.00% Training models: 75%|███████▌ | 6/8 [44:35<18:44, 562.10s/it] Model: DecisionTreeClassifier has Accuracy 75.00% Training models: 88%|████████▊ | 7/8 [52:01<08:44, 524.23s/it] Model: XGBClassifier has Accuracy 91.00% [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. 1 warning generated. [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). Training models: 100%|██████████| 8/8 [53:43<00:00, 402.98s/it] [LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5 [LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31). Model: LGBMClassifier has Accuracy 89.00% CPU times: user 58min 38s, sys: 1min 6s, total: 59min 44s Wall time: 53min 47s
为这个项目生成了8个机器学习模型:逻辑回归、朴素贝叶斯、k近邻(KNN)、决策树、AdaBoost、梯度提升、XGBoost和LightGBM。使用特定的数据集对每个模型进行训练和评估,以确定表现最佳的模型。经过评估,LightGBM模型是最有效的,达到了90%的准确率。该模型不仅表现出了最好的准确率,而且在精确度、召回率和F1-score等其他性能指标上也表现出了鲁棒性,表明了其在各种情况下的一致性和可靠性。次优的表现是XGBoost模型,达到了84%的准确率。虽然准确率低于LightGBM,但XGBoost在其他评价指标上也表现出了良好的效果。此外,详细分析了每个模型在不同数据子集上的性能,以验证其泛化性并防止过拟合。基于此分析,LightGBM不仅在精度方面,而且在泛化能力和稳定性方面证明了其优越性。因此,综合考虑所有评估标准,LightGBM模型表现出最高的坚持度和性能,使其成为在此背景下未来实现的最推荐的选择。
auc图
for name, model in models.items(): #模型训练 model.fit(X_train, y_train) #在test集上进行预测 y_pred = model.predict(X_test) print("Machine Learning Model:", name) # ROC curve 提取正向值进行对比 fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test)[:,1]) roc_auc = auc(fpr, tpr) plt.figure() plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc) plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver Operating Characteristic - {}'.format(name)) plt.legend(loc="lower right") plt.grid()