
简介: 数据集描述数据集中每个分子具有三个构建块。该数据集用于表示分子的三个构建块是否能够与蛋白质相结合,如果能够结合标记为binds为1,否则binds为0.格式描述如下:• id- 我们用来识别分子结合靶标对的独特example_id。• buildingblock1_smiles- 在SMILES中,第一个构建块的结构• buildingblock2_smiles- 在SMILES中,第二个构建块的结构• buildingblock3_smiles- 在SMILES中,第三个构建块的结构• molecule_smiles- 完全组装的分子的结构,在SMILES中。这包括三个构建单元




  • molecule_smiles- 完全组装的分子的结构,在SMILES中。这包括三个构建单元和三嗪核心。请注意,我们使用 a 作为 DNA 接头的替代物。[Dy]
  • protein_name- 蛋白质靶标名称
  • binds- 目标列。分子是否与蛋白质结合的二元类标记。不适用于测试集。


  • rdkit 用于化学信息学的开源工具包,提供了丰富的功能来支持药物涉及、生物活性预测、化学反应预测和化学数据处理等领域。本案例中主要用于计算分子指纹。
  • duckdb 开源嵌入式分析型数据库管理系统,转为数据分析和在线分析处理(OLAP)二涉及。本案例主要用于列式存储数据分析。
  • PySMILES 用于处理SMILES格式的分子表示。


!pip install duckdb
!pip install pysmiles
!pip install rdkit


import re
import os
import unicodedata
import itertools

import pandas as pd
import numpy as np
import pandas

import duckdb

import pysmiles
import plotly
import seaborn as sns
import matplotlib.pylab as pl
import matplotlib.pylab as m
import matplotlib.pylab as mpk
import matplotlib.pyplot as plt
import as px
from matplotlib import pyplot as plt
from rdkit import Chem
from rdkit.Chem import Draw, AllChem
from rdkit import RDLogger
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem.Draw import rdMolDraw2D

from IPython.display import SVG


data_train = '/input/train.parquet'
test_path = '/input/test.parquet'

con = duckdb.connect()

data = con.query(f"""(SELECT * FROM parquet_scan('{data_train}') 
WHERE binds = 0
ORDER BY random()
LIMIT 30000)
(SELECT * FROM parquet_scan('{data_train}')
WHERE binds = 1
ORDER BY random() 
LIMIT 30000)""").df()





data['molecule'] = data['molecule_smiles'].apply(Chem.MolFromSmiles)

def modl(molecule_data, radius=2, bits=1024):
    if molecule_data is None:
        return None
    return list(AllChem.GetMorganFingerprintAsBitVect(molecule_data, radius, nBits=bits))

data['H1_ecfp'] = data['molecule'].apply(modl)
from sklearn.preprocessing import OneHotEncoder

encoder_onehot = OneHotEncoder(sparse_output=False)
encoder_onehot_fit = encoder_onehot.fit_transform(data['protein_name'].values.reshape(-1,1))
X = [ecfp + protein for ecfp, protein in zip(data['H1_ecfp'].tolist(), encoder_onehot_fit.tolist())]
y = data[binds].tolist



from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from tqdm import tqdm

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KneightborsClassifier
from sklearn.tree import DevisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier, plot_importance as plot_importance_xgb
from lightgbm import LGBMClassifier, plot_importance as plot_importance_lgbm

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve, auc, confusion_matrix, accuracy_score, classification_report

models = {
  "Logistic Regression": LogisticRegression(),

  "Naive bayes": GaussianNB(),

  "KNN": KNeighborsClassifier(),

  "Ada Boost": AdaBoostClassifier(),

  "Gradient Boosting Classifier":GradientBoostingClassifier(),

  "Decision Tree Classifier" : DecisionTreeClassifier(max_depth=5,

  #XGBoost 模型(优化的分布式梯度提升库)
  "XGBoost": XGBClassifier(n_estimators=100,

  #LGBM 模型(基于决策树算法的分布式梯度提升框架)
  "LGBM": LGBMClassifier(boosting_type='gbdt',

for name, model in tqdm(models.items(), desc="traning models", total=len(models)):
  #模型学习, y_train)

  score_training = cross_val_score(model, X_train, y_train, cv=10)

  pred_mode = mode.predict(X_test)

  tqdm.write("Model: {} has Accuracy {:.2f}%".format(model.__class__.__name__,round(score_training.mean(), 2) * 100))

Training models:  12%|█▎        | 1/8 [01:40<11:44, 100.63s/it]
Model: LogisticRegression has Accuracy 87.00%

Training models:  25%|██▌       | 2/8 [03:00<08:50, 88.36s/it] 
Model: GaussianNB has Accuracy 74.00%

Training models:  38%|███▊      | 3/8 [05:12<09:02, 108.45s/it]
Model: KNeighborsClassifier has Accuracy 80.00%

Training models:  50%|█████     | 4/8 [13:33<17:33, 263.47s/it]
Model: AdaBoostClassifier has Accuracy 79.00%

Training models:  62%|██████▎   | 5/8 [43:18<40:36, 812.11s/it]
Model: GradientBoostingClassifier has Accuracy 84.00%

Training models:  75%|███████▌  | 6/8 [44:35<18:44, 562.10s/it]
Model: DecisionTreeClassifier has Accuracy 75.00%

Training models:  88%|████████▊ | 7/8 [52:01<08:44, 524.23s/it]
Model: XGBClassifier has Accuracy 91.00%

[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Training models: 100%|██████████| 8/8 [53:43<00:00, 402.98s/it]
[LightGBM] [Warning] bagging_freq is set=5, subsample_freq=0 will be ignored. Current value: bagging_freq=5
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Model: LGBMClassifier has Accuracy 89.00%

CPU times: user 58min 38s, sys: 1min 6s, total: 59min 44s
Wall time: 53min 47s



for name, model in models.items():
  #模型训练, y_train)

  y_pred = model.predict(X_test)
  print("Machine Learning Model:", name)

  # ROC curve 提取正向值进行对比
  fpr, tpr, _ = roc_curve(y_test, model.predict_proba(X_test)[:,1])
  roc_auc = auc(fpr, tpr)

  plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
  plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
  plt.xlim([0.0, 1.0])
  plt.ylim([0.0, 1.05])
  plt.xlabel('False Positive Rate')
  plt.ylabel('True Positive Rate')
  plt.title('Receiver Operating Characteristic - {}'.format(name))
  plt.legend(loc="lower right")
