使用Scikit-Learn pipeline 减少ML项目的代码量并提高可读性（下）-阿里云开发者社区

使用Scikit-Learn pipeline 减少ML项目的代码量并提高可读性（下）

2022-12-21 122

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 使用Scikit-Learn pipeline 减少ML项目的代码量并提高可读性

方案2：采用Scikit-learn pipeline

现在，让我们尝试使用Scikit-learn pipeline执行相同的操作，我将进行相同的转换并应用相同的算法

建立pipeline的第一步是定义每个转换器。约定是为我们拥有的不同变量类型创建转换器。脚步：

1）数值转换器：创建一个数值转换器，该转换器首先估算所有缺失值。然后应用StandardScaler。

2）分类转换器：创建一个分类转换器，该转换器采用OneHotEncoder将分类值转换为整数（1/0）。

3）列转换器：ColumnTransformer用于将上述转换应用于数据帧中的正确列，我将它们传递给我，这是我在上一节中定义的数字和分类特征的两个列表。

4）使用Estimator（Classifier）进行流水线操作：在这里，我将Column Transformer与最终的Transformer进行流水线化，后者是Estimator（我选择Logistic回归作为二进制分类器）

得到结果如下

我们得到了相同的准确率。这里没有多次进行拟合和变换，我们使用转换器和最终估计器对整个pipeline进行了一次拟合，并且我们应用了计算分数的方法（score）以获得模型的准确率。如果要可视化我们创建的pipeline，我们可以使用以下命令将其可视化。

fromsklearnimportset_configset_config(display='diagram')
pipeline

访问pipeline的元素

我们可以使用以下命令访问每个元素

pipeline.named_steps

pipeline.named_steps['transform_column'].transformers_[0]

pipeline.named_steps['transform_column'].transformers_[1]

方案2改进：采用Scikit-learn pipeline （最少代码）

在Scikit-learn中，还有两个以上的函数与我们在上述实现中使用的函数（Column Transformer和pipeline）相同：

*make_column Transformer*
*make_pipeline*

这两个函数允许我们简化到更少的代码，它们有什么不同?

实现结构与前面完全相同，唯一的区别是，我们只传递需要的对象，而不是在函数内部传递元组。正如您在下面看到的，我没有给(SimpleImputer、standardscaler和Onehotencoder)对象指定特定的名称，而是直接将它们输入到pipeline中。

我们没有对pipeline做任何结构上的改变。唯一的区别是解决方案2我们没有任何名称传递给对象,这可以看到可视化的pipeline(下图),我们可以看到,这两个pipeline我们默认为数值和分类处理创建命名pipeline1和2,而上面的实现我们选择设置pipeline的名称。

快速比较上述解决方案

方案1:标准的基本ML工作流

#Replacesmissingvaluesimputer=SimpleImputer(strategy="median")
#scalesthenumericalfeaturescaler=StandardScaler()
#one-hotthecategoricalfeaturesone_hot=OneHotEncoder(handle_unknown='ignore',sparse=False)
#Definetheclassifierlr=LogisticRegression()
#learn/train/fitfromthedataimputer.fit(X_train[numeric_features])
imputed=imputer.transform(X_train[numeric_features])
scaler.fit(imputed)
scaled=scaler.transform(imputed)
one_hot.fit(X_train[categorical_features])
cat=one_hot.transform(X_train[categorical_features])
#ConcatenatingthescaledandonehotmatrixesFinal_train=pd.DataFrame(np.concatenate((scaled, cat), axis=1))
lr.fit(Final_train, y_train)
#Predictonthetestset-usingthetrainedclassifier-stillneedtodothetransformationsX_test_filled=imputer.transform(X_test[numeric_features])
X_test_scaled=scaler.transform(X_test_filled)
X_test_one_hot=one_hot.transform(X_test[categorical_features])
X_test=pd.DataFrame(np.concatenate((X_test_scaled, X_test_one_hot), axis=1))
lr.score(X_test,y_test)

方案2:采用Scikit-learn pipeline

fromsklearn.pipelineimportpipelinefromsklearn.composeimportColumnTransformernumeric_transformer=pipeline(steps=[
                                     ('meanimputer',     SimpleImputer(strategy='mean')),
                                     ('stdscaler', StandardScaler())
                                     ])
categorical_transformer=pipeline(steps=[
                                         ('onehotenc', OneHotEncoder(handle_unknown='ignore'))
                                         ])
col_transformer=ColumnTransformer(transformers=[('numeric_processing',numeric_transformer,                                                                 numeric_features),
                                                    ('categorical_processing', categorical_transformer,                                                       categorical_features)
pipeline=pipeline([
                     ('transform_column', col_transformer),
                     ('logistics', LogisticRegression())
                    ])          
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)

方案2改进

fromsklearn.composeimportmake_column_transformerfromsklearn.pipelineimportmake_pipelinenumeric_transformer=make_pipeline((SimpleImputer(strategy='mean')),
                                    (StandardScaler()))
categorical_transformer=make_pipeline(OneHotEncoder(handle_unknown='ignore'))
col_transformer=make_column_transformer((numeric_transformer, numeric_features),
                                            (categorical_transformer, categorical_features))
pipeline=make_pipeline(col_transformer,LogisticRegression())
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)

通过查看以上代码片段，我们了解到如何在工作流程中采用pipeline，并得得到的更干净，维护良好的代码以及更少的代码行数：我们从大约30行代码减少到20行代码。

结论

在本文中，我尝试向您展示了pipeline的功能，特别是Scikit-learn库提供的pipeline的功能，一旦理解，后者将是非常通用且易于实现的。我开始使用Scikit-learnpipeline作为数据科学的最佳实践，

精通使用pipeline和更好的ML工作流并不需要太多的练习，但是一旦掌握了它，肯定会让您的生活更轻松。如果您已经了解它们并使用它们，那么我很高兴能刷新您的记忆和技能。谢谢阅读

使用Scikit-Learn pipeline 减少ML项目的代码量并提高可读性（下）

方案2：采用Scikit-learn pipeline

访问pipeline的元素

方案2改进：采用Scikit-learn pipeline （最少代码）

快速比较上述解决方案

结论

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像

热门

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

阿里云MVP

话题

直播

下载

镜像站

技术资料

插件

使用Scikit-Learn pipeline 减少ML项目的代码量并提高可读性（下）

方案2：采用Scikit-learn pipeline

访问pipeline的元素

方案2改进：采用Scikit-learn pipeline （最少代码）

快速比较上述解决方案

结论

热门文章

最新文章

相关课程

相关电子书

相关实验场景

推荐镜像