使用自编码器进行数据的匿名化以保护数据隐私（下）-阿里云开发者社区

使用自编码器进行数据的匿名化以保护数据隐私（下）

2022-12-20 129

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介： 使用自编码器进行数据的匿名化以保护数据隐私

数据匿名化与自动编码器

现在，我们准备对数据集进行匿名化。首先，我们构建了一个瓶颈层只有输入层一半大小的自动编码器。

dim_layer_input=X.shape[1]
dim_layer_1=max((int(3*dim_layer_input/4), 1))
dim_layer_2=max((int(dim_layer_input/2), 1))
autoencoder, encoder=build_autoencoder(
dim_input=dim_layer_input,
dim_layer_1=dim_layer_1,
dim_layer_2=dim_layer_2)

训练

autoencoder.fit(
X, X,
epochs=100,
batch_size=256,
shuffle=True,
validation_split=0.3,
callbacks=callbacks)

并提取编码表示作为随机森林分类器的输入

encoded=array(encoder(X))
rf=RandomForestClassifier(
n_estimators=500,
max_depth=2,
n_jobs=8,
random_state=42)
dict_performance=cross_validate(
estimator=rf,
X=encoded, y=y,
cv=10,
n_jobs=4,
return_train_score=True,
scoring=[
"balanced_accuracy",
"f1_weighted",
"roc_auc",
"average_precision"  ]
)
df_performance["ENCODED"] = [
mean(dict_performance[k]) \forkindict_performance.keys()
]

结果还不错。然而我们还不能绘制出特征的重要性，因为潜在的表示是原始的线性组合。当然，我们可以从自动编码器中提取权重，然后返回去了解哪些输入特征会影响更重要的潜在特征，但这只有当自动编码器有一个简单的结构时才可行，就像我们的例子一样。在其他情况下，我们可以对特征进行组编码。

Group-encode特性匿名化

为了在匿名化的数据中保留某种业务知识，我们可以将原始特征按区域分组，然后对每一组应用自动编码器的匿名化。例如，在我们的例子中，我们可以将特性划分为:

个人信息,财务状况,之前的竞选结果,以及总体经济形势。

feat_pers= ['age', 'marital_divorced', 'marital_married', 'marital_single', 'marital_unknown', 'education_basic.4y', 'education_basic.6y', 'education_basic.9y', 'education_high.school', 'education_illiterate', 'education_professional.course', 'education_university.degree', 'education_unknown','job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid', 'job_management', 'job_retired', 'job_self-employed', 'job_services', 'job_student', 'job_technician', 'job_unemployed', 'job_unknown']
feat_fina= ['default_no', 'default_unknown', 'default_yes', 'housing_no', 'housing_unknown', 'housing_yes', 'loan_no', 'loan_unknown', 'loan_yes']
feat_camp= ['campaign', 'pdays', 'previous', 'contact_cellular', 'contact_telephone', 'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep', 'day_of_week_fri', 'day_of_week_mon', 'day_of_week_thu', 'day_of_week_tue', 'day_of_week_wed', 'poutcome_failure', 'poutcome_nonexistent', 'poutcome_success']
feat_econ= ['emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']

然后，我们应用一个独立的自动编码器来匿名化每一组特征。

feat_groups= [
feat_pers,
feat_fina,
feat_camp,
feat_econ]
encoded= []
forgintqdm(feat_groups):
dim_layer_input=len(g)
dim_layer_1=max((int(3*dim_layer_input/4), 1))
dim_layer_2=max((int(dim_layer_input/2), 1))
autoencoder, encoder=build_autoencoder(
dim_input=dim_layer_input,
dim_layer_1=dim_layer_1,
dim_layer_2=dim_layer_2  )
X_tmp=X[:, array_columns.isin(g)]
autoencoder.fit(
X_tmp, X_tmp,
epochs=100,
batch_size=256,
shuffle=True,
validation_split=0.3,
callbacks=callbacks,
verbose=0  )
encoded.append(array(encoder(X_tmp)))
X_encoded=hstack(encoded)

我们可以为每个匿名特征分配感兴趣的区域，因为我们之前已经对它们进行了分组。

array_encoded_features=array(
  ["pers_"+str(j) forjinrange(encoded[0].shape[1])] +\  ["fina_"+str(j) forjinrange(encoded[1].shape[1])] +\  ["camp_"+str(j) forjinrange(encoded[2].shape[1])] +\  ["econ_"+str(j) forjinrange(encoded[3].shape[1])]
)

这样，匿名数据集看起来是这样的:

让我们测试一下匿名化给后的预测能力。

rf=RandomForestClassifier(n_estimators=500, max_depth=2, n_jobs=8, random_state=42)
dict_performance=cross_validate(
estimator=rf,
X=X_encoded, y=y,
cv=10,
n_jobs=4,
return_train_score=True,
scoring=[
"balanced_accuracy",
"f1_weighted",
"roc_auc",
"average_precision"  ]
)
df_performance["GROUP_ENCODED"] = [mean(dict_performance[k]) forkindict_performance.keys()]

在这种情况下，我们能够绘制特征的重要性，因为我们知道每个匿名特征是从哪个兴趣区域创建的。

rf=RandomForestClassifier(
n_estimators=500,
max_depth=2,
n_jobs=8,
random_state=42)
rf.fit(X_encoded, y)
fi=permutation_importance(
estimator=rf,
X=X_encoded,
y=y,
n_repeats=10,
n_jobs=8,
random_state=42).importances_meanfigure(figsize=(16,8))
barh(
y=range(10, 0, -1),
width=sorted(fi, reverse=True)[:10],
alpha=0.9)
ylabel("Feature")
yticks(
range(10, 0, -1),
array_encoded_features[fi.argsort()[::-1][:10]]
)
xlabel("Importance")
title("Group-encoded features importance")
show()

跟我们的常识差不多，来自竞选和经济形势的编码特征是最重要的，这与对非匿名数据的分析是一致的。为了获得更多的细节，我们可以创建具有更细粒度划分的特性组。

总结

在本教程中，我们看到了如何应用自动编码器来匿名化数据集，以便将编码的数据传递给下游的机器学习任务。在数据应该传递到外部以在其他预测机器学习平台上进行测试的情况下，这可能非常有用(想象一下在云上测试模型)。一个受过良好训练的自动编码器保留了原始数据的预测能力。

使用自编码器进行数据的匿名化以保护数据隐私（下）

数据匿名化与自动编码器

Group-encode特性匿名化

总结

热门文章

最新文章

相关电子书

探索云世界

热门

云计算

大数据

云原生

人工智能

数据库

开发与运维

活动广场

任务中心

开发者评测

高校计划

乘风者计划

训练营

直播

下载

镜像站

技术资料

使用自编码器进行数据的匿名化以保护数据隐私（下）

数据匿名化与自动编码器

Group-encode特性匿名化

总结

热门文章

最新文章

相关电子书