【DSW Gallery】使用 Alink 结合 TFDV 进行数据探索和验证

本文涉及的产品
模型在线服务 PAI-EAS,A10/V100等 500元 1个月
交互式建模 PAI-DSW,每月250计算时 3个月
模型训练 PAI-DLC,100CU*H 3个月
简介: Alink 提供了对大规模数据的高效统计,能提供数量、缺失值、最大最小值、分位数、分布直方图等各种统计指标,用户可以探索数据特征,并为特征工程提供辅助。Alink 还能无缝结合 TensorFlow Data Validation,提供数据 schema 推断、数据偏移检测等功能。

直接使用

请打开使用 Alink 结合 TFDV 进行数据探索和验证,并点击右上角 “ 在DSW中打开” 。

image.png


使用 Alink 结合 TFDV 进行数据探索和验证

  通过 Alink 的统计功能可以实现数据探索和数据验证功能,对数据进行检查,并为特征工程提供辅助。

  这个功能与 TensorFlow Data Validation 类似,但通过 Alink 不需要自行配置大规模集群(包括 Apache Beam 以及 Spark/Flink 集群),就可以在 PAI 平台上对大规模数据进行统计分析。同时,Alink 的计算结果也能无缝接入 TensorFlow Data Validation 的数据可视化、数据 schema 推断、数据偏移检测等功能。

  在这个示例 Notebook 中,你将看到通过 Alink 的计算能力结合 TFDV 实现与 TFDV 官方示例 一致的功能。

运行环境要求

  1. PAI-DSW 官方镜像中默认已经安装了 PyAlink,内存要求 4G 及以上。
  2. 本 Notebook 的内容可以直接运行查看,不需要准备任何其他文件。
  3. 为了本 Notebook 内容中可视化内容的正确显示,需要您的网络能正常访问 Github、Google 等网站,否则数据探索的可视化交互图表将无法显示。
  4. 在 3 的基础上,为了更好的使用效果,请先把 Notebook 的主体样式调整为浅色:Settings -> Theme -> JupyterLab Light。

安装依赖包

  安装 tensorflow-metadata 和 tensorflow-data-validation,将基于 tensorflow-metadata 提供的数据结构实现和 tensorflow-data-validation 的无缝对接。

  注:安装中会安装或者更新 tensorflow 包,但这个 Notebook 中并不会使用 TensorFlow。

!pip3 install "tensorflow-data-validation==0.23.0" --use-deprecated=legacy-resolver
!pip3 install "tensorflow-metadata==1.2.0" --use-deprecated=legacy-resolver
Requirement already satisfied: tensorflow-data-validation==0.23.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (0.23.0)
Requirement already satisfied: apache-beam[gcp]<3,>=2.23 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (2.40.0)
Requirement already satisfied: joblib<0.15,>=0.12 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (0.14.1)
Processing /Users/fanhong/Library/Caches/pip/wheels/46/91/e3/0fced4f5fbc0a051a5667096826186c9ff60f2d0e9bf0f1cdc/absl_py-0.8.1-py3-none-any.whl
Requirement already satisfied: pandas<2,>=0.24 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (0.25.3)
Requirement already satisfied: tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (2.3.1)
Requirement already satisfied: tensorflow-transform<0.24,>=0.23 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (0.23.0)
Requirement already satisfied: six<2,>=1.12 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (1.16.0)
Requirement already satisfied: protobuf<4,>=3.7 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (3.19.4)
Requirement already satisfied: numpy<2,>=1.16 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (1.18.5)
Requirement already satisfied: tfx-bsl<0.24,>=0.23 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (0.23.0)
Requirement already satisfied: pyarrow<0.18,>=0.17 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (0.17.1)
Collecting tensorflow-metadata<0.24,>=0.23
  Using cached tensorflow_metadata-0.23.0-py3-none-any.whl (43 kB)
Requirement already satisfied: httplib2<0.21.0,>=0.8 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.12.0)
Requirement already satisfied: typing-extensions>=3.7.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (4.2.0)
Requirement already satisfied: fastavro<2,>=0.23.6 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.6.0)
Requirement already satisfied: crcmod<2.0,>=1.7 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.7)
Requirement already satisfied: proto-plus<2,>=1.7.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.22.0)
Requirement already satisfied: python-dateutil<3,>=2.8.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.8.0)
Requirement already satisfied: cloudpickle<3,>=2.1.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.1.0)
Requirement already satisfied: hdfs<3.0.0,>=2.1.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.7.0)
Requirement already satisfied: grpcio<2,>=1.33.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.47.0)
Requirement already satisfied: pytz>=2018.3 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2022.1)
Requirement already satisfied: pymongo<4.0.0,>=3.8.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.12.3)
Requirement already satisfied: requests<3.0.0,>=2.24.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.27.1)
Requirement already satisfied: orjson<4.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.7.12)
Requirement already satisfied: dill<0.3.2,>=0.3.1.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.3.1.1)
Requirement already satisfied: pydot<2,>=1.2.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.4.2)
Requirement already satisfied: google-cloud-pubsub<3,>=2.1.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.13.6)
Requirement already satisfied: google-auth<3,>=1.18.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.10.0)
Requirement already satisfied: google-auth-httplib2<0.2.0,>=0.1.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.1.0)
Requirement already satisfied: google-cloud-videointelligence<2,>=1.8.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.16.3)
Requirement already satisfied: google-cloud-recommendations-ai<=0.2.0,>=0.1.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.2.0)
Requirement already satisfied: google-apitools<0.5.32,>=0.5.31; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.5.31)
Requirement already satisfied: google-cloud-dlp<4,>=3.0.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.8.0)
Requirement already satisfied: google-cloud-datastore<2,>=1.8.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.15.4)
Requirement already satisfied: google-cloud-vision<2,>=0.38.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.0.2)
Requirement already satisfied: google-cloud-bigtable<2,>=0.31.1; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.7.2)
Requirement already satisfied: google-cloud-spanner<2,>=1.13.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.19.3)
Requirement already satisfied: grpcio-gcp<1,>=0.2.2; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.2.2)
Requirement already satisfied: google-cloud-bigquery-storage>=2.6.3; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.14.2)
Requirement already satisfied: google-cloud-core<2,>=0.28.1; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.7.3)
Requirement already satisfied: google-cloud-bigquery<3,>=1.6.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.34.4)
Requirement already satisfied: google-cloud-language<2,>=1.3.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.3.2)
Requirement already satisfied: google-cloud-pubsublite<2,>=1.2.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.4.3)
Requirement already satisfied: cachetools<5,>=3.1.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (4.2.4)
Requirement already satisfied: h5py<2.11.0,>=2.10.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (2.10.0)
Requirement already satisfied: astunparse==1.6.3 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.6.3)
Requirement already satisfied: termcolor>=1.1.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.1.0)
Requirement already satisfied: wheel>=0.26 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (0.37.1)
Requirement already satisfied: google-pasta>=0.1.8 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (0.2.0)
Requirement already satisfied: opt-einsum>=2.3.2 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (3.3.0)
Requirement already satisfied: tensorboard<3,>=2.3.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (2.10.0)
Requirement already satisfied: tensorflow-estimator<2.4.0,>=2.3.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (2.3.0)
Requirement already satisfied: wrapt>=1.11.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.14.1)
Requirement already satisfied: gast==0.3.3 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (0.3.3)
Requirement already satisfied: keras-preprocessing<1.2,>=1.1.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.1.2)
Requirement already satisfied: google-api-python-client<2,>=1.7.11 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tfx-bsl<0.24,>=0.23->tensorflow-data-validation==0.23.0) (1.12.11)
Requirement already satisfied: tensorflow-serving-api!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tfx-bsl<0.24,>=0.23->tensorflow-data-validation==0.23.0) (2.9.1)
Requirement already satisfied: googleapis-common-protos in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-metadata<0.24,>=0.23->tensorflow-data-validation==0.23.0) (1.56.2)
Requirement already satisfied: docopt in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from hdfs<3.0.0,>=2.1.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.6.2)
Requirement already satisfied: charset-normalizer~=2.0.0; python_version >= "3" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from requests<3.0.0,>=2.24.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.0.12)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from requests<3.0.0,>=2.24.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.26.9)
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from requests<3.0.0,>=2.24.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2021.10.8)
Requirement already satisfied: idna<4,>=2.5; python_version >= "3" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from requests<3.0.0,>=2.24.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.3)
Requirement already satisfied: pyparsing>=2.1.4 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from pydot<2,>=1.2.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.0.8)
Requirement already satisfied: google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.32.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-pubsub<3,>=2.1.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.8.2)
Requirement already satisfied: grpcio-status>=1.16.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-pubsub<3,>=2.1.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.47.0)
Requirement already satisfied: grpc-google-iam-v1<1.0.0dev,>=0.12.4 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-pubsub<3,>=2.1.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.12.4)
Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.6" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-auth<3,>=1.18.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (4.8)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-auth<3,>=1.18.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.2.8)
Requirement already satisfied: fasteners>=0.14 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-apitools<0.5.32,>=0.5.31; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.17.3)
Requirement already satisfied: oauth2client>=1.4.12 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-apitools<0.5.32,>=0.5.31; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.0.0)
Requirement already satisfied: packaging<22.0dev,>=14.3 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-bigquery<3,>=1.6.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (21.3)
Requirement already satisfied: google-resumable-media<3.0dev,>=0.6.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-bigquery<3,>=1.6.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.3.3)
Requirement already satisfied: overrides<7.0.0,>=6.0.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-pubsublite<2,>=1.2.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (6.2.0)
Requirement already satisfied: markdown>=2.6.8 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (3.4.1)
Requirement already satisfied: setuptools>=41.0.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (61.2.0)
Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (0.6.1)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.8.1)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (0.4.6)
Requirement already satisfied: werkzeug>=1.0.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (2.2.2)
Requirement already satisfied: uritemplate<4dev,>=3.0.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-api-python-client<2,>=1.7.11->tfx-bsl<0.24,>=0.23->tensorflow-data-validation==0.23.0) (3.0.1)
Requirement already satisfied: pyasn1>=0.1.3 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from rsa<5,>=3.1.4; python_version >= "3.6"->google-auth<3,>=1.18.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.4.8)
Requirement already satisfied: google-crc32c<2.0dev,>=1.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-resumable-media<3.0dev,>=0.6.0->google-cloud-bigquery<3,>=1.6.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.3.0)
Requirement already satisfied: importlib-metadata>=4.4; python_version < "3.10" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from markdown>=2.6.8->tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (4.11.3)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.3.1)
Requirement already satisfied: MarkupSafe>=2.1.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from werkzeug>=1.0.1->tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (2.1.1)
Requirement already satisfied: zipp>=0.5 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from importlib-metadata>=4.4; python_version < "3.10"->markdown>=2.6.8->tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (3.8.0)
Requirement already satisfied: oauthlib>=3.0.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (3.2.0)
Installing collected packages: absl-py, tensorflow-metadata
  Attempting uninstall: absl-py
    Found existing installation: absl-py 0.12.0
    Uninstalling absl-py-0.12.0:
      Successfully uninstalled absl-py-0.12.0
  Attempting uninstall: tensorflow-metadata
    Found existing installation: tensorflow-metadata 1.2.0
    Uninstalling tensorflow-metadata-1.2.0:
      Successfully uninstalled tensorflow-metadata-1.2.0
ERROR: pip's legacy dependency resolver does not consider dependency conflicts when selecting packages. This behaviour is the source of the following dependency conflicts.
tensorflow-serving-api 2.9.1 requires tensorflow<3,>=2.9.1, but you'll have tensorflow 2.3.1 which is incompatible.
Successfully installed absl-py-0.8.1 tensorflow-metadata-0.23.0
Collecting tensorflow-metadata==1.2.0
  Using cached tensorflow_metadata-1.2.0-py3-none-any.whl (48 kB)
Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-metadata==1.2.0) (1.56.2)
Requirement already satisfied: protobuf<4,>=3.13 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-metadata==1.2.0) (3.19.4)
Collecting absl-py<0.13,>=0.9
  Using cached absl_py-0.12.0-py3-none-any.whl (129 kB)
Requirement already satisfied: six in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from absl-py<0.13,>=0.9->tensorflow-metadata==1.2.0) (1.16.0)
Installing collected packages: absl-py, tensorflow-metadata
  Attempting uninstall: absl-py
    Found existing installation: absl-py 0.8.1
    Uninstalling absl-py-0.8.1:
      Successfully uninstalled absl-py-0.8.1
  Attempting uninstall: tensorflow-metadata
    Found existing installation: tensorflow-metadata 0.23.0
    Uninstalling tensorflow-metadata-0.23.0:
      Successfully uninstalled tensorflow-metadata-0.23.0
ERROR: pip's legacy dependency resolver does not consider dependency conflicts when selecting packages. This behaviour is the source of the following dependency conflicts.
tfx-bsl 0.23.0 requires absl-py<0.9,>=0.7, but you'll have absl-py 0.12.0 which is incompatible.
tfx-bsl 0.23.0 requires tensorflow-metadata<0.24,>=0.23, but you'll have tensorflow-metadata 1.2.0 which is incompatible.
tensorflow-transform 0.23.0 requires absl-py<0.9,>=0.7, but you'll have absl-py 0.12.0 which is incompatible.
tensorflow-transform 0.23.0 requires tensorflow-metadata<0.24,>=0.23, but you'll have tensorflow-metadata 1.2.0 which is incompatible.
tensorflow-serving-api 2.9.1 requires tensorflow<3,>=2.9.1, but you'll have tensorflow 2.3.1 which is incompatible.
tensorflow-data-validation 0.23.0 requires absl-py<0.9,>=0.7, but you'll have absl-py 0.12.0 which is incompatible.
tensorflow-data-validation 0.23.0 requires tensorflow-metadata<0.24,>=0.23, but you'll have tensorflow-metadata 1.2.0 which is incompatible.
Successfully installed absl-py-0.12.0 tensorflow-metadata-1.2.0

数据准备

  我们使用芝加哥出租车行程数据集,这个数据集也是 TFDV Demo 所使用的数据集。数据已经传到 OSS 上,在 Alink 中可以直接使用下面的链接,所以不需要额外的准备:

通过 Alink 进行数据探索

导入 pyalink 包,并启用本地运行环境。

  在这个示例中,我们使用 useLocalEnv 在本地(也就是 DSW 的 container 内)运行 Alink 作业,使用多线程的方式模拟分布式计算。

from pyalink.alink import *
useLocalEnv(2)

探索训练数据集:计算训练集统计信息,并对数据各特征的统计结果进行可视化探索。

  在 Alink 中,读取数据通过数据源组件完成,通过 CsvSourceBatchOp 可以读取 CSV 数据源。其中文件路径可以是本地路径,也可以是 HTTP/HTTPS 链接,还可以是 OSS、HDFS 等路径。由于执行引擎对数据类型有较强的要求,因此还需要指定数据的列名和基本数据类型(schemaStr)。

  与 TFDV 一样地,Alink 中的可视化同样使用了 Facets 来进行展示。从可视化中可以看到数据各个特征的基础统计、分布等信息,交互进行探索。

  注:为了可视化页面的正确显示,需要您的网络能正常访问 github 的内容。

# 数据的列名和基本数据类型
schemaStr = "pickup_community_area bigint,fare double,trip_start_month int,trip_start_hour int,trip_start_day int,trip_start_timestamp long,pickup_latitude double,pickup_longitude double,dropoff_latitude double,dropoff_longitude double,trip_miles double,pickup_census_tract string,dropoff_census_tract string,payment_type string,company string,trip_seconds double,dropoff_community_area bigint,tips double"
# 指定数据源
source = CsvSourceBatchOp()\
    .setFilePath("https://alink-release.oss-cn-beijing.aliyuncs.com/data-files/chicago_train_data.csv")\
    .setSchemaStr(schemaStr)\
    .setIgnoreFirstLine(True)
# 告诉 Alink 需要展示数据统计信息的可视化
source.lazyVizStatistics()
# 执行作业
BatchOperator.execute()

image.png

image.png

image.png

image.png

image.png

image.png

扩展到更大规模的数据。

  对于更大规模的数据,可以使用 usePAIEnv 向大规模集群提交作业,详细使用可以通过 help(usePAIEnv) 查看。

help(usePAIEnv)
Help on function usePAIEnv in module pyalink.alink.env:
usePAIEnv(workers=2, memory_per_worker=4096, cpu_per_worker=1, region_id=None, access_key_id=None, access_key_secret=None, workspace_id=None, workspace_name=None, config=None)
    Submit job to PAIFlow
    :param workers           (int)     optional, the default value is 2,
                                       when workers<=0, PyAlink will automatically estimate its value.
    :param memory_per_worker (int)     optional, the default value is 4096
    :param cpu_per_worker    (int)     optional, the default value is 1
    :param region_id         (string)  
    :param access_key_id     (string)  
    :param access_key_secret (string)
    :param workspace_id      (string)  optional, the id of workspace
    :param workspace_name    (string)  optional, the name of workspace
                                       attention: workspace_id and workspace_name must not be None together.
    :param config            (dict)  custom configuration for PyAlink
        - pop_extra_config (dict)     the extra configuration for pop client
        - paiflow_endpoint (str)      the pop endpoint of PAIFlow
        - workspace_endpoint (str)    the pop endpoint of AIWorkspace
        - compute_resource_type (str) options: MaxCompute, Flink, default is MaxCompute
        - compute_resource_env (str)  options: dev, prod, default is dev
        - compute_resource_name (str) the name of computeResource
        - oss_rolearn (str)           the roleArn of oss
        =============================================
        # customize flink-configuration
        # example:
        #   'FLINK_CONFIG_restart-strategy': 'none'
        - FLINK_CONFIG_[key]: value
        # customize the vvp job labels
        - VVP_LABEL_[key]: value
        - jvm_system_properties: dict
        - jvm_startup_options: list[str] the extra commandline options for jvm
        =============================================
        - storage_type: options 'oss', 'MaxCompute'
        ---------------------------------------------
        # when storage_type='oss'
        - oss_endpoint (str)          required, specify the endpoint of OSS
        - oss_base_uri (str)          required, in format of oss://[bucket]/[path]
        - # when the credentials to access OSS is same with global credentials,
          # the following parameters can be ommited.
          # PyAlink will look in several locations when searching for OSS credentials.
          # the order in which PyAlink searches for credentials is:
          # 1. the following credentials parameters in the usePAIEnv() method
          # 2. System Environment: 'OSS_ACCESS_KEY_ID', 'OSS_ACCESS_KEY_SECRET', 'OSS_SECURITY_TOKEN'
          # 3. the parameters of (access_key_id, access_key_secret) in the usePAIEnv() method
          # 4. System Environment: 'ALINK_PAIFLOW_ACCESS_KEY_ID', 'ALINK_PAIFLOW_ACCESS_KEY_SECRET'
          # 5. System Environment: 'ALIBABA_CLOUD_ACCESS_KEY_ID', 'ALIBABA_CLOUD_ACCESS_KEY_SECRET'
        - oss_access_key_id (str)     optional, the AccessKeyId for your Aliyun Account to access OSS
        - oss_access_key_secret (str) optional, the AccessKeySecret for your Aliyun Account to access OSS
        - oss_security_token (str)    optional, the SecurityToken for your Aliyun Account to access OSS
        ---------------------------------------------
        # when storage_type='MaxCompute'
        - maxcompute_endpoint (str)          required
        - maxcompute_project  (str)          required
        - maxcompute_table_name_prefix (str) optional, the default value is 'pyalink_tmp_'

Alink 结合 TFDV 进行数据验证

  Alink 的统计功能可以与 TFDV 统计之外的其他功能进行无缝集成,包括数据可视化、数据 schema 推断、数据偏移检测等功能。

train_stats = InternalFullStatsBatchOp().linkFrom(source).collectFullStats().getDatasetFeatureStatisticsList()
import tensorflow_data_validation as tfdv
tfdv.visualize_statistics(train_stats)

  通过 TFDV 接口,从统计结果中推断数据的 schema(注意避免和 Alink 中的 schemaStr 混淆)。

schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

image.png

image.png

 计算验证集的统计信息,并和训练集对比查看。

 通过对比数据的 schema 和验证集的统计信息,得到数据中的异常信息。

eval_source = CsvSourceBatchOp()\
    .setFilePath("https://alink-release.oss-cn-beijing.aliyuncs.com/data-files/chicago_eval_data.csv")\
    .setSchemaStr(schemaStr)\
    .setIgnoreFirstLine(True)
eval_stats = InternalFullStatsBatchOp().linkFrom(eval_source).collectFullStats().getDatasetFeatureStatisticsList()
anomalies = tfdv.validate_statistics(statistics=eval_stats, schema=schema)
tfdv.display_anomalies(anomalies)

image.png

  通过 Alink 与 TFDV 无缝对接实现更多基于统计信息的功能,可以参考原 TFDV 官方示例

相关实践学习
使用PAI-EAS一键部署ChatGLM及LangChain应用
本场景中主要介绍如何使用模型在线服务(PAI-EAS)部署ChatGLM的AI-Web应用以及启动WebUI进行模型推理,并通过LangChain集成自己的业务数据。
机器学习概览及常见算法
机器学习(Machine Learning, ML)是人工智能的核心,专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能,它是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。 本课程将带你入门机器学习,掌握机器学习的概念和常用的算法。
相关文章
|
存储 机器学习/深度学习 人工智能
【DSW Gallery】DSW基础使用介绍
PAI-DSW是一款云端机器学习开发IDE,为您提供交互式编程环境,适用于不同水平的开发者。本文为您介绍PAI-DSW的功能特点以及界面的基础使用。
【DSW Gallery】DSW基础使用介绍
|
7月前
|
搜索推荐 语音技术 开发工具
ModelScope问题之文档部署到阿里云EAS 调用模型报错如何解决
ModelScope模型报错是指在使用ModelScope平台进行模型训练或部署时遇到的错误和问题;本合集将收集ModelScope模型报错的常见情况和排查方法,帮助用户快速定位问题并采取有效措施。
383 1
|
机器学习/深度学习 人工智能 算法
【DSW Gallery】PAI-DSW快速入门
PAI-DSW是一款为AI开发者量身定制的云端机器学习交互式开发IDE,随时随地开启Notebook快速读取数据、开发算法、训练及部署模型。本文介绍如何快速上手PAI-DSW。
【DSW Gallery】PAI-DSW快速入门
|
消息中间件 运维 算法
【DSW Gallery】IsolationForest算法解决异常检测问题
IsolationForest 是一种无监督的异常检测算法, 用于对无 label 的数据进行异常检测,并且支持将 IsolationForest 模型部署成一个流服务,用来对实时数据进行异常检测。该 Demo 将介绍如何在 DSW 中使用 IsolationForest 算法解决异常检测问题。
【DSW Gallery】IsolationForest算法解决异常检测问题
|
机器学习/深度学习 人工智能 Kubernetes
【DSW Gallery】介绍如何使用命令行工具提交DLC任务
本文介绍如何使用DLC命令行工具提交任务到指定的工作空间内. 同时,会介绍如何提交预付费和后付费的DLC训练任务
【DSW Gallery】介绍如何使用命令行工具提交DLC任务
|
分布式计算 监控 PyTorch
【DSW Gallery】如何在DLC上提交ElasticBatch任务
ElasticBatch是一种分布式离线弹性批量推理作业类型, 本文将介绍ElasticBatch SDK接口以及如何在DLC上提交ElasticBatch任务。
【DSW Gallery】如何在DLC上提交ElasticBatch任务
|
人工智能 并行计算 算法
【DSW Gallery】基于MOCOV2的自监督学习示例
EasyCV是基于Pytorch,以自监督学习和Transformer技术为核心的 all-in-one 视觉算法建模工具,并包含图像分类,度量学习,目标检测,姿态识别等视觉任务的SOTA算法。本文以自监督学习-MOCO为例,为您介绍如何在PAI-DSW中使用EasyCV。
【DSW Gallery】基于MOCOV2的自监督学习示例
|
算法 PyTorch 算法框架/工具
【DSW Gallery】基于EasyCV的视频分类示例
EasyCV是基于Pytorch,以自监督学习和Transformer技术为核心的 all-in-one 视觉算法建模工具,并包含图像分类,度量学习,目标检测,姿态识别等视觉任务的SOTA算法。本文以视频分类为例,为您介绍如何在PAI-DSW中使用EasyCV。
【DSW Gallery】基于EasyCV的视频分类示例
|
并行计算 算法 自动驾驶
【DSW Gallery】基于EasyCV的BEVFormer 3D检测示例
EasyCV是基于Pytorch,以自监督学习和Transformer技术为核心的 all-in-one 视觉算法建模工具,并包含图像分类,度量学习,目标检测,姿态识别等视觉任务的SOTA算法。本文将以BEVFormer 3D检测为例,为您介绍如何在PAI-DSW中使用EasyCV。
【DSW Gallery】基于EasyCV的BEVFormer 3D检测示例