【DSW Gallery】使用 Alink 结合 TFDV 进行数据探索和验证

本文涉及的产品
模型在线服务 PAI-EAS,A10/V100等 500元 1个月
模型训练 PAI-DLC,100CU*H 3个月
交互式建模 PAI-DSW,每月250计算时 3个月
简介: Alink 提供了对大规模数据的高效统计,能提供数量、缺失值、最大最小值、分位数、分布直方图等各种统计指标,用户可以探索数据特征,并为特征工程提供辅助。Alink 还能无缝结合 TensorFlow Data Validation,提供数据 schema 推断、数据偏移检测等功能。

直接使用

请打开使用 Alink 结合 TFDV 进行数据探索和验证,并点击右上角 “ 在DSW中打开” 。

image.png


使用 Alink 结合 TFDV 进行数据探索和验证

  通过 Alink 的统计功能可以实现数据探索和数据验证功能,对数据进行检查,并为特征工程提供辅助。

  这个功能与 TensorFlow Data Validation 类似,但通过 Alink 不需要自行配置大规模集群(包括 Apache Beam 以及 Spark/Flink 集群),就可以在 PAI 平台上对大规模数据进行统计分析。同时,Alink 的计算结果也能无缝接入 TensorFlow Data Validation 的数据可视化、数据 schema 推断、数据偏移检测等功能。

  在这个示例 Notebook 中,你将看到通过 Alink 的计算能力结合 TFDV 实现与 TFDV 官方示例 一致的功能。

运行环境要求

  1. PAI-DSW 官方镜像中默认已经安装了 PyAlink,内存要求 4G 及以上。
  2. 本 Notebook 的内容可以直接运行查看,不需要准备任何其他文件。
  3. 为了本 Notebook 内容中可视化内容的正确显示,需要您的网络能正常访问 Github、Google 等网站,否则数据探索的可视化交互图表将无法显示。
  4. 在 3 的基础上,为了更好的使用效果,请先把 Notebook 的主体样式调整为浅色:Settings -> Theme -> JupyterLab Light。

安装依赖包

  安装 tensorflow-metadata 和 tensorflow-data-validation,将基于 tensorflow-metadata 提供的数据结构实现和 tensorflow-data-validation 的无缝对接。

  注:安装中会安装或者更新 tensorflow 包,但这个 Notebook 中并不会使用 TensorFlow。

!pip3 install "tensorflow-data-validation==0.23.0" --use-deprecated=legacy-resolver
!pip3 install "tensorflow-metadata==1.2.0" --use-deprecated=legacy-resolver
Requirement already satisfied: tensorflow-data-validation==0.23.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (0.23.0)
Requirement already satisfied: apache-beam[gcp]<3,>=2.23 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (2.40.0)
Requirement already satisfied: joblib<0.15,>=0.12 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (0.14.1)
Processing /Users/fanhong/Library/Caches/pip/wheels/46/91/e3/0fced4f5fbc0a051a5667096826186c9ff60f2d0e9bf0f1cdc/absl_py-0.8.1-py3-none-any.whl
Requirement already satisfied: pandas<2,>=0.24 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (0.25.3)
Requirement already satisfied: tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (2.3.1)
Requirement already satisfied: tensorflow-transform<0.24,>=0.23 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (0.23.0)
Requirement already satisfied: six<2,>=1.12 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (1.16.0)
Requirement already satisfied: protobuf<4,>=3.7 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (3.19.4)
Requirement already satisfied: numpy<2,>=1.16 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (1.18.5)
Requirement already satisfied: tfx-bsl<0.24,>=0.23 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (0.23.0)
Requirement already satisfied: pyarrow<0.18,>=0.17 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (0.17.1)
Collecting tensorflow-metadata<0.24,>=0.23
  Using cached tensorflow_metadata-0.23.0-py3-none-any.whl (43 kB)
Requirement already satisfied: httplib2<0.21.0,>=0.8 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.12.0)
Requirement already satisfied: typing-extensions>=3.7.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (4.2.0)
Requirement already satisfied: fastavro<2,>=0.23.6 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.6.0)
Requirement already satisfied: crcmod<2.0,>=1.7 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.7)
Requirement already satisfied: proto-plus<2,>=1.7.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.22.0)
Requirement already satisfied: python-dateutil<3,>=2.8.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.8.0)
Requirement already satisfied: cloudpickle<3,>=2.1.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.1.0)
Requirement already satisfied: hdfs<3.0.0,>=2.1.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.7.0)
Requirement already satisfied: grpcio<2,>=1.33.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.47.0)
Requirement already satisfied: pytz>=2018.3 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2022.1)
Requirement already satisfied: pymongo<4.0.0,>=3.8.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.12.3)
Requirement already satisfied: requests<3.0.0,>=2.24.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.27.1)
Requirement already satisfied: orjson<4.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.7.12)
Requirement already satisfied: dill<0.3.2,>=0.3.1.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.3.1.1)
Requirement already satisfied: pydot<2,>=1.2.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.4.2)
Requirement already satisfied: google-cloud-pubsub<3,>=2.1.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.13.6)
Requirement already satisfied: google-auth<3,>=1.18.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.10.0)
Requirement already satisfied: google-auth-httplib2<0.2.0,>=0.1.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.1.0)
Requirement already satisfied: google-cloud-videointelligence<2,>=1.8.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.16.3)
Requirement already satisfied: google-cloud-recommendations-ai<=0.2.0,>=0.1.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.2.0)
Requirement already satisfied: google-apitools<0.5.32,>=0.5.31; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.5.31)
Requirement already satisfied: google-cloud-dlp<4,>=3.0.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.8.0)
Requirement already satisfied: google-cloud-datastore<2,>=1.8.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.15.4)
Requirement already satisfied: google-cloud-vision<2,>=0.38.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.0.2)
Requirement already satisfied: google-cloud-bigtable<2,>=0.31.1; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.7.2)
Requirement already satisfied: google-cloud-spanner<2,>=1.13.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.19.3)
Requirement already satisfied: grpcio-gcp<1,>=0.2.2; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.2.2)
Requirement already satisfied: google-cloud-bigquery-storage>=2.6.3; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.14.2)
Requirement already satisfied: google-cloud-core<2,>=0.28.1; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.7.3)
Requirement already satisfied: google-cloud-bigquery<3,>=1.6.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.34.4)
Requirement already satisfied: google-cloud-language<2,>=1.3.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.3.2)
Requirement already satisfied: google-cloud-pubsublite<2,>=1.2.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.4.3)
Requirement already satisfied: cachetools<5,>=3.1.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (4.2.4)
Requirement already satisfied: h5py<2.11.0,>=2.10.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (2.10.0)
Requirement already satisfied: astunparse==1.6.3 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.6.3)
Requirement already satisfied: termcolor>=1.1.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.1.0)
Requirement already satisfied: wheel>=0.26 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (0.37.1)
Requirement already satisfied: google-pasta>=0.1.8 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (0.2.0)
Requirement already satisfied: opt-einsum>=2.3.2 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (3.3.0)
Requirement already satisfied: tensorboard<3,>=2.3.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (2.10.0)
Requirement already satisfied: tensorflow-estimator<2.4.0,>=2.3.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (2.3.0)
Requirement already satisfied: wrapt>=1.11.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.14.1)
Requirement already satisfied: gast==0.3.3 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (0.3.3)
Requirement already satisfied: keras-preprocessing<1.2,>=1.1.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.1.2)
Requirement already satisfied: google-api-python-client<2,>=1.7.11 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tfx-bsl<0.24,>=0.23->tensorflow-data-validation==0.23.0) (1.12.11)
Requirement already satisfied: tensorflow-serving-api!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tfx-bsl<0.24,>=0.23->tensorflow-data-validation==0.23.0) (2.9.1)
Requirement already satisfied: googleapis-common-protos in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-metadata<0.24,>=0.23->tensorflow-data-validation==0.23.0) (1.56.2)
Requirement already satisfied: docopt in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from hdfs<3.0.0,>=2.1.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.6.2)
Requirement already satisfied: charset-normalizer~=2.0.0; python_version >= "3" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from requests<3.0.0,>=2.24.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.0.12)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from requests<3.0.0,>=2.24.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.26.9)
Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from requests<3.0.0,>=2.24.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2021.10.8)
Requirement already satisfied: idna<4,>=2.5; python_version >= "3" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from requests<3.0.0,>=2.24.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.3)
Requirement already satisfied: pyparsing>=2.1.4 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from pydot<2,>=1.2.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.0.8)
Requirement already satisfied: google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.32.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-pubsub<3,>=2.1.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.8.2)
Requirement already satisfied: grpcio-status>=1.16.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-pubsub<3,>=2.1.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.47.0)
Requirement already satisfied: grpc-google-iam-v1<1.0.0dev,>=0.12.4 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-pubsub<3,>=2.1.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.12.4)
Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.6" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-auth<3,>=1.18.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (4.8)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-auth<3,>=1.18.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.2.8)
Requirement already satisfied: fasteners>=0.14 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-apitools<0.5.32,>=0.5.31; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.17.3)
Requirement already satisfied: oauth2client>=1.4.12 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-apitools<0.5.32,>=0.5.31; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.0.0)
Requirement already satisfied: packaging<22.0dev,>=14.3 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-bigquery<3,>=1.6.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (21.3)
Requirement already satisfied: google-resumable-media<3.0dev,>=0.6.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-bigquery<3,>=1.6.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.3.3)
Requirement already satisfied: overrides<7.0.0,>=6.0.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-pubsublite<2,>=1.2.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (6.2.0)
Requirement already satisfied: markdown>=2.6.8 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (3.4.1)
Requirement already satisfied: setuptools>=41.0.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (61.2.0)
Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (0.6.1)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.8.1)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (0.4.6)
Requirement already satisfied: werkzeug>=1.0.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (2.2.2)
Requirement already satisfied: uritemplate<4dev,>=3.0.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-api-python-client<2,>=1.7.11->tfx-bsl<0.24,>=0.23->tensorflow-data-validation==0.23.0) (3.0.1)
Requirement already satisfied: pyasn1>=0.1.3 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from rsa<5,>=3.1.4; python_version >= "3.6"->google-auth<3,>=1.18.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.4.8)
Requirement already satisfied: google-crc32c<2.0dev,>=1.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-resumable-media<3.0dev,>=0.6.0->google-cloud-bigquery<3,>=1.6.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.3.0)
Requirement already satisfied: importlib-metadata>=4.4; python_version < "3.10" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from markdown>=2.6.8->tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (4.11.3)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.3.1)
Requirement already satisfied: MarkupSafe>=2.1.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from werkzeug>=1.0.1->tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (2.1.1)
Requirement already satisfied: zipp>=0.5 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from importlib-metadata>=4.4; python_version < "3.10"->markdown>=2.6.8->tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (3.8.0)
Requirement already satisfied: oauthlib>=3.0.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (3.2.0)
Installing collected packages: absl-py, tensorflow-metadata
  Attempting uninstall: absl-py
    Found existing installation: absl-py 0.12.0
    Uninstalling absl-py-0.12.0:
      Successfully uninstalled absl-py-0.12.0
  Attempting uninstall: tensorflow-metadata
    Found existing installation: tensorflow-metadata 1.2.0
    Uninstalling tensorflow-metadata-1.2.0:
      Successfully uninstalled tensorflow-metadata-1.2.0
ERROR: pip's legacy dependency resolver does not consider dependency conflicts when selecting packages. This behaviour is the source of the following dependency conflicts.
tensorflow-serving-api 2.9.1 requires tensorflow<3,>=2.9.1, but you'll have tensorflow 2.3.1 which is incompatible.
Successfully installed absl-py-0.8.1 tensorflow-metadata-0.23.0
Collecting tensorflow-metadata==1.2.0
  Using cached tensorflow_metadata-1.2.0-py3-none-any.whl (48 kB)
Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-metadata==1.2.0) (1.56.2)
Requirement already satisfied: protobuf<4,>=3.13 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-metadata==1.2.0) (3.19.4)
Collecting absl-py<0.13,>=0.9
  Using cached absl_py-0.12.0-py3-none-any.whl (129 kB)
Requirement already satisfied: six in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from absl-py<0.13,>=0.9->tensorflow-metadata==1.2.0) (1.16.0)
Installing collected packages: absl-py, tensorflow-metadata
  Attempting uninstall: absl-py
    Found existing installation: absl-py 0.8.1
    Uninstalling absl-py-0.8.1:
      Successfully uninstalled absl-py-0.8.1
  Attempting uninstall: tensorflow-metadata
    Found existing installation: tensorflow-metadata 0.23.0
    Uninstalling tensorflow-metadata-0.23.0:
      Successfully uninstalled tensorflow-metadata-0.23.0
ERROR: pip's legacy dependency resolver does not consider dependency conflicts when selecting packages. This behaviour is the source of the following dependency conflicts.
tfx-bsl 0.23.0 requires absl-py<0.9,>=0.7, but you'll have absl-py 0.12.0 which is incompatible.
tfx-bsl 0.23.0 requires tensorflow-metadata<0.24,>=0.23, but you'll have tensorflow-metadata 1.2.0 which is incompatible.
tensorflow-transform 0.23.0 requires absl-py<0.9,>=0.7, but you'll have absl-py 0.12.0 which is incompatible.
tensorflow-transform 0.23.0 requires tensorflow-metadata<0.24,>=0.23, but you'll have tensorflow-metadata 1.2.0 which is incompatible.
tensorflow-serving-api 2.9.1 requires tensorflow<3,>=2.9.1, but you'll have tensorflow 2.3.1 which is incompatible.
tensorflow-data-validation 0.23.0 requires absl-py<0.9,>=0.7, but you'll have absl-py 0.12.0 which is incompatible.
tensorflow-data-validation 0.23.0 requires tensorflow-metadata<0.24,>=0.23, but you'll have tensorflow-metadata 1.2.0 which is incompatible.
Successfully installed absl-py-0.12.0 tensorflow-metadata-1.2.0

数据准备

  我们使用芝加哥出租车行程数据集,这个数据集也是 TFDV Demo 所使用的数据集。数据已经传到 OSS 上,在 Alink 中可以直接使用下面的链接,所以不需要额外的准备:

通过 Alink 进行数据探索

导入 pyalink 包,并启用本地运行环境。

  在这个示例中,我们使用 useLocalEnv 在本地(也就是 DSW 的 container 内)运行 Alink 作业,使用多线程的方式模拟分布式计算。

from pyalink.alink import *
useLocalEnv(2)

探索训练数据集:计算训练集统计信息,并对数据各特征的统计结果进行可视化探索。

  在 Alink 中,读取数据通过数据源组件完成,通过 CsvSourceBatchOp 可以读取 CSV 数据源。其中文件路径可以是本地路径,也可以是 HTTP/HTTPS 链接,还可以是 OSS、HDFS 等路径。由于执行引擎对数据类型有较强的要求,因此还需要指定数据的列名和基本数据类型(schemaStr)。

  与 TFDV 一样地,Alink 中的可视化同样使用了 Facets 来进行展示。从可视化中可以看到数据各个特征的基础统计、分布等信息,交互进行探索。

  注:为了可视化页面的正确显示,需要您的网络能正常访问 github 的内容。

# 数据的列名和基本数据类型
schemaStr = "pickup_community_area bigint,fare double,trip_start_month int,trip_start_hour int,trip_start_day int,trip_start_timestamp long,pickup_latitude double,pickup_longitude double,dropoff_latitude double,dropoff_longitude double,trip_miles double,pickup_census_tract string,dropoff_census_tract string,payment_type string,company string,trip_seconds double,dropoff_community_area bigint,tips double"
# 指定数据源
source = CsvSourceBatchOp()\
    .setFilePath("https://alink-release.oss-cn-beijing.aliyuncs.com/data-files/chicago_train_data.csv")\
    .setSchemaStr(schemaStr)\
    .setIgnoreFirstLine(True)
# 告诉 Alink 需要展示数据统计信息的可视化
source.lazyVizStatistics()
# 执行作业
BatchOperator.execute()

image.png

image.png

image.png

image.png

image.png

image.png

扩展到更大规模的数据。

  对于更大规模的数据,可以使用 usePAIEnv 向大规模集群提交作业,详细使用可以通过 help(usePAIEnv) 查看。

help(usePAIEnv)
Help on function usePAIEnv in module pyalink.alink.env:
usePAIEnv(workers=2, memory_per_worker=4096, cpu_per_worker=1, region_id=None, access_key_id=None, access_key_secret=None, workspace_id=None, workspace_name=None, config=None)
    Submit job to PAIFlow
    :param workers           (int)     optional, the default value is 2,
                                       when workers<=0, PyAlink will automatically estimate its value.
    :param memory_per_worker (int)     optional, the default value is 4096
    :param cpu_per_worker    (int)     optional, the default value is 1
    :param region_id         (string)  
    :param access_key_id     (string)  
    :param access_key_secret (string)
    :param workspace_id      (string)  optional, the id of workspace
    :param workspace_name    (string)  optional, the name of workspace
                                       attention: workspace_id and workspace_name must not be None together.
    :param config            (dict)  custom configuration for PyAlink
        - pop_extra_config (dict)     the extra configuration for pop client
        - paiflow_endpoint (str)      the pop endpoint of PAIFlow
        - workspace_endpoint (str)    the pop endpoint of AIWorkspace
        - compute_resource_type (str) options: MaxCompute, Flink, default is MaxCompute
        - compute_resource_env (str)  options: dev, prod, default is dev
        - compute_resource_name (str) the name of computeResource
        - oss_rolearn (str)           the roleArn of oss
        =============================================
        # customize flink-configuration
        # example:
        #   'FLINK_CONFIG_restart-strategy': 'none'
        - FLINK_CONFIG_[key]: value
        # customize the vvp job labels
        - VVP_LABEL_[key]: value
        - jvm_system_properties: dict
        - jvm_startup_options: list[str] the extra commandline options for jvm
        =============================================
        - storage_type: options 'oss', 'MaxCompute'
        ---------------------------------------------
        # when storage_type='oss'
        - oss_endpoint (str)          required, specify the endpoint of OSS
        - oss_base_uri (str)          required, in format of oss://[bucket]/[path]
        - # when the credentials to access OSS is same with global credentials,
          # the following parameters can be ommited.
          # PyAlink will look in several locations when searching for OSS credentials.
          # the order in which PyAlink searches for credentials is:
          # 1. the following credentials parameters in the usePAIEnv() method
          # 2. System Environment: 'OSS_ACCESS_KEY_ID', 'OSS_ACCESS_KEY_SECRET', 'OSS_SECURITY_TOKEN'
          # 3. the parameters of (access_key_id, access_key_secret) in the usePAIEnv() method
          # 4. System Environment: 'ALINK_PAIFLOW_ACCESS_KEY_ID', 'ALINK_PAIFLOW_ACCESS_KEY_SECRET'
          # 5. System Environment: 'ALIBABA_CLOUD_ACCESS_KEY_ID', 'ALIBABA_CLOUD_ACCESS_KEY_SECRET'
        - oss_access_key_id (str)     optional, the AccessKeyId for your Aliyun Account to access OSS
        - oss_access_key_secret (str) optional, the AccessKeySecret for your Aliyun Account to access OSS
        - oss_security_token (str)    optional, the SecurityToken for your Aliyun Account to access OSS
        ---------------------------------------------
        # when storage_type='MaxCompute'
        - maxcompute_endpoint (str)          required
        - maxcompute_project  (str)          required
        - maxcompute_table_name_prefix (str) optional, the default value is 'pyalink_tmp_'

Alink 结合 TFDV 进行数据验证

  Alink 的统计功能可以与 TFDV 统计之外的其他功能进行无缝集成,包括数据可视化、数据 schema 推断、数据偏移检测等功能。

train_stats = InternalFullStatsBatchOp().linkFrom(source).collectFullStats().getDatasetFeatureStatisticsList()
import tensorflow_data_validation as tfdv
tfdv.visualize_statistics(train_stats)

  通过 TFDV 接口,从统计结果中推断数据的 schema(注意避免和 Alink 中的 schemaStr 混淆)。

schema = tfdv.infer_schema(statistics=train_stats)
tfdv.display_schema(schema=schema)

image.png

image.png

 计算验证集的统计信息,并和训练集对比查看。

 通过对比数据的 schema 和验证集的统计信息,得到数据中的异常信息。

eval_source = CsvSourceBatchOp()\
    .setFilePath("https://alink-release.oss-cn-beijing.aliyuncs.com/data-files/chicago_eval_data.csv")\
    .setSchemaStr(schemaStr)\
    .setIgnoreFirstLine(True)
eval_stats = InternalFullStatsBatchOp().linkFrom(eval_source).collectFullStats().getDatasetFeatureStatisticsList()
anomalies = tfdv.validate_statistics(statistics=eval_stats, schema=schema)
tfdv.display_anomalies(anomalies)

image.png

  通过 Alink 与 TFDV 无缝对接实现更多基于统计信息的功能,可以参考原 TFDV 官方示例

相关实践学习
使用PAI+LLaMA Factory微调Qwen2-VL模型,搭建文旅领域知识问答机器人
使用PAI和LLaMA Factory框架,基于全参方法微调 Qwen2-VL模型,使其能够进行文旅领域知识问答,同时通过人工测试验证了微调的效果。
机器学习概览及常见算法
机器学习(Machine Learning, ML)是人工智能的核心,专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能,它是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。 本课程将带你入门机器学习,掌握机器学习的概念和常用的算法。
相关文章
|
机器学习/深度学习 人工智能 运维
MLOps : 机器学习运维
MLOps : 机器学习运维
503 0
|
存储 数据挖掘 数据处理
Pandas中explode()函数的应用与实战
Pandas中explode()函数的应用与实战
357 0
|
Web App开发 应用服务中间件 PHP
|
Java 测试技术 Maven
maven 打jar包:mvn clean package
maven 打jar包:mvn clean package
258 7
|
测试技术 API Docker
使用ruri快速构建跨架构chroot容器
【8月更文挑战第22天】本指南介绍如何使用 ruri 工具快速构建跨架构 chroot 容器。首先需安装 ruri,并确保系统满足安装要求。接着确定目标架构(如从 x86 到 ARM),并准备好相应的工具链和依赖库。利用 ruri 的命令行工具启动容器构建流程,指定源与目标架构及基础镜像。构建完成后可进一步配置和定制容器,安装所需软件包与调整系统设置。随后通过运行测试用例验证容器功能,解决发现的问题。最后将测试合格的容器部署至生产环境,利用容器管理工具进行管理和运行。在整个过程中要注意架构间的差异与兼容性问题,并确保系统环境稳定,定期更新 ruri 和相关组件。
235 5
|
存储 Linux 网络安全
Linux(CentOs7) --- 安装Docker容器
Linux(CentOs7) --- 安装Docker容器
753 1
|
Java 编译器 API
Java中的动态编译与运行
Java中的动态编译与运行
|
机器学习/深度学习 人工智能 自然语言处理
大模型时代下,算法工程师该何去何从?
大模型时代的到来,将算法工程师的职业发展带入了全新的境地。在这个浩瀚的数据海洋中,算法工程师们面临着前所未有的挑战和机遇。不久前,合合信息举办了一场《》的直播活动,智能技术平台事业部副总经理、高级工程师丁凯博士分享了。这段深度探讨不仅让我对算法工程师的未来有了更清晰的认识,也启发了我对自身职业发展的思考。接下来,我将分享这次讨论的精彩内容,希望能够为同学们提供一些有益的启示与思考。
|
SQL 存储 缓存
SqlAlchemy 2.0 中文文档(二十五)(1)
SqlAlchemy 2.0 中文文档(二十五)
209 0
|
SQL API 数据库
Python中的SQLAlchemy框架:深度解析与实战应用
【4月更文挑战第13天】在Python的众多ORM(对象关系映射)框架中,SQLAlchemy以其功能强大、灵活性和易扩展性脱颖而出,成为许多开发者首选的数据库操作工具。本文将深入探讨SQLAlchemy的核心概念、功能特点以及实战应用,帮助读者更好地理解和使用这一框架。

热门文章

最新文章