直接使用
请打开使用 Alink 结合 TFDV 进行数据探索和验证,并点击右上角 “ 在DSW中打开” 。
使用 Alink 结合 TFDV 进行数据探索和验证
通过 Alink 的统计功能可以实现数据探索和数据验证功能,对数据进行检查,并为特征工程提供辅助。
这个功能与 TensorFlow Data Validation 类似,但通过 Alink 不需要自行配置大规模集群(包括 Apache Beam 以及 Spark/Flink 集群),就可以在 PAI 平台上对大规模数据进行统计分析。同时,Alink 的计算结果也能无缝接入 TensorFlow Data Validation 的数据可视化、数据 schema 推断、数据偏移检测等功能。
在这个示例 Notebook 中,你将看到通过 Alink 的计算能力结合 TFDV 实现与 TFDV 官方示例 一致的功能。
运行环境要求
- PAI-DSW 官方镜像中默认已经安装了 PyAlink,内存要求 4G 及以上。
- 本 Notebook 的内容可以直接运行查看,不需要准备任何其他文件。
- 为了本 Notebook 内容中可视化内容的正确显示,需要您的网络能正常访问 Github、Google 等网站,否则数据探索的可视化交互图表将无法显示。
- 在 3 的基础上,为了更好的使用效果,请先把 Notebook 的主体样式调整为浅色:Settings -> Theme -> JupyterLab Light。
安装依赖包
安装 tensorflow-metadata 和 tensorflow-data-validation,将基于 tensorflow-metadata 提供的数据结构实现和 tensorflow-data-validation 的无缝对接。
注:安装中会安装或者更新 tensorflow 包,但这个 Notebook 中并不会使用 TensorFlow。
!pip3 install "tensorflow-data-validation==0.23.0" --use-deprecated=legacy-resolver !pip3 install "tensorflow-metadata==1.2.0" --use-deprecated=legacy-resolver
Requirement already satisfied: tensorflow-data-validation==0.23.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (0.23.0) Requirement already satisfied: apache-beam[gcp]<3,>=2.23 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (2.40.0) Requirement already satisfied: joblib<0.15,>=0.12 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (0.14.1) Processing /Users/fanhong/Library/Caches/pip/wheels/46/91/e3/0fced4f5fbc0a051a5667096826186c9ff60f2d0e9bf0f1cdc/absl_py-0.8.1-py3-none-any.whl Requirement already satisfied: pandas<2,>=0.24 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (0.25.3) Requirement already satisfied: tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (2.3.1) Requirement already satisfied: tensorflow-transform<0.24,>=0.23 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (0.23.0) Requirement already satisfied: six<2,>=1.12 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (1.16.0) Requirement already satisfied: protobuf<4,>=3.7 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (3.19.4) Requirement already satisfied: numpy<2,>=1.16 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (1.18.5) Requirement already satisfied: tfx-bsl<0.24,>=0.23 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (0.23.0) Requirement already satisfied: pyarrow<0.18,>=0.17 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-data-validation==0.23.0) (0.17.1) Collecting tensorflow-metadata<0.24,>=0.23 Using cached tensorflow_metadata-0.23.0-py3-none-any.whl (43 kB) Requirement already satisfied: httplib2<0.21.0,>=0.8 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.12.0) Requirement already satisfied: typing-extensions>=3.7.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (4.2.0) Requirement already satisfied: fastavro<2,>=0.23.6 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.6.0) Requirement already satisfied: crcmod<2.0,>=1.7 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.7) Requirement already satisfied: proto-plus<2,>=1.7.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.22.0) Requirement already satisfied: python-dateutil<3,>=2.8.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.8.0) Requirement already satisfied: cloudpickle<3,>=2.1.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.1.0) Requirement already satisfied: hdfs<3.0.0,>=2.1.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.7.0) Requirement already satisfied: grpcio<2,>=1.33.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.47.0) Requirement already satisfied: pytz>=2018.3 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2022.1) Requirement already satisfied: pymongo<4.0.0,>=3.8.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.12.3) Requirement already satisfied: requests<3.0.0,>=2.24.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.27.1) Requirement already satisfied: orjson<4.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.7.12) Requirement already satisfied: dill<0.3.2,>=0.3.1.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.3.1.1) Requirement already satisfied: pydot<2,>=1.2.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.4.2) Requirement already satisfied: google-cloud-pubsub<3,>=2.1.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.13.6) Requirement already satisfied: google-auth<3,>=1.18.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.10.0) Requirement already satisfied: google-auth-httplib2<0.2.0,>=0.1.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.1.0) Requirement already satisfied: google-cloud-videointelligence<2,>=1.8.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.16.3) Requirement already satisfied: google-cloud-recommendations-ai<=0.2.0,>=0.1.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.2.0) Requirement already satisfied: google-apitools<0.5.32,>=0.5.31; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.5.31) Requirement already satisfied: google-cloud-dlp<4,>=3.0.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.8.0) Requirement already satisfied: google-cloud-datastore<2,>=1.8.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.15.4) Requirement already satisfied: google-cloud-vision<2,>=0.38.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.0.2) Requirement already satisfied: google-cloud-bigtable<2,>=0.31.1; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.7.2) Requirement already satisfied: google-cloud-spanner<2,>=1.13.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.19.3) Requirement already satisfied: grpcio-gcp<1,>=0.2.2; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.2.2) Requirement already satisfied: google-cloud-bigquery-storage>=2.6.3; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.14.2) Requirement already satisfied: google-cloud-core<2,>=0.28.1; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.7.3) Requirement already satisfied: google-cloud-bigquery<3,>=1.6.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.34.4) Requirement already satisfied: google-cloud-language<2,>=1.3.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.3.2) Requirement already satisfied: google-cloud-pubsublite<2,>=1.2.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.4.3) Requirement already satisfied: cachetools<5,>=3.1.0; extra == "gcp" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (4.2.4) Requirement already satisfied: h5py<2.11.0,>=2.10.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (2.10.0) Requirement already satisfied: astunparse==1.6.3 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.6.3) Requirement already satisfied: termcolor>=1.1.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.1.0) Requirement already satisfied: wheel>=0.26 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (0.37.1) Requirement already satisfied: google-pasta>=0.1.8 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (0.2.0) Requirement already satisfied: opt-einsum>=2.3.2 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (3.3.0) Requirement already satisfied: tensorboard<3,>=2.3.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (2.10.0) Requirement already satisfied: tensorflow-estimator<2.4.0,>=2.3.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (2.3.0) Requirement already satisfied: wrapt>=1.11.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.14.1) Requirement already satisfied: gast==0.3.3 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (0.3.3) Requirement already satisfied: keras-preprocessing<1.2,>=1.1.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.1.2) Requirement already satisfied: google-api-python-client<2,>=1.7.11 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tfx-bsl<0.24,>=0.23->tensorflow-data-validation==0.23.0) (1.12.11) Requirement already satisfied: tensorflow-serving-api!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tfx-bsl<0.24,>=0.23->tensorflow-data-validation==0.23.0) (2.9.1) Requirement already satisfied: googleapis-common-protos in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-metadata<0.24,>=0.23->tensorflow-data-validation==0.23.0) (1.56.2) Requirement already satisfied: docopt in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from hdfs<3.0.0,>=2.1.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.6.2) Requirement already satisfied: charset-normalizer~=2.0.0; python_version >= "3" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from requests<3.0.0,>=2.24.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.0.12) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from requests<3.0.0,>=2.24.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.26.9) Requirement already satisfied: certifi>=2017.4.17 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from requests<3.0.0,>=2.24.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2021.10.8) Requirement already satisfied: idna<4,>=2.5; python_version >= "3" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from requests<3.0.0,>=2.24.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.3) Requirement already satisfied: pyparsing>=2.1.4 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from pydot<2,>=1.2.0->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.0.8) Requirement already satisfied: google-api-core[grpc]!=2.0.*,!=2.1.*,!=2.2.*,!=2.3.*,!=2.4.*,!=2.5.*,!=2.6.*,!=2.7.*,<3.0.0dev,>=1.32.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-pubsub<3,>=2.1.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.8.2) Requirement already satisfied: grpcio-status>=1.16.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-pubsub<3,>=2.1.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.47.0) Requirement already satisfied: grpc-google-iam-v1<1.0.0dev,>=0.12.4 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-pubsub<3,>=2.1.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.12.4) Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.6" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-auth<3,>=1.18.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (4.8) Requirement already satisfied: pyasn1-modules>=0.2.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-auth<3,>=1.18.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.2.8) Requirement already satisfied: fasteners>=0.14 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-apitools<0.5.32,>=0.5.31; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.17.3) Requirement already satisfied: oauth2client>=1.4.12 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-apitools<0.5.32,>=0.5.31; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (3.0.0) Requirement already satisfied: packaging<22.0dev,>=14.3 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-bigquery<3,>=1.6.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (21.3) Requirement already satisfied: google-resumable-media<3.0dev,>=0.6.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-bigquery<3,>=1.6.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (2.3.3) Requirement already satisfied: overrides<7.0.0,>=6.0.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-cloud-pubsublite<2,>=1.2.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (6.2.0) Requirement already satisfied: markdown>=2.6.8 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (3.4.1) Requirement already satisfied: setuptools>=41.0.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (61.2.0) Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (0.6.1) Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.8.1) Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (0.4.6) Requirement already satisfied: werkzeug>=1.0.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (2.2.2) Requirement already satisfied: uritemplate<4dev,>=3.0.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-api-python-client<2,>=1.7.11->tfx-bsl<0.24,>=0.23->tensorflow-data-validation==0.23.0) (3.0.1) Requirement already satisfied: pyasn1>=0.1.3 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from rsa<5,>=3.1.4; python_version >= "3.6"->google-auth<3,>=1.18.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (0.4.8) Requirement already satisfied: google-crc32c<2.0dev,>=1.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-resumable-media<3.0dev,>=0.6.0->google-cloud-bigquery<3,>=1.6.0; extra == "gcp"->apache-beam[gcp]<3,>=2.23->tensorflow-data-validation==0.23.0) (1.3.0) Requirement already satisfied: importlib-metadata>=4.4; python_version < "3.10" in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from markdown>=2.6.8->tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (4.11.3) Requirement already satisfied: requests-oauthlib>=0.7.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (1.3.1) Requirement already satisfied: MarkupSafe>=2.1.1 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from werkzeug>=1.0.1->tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (2.1.1) Requirement already satisfied: zipp>=0.5 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from importlib-metadata>=4.4; python_version < "3.10"->markdown>=2.6.8->tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (3.8.0) Requirement already satisfied: oauthlib>=3.0.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow!=2.0.*,!=2.1.*,!=2.2.*,<3,>=1.15.2->tensorflow-data-validation==0.23.0) (3.2.0) Installing collected packages: absl-py, tensorflow-metadata Attempting uninstall: absl-py Found existing installation: absl-py 0.12.0 Uninstalling absl-py-0.12.0: Successfully uninstalled absl-py-0.12.0 Attempting uninstall: tensorflow-metadata Found existing installation: tensorflow-metadata 1.2.0 Uninstalling tensorflow-metadata-1.2.0: Successfully uninstalled tensorflow-metadata-1.2.0 ERROR: pip's legacy dependency resolver does not consider dependency conflicts when selecting packages. This behaviour is the source of the following dependency conflicts. tensorflow-serving-api 2.9.1 requires tensorflow<3,>=2.9.1, but you'll have tensorflow 2.3.1 which is incompatible. Successfully installed absl-py-0.8.1 tensorflow-metadata-0.23.0 Collecting tensorflow-metadata==1.2.0 Using cached tensorflow_metadata-1.2.0-py3-none-any.whl (48 kB) Requirement already satisfied: googleapis-common-protos<2,>=1.52.0 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-metadata==1.2.0) (1.56.2) Requirement already satisfied: protobuf<4,>=3.13 in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from tensorflow-metadata==1.2.0) (3.19.4) Collecting absl-py<0.13,>=0.9 Using cached absl_py-0.12.0-py3-none-any.whl (129 kB) Requirement already satisfied: six in /opt/anaconda3/envs/pyalink_public/lib/python3.7/site-packages (from absl-py<0.13,>=0.9->tensorflow-metadata==1.2.0) (1.16.0) Installing collected packages: absl-py, tensorflow-metadata Attempting uninstall: absl-py Found existing installation: absl-py 0.8.1 Uninstalling absl-py-0.8.1: Successfully uninstalled absl-py-0.8.1 Attempting uninstall: tensorflow-metadata Found existing installation: tensorflow-metadata 0.23.0 Uninstalling tensorflow-metadata-0.23.0: Successfully uninstalled tensorflow-metadata-0.23.0 ERROR: pip's legacy dependency resolver does not consider dependency conflicts when selecting packages. This behaviour is the source of the following dependency conflicts. tfx-bsl 0.23.0 requires absl-py<0.9,>=0.7, but you'll have absl-py 0.12.0 which is incompatible. tfx-bsl 0.23.0 requires tensorflow-metadata<0.24,>=0.23, but you'll have tensorflow-metadata 1.2.0 which is incompatible. tensorflow-transform 0.23.0 requires absl-py<0.9,>=0.7, but you'll have absl-py 0.12.0 which is incompatible. tensorflow-transform 0.23.0 requires tensorflow-metadata<0.24,>=0.23, but you'll have tensorflow-metadata 1.2.0 which is incompatible. tensorflow-serving-api 2.9.1 requires tensorflow<3,>=2.9.1, but you'll have tensorflow 2.3.1 which is incompatible. tensorflow-data-validation 0.23.0 requires absl-py<0.9,>=0.7, but you'll have absl-py 0.12.0 which is incompatible. tensorflow-data-validation 0.23.0 requires tensorflow-metadata<0.24,>=0.23, but you'll have tensorflow-metadata 1.2.0 which is incompatible. Successfully installed absl-py-0.12.0 tensorflow-metadata-1.2.0
数据准备
我们使用芝加哥出租车行程数据集,这个数据集也是 TFDV Demo 所使用的数据集。数据已经传到 OSS 上,在 Alink 中可以直接使用下面的链接,所以不需要额外的准备:
- 训练集:https://alink-release.oss-cn-beijing.aliyuncs.com/data-files/chicago_train_data.csv
- 评估集:https://alink-release.oss-cn-beijing.aliyuncs.com/data-files/chicago_eval_data.csv
通过 Alink 进行数据探索
导入 pyalink 包,并启用本地运行环境。
在这个示例中,我们使用 useLocalEnv
在本地(也就是 DSW 的 container 内)运行 Alink 作业,使用多线程的方式模拟分布式计算。
from pyalink.alink import * useLocalEnv(2)
探索训练数据集:计算训练集统计信息,并对数据各特征的统计结果进行可视化探索。
在 Alink 中,读取数据通过数据源组件完成,通过 CsvSourceBatchOp
可以读取 CSV 数据源。其中文件路径可以是本地路径,也可以是 HTTP/HTTPS 链接,还可以是 OSS、HDFS 等路径。由于执行引擎对数据类型有较强的要求,因此还需要指定数据的列名和基本数据类型(schemaStr)。
与 TFDV 一样地,Alink 中的可视化同样使用了 Facets 来进行展示。从可视化中可以看到数据各个特征的基础统计、分布等信息,交互进行探索。
注:为了可视化页面的正确显示,需要您的网络能正常访问 github 的内容。
# 数据的列名和基本数据类型 schemaStr = "pickup_community_area bigint,fare double,trip_start_month int,trip_start_hour int,trip_start_day int,trip_start_timestamp long,pickup_latitude double,pickup_longitude double,dropoff_latitude double,dropoff_longitude double,trip_miles double,pickup_census_tract string,dropoff_census_tract string,payment_type string,company string,trip_seconds double,dropoff_community_area bigint,tips double" # 指定数据源 source = CsvSourceBatchOp()\ .setFilePath("https://alink-release.oss-cn-beijing.aliyuncs.com/data-files/chicago_train_data.csv")\ .setSchemaStr(schemaStr)\ .setIgnoreFirstLine(True) # 告诉 Alink 需要展示数据统计信息的可视化 source.lazyVizStatistics() # 执行作业 BatchOperator.execute()
扩展到更大规模的数据。
对于更大规模的数据,可以使用 usePAIEnv
向大规模集群提交作业,详细使用可以通过 help(usePAIEnv)
查看。
help(usePAIEnv)
Help on function usePAIEnv in module pyalink.alink.env: usePAIEnv(workers=2, memory_per_worker=4096, cpu_per_worker=1, region_id=None, access_key_id=None, access_key_secret=None, workspace_id=None, workspace_name=None, config=None) Submit job to PAIFlow :param workers (int) optional, the default value is 2, when workers<=0, PyAlink will automatically estimate its value. :param memory_per_worker (int) optional, the default value is 4096 :param cpu_per_worker (int) optional, the default value is 1 :param region_id (string) :param access_key_id (string) :param access_key_secret (string) :param workspace_id (string) optional, the id of workspace :param workspace_name (string) optional, the name of workspace attention: workspace_id and workspace_name must not be None together. :param config (dict) custom configuration for PyAlink - pop_extra_config (dict) the extra configuration for pop client - paiflow_endpoint (str) the pop endpoint of PAIFlow - workspace_endpoint (str) the pop endpoint of AIWorkspace - compute_resource_type (str) options: MaxCompute, Flink, default is MaxCompute - compute_resource_env (str) options: dev, prod, default is dev - compute_resource_name (str) the name of computeResource - oss_rolearn (str) the roleArn of oss ============================================= # customize flink-configuration # example: # 'FLINK_CONFIG_restart-strategy': 'none' - FLINK_CONFIG_[key]: value # customize the vvp job labels - VVP_LABEL_[key]: value - jvm_system_properties: dict - jvm_startup_options: list[str] the extra commandline options for jvm ============================================= - storage_type: options 'oss', 'MaxCompute' --------------------------------------------- # when storage_type='oss' - oss_endpoint (str) required, specify the endpoint of OSS - oss_base_uri (str) required, in format of oss://[bucket]/[path] - # when the credentials to access OSS is same with global credentials, # the following parameters can be ommited. # PyAlink will look in several locations when searching for OSS credentials. # the order in which PyAlink searches for credentials is: # 1. the following credentials parameters in the usePAIEnv() method # 2. System Environment: 'OSS_ACCESS_KEY_ID', 'OSS_ACCESS_KEY_SECRET', 'OSS_SECURITY_TOKEN' # 3. the parameters of (access_key_id, access_key_secret) in the usePAIEnv() method # 4. System Environment: 'ALINK_PAIFLOW_ACCESS_KEY_ID', 'ALINK_PAIFLOW_ACCESS_KEY_SECRET' # 5. System Environment: 'ALIBABA_CLOUD_ACCESS_KEY_ID', 'ALIBABA_CLOUD_ACCESS_KEY_SECRET' - oss_access_key_id (str) optional, the AccessKeyId for your Aliyun Account to access OSS - oss_access_key_secret (str) optional, the AccessKeySecret for your Aliyun Account to access OSS - oss_security_token (str) optional, the SecurityToken for your Aliyun Account to access OSS --------------------------------------------- # when storage_type='MaxCompute' - maxcompute_endpoint (str) required - maxcompute_project (str) required - maxcompute_table_name_prefix (str) optional, the default value is 'pyalink_tmp_'
Alink 结合 TFDV 进行数据验证
Alink 的统计功能可以与 TFDV 统计之外的其他功能进行无缝集成,包括数据可视化、数据 schema 推断、数据偏移检测等功能。
train_stats = InternalFullStatsBatchOp().linkFrom(source).collectFullStats().getDatasetFeatureStatisticsList()
import tensorflow_data_validation as tfdv tfdv.visualize_statistics(train_stats)
通过 TFDV 接口,从统计结果中推断数据的 schema(注意避免和 Alink 中的 schemaStr 混淆)。
schema = tfdv.infer_schema(statistics=train_stats) tfdv.display_schema(schema=schema)
计算验证集的统计信息,并和训练集对比查看。
通过对比数据的 schema 和验证集的统计信息,得到数据中的异常信息。
eval_source = CsvSourceBatchOp()\ .setFilePath("https://alink-release.oss-cn-beijing.aliyuncs.com/data-files/chicago_eval_data.csv")\ .setSchemaStr(schemaStr)\ .setIgnoreFirstLine(True) eval_stats = InternalFullStatsBatchOp().linkFrom(eval_source).collectFullStats().getDatasetFeatureStatisticsList()
anomalies = tfdv.validate_statistics(statistics=eval_stats, schema=schema) tfdv.display_anomalies(anomalies)
通过 Alink 与 TFDV 无缝对接实现更多基于统计信息的功能,可以参考原 TFDV 官方示例。