直接使用
请打开如何提交开启AIMaster容错监控的DLC任务,并点击右上角 “ 在DSW中打开” 。
概览
PAI-DLC(Deep Learning Containers)是基于阿里巴巴容器服务ACK(Alibaba Cloud Container Service for Kubernetes)的深度学习训练平台,为您提供灵活、稳定、易用和极致性能的深度学习训练环境。
为了提高大规模分布式深度学习任务鲁棒性,PAI-DLC提供了基于AIMaster的容错监控功能。AIMaster是一个管控组件,当任务开启AIMaster容错监控功能后,会拉起一个AIMaster实例和任务其他实例一起运行,起到任务监控、容错判断以及资源控制等作用。
本文将介绍如何使用PAI-DLC Python SDK提交开启AIMaster容错监控的DLC任务。
前提条件
- 开通PAI-DLC,并完成授权,详情请参见云产品依赖与授权:DLC。
- 为运行训练任务,准备资源组集群(本文使用公共DLC资源组进行演示)
- 已获取阿里云账户的AccessKey ID和AccessKey Secret,详情请参见获取AccessKey。
步骤一:安装Python SDK
执行以下命令安装PAI-DLC的Python SDK
!pip install alibabacloud-pai-dlc20201203 -U -q
步骤二:提交开启AIMaster容错监控任务
在PAI-DLC Python SDK中和AIMaster容错监控相关的关键字段如下所示,在JobSettings中通过enable_error_monitoring_in_aimaster参数开启AIMaster容错监控,通过error_monitoring_args参数设置具体的容错监控参数。
from alibabacloud_pai_dlc20201203.models import JobSettings, CreateJobRequest settings = JobSettings( enable_error_monitoring_in_aimaster = True, error_monitoring_args = "" ) create_job_req = CreateJobRequest( ... settings = settings, )
下面将分别给出开启AIMaster容错监控的PyTorch任务示例以及TensorFlow任务示例。
PyTorch任务示例
下面的示例假设是PyTorch同步训练任务,在运行过程中意外发生hang。 示例任务开启了AIMaster容错监控,设置的容错监控参数为:开启任务重启、开启任务hang检测。在任务运行过程中,当AIMaster检测到任务hang后将会重启任务。
import time from alibabacloud_pai_dlc20201203.client import Client from alibabacloud_pai_dlc20201203.models import CreateJobRequest, JobSpec, JobSettings from alibabacloud_tea_openapi.models import Config workspace_id = "***已有的AI工作空间ID***" region_id = "cn-hangzhou" # Region,可以是cn-hangzhou,cn-shanghai,cn-shenzhen等 config = Config( access_key_id="***你的access_key_id***", access_key_secret="***你的access_key_secret***", region_id=region_id, endpoint= "pai-dlc.{}.aliyuncs.com".format(region_id)) dlc_client = Client(config) worker_spec = JobSpec( type = "Worker", pod_count = 2, image = "registry.{}.aliyuncs.com/pai-dlc/pytorch-training:1.8PAI-gpu-py36-cu101-ubuntu18.04".format(region_id), ecs_spec = "ecs.c6.large",) settings = JobSettings( enable_error_monitoring_in_aimaster = True, error_monitoring_args = "--job-execution-mode=Sync --enable-job-restart=True \ --enable-job-hang-detection=True --job-hang-interval=20 \ --max-num-of-same-error=1") create_job_req = CreateJobRequest( display_name = "TestJobHangWithRetry", job_type = "PyTorchJob", workspace_id = workspace_id, job_specs = [worker_spec], user_command = "sleep 10240", settings = settings, ) create_job_resp = dlc_client.create_job(create_job_req) job_id = create_job_resp.body.job_id while True: job = dlc_client.get_job(job_id).body print('job is {}'.format(job.status)) if job.status in ('Succeeded', 'Failed', 'Stopped'): break time.sleep(10)
任务提交后,您可以到DLC-Web上查看任务详细运行信息。上述任务运行过程如下图所示,由于任务发生过重启,所以存在重复实例名。
TensorFlow任务示例
下面示例是TensorFlow异步训练任务,在任务运行过程中,worker-0训练到一定步数会自动出错,完整测试代码见链接。
该示例任务开启了AIMaster容错监控,容错策略配置的OnFailure,只要worker运行出错会无条件重启。
import time from alibabacloud_pai_dlc20201203.client import Client from alibabacloud_pai_dlc20201203.models import CreateJobRequest, JobSpec, JobSettings from alibabacloud_tea_openapi.models import Config workspace_id = "***已有的AI工作空间ID***" region_id = "cn-hangzhou" # Region,可以是cn-hangzhou,cn-shanghai,cn-shenzhen等 config = Config( access_key_id="***你的access_key_id***", access_key_secret="***你的access_key_secret***", region_id=region_id, endpoint= "pai-dlc.{}.aliyuncs.com".format(region_id)) dlc_client = Client(config) docker_image = "registry.{}.aliyuncs.com/pai-dlc/tensorflow-training:1.15-cpu-py36-ubuntu18.04".format(region_id) ps_spec = JobSpec( type = "PS", pod_count = 1, image = docker_image, ecs_spec = "ecs.c6.large",) chief_spec = JobSpec( type = "Worker", pod_count = 1, image = docker_image, ecs_spec = "ecs.c6.large",) worker_spec = JobSpec( type = "Chief", pod_count = 1, image = docker_image, ecs_spec = "ecs.c6.large",) job_spec = [ps_spec, chief_spec, worker_spec] settings = JobSettings( enable_error_monitoring_in_aimaster = True, error_monitoring_args = "--job-execution-mode=Async --fault-tolerant-policy=OnFailure") create_job_req = CreateJobRequest( display_name = "TestPsJobWithWorkerRetry", job_type = "TFJob", workspace_id = workspace_id, job_specs = job_spec, user_command = "wget https://pai-dlc-regression-test.oss-cn-beijing.aliyuncs.com/fault-tolerance/ps_job_test.py && python ps_job_test.py", settings = settings, ) create_job_resp = dlc_client.create_job(create_job_req) job_id = create_job_resp.body.job_id while True: job = dlc_client.get_job(job_id).body print('job is {}'.format(job.status)) if job.status in ('Succeeded', 'Failed', 'Stopped'): break time.sleep(10)
任务提交后,您可以到DLC-Web上查看任务详细运行信息。上述任务运行过程如下图所示,由于worker-0实例发生过重启,所以存在重复实例名。