1. 报错堆栈
[stdout] [2025-04-09 17:15:19] [dlcxxx-worker-1] /mnt/train-code/einsumdai/qwen_7b/LLaMA-Megatron/PAI-Megatron-Patch/megatron_patch/proto/:/mnt/train-code/einsumdai/qwen_7b/LLaMA-Megatron/Megatron-LM:/mnt/train-code/einsumdai/qwen_7b/LLaMA-Megatron/PAI-Megatron-Patch: [stderr] [2025-04-09 17:15:22] [dlcxxx-worker-1] ls: 无法访问 '/mnt/train-code/einsumdai/qwen_7b/LLaMA-Megatron/checkpoint/yp-test/pretrain/iter_*': 没有那个文件或目录 [stdout] [2025-04-09 17:15:22] [dlcxxx-worker-1] load from pretrain checkpoint. [stdout] [2025-04-09 17:15:22] [dlcxxx-worker-1] torchrun --nproc_per_node 8 --nnodes 8 --node_rank 2 --master_addr dlchrvc3sn61pyy4-master-0 --master_port 23456 pretrain_megatron_alpaca_local.py --tensor-model-parallel-size 2 --pipeline-model-parallel-size 1 --distributed-backend nccl --use-distributed-optimizer --sequence-parallel --distributed-timeout-minutes 60 --add-qkv-bias --num-layers 28 --hidden-size 3584 --num-attention-heads 28 --num-query-groups 4 --ffn-hidden-size 18944 --position-embedding-type rope --max-position-embeddings 8192 --make-vocab-size-divisible-by 128 --norm-epsilon 1e-6 --normalization RMSNorm --swiglu --untie-embeddings-and-output-weights --use-flash-attn --use-rotary-position-embeddings --log-timers-to-tensorboard --log-validation-ppl-to-tensorboard --log-params-norm --log-output-info-by-layer --timing-log-level 1 --tensorboard-queue-size 1 --tensorboard-dir /mnt/train-code/einsumdai/qwen_7b/LLaMA-Megatron/tensorboard/yp-test_2025.04.09-17.15 --log-batch-size-to-tensorboard --log-validation-ppl-to-tensorboard --tensorboard-outputinfo-log-interval 20 --tensorboard-params-log-interval 20 --attention-dropout 0.0 --hidden-dropout 0.0 --weight-decay 0.1 --clip-grad 1.0 --adam-beta1 0.9 --adam-beta2 0.95 --adam-eps 1e-8 --micro-batch-size 1 --global-batch-size 96 --train-iters 3798 --epoch-num 3 --log-interval 1 --disable-bias-linear --no-bias-gelu-fusion --optimizer adam --no-bias-dropout-fusion --no-gradient-accumulation-fusion --no-bias-swiglu-fusion --no-manual-gc-eval --finetune --group-query-attention --manual-gc --use-mcore-models --seed 1234 --init-method-std 0.02 --lr 2e-5 --lr-decay-style cosine --min-lr 2e-6 --override-opt_param-scheduler --lr-decay-iters 3798 --lr-warmup-iters 113 --save /mnt/train-code/einsumdai/qwen_7b/LLaMA-Megatron/checkpoint/yp-test/pretrain --save-interval 500 --exit-signal-handler --load /mnt/train-code/einsumdai/qwen_7b/LLaMA-Megatron/checkpoint/checkpoint_v205-xverse-7b/pretrain --no-load-optim --no-load-rng --bf16 --attention-softmax-in-fp32 --loss-scale-window 100 --initial-loss-scale 65536 --eval-interval 100000000 --eval-iters 1000000 --data-impl mmap --data-path /mnt/train-code/einsumdai/dataset/sft_data_v3_250402.json --split 98,2,0 --train-data-disk-size 9426386 --seq-length 8192 --num-workers 0 --tokenizer-model /mnt/train-code/einsumdai/qwen_7b/LLaMA-Megatron/alpaca-tokenizer/llama-7b/tokenizer.model --dataloader-type cyclic --set-size 96 --set-files-per-dir 1000 --extra-vocab-size 0 --max-padding-length 8192 --patch-tokenizer-type SftTokenizer --patch-tokenizer-model /mnt/train-code/einsumdai/qwen_7b/LLaMA-Megatron/alpaca-tokenizer/llama-7b/ [stdout] [2025-04-09 17:15:26] [dlcxxx-worker-1] [2025-04-09 17:15:25,988] torch.distributed.run: [WARNING] [stdout] [2025-04-09 17:15:26] [dlcxxx-worker-1] [2025-04-09 17:15:25,988] torch.distributed.run: [WARNING] ***************************************** [stdout] [2025-04-09 17:15:26] [dlcxxx-worker-1] [2025-04-09 17:15:25,988] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [stdout] [2025-04-09 17:15:26] [dlcxxx-worker-1] [2025-04-09 17:15:25,988] torch.distributed.run: [WARNING] ***************************************** [stdout] [2025-04-09 17:15:26] [dlcxxx-worker-1] [W socket.cpp:663] [c10d] The IPv6 network addresses of (dlchrvc3sn61pyy4-master-0, 23456) cannot be retrieved (gai error: -2 - Name or service not known). [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] Traceback (most recent call last): [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/bin/torchrun", line 33, in <module> [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] sys.exit(load_entry_point('torch==2.1.0a0+32f93b1', 'console_scripts', 'torchrun')()) [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] return f(*args, **kwargs) [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] run(args) [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] elastic_launch( [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__ [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] return launch_agent(self._config, self._entrypoint, list(args)) [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 255, in launch_agent [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] result = agent.run() [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] result = f(*args, **kwargs) [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] result = self._invoke_run(role) [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 871, in _invoke_run [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] self._initialize_workers(self._worker_group) [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] result = f(*args, **kwargs) [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] self._rendezvous(worker_group) [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] result = f(*args, **kwargs) [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 549, in _rendezvous [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec) [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] result = f(*args, **kwargs) [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 637, in _assign_worker_ranks [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] role_infos = self._share_and_gather(store, group_rank, group_world_size, spec) [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/agent/server/api.py", line 674, in _share_and_gather [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] role_infos_bytes = store_util.synchronize( [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] agent_data = get_all(store, rank, key_prefix, world_size) [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] data = store.get(f"{prefix}{idx}") [stdout] [2025-04-09 17:20:27] [dlcxxx-worker-1] RuntimeError: Socket Timeout
2. 问题原因
客户作业为8机8卡的作业,通过megatron使用torchrun来起任务,因为任务spot任务分配的机器本地镜像情况不同,有的本地存在cache,有的需要现拉,导致了不同worker起来的时间相差6-7分钟,导致torch.run的barrier对齐的时候超时失败。
2.1. 堆栈分析
客户日志中torch的版本:torch==2.1.0a0+32f93b1,应该是pai-megatron-patch自己打的某个版本
源码分析使用版本:pytorch v2.1.0-rc6
2.1.1. get_all
torch.distributed.elastic.utils.store.get_all
torch/distributed/elastic/utils/store.py:34
data = store.get(f"{prefix}{idx}")
这里socket timeout,需要找下store是什么对象,get方法去做了什么,超时时间能否设置,往上看堆栈
2.1.2. synchronize
torch.distributed.elastic.utils.store.synchronize
torch/distributed/elastic/utils/store.py:64
agent_data = get_all(store, rank, key_prefix, world_size)
store在调用get_all的时候直接传了进去,是synchronize方法的入参
2.1.3. _share_and_gather
torch.distributed.elastic.agent.server.api.SimpleElasticAgent._share_and_gather
torch/distributed/elastic/agent/server/api.py:674
role_infos_bytes = store_util.synchronize( store, agent_config_enc, group_rank, group_world_size, key_prefix )
调用_share_and_gather的入参里就有store,调用的上面synchronize的静态方法,同理还得往上看
2.2.4. _assign_worker_ranks
torch.distributed.elastic.agent.server.api.SimpleElasticAgent._assign_worker_ranks
torch/distributed/elastic/agent/server/api.py:637
role_infos = self._share_and_gather(store, group_rank, group_world_size, spec)
这里一样,也是直接拿着store调用
2.2.5. _rendezvous
torch.distributed.elastic.agent.server.api.SimpleElasticAgent._rendezvous
这里有store相关的逻辑
是如下方法返回的对象:torch.distributed.elastic.rendezvous.api.RendezvousHandler.next_rendezvous
这里看next_rendezvous这个抽象方法的描述反馈的是torch.distributed.Store,但是这个方法实现在多个类里,具体要结合初始化方法看下
从python的代码看,这个torch.distributed.Store应该是cpp做的实现
回到当前方法这里:torch.distributed.elastic.agent.server.api.SimpleElasticAgent._rendezvous
next_rendezvous方法所在的实例化对象是spec
来自:spec = worker_group.spec
worker_group是_rendezvous方法调用的时候上层传入的
2.2.6. _initialize_workers
torch.distributed.elastic.agent.server.api.SimpleElasticAgent._initialize_workers
做的如下调用:self._rendezvous(worker_group)
这里worker_group有类型声明
def _initialize_workers(self, worker_group: WorkerGroup) -> None:
worker_group的类型是 torch.distributed.elastic.agent.server.api.WorkerGroup
其中spec对象的类型是 torch.distributed.elastic.agent.server.api.WorkerSpec
其中rdzv_handler对象的类型是 torch.distributed.elastic.rendezvous.api.RendezvousHandler 但是有多个继承类,得看一下初始化的时候初始化了哪个
2.2.7. run\_invoke_run
torch.distributed.elastic.agent.server.api.SimpleElasticAgent.run 调用了->
torch.distributed.elastic.agent.server.api.SimpleElasticAgent._invoke_run 调用了 -> initialize_workers方法(入参是self._worker_group)
这个入参不是调用_invoke_run的时候传入的
也不是更上一层调用 run的时候传入的 所在类是 torch.distributed.elastic.agent.server.api.SimpleElasticAgent 有子类,应该是初始化的时候做的,看着pycharm上的提示,run和_invoke_run方法没有被重写过,需要看下怎么初始化的 self._worker_group
2.2.8. launch_agent
初始化在这里:torch.distributed.launcher.api.launch_agent
与上面分析衔接的是,这里的代码:result = agent.run()
类是下面这个,继承自SimpleElasticAgent
torch.distributed.elastic.agent.server.local_elastic_agent.LocalElasticAgent
核心是要找到这个类的实例化对象中 spec的 spec.rdzv_handler.next_rendezvous()返回的store对象是什么类型,那个对象的 store.get方法的超时参数受什么控制
LocalElasticAgent初始化的时候 spec为传入的参数,为torch.distributed.elastic.agent.server.api.WorkerSpec类实例化的对象,其中rdzv_handler 是
rdzv_registry.get_rendezvous_handler(rdzv_parameters) 初始化的
2.2.9. get_rendezvous_handler
torch.distributed.elastic.rendezvous.registry.get_rendezvous_handler
通过 torch.distributed.elastic.rendezvous.api.RendezvousHandlerRegistry.create_handler 来创建handler,依据的是torchrun config中的rdzv_backend参数
handler在pytorch有内置的四种,在
torch.distributed.elastic.rendezvous.__init__.py中注册
这里的store是torch.distributed.elastic.agent.server.api.WorkerSpec的rdzv_handler在
应该是多节点,非standalone模式且torchrun未指定的情况下,默认rdzv_backend
='static'的实现类
torch.distributed.elastic.rendezvous.static_tcp_rendezvous.StaticTCPRendezvous的next_rendezvous方法返回的
看着是所有节点根据是否master节点初始化的PrefixStore封装的TCPStore,这里docs介绍5分钟超时,对get wait等方法也生效,跟dlc的worker日志5分钟的Exception上也能大致对上
那现在就是要溯源这里的timeout看如何在初始化的时候调大来应对这种image pull不同worker时间差距大的问题了
2.2.10. timeout 溯源
Rendezvous构造:
params:
config:
args_parser默认:
torch.distributed.run.config_from_args
torch.distributed.launcher.api.LaunchConfig
初始化TCPStore默认的timeout是900s
3. 参数是否可以修改
手动指定下--rdzv-conf timeout=1800来试下,虽然上面分析release的版本是900s与客户不同,但是timeout传参的方式应该是没变化的。
并未生效
细看了一下代码发现,光顾着看初始化配置了,这里synchronize方法中有单独配置
这里get_all的时候写死了barrier timeout,该场景改上面的参数没用
4. 解决方案
- 目前看客户项目中pytorch的版本也是个预发版本,应该是自己或者社区打的,可以修改上述的代码重新编译
- 放弃spot购买独享资源,对该任务指定节点调度
- PAI平台镜像加速功能:参考文档