人工智能平台PAI操作报错合集之version选了0.7.5并在使用learn_loss_weight时遇到报错，如何解决-阿里云开发者社区

人工智能平台PAI操作报错合集之version选了0.7.5并在使用learn_loss_weight时遇到报错，如何解决

2024-06-11 70

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

本文涉及的产品

交互式建模 PAI-DSW，每月250计算时 3个月

模型训练 PAI-DLC，100CU*H 3个月

模型在线服务 PAI-EAS，A10/V100等 500元 1个月

简介： 阿里云人工智能平台PAI (Platform for Artificial Intelligence) 是阿里云推出的一套全面、易用的机器学习和深度学习平台，旨在帮助企业、开发者和数据科学家快速构建、训练、部署和管理人工智能模型。在使用阿里云人工智能平台PAI进行操作时，可能会遇到各种类型的错误。以下列举了一些常见的报错情况及其可能的原因和解决方法。

问题一：机器学习PAI报错怎么解决?

机器学习PAI报错怎么解决?INFO: Found applicable config definition build:dynamic_kernels in file /home/pangjun/BladeDISC_GPU/tf_community/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS

Loading:

WARNING: The following configs were expanded more than once: [cuda]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.

Loading:

Loading: 0 packages loaded

INFO: Build options --action_env, --compilation_mode, --copt, and 2 more have changed, discarding analysis cache.

Analyzing: 2 targets (1 packages loaded, 0 targets configured)

INFO: Analyzed 2 targets (195 packages loaded, 13606 targets configured).

checking cached actions

INFO: Found 1 target and 1 test target...

[0 / 4] [Prepa] BazelWorkspaceStatusAction stable-status.txt

WARNING: /home/pangjun/.cache/bazel/_bazel_pangjun/a92cb0e935d0b101686941713fa06780/external/org_disc_compiler/mlir/disc/BUILD:2133:8: input 'mlir/disc/cutlass' to @org_disc_compiler//mlir/disc:cutlass_header_preprocess is a directory; dependency checking of directories is unsound

[6,483 / 8,118] Compiling llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp; 0s local, remote-cache ... (191 actions, 190 running)

[6,483 / 8,118] Compiling llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp; 1s local, remote-cache ... (192 actions running)

[6,485 / 8,118] Compiling llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp; 2s local, remote-cache ... (191 actions, 190 running)

ERROR: /home/pangjun/.cache/bazel/_bazel_pangjun/a92cb0e935d0b101686941713fa06780/external/org_disc_compiler/mlir/disc/BUILD:2133:8: Executing genrule @org_disc_compiler//mlir/disc:cutlass_header_preprocess failed: (Exit 2): bash failed: error executing command (from target @org_disc_compiler//mlir/disc:cutlass_header_preprocess) /bin/bash -c ... (remaining 1 argument skipped)

cc1plus: fatal error: cuda_runtime.h: No such file or directory 这个需要怎么设置环境变量能让他找到 conda 环境里的cuda runtime？

参考回答：

对一下这里通过 nvcc 找到的路径或者在 L30 上直接加一下conda cuda的路径试试

关于本问题的更多回答可点击原文查看：

https://developer.aliyun.com/ask/582111

问题二：机器学习PAI报错是TensorFlow 有什么办法能enable呀?

机器学习PAI报错是TensorFlow job failed immediately after re-launching, since failover is disabled by configuration，有什么办法能enable呀?

参考回答：

-DautoEnablePsTaskFailover=true -DuseSparseClusterSchema=true

关于本问题的更多回答可点击原文查看：

https://developer.aliyun.com/ask/582284

问题三：机器学习PAI version如果选了0.7.5现在推荐用哪个版本呢？

机器学习PAI version如果选了0.7.5

在用learn_loss_weight 时会报

File "/worker/tensorflow_jobs/easy_rec/python/model/multi_task_model.py", line 192, in get_learnt_loss

raise ValueError('Unsupported loss weight strategy: ' + strategy.Name)

AttributeError: 'int' object has no attribute 'Name' ? 我看pai上现在还不能用0.6.3，那现在推荐用哪个版本呢？

参考回答：

loss_weight_strategy: Uncertainty 加在与losses平级的地方

关于本问题的更多回答可点击原文查看：

https://developer.aliyun.com/ask/580114

问题四：机器学习PAI easyrec 多分类的probs 输出类型是什么呢？

机器学习PAI easyrec 多分类的probs 输出类型是什么呢？官方文档里写的是string, 但是一直报错说类型不对

预测的时候-Doutput_cols='probs string,probs_y double,y bigint'，会报错_common_io.UserException: table/table_record_data.cpp(171): UserException: Value type does not match the column type. Column index: 5, Column type: string, and cast error:Unable to cast Python instance to C++ type (compile in debug mode for details)

参考回答：