问题一:机器学习PAI报错怎么解决?
机器学习PAI报错怎么解决?INFO: Found applicable config definition build:dynamic_kernels in file /home/pangjun/BladeDISC_GPU/tf_community/.bazelrc: --define=dynamic_loaded_kernels=true --copt=-DAUTOLOAD_DYNAMIC_KERNELS
Loading:
WARNING: The following configs were expanded more than once: [cuda]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
Loading:
Loading: 0 packages loaded
INFO: Build options --action_env, --compilation_mode, --copt, and 2 more have changed, discarding analysis cache.
Analyzing: 2 targets (1 packages loaded, 0 targets configured)
INFO: Analyzed 2 targets (195 packages loaded, 13606 targets configured).
checking cached actions
INFO: Found 1 target and 1 test target...
[0 / 4] [Prepa] BazelWorkspaceStatusAction stable-status.txt
WARNING: /home/pangjun/.cache/bazel/_bazel_pangjun/a92cb0e935d0b101686941713fa06780/external/org_disc_compiler/mlir/disc/BUILD:2133:8: input 'mlir/disc/cutlass' to @org_disc_compiler//mlir/disc:cutlass_header_preprocess is a directory; dependency checking of directories is unsound
[6,483 / 8,118] Compiling llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp; 0s local, remote-cache ... (191 actions, 190 running)
[6,483 / 8,118] Compiling llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp; 1s local, remote-cache ... (192 actions running)
[6,485 / 8,118] Compiling llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp; 2s local, remote-cache ... (191 actions, 190 running)
ERROR: /home/pangjun/.cache/bazel/_bazel_pangjun/a92cb0e935d0b101686941713fa06780/external/org_disc_compiler/mlir/disc/BUILD:2133:8: Executing genrule @org_disc_compiler//mlir/disc:cutlass_header_preprocess failed: (Exit 2): bash failed: error executing command (from target @org_disc_compiler//mlir/disc:cutlass_header_preprocess) /bin/bash -c ... (remaining 1 argument skipped)
cc1plus: fatal error: cuda_runtime.h: No such file or directory 这个需要怎么设置环境变量能让他找到 conda 环境里的cuda runtime?
参考回答:
对一下这里通过 nvcc 找到的路径或者在 L30 上直接加一下conda cuda的路径试试
关于本问题的更多回答可点击原文查看:
https://developer.aliyun.com/ask/582111
问题二:机器学习PAI报错是TensorFlow 有什么办法能enable呀?
机器学习PAI报错是TensorFlow job failed immediately after re-launching, since failover is disabled by configuration,有什么办法能enable呀?
参考回答:
-DautoEnablePsTaskFailover=true -DuseSparseClusterSchema=true
关于本问题的更多回答可点击原文查看:
https://developer.aliyun.com/ask/582284
问题三:机器学习PAI version如果选了0.7.5现在推荐用哪个版本呢?
机器学习PAI version如果选了0.7.5
在用learn_loss_weight 时会报
File "/worker/tensorflow_jobs/easy_rec/python/model/multi_task_model.py", line 192, in get_learnt_loss
raise ValueError('Unsupported loss weight strategy: ' + strategy.Name)
AttributeError: 'int' object has no attribute 'Name' ? 我看pai上现在还不能用0.6.3,那现在推荐用哪个版本呢?
参考回答:
loss_weight_strategy: Uncertainty 加在与losses平级的地方
关于本问题的更多回答可点击原文查看:
https://developer.aliyun.com/ask/580114
问题四:机器学习PAI easyrec 多分类的probs 输出类型是什么呢?
机器学习PAI easyrec 多分类的probs 输出类型是什么呢?官方文档里写的是string, 但是一直报错说类型不对
预测的时候-Doutput_cols='probs string,probs_y double,y bigint', 会报错_common_io.UserException: table/table_record_data.cpp(171): UserException: Value type does not match the column type. Column index: 5, Column type: string, and cast error:Unable to cast Python instance to C++ type (compile in debug mode for details)
参考回答:
float试试
或者double试试
如果多分类是3,应该是num_class :3 。但是很少用。
https://easyrec.readthedocs.io/en/latest/models/multi_cls.html
关于本问题的更多回答可点击原文查看:
https://developer.aliyun.com/ask/577018
问题五:机器学习PAI 中,出现这个报错是什么原因?
"机器学习PAI 中,出现这个报错是什么原因?
Exception in thread QueueRunnerThread-dummy_queue-sync_token_q_EnqueueMany:
CancelledError: Step was cancelled by an explicit call to Session::Close()
."
参考回答:
这个错误信息表示在运行机器学习PAI的过程中,某个步骤被显式地通过调用Session::Close()方法取消了。这通常是因为在训练过程中,用户手动关闭了会话或者停止了训练过程。
为了解决这个问题,您可以尝试以下几种方法:
- 检查代码中是否有显式地调用Session::Close()方法来停止训练。如果有,请删除或注释掉这些代码行。
- 如果您是在Jupyter Notebook或其他交互式环境中运行代码,尝试重新启动内核(Kernel)并重新运行代码。有时候,重启内核可以解决一些临时的问题。
- 如果问题仍然存在,您可以尝试查看PAI的官方文档或社区支持,看看是否有其他人遇到了类似的问题,并找到了解决方案。
关于本问题的更多回答可点击原文查看: