[1,7]:Registered kernels: [1,7]: device='GPU'; Tindices in [DT_INT32]; Toffsets in [DT_INT32] [1,7]: device='GPU'; Tindices in [DT_INT32]; Toffsets in [DT_INT64] [1,7]: device='GPU'; Tindices in [DT_INT64]; Toffsets in [DT_INT32] [1,7]: device='GPU'; Tindices in [DT_INT64]; Toffsets in [DT_INT64] [1,7]: [1,7]: [[input_layer/input_layer/group_embedding_lookup/PreprocessingForward/PreprocessingForward]] [1,7]: [1,7]:During handling of the above exception, another exception occurred: [1,7]: [1,7]:Traceback (most recent call last): [1,7]: File "train.py", line 887, in [1,7]: main() [1,7]: File "train.py", line 642, in main [1,7]: train(sess_config, hooks, model, train_init_op, train_steps, [1,7]: File "train.py", line 505, in train [1,7]: with tf.train.MonitoredTrainingSession( [1,7]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 655, in MonitoredTrainingSession [1,7]: return MonitoredSession( [1,7]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1085, in init [1,7]: super(MonitoredSession, self).init( [1,7]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 800, in init [1,7]: self._sess = _RecoverableSession(self._coordinated_creator) [1,7]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1282, in init [1,7]: _WrappedSession.init(self, self._create_session()) [1,7]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1287, in _create_session [1,7]: return self._sess_creator.create_session() [1,7]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 953, in create_session [1,7]: self.tf_sess = self._session_creator.create_session() [1,7]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 713, in create_session [1,7]: return self._get_session_manager().prepare_session( [1,7]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/session_manager.py", line 306, in prepare_session [1,7]: sess.run(init_op, feed_dict=init_feed_dict) [1,7]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 955, in run [1,7]: result = self._run(None, fetches, feed_dict, options_ptr, [1,7]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1179, in _run [1,7]: results = self._do_run(handle, final_targets, final_fetches, [1,7]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1358, in _do_run [1,7]: return self._do_call(_run_fn, feeds, fetches, targets, options, [1,7]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call [1,7]: raise type(e)(node_def, op, message) [1,7]:tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'PreprocessingForward' used by node input_layer/input_layer/group_embedding_lookup/PreprocessingF[1,7]:orward/PreprocessingForward (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [num_ranks=16, num_gpus=16, Toffsets=DT_INT64, Tindices=DT_INT64, num_lookups=26, combiners=["mean", "mean", "mean", "mean", "mean", ..., "mean", "mean", "mean", "mean", "mean"], dimensions=[16, 16, 16, 16, 16, ..., 16, 16, 16, 16, 16], shard=[-1, -1, -1, -1, -1, ..., -1, -1, -1, -1, -1], rank=7, id_in_local_rank=0] [1,7]:Registered devices: [CPU, XLA_CPU] [1,7]:Registered kernels: [1,7]: device='GPU'; Tindices in [DT_INT32]; Toffsets in [DT_INT32] [1,7]: device='GPU'; Tindices in [DT_INT32]; Toffsets in [DT_INT64] [1,7]: device='GPU'; Tindices in [DT_INT64]; Toffsets in [DT_INT32] [1,7]: device='GPU'; Tindices in [DT_INT64]; Toffsets in [DT_INT64] [1,7]: [1,7]: [[input_layer/input_layer/group_embedding_lookup/PreprocessingForward/PreprocessingForward]] 请帮助看一下, 机器学习PAI以前出现过这个问题吗?用的还是deepfm模型, 上次跑通了单机多卡, 这次想试试, 多机多卡, 在上yarn调度. ssh都配好了, mpi在多机上可以跑通
你试试在机器上看看get_physical_devices看看tf有没有正常识别GPU设备,SOK的相关Op目前都只实现了GPU的版本,但是我看Log显示进程只检测到CPU device;所以你要不进容器检查一下tf 的visible_devices,此回答整理自钉群“DeepRec用户群”
版权声明:本文内容由阿里云实名注册用户自发贡献,版权归原作者所有,阿里云开发者社区不拥有其著作权,亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容,填写侵权投诉表单进行举报,一经查实,本社区将立刻删除涉嫌侵权内容。
人工智能平台 PAI(Platform for AI,原机器学习平台PAI)是面向开发者和企业的机器学习/深度学习工程平台,提供包含数据标注、模型构建、模型训练、模型部署、推理优化在内的AI开发全链路服务,内置140+种优化算法,具备丰富的行业场景插件,为用户提供低门槛、高性能的云原生AI工程化能力。