成功解决tensorflow.python.framework.errors_impl.InvalidArgumentError报错问题

成功解决tensorflow多GPU训练模型时报错问题

life_007

13320人浏览 · 2022-09-30 16:53:43

life_007 · 2022-09-30 16:53:43 发布

问题描述：在使用TensorFlow2.2训练模型时，加入多GPU训练出现如下错误

tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'NcclAllReduce' used by {{node NcclAllReduce}} with these attrs: [reduction="sum", shared_name="c1", T=DT_FLOAT, num_devices=2]
Registered devices: [CPU, GPU, XLA_CPU, XLA_GPU]
Registered kernels:
  <no registered kernels>

	 [[NcclAllReduce]] [Op:__inference_train_function_81214]

这个错误是发生在使用多个GPU进行并行训练的时候，使用单个GPU训练的时候并没有报错。

运行环境：

训练的模型yolox
系统：Win10
tensorflow-gpu版本：2.2.0
使用2张GPU

报错原因:
MirroredStrategy默认使用NCCL进行跨设备通信，而NCCL在Windows上不可用，也就是说这种默认的多GPU分布式训练模式不支持win系统；所以需要修改多GPU训练模式。

解决问题：
Ctrl+F，在train.py文件中检索‘strategy’关键词，发现如下

	if ngpus_per_node > 1:
		strategy = tf.distribute.MirroredStrategy( )
	else:
		strategy = None
	print('Number of devices: {}'.format(ngpus_per_node))

添加训练模式如下：

	if ngpus_per_node > 1:
		strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0", "/gpu:1"], cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())
	else:
		strategy = None
	print('Number of devices: {}'.format(ngpus_per_node))