【MindSpore-GPU-1.1.0】【LeNet5】训练报cudaHostAlloc failed

发布时间:2022-12-12 20:00

MindSpore-GPU-1.1.0

运行环境

Windows 10 家庭中文版 2004 20279.1

WSL2

Ubuntu 18.04

cudatoolkit 1.

cudnn 

conda 

【操作步骤&问题现象】

python train.py --device_target="GPU"

【日志信息】(可选,上传日志内容或者附件)

============== Starting Training ==============

libnuma: Warning: Cannot read node cpumask from sysfs

numa_sched_setaffinity_v2_int() failed; abort

: Invalid argument

set_mempolicy: Function not implemented

numa_sched_setaffinity_v2_int() failed; abort

: Invalid argument

set_mempolicy: Function not implemented

[ERROR] MD(4311,python):2021-01-01-21:18:15.974.858 [mindspore/ccsrc/minddata/dataset/util/arena.cc:242] Init] cudaHostAlloc failed, ret[2], out of memory

[ERROR] KERNEL(4311,python):2021-01-01-21:23:15.982.517 [mindspore/ccsrc/backend/kernel_compiler/gpu/data/dataset_iterator_kernel.cc:114] ReadDevice] Get data timeout

[ERROR] DEVICE(4311,python):2021-01-01-21:23:15.982.628 [mindspore/ccsrc/runtime/device/gpu/gpu_kernel_runtime.cc:652] LaunchKernelDynamic] Op Error: Launch kernel failed. | Error Number: 0

Traceback (most recent call last):

  File "train.py", line 71, in

    dataset_sink_mode=args.dataset_sink_mode)

  File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/train/model.py", line 592, in train

    sink_size=sink_size)

  File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/train/model.py", line 391, in _train

    self._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)

  File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/train/model.py", line 452, in _train_dataset_sink_process

    outputs = self._train_network(*inputs)

  File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/nn/cell.py", line 331, in __call__

    out = self.compile_and_run(*inputs)

  File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/nn/cell.py", line 602, in compile_and_run

    return _executor(self, *new_inputs, phase=self.phase)

  File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/common/api.py", line 582, in __call__

    return self.run(obj, *args, phase=phase)

  File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/common/api.py", line 610, in run

    return self._exec_pip(obj, *args, phase=phase_real)

  File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/common/api.py", line 75, in wrapper

    results = fn(*arg, **kwargs)

  File "/home/bamboo/miniconda3/envs/ms/lib/python3.7/site-packages/mindspore/common/api.py", line 593, in _exec_pip

    return self._executor(args_list, phase)

RuntimeError: mindspore/ccsrc/runtime/device/gpu/gpu_kernel_runtime.cc:652 LaunchKernelDynamic] Op Error: Launch kernel failed. | Error Number: 0

解答:

日志中报错提及:out of memory, 应该为内存不够:

[ERROR] MD(4311,python):.../arena.cc:242] Init] cudaHostAlloc failed, ret[2], out of memory

请查下机器上有多少内存,您的gpu机器应该为numa架构,请基于此命令 numastat -cm进行查看 , 正常能看到类似下图结果

当前GPU上运行初始化时,数据模块会尝试申请2G 内存,请结合上述命令分析机器内存是否够用​​​​​​​

此外网络部分也会尝试申请内存,请执行上述命令,监控网络运行期间内存的变化,是否已满负荷

$ numastat -cm

Per-node system memory usage(in MBs):

                       Node  0             Node  1           Total

                      --------           ---------       ----------

MemTotal               12524                12915            25439

MemFree                 ****                 ****             ****

MemUsed                 ****                 ****             ***

ItVuer - 免责声明 - 关于我们 - 联系我们

本网站信息来源于互联网,如有侵权请联系:561261067@qq.com

桂ICP备16001015号