记录一下最近跑TinaFace代码在原来服务器跑没有问题,新服务器跑遇到的错误

首先,按照官网步骤安装相关包:
本人环境:
显卡驱动版本: NVIDIA-SMI 460.73.01 Driver Version: 460.106.00 CUDA Version: 11.2
CUDA版本:nvcc -V: Cuda compilation tools, release 11.1, V11.1.74

pytorch                   1.8.1 
torchvision               0.9.1  
mmcv-full                 1.4.6
mmdet                     2.22.0
cudatoolkit               11.1.1 

ps:如果没有安装上mmcv或者mmdet,不要怀疑,肯定是你的版本有问题。这个问题博主也遇到了。

检查上面版本,通常来讲,没有任何问题。
cudatoolkit严格按照cuda版本安装的,mmdet也是根据cuda版本和pytorch版本安装的。

不出意外的话出意外了,最后一步执行命令 pip install -v -e . 编辑vedadet报错:

Installing collected packages: vedadet
  Running setup.py develop for vedadet
    Running command /mnt/data/cbm/software/anaconda3/envs/lw/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/mnt/data1/lw/vedadet-main/setup.py'"'"'; __file__='"'"'/mnt/data1/lw/vedadet-main/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps
    running develop
    running egg_info
    writing vedadet.egg-info/PKG-INFO
    writing dependency_links to vedadet.egg-info/dependency_links.txt
    writing requirements to vedadet.egg-info/requires.txt
    writing top-level names to vedadet.egg-info/top_level.txt
    reading manifest file 'vedadet.egg-info/SOURCES.txt'
    adding license file 'LICENSE'
    writing manifest file 'vedadet.egg-info/SOURCES.txt'
    running build_ext
    building 'vedadet.ops.nms.nms_ext' extension
    Emitting ninja build file /mnt/data1/lw/vedadet-main/build/temp.linux-x86_64-3.8/build.ninja...
    Compiling objects...
    Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
    [1/1] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/data1/lw/vedadet-main/build/temp.linux-x86_64-3.8/vedadet/ops/nms/src/cuda/nms_kernel.o.d -DWITH_CUDA -I/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/include -I/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/include/TH -I/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/mnt/data/cbm/software/anaconda3/envs/lw/include/python3.8 -c -c /mnt/data1/lw/vedadet-main/vedadet/ops/nms/src/cuda/nms_kernel.cu -o /mnt/data1/lw/vedadet-main/build/temp.linux-x86_64-3.8/vedadet/ops/nms/src/cuda/nms_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=nms_ext -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++14
    FAILED: /mnt/data1/lw/vedadet-main/build/temp.linux-x86_64-3.8/vedadet/ops/nms/src/cuda/nms_kernel.o
    /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /mnt/data1/lw/vedadet-main/build/temp.linux-x86_64-3.8/vedadet/ops/nms/src/cuda/nms_kernel.o.d -DWITH_CUDA -I/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/include -I/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/include/TH -I/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/mnt/data/cbm/software/anaconda3/envs/lw/include/python3.8 -c -c /mnt/data1/lw/vedadet-main/vedadet/ops/nms/src/cuda/nms_kernel.cu -o /mnt/data1/lw/vedadet-main/build/temp.linux-x86_64-3.8/vedadet/ops/nms/src/cuda/nms_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=nms_ext -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86 -std=c++14
    nvcc fatal   : Unsupported gpu architecture 'compute_86'
    ninja: build stopped: subcommand failed.
    Traceback (most recent call last):
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1667, in _run_ninja_build
        subprocess.run(
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/subprocess.py", line 516, in run
        raise CalledProcessError(retcode, process.args,
    subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/mnt/data1/lw/vedadet-main/setup.py", line 119, in <module>
        setup(
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
        return distutils.core.setup(**attrs)
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/dist.py", line 966, in run_commands
        self.run_command(cmd)
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/setuptools/command/develop.py", line 34, in run
        self.install_for_development()
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/setuptools/command/develop.py", line 114, in install_for_development
        self.run_command('build_ext')
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
        _build_ext.run(self)
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
        _build_ext.build_ext.run(self)
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/command/build_ext.py", line 340, in run
        self.build_extensions()
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 708, in build_extensions
        build_ext.build_extensions(self)
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
        _build_ext.build_ext.build_extensions(self)
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/command/build_ext.py", line 449, in build_extensions
        self._build_extensions_serial()
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/command/build_ext.py", line 474, in _build_extensions_serial
        self.build_extension(ext)
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 202, in build_extension
        _build_ext.build_extension(self, ext)
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/distutils/command/build_ext.py", line 528, in build_extension
        objects = self.compiler.compile(sources,
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 529, in unix_wrap_ninja_compile
        _write_ninja_file_and_compile_objects(
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1354, in _write_ninja_file_and_compile_objects
        _run_ninja_build(
      File "/mnt/data/cbm/software/anaconda3/envs/lw/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1683, in _run_ninja_build
        raise RuntimeError(message) from e
    RuntimeError: Error compiling objects for extension
ERROR: Command errored out with exit status 1: /mnt/data/cbm/software/anaconda3/envs/lw/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/mnt/data1/lw/vedadet-main/setup.py'"'"'; __file__='"'"'/mnt/data1/lw/vedadet-main/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.

总结一下,就是报这个错误:nvcc fatal : Unsupported gpu architecture 'compute_86'
意思就是我的GPU算力架构太高了,不支持computer_86。博主GPU算力是3090显卡

网上很多博主说pytorch暂不支持computer_86,有很多建议说是将算力改为computer_75等。
这种自降身价的事,博主是不会做,万一改不回来了,岂不是亏大了。而且也有很多3090显卡的博主降算力后,推理代码时出现各种问题。

解决方案:
虽然 nvcc -V 显示的版本是11.1,但是cuda有个编译版本:一般在/usr/loacal/cuda/文件夹下:
在这里插入图片描述
这可以可以发现,本地安装了多个cuda版本。cuda文件是编译时采用的cuda版本文件的软件链接。即虽然是nvcc -V版本是11.1,但是编译版本是cuda10.2,导致出错。

临时解决方案:

将conda环境从cuda11.1版本降低到10.2,执行指令pip install -v -e .能够编译完成,但是执行代码时候会报错:

UserWarning:  GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37. If you want to use the GeForce RTX 30

pytorch 3090显卡对应的cuda版本过低,可以直接将conda环境中的cuda版本升级到11.1,临时解决

注意:虽然两次都是报错cuda问题,第一次需要重新加载编辑本地代码,因此执行了编辑时的cuda版本,但是代码运行时执行是nvcc -V的版本。又会报版本过低。

但是在使用mmdetection框架时,通常都会需要自己设计网络框架后,需要重新编辑本地代码,同样还是需要执行pip install -v -e .问题不会被解决。

永久解决方案:

将编译时用到的cuda版本升级到11.1

相关参考博客如下:
nvcc fatal : Unsupported gpu architecture ‘compute_86‘

安装CUDA时,nvcc --version和cat /usr/local/cuda/version.txt版本不一致

Logo

为开发者提供学习成长、分享交流、生态实践、资源工具等服务,帮助开发者快速成长。

更多推荐