yolov5 训练模型预测没有框？mAP为零？

关于`lr_scheduler.step()` before `optimizer.step()` 导致的 mAP 为零

该用户被迫注册

11076人浏览 · 2022-04-15 23:26:25

该用户被迫注册 · 2022-04-15 23:26:25 发布

2022-04-08
前几天（4月8号左右），下了个 yolov5 ，想做个目标检测。
从安装依赖，下载权重到进行预测都很顺利，到训练自己的数据集，却出了点问题，发现预测结果都没有框。
可能我们的问题不一样，但和我一样是小白可以参考一下。

我的问题

在 runs/train/exp 文件夹下，观察结果，发现图片的确是自己的数据集，并且自己标注的框框也都有，但是 result.png 的曲线却显得不正常。
在这里插入图片描述
如果你的问题，和我的一样，可以继续看下去，可能对你有所帮助，如果我们的问题不一样，可以不必花太多时间在这片博客上，建议去寻找新的解决方案。

GitHub 的类似 issue

GitHub 的 issues上有类似问题，如果英文好的话，可以参考一下：
yolov5 does not train or detect
该 hub 友的预测结果图是这样的。
在这里插入图片描述
我们慢慢往下滑，观察他运行终端的提示信息，当翻到 Epoch = 0 的位置时，出现了这个: userwarning 你可以试着在你的终端中找找是否也有这个反馈。该 issue 中的 Output 的提示是这样的，也就是上面的截图的内容：

UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). "

大概意思是：在PyTorch 1.1.0 版本之后，调用optimizer.step() 要放在 lr_scheduler.step() 的前面，这里出错的话，会导致 PyTorch 跳过第一个「rate schedule」的值。
(我猜，这可能导致epoch = 0时的训练被跳过，以至于对后面的叠加造成影响，导致最终的训练结果无效。)
然后，给了个 PyTorch 的一个链接的详细介绍：
https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
我们点进去，找找发现确实有这么个「WARNING」。👊
在这里插入图片描述

解决方案

既然找到了原因，就想点儿办法解决。
如果按照这个 warning 提示，将lr_scheduler.step()换到后面，似乎可以？不过我的功底不到家，在源码里修改了之后，又出现了新的问题。就放弃了🤣

看看 hub 友的问题是如何解决的，他是更改版本的。
issus 页面往下滑，发现有另一位 hub 友发表了自己的看法，他也发现了可能是因为版本的原因。
在这里插入图片描述
该问题的提出者，也根据这个问题找到了解决方案。
也就是更改了版本。
命令代码如下：

pip3 install torch==1.9.1+cu102 torchvision==0.10.1+cu102 torchaudio===0.9.1

后面的下载源 -f https://download.pytorch.org/whl/torch_stable.html ，就没有必要加了，我们一般都是用国内镜像的，加了速度可能会很慢。

如果没什么意外的话，到这里，问题应该是解决了。

我当时没想到在 GitHub 上也有类似的问题，因此自己观察终端反馈，网上搜了相关降低版本的命令，额，有的灵，有的不灵，但最后鼓捣半天，还是通过玄学成功了。

总结：
查看终端信息，找出问题原因，解决问题。

一遇到问题直接百度，这可能会浪费很多时间，并且会遇到很多玄学的解决方法，有时候，这并非好事。
或许，该多注意结果的数据是否合理，终端中是否有异常信息，再加以推测问题原因，最后寻求搜索引擎和互联网的帮助。

下面是我当时的解决办法，如果上述命令没效果的话，试试玄学吧，😂：

卸载Pytorch
conda 的命令：

conda uninstall pytorch
conda uninstall libtorch

pip 的命令：

pip uninstall torch

重新安装 Pytorch ：

conda install pytorch==1.8.1 torchvision==0.9.1 torchaudio==0.8.1 cudatoolkit=10.2 -c pytorch

detect 一下看看有没有问题，再进行 train 。
可能会遇到的问题：

OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.

这是因为 libiomp5md.dll 已经有了，把虚拟环境中的 libiomp5md.dll 剪切到别的地方就OK了。这里我参考的链接：https://www.cnblogs.com/Flat-White/p/14678858.html

到这，我所遇到的所有问题已经解决并分享了，如果你有其他问题，我可能无法提供帮助。

如果训练结果的预测还是没有框的话，看看数据集是否太少或者迭代次数是否太低。