Pytorch计算Loss值为Nan的一种情况【exp计算溢出，利用softmax计算的冗余性解决】

一、报错提示FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total

PuJiang-

10259人浏览 · 2021-11-17 10:49:20

PuJiang- · 2021-11-17 10:49:20 发布

一、报错提示

FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior. torch.nn.utils.clip_grad_norm_(WAP_model.parameters(), clip_c)

pytorch进行FutureWarning警告之后，train和valid的loss计算值都显示为Nan。

二、调试过程

在loss.backward()之前的loss都是有值的，没有出现Nan，但是进行梯度计算时产生了Nan。

1、使用autograd.detect_anomaly()开启自动求导的异常值检测。

开始引入torch.autograd:

import torch.autograd as autograd

在loss.backward()外侧加上autograd.detect_anomaly():

        with autograd.detect_anomaly():
            loss.backward()

产生报错：ExpBackward。于是考察网络中所有与exp有关的计算，检查是否有值溢出。

RuntimeError: Function 'ExpBackward' returned nan values in its 0th output.

2、使用torch.isnan().sum()>0,torch.isinf().sum()>0检测某个tensor中是否有异常值。

beta = torch.exp(z1) / (torch.exp(z0)[:, None] + torch.exp(z1) + 1e-5)
        if torch.isnan(torch.exp(z1)).sum()>0:
            print('expz1_nan')
        if torch.isinf(torch.exp(z1)).sum()>0:
            print('expz1_inf')
        if torch.isnan(torch.exp(z0)).sum()>0:
            print('expz0_nan')
        if torch.isinf(torch.exp(z0)).sum()>0:
            print('expz0_inf')

再次debug，发现torch.exp(z0)的某次运算过程产生了inf值：
在这里插入图片描述
torch.exp(z0)产生了inf值，于是往上查看z0是否有异常值：

z0为92时，计算e的92次方产生了上溢，所以对应的exp计算出现了inf，反向传播求梯度时这个位置无法正确进行求值，因此报错。

三、解决思路

1、利用softmax函数冗余性，看下面这个例子

import math
import numpy as np
 
def softmax(inp):
    length = len(inp)
    exps = []
    res = 0
    ind = 0
    for item in inp:
        exp = math.exp(item)
        res = res + exp
        exps.append(exp)
        ind+=1
    exps = np.array(exps)   
    return exps/res
 
    
inp = [1000,500,500]
inp1 = [-1000,-1000,-1000]
print("上溢:",softmax(inp))
print("下溢:",softmax(inp1))

上溢：在计算 $e^{1000}$ 、 $e^{500}$ 、 $e^{500}$ 每个小项数值过大就已经产生溢出。
下溢：在计算 $e^{-1000}$ 值接近0，精度不够产生了下溢，每个小项有值为0。但计算softmax时， $\frac{0}{0+0+0}$ 分母为0，这个式子整体为Nan。

softmax公式推导：
$\frac{exp^{(x-a)}}{\sum_{i=1}^{k}exp^{(x_i-a)}}=\frac{exp^{(x)}exp^{(-a)}}{exp^{(-a)}\sum_{i=1}^{k}exp^{(x_i)}} =\frac{exp^{(x)}}{\sum_{i=1}^{k}exp^{(x_i)}}$
可以使用x-a令数据产生偏移，但计算结果仍不产生改变。
那么a应该如何选取？令a=max(x)

inp = [1000,500,500]
inp1 = [-1000,-1000,-1000]

对应的减去最大值后：

inp = [0,-500,-500]
inp1 = [0,0,0]

上溢：在计算 $e^{0}=1$ 就是最大值了，因此解决了上溢问题。
下溢：在计算 $e^{0}=1$ 分母必有一项为1，其余的项不管多小， $e^x$ 函数只会无限接近于0，所有值都是大于0的，因此分母一定不会为0。解决了分母为0结果为Nan的问题。

2、解决方式

减去z0的最大值：
在这里插入图片描述

四、参考资料

[1] : https://blog.csdn.net/zx1245773445/article/details/86443099
[2] : https://www.xarg.org/2016/06/the-log-sum-exp-trick-in-machine-learning/
[3] : https://discuss.pytorch.org/t/getting-nan-values-in-backward-pass/83696

华为云开发者联盟

为开发者提供学习成长、分享交流、生态实践、资源工具等服务，帮助开发者快速成长。

更多推荐

解锁HDC 2024之旅：从购票到报名，全程攻略

华为云开发者联盟

从原始边列表到邻接矩阵Python实现图数据处理的完整指南

华为云开发者联盟

华为云云原生FinOps解决方案，释放云原生最大价值

华为云开发者联盟

所有评论(0)

查看更多评论

PuJiang-

@jump882

已为社区贡献3条内容