enable anomaly detection to find the operation that failed to compute its gradient, with torch.autog

关于pytorch中多个backward出现的问题：enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly (True)

静静的喝酒

6761人浏览 · 2022-07-26 12:37:04

静静的喝酒 · 2022-07-26 12:37:04 发布

关于pytorch中多个backward出现的问题：enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly True.

在执行代码中包含两个方向传播(backward)时，可能会出现这种问题：

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [10, 1]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

什么情况下会出现这种问题，我们先构建一个场景：

import torch
from torch import nn as nn
from torch.nn import functional as F
from torch import optim

为了简化问题，构建两个相同的神经网络：

class Net_1(nn.Module):
    def __init__(self):
        super(Net_1, self).__init__()

        self.linear_1 = nn.Linear(1,10)
        self.linear_2 = nn.Linear(10,1)

    def forward(self,x):
        x = self.linear_1(x)
        x = F.relu(x)
        x = self.linear_2(x)
        x = F.softmax(x,dim=1)
        return x

class Net_2(nn.Module):
    def __init__(self):
        super(Net_2,self).__init__()

        self.linear_1 = nn.Linear(1,10)
        self.linear_2 = nn.Linear(10,1)

    def forward(self, x):
        x = self.linear_1(x)
        x = F.relu(x)
        x = self.linear_2(x)
        x = F.softmax(x,dim=1)
        return x

算法执行流程：
定义模型Net_1,Net_2、两个模型对应的优化器(Optimizer)optimizer_n1,optimizer_n2，以及损失函数criterion：

n_1 = Net_1()
n_2 = Net_2()

optimizer_n1 = optim.Adam(n_1.parameters(),lr=0.001)
optimizer_n2 = optim.Adam(n_2.parameters(),lr=0.001)
criterion = nn.MSELoss()

执行过程如下：

for i in range(10):
    x = torch.randn(10,1).float()
    y = 2 * x

    pred_n1 = n_1(x)
    optimizer_n1.zero_grad()
    loss_n1 = criterion(y,pred_n1)
    loss_n1.backward()
    optimizer_n1.step()

    pred_n2 = n_2(pred_n1)
    optimizer_n2.zero_grad()
    loss_n2 = criterion(y,pred_n2)
    loss_n2.backward()
    optimizer_n2.step()

注意的点：该执行过程的特点是，第一个神经网络的pred_n1，它也参与了第二个神经网络的反向传播的过程。

我们知道的是，loss_n1.backward()操作在执行后，计算节点被保存了，但是计算图结构被释放掉了，导致第二个损失函数进行反向传播过程中需要使用第一次反向传播的计算图结构失败。

至此，我们在backward中使用retain_graph=True来保存它的计算图结构：
这里有一小细节：
实际上loss_n1,loss_n2的backward都可以添加参数retain_graph=True，我们之所以只在第一个里面添加，是因为如果第二个也加了，本次for循环中的计算图结构就堆积在了内存中，释放不掉，对内存是有负担的。所以一般情况下本次for循环的计算图使用完后，我们给它释放掉。

for i in range(10):
    x = torch.randn(10,1).float()
    y = 2 * x

    pred_n1 = n_1(x)
    optimizer_n1.zero_grad()
    loss_n1 = criterion(y,pred_n1)
    loss_n1.backward(retain_graph=True)
    optimizer_n1.step()

    pred_n2 = n_2(pred_n1)
    optimizer_n2.zero_grad()
    loss_n2 = criterion(y,pred_n2)
    loss_n2.backward()
    optimizer_n2.step()

修改了这个细节之后，重新运行代码：
仍然在出错。
这次出错的问题，就是标题的问题：
首先，按照它的要求，执行一次torch.autograd.set_detect_anomaly(True)：

import torch
from torch import nn as nn
from torch.nn import functional as F
from torch import optim

torch.autograd.set_detect_anomaly(True)

返回的报错结果如下：

D:\software\anaconda3\envs\pytorch\lib\site-packages\torch\autograd\__init__.py:154: UserWarning: Error detected in AddmmBackward0. Traceback of forward call that caused the error:
  File "D:\code_work\reinforcement_learning\.pytest_cache\bark_test.py", line 49, in <module>
    pred_n1 = n_1(x)
  File "D:\software\anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\code_work\reinforcement_learning\.pytest_cache\bark_test.py", line 18, in forward
    x = self.linear_2(x)
  File "D:\software\anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\software\anaconda3\envs\pytorch\lib\site-packages\torch\nn\modules\linear.py", line 103, in forward
    return F.linear(input, self.weight, self.bias)
  File "D:\software\anaconda3\envs\pytorch\lib\site-packages\torch\nn\functional.py", line 1848, in linear
    return torch._C._nn.linear(input, weight, bias)
 (Triggered internally at  ..\torch\csrc\autograd\python_anomaly_mode.cpp:104.)
  Variable._execution_engine.run_backward(
Traceback (most recent call last):
  File "D:\code_work\reinforcement_learning\.pytest_cache\bark_test.py", line 58, in <module>
    loss_n2.backward()
  File "D:\software\anaconda3\envs\pytorch\lib\site-packages\torch\_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "D:\software\anaconda3\envs\pytorch\lib\site-packages\torch\autograd\__init__.py", line 154, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [10, 1]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

无论是前面retain_graph=True报的错还是这个错误，都是指向loss_n2.backward(),上面的报错信息就详细地梳理了loss_n2反向传播的计算流程，和我们代码关联的只有一项：

x = self.linear_2(x)

在这个计算过程中，由于存在pred_n1的参与，我们要保证loss_n2.backward()执行过程中requires_grad=False，否则会发生冲突。
因此，我们在该神经网络中的初始部分使用detach操作，将它的requires_grad停止。即：

x = self.linear_1(x).detach()

为什么只添加在初始位置，其余位置不加？
很简单，因为只有神经网络的初始节点(叶节点)才能接收梯度(requires_grad=True)，非叶节点(隐藏层中的节点都是requires_grad=False)
损失函数结果本身是标量，自然也不会有梯度的。
至此，上述错误解决。修改后完整代码如下，大家可以对比一下：

import torch
from torch import nn as nn
from torch.nn import functional as F
from torch import optim


class Net_1(nn.Module):
    def __init__(self):
        super(Net_1, self).__init__()

        self.linear_1 = nn.Linear(1,10)
        self.linear_2 = nn.Linear(10,1)

    def forward(self,x):
        x = self.linear_1(x)
        x = F.relu(x)
        x = self.linear_2(x)
        x = F.softmax(x,dim=1)
        return x


class Net_2(nn.Module):
    def __init__(self):
        super(Net_2,self).__init__()

        self.linear_1 = nn.Linear(1,10)
        self.linear_2 = nn.Linear(10,1)

    def forward(self, x):
        x = self.linear_1(x).detach()
        x = F.relu(x)
        x = self.linear_2(x)
        x = F.softmax(x,dim=1)
        return x


n_1 = Net_1()
n_2 = Net_2()

optimizer_n1 = optim.Adam(n_1.parameters(),lr=0.001)
optimizer_n2 = optim.Adam(n_2.parameters(),lr=0.001)
criterion = nn.MSELoss()

for i in range(10):
    x = torch.randn(10,1).float()
    y = 2 * x

    pred_n1 = n_1(x)
    optimizer_n1.zero_grad()
    loss_n1 = criterion(y,pred_n1)
    loss_n1.backward(retain_graph=True)
    optimizer_n1.step()

    pred_n2 = n_2(pred_n1)
    optimizer_n2.zero_grad()
    loss_n2 = criterion(y,pred_n2)
    loss_n2.backward()
    optimizer_n2.step()