Understanding backward() in PyTorch (Updated for V0.4)
Update for PyTorch 0.4:
Earlier versions used Variable
to wrap tensors with different properties. Since version 0.4, Variable
is merged with tensor
, in other words, Variable
is NOT needed anymore. The flag require_grad
can be directly set in tensor
. Accordingly, this post is also updated.
Having heard about the announcement about Theano from Bengio lab , as a Theano user, I am happy and sad to see the fading of the old hero, caused by many raising stars. Sad to see it is too old to compete with its industrial competitors, and happy to have so many excellent deep learning frameworks to choose from. Recently I started translating some of my old codes to Pytorch and have been really impressed by its dynamic nature and clearness. But at the very beginning, I was very confused by the backward()
function when reading the tutorials and documentations. This motivated me to write this post in order for other Pytorch beginners to ease the understanding a bit. And I’ll assume that you already know the autograd
module and what a Variable
is, but are a little confused by definition of backward()
.
First let’s recall the gradient computing under mathematical notions. For an independent variable \(x\) (scalar or vector), the whatever operation on \(x\) is \(y = f(x)\). Then the gradient of \(y\) w.r.t \(x_i\)s is
\[\begin{align} \nabla y&=\begin{bmatrix} \frac{\partial y}{\partial x_1}\\ \frac{\partial y}{\partial x_2}\\ \vdots \end{bmatrix} \end{align}.\]Then for a specific point of \(x=[X_1, X_2, \cdots]\), we’ll get the gradient of \(y\) on that point as a vector. With these notions in mind, the following things are a bit confusing at the beginning

Mathematically, we would say “The gradients of a function w.r.t. the independent variables”, whereas the
.grad
is attached to the leaftensor
s. In Theano and Tensorflow, the computed gradients are stored separately in a variable. But with a moment of adjustment, it is fairly easy to buy that. In Pytorch it is also possible to get the.grad
for intermediateVariable
s with help ofregister_hook
function 
The parameter
grad_variables
of the functiontorch.autograd.backward(variables, grad_tensors=None, retain_graph=None, create_graph=None, retain_variables=None, grad_variables=None)
is not straightforward for knowing its functionality. **note thatgrad_variables
is deprecated, usegrad_tensors
instead. 
What is
retain_graph
doing?
import torch as T
import torch.autograd
import numpy as np
Simplicity of using backward()
'''
Define a scalar variable, set requires_grad to be true to add it to backward path for computing gradients
It is actually very simple to use backward()
first define the computation graph, then call backward()
'''
x = T.randn(1, 1, requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)
#define an operation on x
y = 2 * x
print('y', y)
#define one more operation to check the chain rule
z = y ** 3
print('z', z)
x tensor([[0.6955]], requires_grad=True)
y tensor([[1.3911]], grad_fn=<MulBackward>)
z tensor([[2.6918]], grad_fn=<PowBackward0>)
The simple operations defined a forward path \(z=(2x)^3\), \(z\) will be the final output tensor
we would like to compute gradient: \(dz=24x^2dx\), which will be passed to the parameter tensors
in backward()
function.
#yes, it is just as simple as this to compute gradients:
z.backward()
print('z gradient:', z.grad)
print('y gradient:', y.grad)
print('x gradient:', x.grad, 'Requires gradient?', x.grad.requires_grad) # note that x.grad is also a tensor
z gradient None
y gradient None
x gradient tensor([[11.6105]]) Requires gradient? False
The gradients of both \(y\) and \(z\) are None, since the function returns the gradient for the leaves, which is \(x\) in this case. At the very beginning, I was assuming something like this:
x gradient: None
y gradient: None
z gradient: tensor([11.6105])
since the gradient is calculated for the final output \(z\).
With a blink of thinking, we could figure out it would be practically chaos if \(x\) is a multidimensional vector. x.grad
should be interpreted as the gradient of \(z\) at \(x\).
How do we use grad_tensors
?
grad_tensors
should be a list of torch tensors. In default case, the backward()
is applied to scalarvalued function, the default value of grad_tensors
is thus torch.FloatTensor([0])
. But why is that? What if we put some other values to it?
Keep the same forward path, then do backward
by only setting retain_graph
as True
.
x = T.randn(1, 1, requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)
#define an operation on x
y = 2 * x
#define one more operation to check the chain rule
z = y ** 3
z.backward(retain_graph=True)
print('Keeping the default value of grad_tensors gives')
print('z gradient:', z.grad)
print('y gradient:', y.grad)
print('x gradient:', x.grad)
x tensor([[0.7207]], requires_grad=True)
Keeping the default value of grad_tensors gives
z gradient: None
y gradient: None
x gradient: tensor([[12.4668]])
Testing the explicit default value, which should give the same result. For the same graph which is retained, DO NOT forget to zero the gradient before recalculate the gradients.
x.grad.data.zero_()
z.backward(T.Tensor([[1]]), retain_graph=True)
print('Set grad_tensors to 1 gives')
print('z gradient:', z.grad)
print('y gradient:', y.grad)
print('x gradient:', x.grad)
Set grad_tensors to 0 gives
z gradient: None
y gradient: None
x gradient: tensor([[12.4668]])
Then what about other values, let’s try 0.1 and 0.5.
x.grad.data.zero_()
z.backward(T.Tensor([[0.1]]), retain_graph=True)
print('Set grad_tensors to 0.1 gives')
print('z gradient:', z.grad)
print('y gradient:', y.grad)
print('x gradient:', x.grad)
Set grad_tensors to 0.1 gives
z gradient: None
y gradient: None
x gradient: tensor([[1.2467]])
x.grad.data.zero_()
z.backward(T.FloatTensor([[0.5]]), retain_graph=True)
print('Modifying the default value of grad_variables to 0.1 gives')
print('z gradient', z.grad)
print('y gradient', y.grad)
print('x gradient', x.grad)
Modifying the default value of grad_variables to 0.5 gives
z gradient None
y gradient None
x gradient tensor([[6.2334]])
It looks like the elements of grad_tensors
act as scaling factors. Now let’s set \(x\) to be a \(2\times 2\) matrix. Note that \(z\) will also be a matrix. (Always use the latest version, backward
had been improved a lot from earlier version, becoming much easier to understand.)
x = T.randn(2, 2, requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)
#define an operation on x
y = 2 * x
#define one more operation to check the chain rule
z = y ** 3
print('z shape:', z.size())
z.backward(T.FloatTensor([[1, 1], [1, 1]]), retain_graph=True)
print('x gradient for its all elements:\n', x.grad)
print()
x.grad.data.zero_() #the gradient for x will be accumulated, it needs to be cleared.
z.backward(T.FloatTensor([[0, 1], [0, 1]]), retain_graph=True)
print('x gradient for the second column:\n', x.grad)
print()
x.grad.data.zero_()
z.backward(T.FloatTensor([[1, 1], [0, 0]]), retain_graph=True)
print('x gradient for the first row:\n', x.grad)
x tensor([[2.5212, 1.2730],
[ 0.0366, 0.0750]], requires_grad=True)
z shape: torch.Size([2, 2])
x gradient for its all elements:
tensor([[152.5527, 38.8946],
[ 0.0322, 0.1349]])
x gradient for the second column:
tensor([[ 0.0000, 38.8946],
[ 0.0000, 0.1349]])
x gradient for the first row:
tensor([[152.5527, 38.8946],
[ 0.0000, 0.0000]])
We can clearly see the gradients of \(z\) are computed w.r.t to each dimension of \(x\), because the operations are all elementwise.
Then what if we render the output onedimensional (scalar) while \(x\) is twodimensional. This is a real simplified scenario of neural networks.
\[f(x)=\frac{1}{n}\sum_i^n(2x_i)^3\] \[f'(x)=\frac{1}{n}\sum_i^n24x_i^2\]x = T.randn(2, 2, requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)
#define an operation on x
y = 2 * x
#print('y', y)
#define one more operation to check the chain rule
z = y ** 3
out = z.mean()
print('out', out)
out.backward(retain_graph=True)
print('x gradient:\n', x.grad)
x tensor([[ 1.8528, 0.2083],
[1.5296, 0.3136]], requires_grad=True)
out tensor(5.6434, grad_fn=<MeanBackward1>)
x gradient:
tensor([[20.5970, 0.2604],
[14.0375, 0.5903]])
We will get complaints if the grad_tensors
is specified for the scalar function.
x.grad.data.zero_()
out.backward(T.FloatTensor([[1, 1], [1, 1]]), retain_graph=True)
print('x gradient', x.grad)

RuntimeError Traceback (most recent call last)
<ipythoninput78db7cccdf3863> in <module>()
1 x.grad.data.zero_()
> 2 out.backward(T.FloatTensor([[1, 1], [1, 1]]), retain_graph=True)
3 print('x gradient', x.grad)
/usr/lib/python3.7/sitepackages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
91 products. Defaults to ``False``.
92 """
> 93 torch.autograd.backward(self, gradient, retain_graph, create_graph)
94
95 def register_hook(self, hook):
/usr/lib/python3.7/sitepackages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
88 Variable._execution_engine.run_backward(
89 tensors, grad_tensors, retain_graph, create_graph,
> 90 allow_unreachable=True) # allow_unreachable flag
91
92
RuntimeError: invalid gradient at index 0  expected shape [] but got [2, 2]
What is retain_graph
doing?
When training a model, the graph will be regenerated for each iteration. Therefore each iteration will consume the graph if the retain_graph
is false, in order to keep the graph, we need to set it be true.
x = T.randn(2, 2, requires_grad=True) #x is a leaf created by user, thus grad_fn is none
print('x', x)
#define an operation on x
y = 2 * x
#print('y', y)
#define one more operation to check the chain rule
z = y ** 3
out = z.mean()
print('out', out)
out.backward() #without setting retain_graph to be true, it is alright for first time of backward.
print('x gradient', x.grad)
x.grad.data.zero_()
out.backward() #Now we get complaint saying that no graph is available for tracing back.
print('x gradient', x.grad)
x tensor([[0.7452, 1.5727],
[ 0.1702, 0.7374]], requires_grad=True)
out tensor(7.7630, grad_fn=<MeanBackward1>)
x gradient tensor([[ 3.3323, 14.8394],
[ 0.1738, 3.2623]])

RuntimeError Traceback (most recent call last)
<ipythoninput8280a8d867d529> in <module>()
12
13 x.grad.data.zero_()
> 14 out.backward() #Now we get complaint saying that no graph is available for tracing back.
15 print('x gradient', x.grad)
/usr/lib/python3.7/sitepackages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
91 products. Defaults to ``False``.
92 """
> 93 torch.autograd.backward(self, gradient, retain_graph, create_graph)
94
95 def register_hook(self, hook):
/usr/lib/python3.7/sitepackages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
88 Variable._execution_engine.run_backward(
89 tensors, grad_tensors, retain_graph, create_graph,
> 90 allow_unreachable=True) # allow_unreachable flag
91
92
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
Wrap up
 The
backward()
function made differentiation very simple  For nonscalar
tensor
, we need to specifygrad_tensors
 If you need to backward() twice on a graph or subgraph, you will need to set
retain_graph
to be true.  Note that grad will accumulate from excuting the graph multiple times.