# Mathematical interpretation of why Gradient descent works

For minimizing a differentiable multivariable function $f(\mathbf x)$ with initial value of $\mathbf x_0$, the fastest decreasing would be the direction of negative gradient of $f(x)$ since its gradient points to fastest ascending direction. The updating rule

$\begin{equation} \mathbf x_{t+1} = \mathbf x_t - \eta\nabla f(\mathbf x_t), t=1,2~\cdots \end{equation}$

with sufficiently small $\eta$ leads to $f(\mathbf x_{t+1})<f(\mathbf x_t)$. The scalar constant $\eta$ is the step size determining how far the updating moves in the negative gradient direction, which is usually called learning rate in machine learning model training.

### But, why $\eta$ has to be small?

To gain the mathematical interpretation, assume the general context of

$\begin{equation} \min_{\mathbf x}f(\mathbf x), \mathbf x\in \mathcal R^m \end{equation}$

with initial condition $\mathbf x=\mathbf x_0$.

The goal of first updating is to find $\mathbf x_1=\mathbf x_0+\Delta\mathbf x$ such that

$\begin{equation} f(\mathbf x_1) \le f(\mathbf x_0), \end{equation}$

where $\Delta\mathbf x$ is the updating vector.

First we approximate $f(\mathbf x_1)$ with the first-order Taylor expansion as

$\begin{equation} f(\mathbf x_1) = f(\mathbf x_0 + \Delta\mathbf x)=f(\mathbf x_0)+\Delta\mathbf x^T\nabla f(\mathbf x_0), \end{equation}$

$\begin{equation} f(\mathbf x_1) - f(\mathbf x_0)=\Delta\mathbf x^T\nabla f(\mathbf x_0). \end{equation}$

Now to ensure the semi-negative definite of $f(\mathbf x_1) - f(\mathbf x_0)$, we can simply choose

$\begin{equation} \Delta\mathbf x = -\eta\nabla f(\mathbf x_0) \end{equation}$

and then we have

$\begin{equation} f(\mathbf x_1) - f(\mathbf x_0)=-\eta\|\nabla f(\mathbf x_0)\|^2 \le 0, \end{equation}$

for some scalar constant $\eta$, and $\|\nabla f(\mathbf x_0)\|^2=0$ if and only if $f(\mathbf x_0)$ is already the minimum.

To sum up and generalize to step $t$, the updating rule

$\begin{equation} \mathbf x_{t+1} = \mathbf x_t - \eta \nabla f(\mathbf x_t) \end{equation}$

drives the searching towards the minimum given that the proper value of $\eta$ can ensure that $x_{t+1}$ is close enough to $x_t$ such that first-order approximate is a valid approximate with tolerable error.