Почему производные второго порядка полезны в выпуклой оптимизации?

18

Я предполагаю, что это основной вопрос, и он связан с направлением самого градиента, но я ищу примеры, где методы 2-го порядка (например, BFGS ) более эффективны, чем простой градиентный спуск.

optimization Бар
источник

3

Не слишком ли просто наблюдать, что «найти вершину параболоида» является гораздо лучшим приближением к задаче «найти минимум», чем «найти минимум этой линейной функции» (которая, конечно, не имеет минимума, потому что линейный)?

20

Вот общая схема для интерпретации как градиентного спуска, так и метода Ньютона, что, возможно, является полезным способом восприятия разницы как дополнения к ответу @ Sycorax. (BFGS приближается к методу Ньютона; я не буду говорить об этом, в частности, здесь.)

Мы минимизируем функцию $f$ , но не знаем, как это сделать напрямую. Поэтому вместо этого мы берем локальное приближение в нашей текущей точке $x$ и минимизируем это.

Метод Ньютона приближает функцию, используя разложение Тейлора второго порядка:

е (Y) \approx N_{Икс} (Y) знак равно е (Икс) + \nabla е (Икс)^{T} (Y - Икс) + \frac{1}{2} (Y - Икс)^{T} \nabla^{2} е (Икс) (Y - Икс),

$f(y) \approx N_x(y) := f(x) + \nabla f(x)^T (y - x) + \tfrac12 (y - x)^T \, \nabla^2 f(x) \, (y - x) ,$ где

\nabla f (x)

$\nabla f(x)$ обозначает градиент

f

$f$ в точке

x

$x$ и

\nabla^{2} f (x)

$\nabla^2 f(x)$ гессианом при

x

$x$ . Затем он шагает до

\arg min_{y} N_{x} (y)

$\arg\min_y N_x(y)$ и повторяется.

$t$ $x - t \nabla f(x)$

\begin{aligned} Икс - T \nabla е (Икс) & знак равно Arg \underset{Y}{Максимум} [е (Икс) + \nabla е (Икс)^{T} (Y - Икс) + \frac{1}{2 T} | | Y - Икс {| |}^{2}] \\ знак равно Arg \underset{Y}{Максимум} [е (Икс) + \nabla е (Икс)^{T} (Y - Икс) + \frac{1}{2} (Y - Икс)^{T} \frac{1}{T} я (Y - Икс)], \end{aligned}

$\begin{align} x - t \,\nabla f(x) &= \arg\max_y \left[f(x) + \nabla f(x)^T (y - x) + \tfrac{1}{2 t} \lVert y - x \rVert^2\right] \\&= \arg\max_y \left[f(x) + \nabla f(x)^T (y - x) + \tfrac12 (y-x)^T \tfrac{1}{t} I (y - x)\right] .\end{align}$

{грамм}_{Икс} (Y) знак равно е (Икс) + \nabla е (Икс)^{T} (Y - Икс) + \frac{1}{2} (Y - Икс)^{T} \frac{1}{T} я (Y - Икс),

$G_x(y) := f(x) + \nabla f(x)^T (y - x) + \tfrac12 (y-x)^T \tfrac{1}{t} I (y - x).$

Таким образом, градиентное спуск похоже на использование метода Ньютона, но вместо того, чтобы использовать разложение Тейлора второго порядка, мы притворяемся, что гессиан равен . Этотчасто является существенно худшим приближением кчем, и, следовательно, градиентный спуск часто требует гораздо худших шагов, чем метод Ньютона. Конечно, это уравновешивается тем, что каждый шаг градиентного спуска намного дешевле в расчете, чем каждый шаг метода Ньютона. Что лучше, полностью зависит от характера проблемы, ваших вычислительных ресурсов и ваших требований к точности. $\tfrac1t I$ $G$ $f$ $N$

Рассматривая пример @ Sycorax минимизации квадратичного

f (x) = \frac{1}{2} x^{T} A x + d^{T} x + c

$f(x) = \tfrac12 x^T A x + d^T x + c$ for a moment, it's worth noting that this perspective helps with understanding both methods.

With Newton's method, we'll have $N = f$ so that it terminates with the exact answer (up to floating point accuracy issues) in a single step.

G_{x} (y) = f (x) + (A x + d)^{T} y + \frac{1}{2} (x - y)^{T} \frac{1}{t} I (x - y)

$G_x(y) = f(x) + (A x + d)^T y + \tfrac12 (x - y)^T \tfrac1t I (x-y)$ whose tangent plane at

x

$x$ is correct, but whose curvature is entirely wrong, and indeed throws away the important differences in different directions when the eigenvalues of

A

$A$ vary substantially.

Dougal
источник

1

This is similar to @Aksakal's answer, but in more depth.

Dougal

1

(+1) This is a great addition!

Sycorax says Reinstate Monica

17

Essentially, the advantage of a second-derivative method like Newton's method is that it has the quality of quadratic termination. This means that it can minimize a quadratic function in a finite number of steps. A method like gradient descent depends heavily on the learning rate, which can cause optimization to either converge slowly because it's bouncing around the optimum, or to diverge entirely. Stable learning rates can be found... but involve computing the hessian. Even when using a stable learning rate, you can have problems like oscillation around the optimum, i.e. you won't always take a "direct" or "efficient" path towards the minimum. So it can take many iterations to terminate, even if you're relatively close to it. BFGS and Newton's method can converge more quickly even though the computational effort of each step is more expensive.

To your request for examples: Suppose you have the objective function

F (x) = \frac{1}{2} x^{T} A x + d^{T} x + c

$F(x)=\frac{1}{2}x^TAx+d^Tx+c$ The gradient is

\nabla F (x) = A x + d

$\nabla F(x)=Ax+d$ and putting it into the steepest descent form with constant learning rate

x_{k + 1} = x_{k} - α (A x_{k} + d) = (I - α A) x_{k} - α d .

$x_{k+1}= x_k-\alpha(Ax_k+d) = (I-\alpha A)x_k-\alpha d.$

This will be stable if the magnitudes of the eigenvectors of $I-\alpha A$ are less than 1. We can use this property to show that a stable learning rate satisfies

α < \frac{2}{λ_{m a x}},

$\alpha<\frac{2}{\lambda_{max}},$ where

λ_{m a x}

$\lambda_{max}$ is the largest eigenvalue of

A

$A$ . The steepest descent algorithm's convergence rate is limited by the largest eigenvalue and the routine will converge most quickly in the direction of its corresponding eigenvector. Likewise, it will converge most slowly in directions of the eigenvector of the smallest eigenvalue. When there is a large disparity between large and small eigenvalues for

A

$A$ , gradient descent will be slow. Any

A

$A$ with this property will converge slowly using gradient descent.

In the specific context of neural networks, the book Neural Network Design has quite a bit of information on numerical optimization methods. The above discussion is a condensation of section 9-7.

Sycorax says Reinstate Monica
источник

Great answer! I'm accepting @Dougal 's answer as I think it provides a simpler explanation.

Bar

6

In convex optimization you are approximating the function as the second degree polynomial in one dimensional case:

f (x) = c + β x + α x^{2}

$f(x)=c+\beta x + \alpha x^2$

In this case the the second derivative

\partial^{2} f (x) / \partial x^{2} = 2 α

$\partial^2 f(x)/\partial x^2=2\alpha$

If you know the derivatives, then it's easy to get the next guess for the optimum:

guess = - \frac{β}{2 α}

$\text{guess}=-\frac{\beta}{2\alpha}$

The multivariate case is very similar, just use gradients for derivatives.

Aksakal
источник

2

@Dougal already gave a great technical answer.

The no-maths explanation is that while the linear (order 1) approximation provides a “plane” that is tangential to a point on an error surface, the quadratic approximation (order 2) provides a surface that hugs the curvature of the error surface.

The videos on this link do a great job of visualizing this concept. They display order 0, order 1 and order 2 approximations to the function surface, which just intuitively verifies what the other answers present mathematically.

Also, a good blogpost on the topic (applied to neural networks) is here.

Zhubarb
источник

Почему производные второго порядка полезны в выпуклой оптимизации?

Ответы: