Newton's method in optimization

A comparison of gradient descent (green) and Newton's method (red) for minimizing a function (with small step sizes). Newton's method uses curvature information to take a more direct route.

In mathematics, Newton's method is an iterative method for finding roots of equations. More generally, Newton's method is used to find critical points of differentiable functions, which are the zeros of the derivative function.

1 Method
2 Geometric interpretation
3 Higher dimensions
4 Other approximations
5 See also
6 References

Method

Newton's Method attempts to construct a sequence $x n$ from an initial guess $x 0$ that converges towards $x *$ such that $f\, '(x_*)=0$ . This $x *$ is called a stationary point of $f(\cdot)$ .

The second order Taylor expansion $f T (x)$ of function $f(\cdot)$ around $x n$ (where $Δ x = x - x n$ ) is: $\displaystyle f_T(x_n+\Delta x)=f_T(x)=f(x_n)+f'(x_n)\Delta x+\frac 1 2 f'' (x_n) \Delta x^2$ , attains its extremum when its derivative with respect to $Δ x$ is equal to zero, i.e. when $Δ x$ solves the linear equation:

$\displaystyle f'(x_n)+f'' (x_n) \Delta x=0.$

(Considering the right-hand side of the above equation as a quadratic in $Δ x$ , with constant coefficients.)

Thus, provided that $\displaystyle f(x)$ is a twice-differentiable function well approximated by its second order Taylor expansion and the initial guess $\displaystyle x_0$ is chosen close enough to $x *$ , the sequence $(x n)$ defined by: $\Delta x = x-x_n = - \frac{f'(x_n)}{f''(x_n)}$

$x_{n+1} = x_n - \frac{f'(x_n)}{f''(x_n)}, \ n = 0, 1, \dots$

will converge towards a root of $f'$ , i.e. $x *$ for which $f'(x *) = 0$ .

Geometric interpretation

The geometric interpretation of Newton's method is that at each iteration one approximates $f(\mathbf{x})$ by a quadratic function around $\mathbf{x}_n$ , and then takes a step towards the maximum/minimum of that quadratic function (in higher dimensions, this may also be a saddle point). Note that if $f(\mathbf{x})$ happens to be a quadratic function, then the exact extremum is found in one step.

Higher dimensions

The above iterative scheme can be generalized to several dimensions by replacing the derivative with the gradient, $\nabla f(\mathbf{x})$ , and the reciprocal of the second derivative with the inverse of the Hessian matrix, $H f(\mathbf{x})$ . One obtains the iterative scheme

$\mathbf{x}_{n+1} = \mathbf{x}_n - [H f(\mathbf{x}_n)]^{-1} \nabla f(\mathbf{x}_n), \ n \ge 0.$

Usually Newton's method is modified to include a small step size $γ > 0$ instead of $γ = 1$

$\mathbf{x}_{n+1} = \mathbf{x}_n - \gamma[H f(\mathbf{x}_n)]^{-1} \nabla f(\mathbf{x}_n).$

This is often done to ensure that the Wolfe conditions are satisfied at each step $\mathbf{x}_n \to \mathbf{x}_{n+1}$ of the iteration.

Where applicable, Newton's method converges much faster towards a local maximum or minimum than gradient descent. In fact, every local minimum has a neighborhood $N$ such that, if we start with $\mathbf{x}_0 \in N,$ Newton's method with step size $γ = 1$ converges quadratically (if the Hessian is invertible and a Lipschitz continuous function of $\mathbf{x}$ in that neighborhood).

Finding the inverse of the Hessian in high dimensions can be an expensive operation. In such cases, instead of directly inverting the Hessian it's better to calculate the vector $\mathbf{p}_{n} = [H f(\mathbf{x}_n)]^{-1} \nabla f(\mathbf{x}_n)$ as the solution to the system of linear equations

$[H f(\mathbf{x}_n)] \mathbf{p}_{n} = \nabla f(\mathbf{x}_n)$

which may be solved by various factorizations or approximately (but to great accuracy) using iterative methods. Many of these methods are only applicable to certain types of equations, for example the Cholesky factorization and conjugate gradient will only work if $[H f(\mathbf{x}_n)]$ is a positive definite matrix. While this may seem like a limitation, it's often useful indicator of something gone wrong, for example if a minimization problem is being approached and $[H f(\mathbf{x}_n)]$ is not positive definite, then the iterations are converging to a saddle point and not a minimum.

On the other hand, if a constrained optimization is done (for example, with Lagrange multipliers), the problem may become one of saddle point finding, in which case the Hessian will be symmetric indefinite and the solution of $\mathbf p_n$ will need to be done with a method that will work for such, such as the LDL^T variant of Cholesky factorization or the conjugate residual method.

There also exist various quasi-Newton methods, where an approximation for the Hessian (or its inverse directly) is built up from changes in the gradient.

If the Hessian is close to a non-invertible matrix, the inverted Hessian can be numerically unstable and the solution may diverge. In this case, certain workarounds have been tried in the past, which have varied success with certain problems. One can, for example, modify the Hessian by adding a correction matrix $B n$ so as to make $H_f(\mathbf{x}_n) + B_n$ positive definite. One approach is to diagonalize $H f$ and choose $B n$ so that $H_f(\mathbf{x}_n) + B_n$ has the same eigenvectors as $H f$ , but with each negative eigenvalue replaced by $\epsilon><span class=$ 0." border="0">

An approach exploited in the Levenberg–Marquardt algorithm (which uses an approximate Hessian) is to add a scaled identity matrix to the Hessian, $\mu \mathbf I$ , with the scale adjusted at every iteration as needed. For large $μ$ and small Hessian, the iterations will behave like gradient descent with step size $\frac 1 \mu$ . This results in slower but more reliable convergence where the Hessian doesn't provide useful information.

Other approximations

Some functions are poorly approximated by quadratics, particularly when far from a maximum or minimum. In these cases, approximations other than quadratic may be more appropriate.^[1]

References

^ Thomas P. Minka (2002-04-17) (PDF). Beyond Newton's Method. http://research.microsoft.com/en-us/um/people/minka/papers/minka-newton.pdf. Retrieved 2009-02-20.

Avriel, Mordecai (2003). Nonlinear Programming: Analysis and Methods. Dover Publishing. ISBN 0-486-43227-0.
Bonnans, J. Frédéric; Gilbert, J. Charles; Lemaréchal, Claude; Sagastizábal, Claudia A. (2006). Numerical optimization: Theoretical and practical aspects. Universitext (Second revised ed. of translation of 1997 French ed.). Berlin: Springer-Verlag. pp. xiv+490. doi:10.1007/978-3-540-35447-5. ISBN 3-540-35445-X. MR 2265882. http://www.springer.com/mathematics/applications/book/978-3-540-35445-1.
Fletcher, Roger (1987). Practical methods of optimization (2nd ed.). New York: John Wiley & Sons. ISBN 978-0-471-91547-8 .
Nocedal, Jorge & Wright, Stephen J. (1999). Numerical Optimization. Springer-Verlag. ISBN 0-387-98793-2.

Optimization: Algorithms, methods, and heuristics

Unconstrained nonlinear: Methods calling ...

... functions

Golden section search · Interpolation methods · Line search · Successive parabolic interpolation

... and gradients

Convergence	Trust region · Wolfe conditions

Quasi–Newton	BFGS and L-BFGS · DFP · Symmetric rank-one (SR1)

Other methods	Gauss–Newton · Gradient · Levenberg–Marquardt · Conjugate gradient

... and Hessians

Newton's method

The graph of a strictly concave quadratic function is shown in blue, with its unique maximum shown as a red dot. Below the graph appears the contours of the function: The level sets are nested ellipses.

Constrained nonlinear

General	Barrier methods · Penalty methods

Differentiable	Augmented Lagrangian methods · Sequential quadratic programming · Successive linear programming

Convex minimization

General

Interior point method · Reduced gradient (Frank–Wolfe) · Subgradient method · Cutting-plane method

Linear and
quadratic

Interior point	Ellipsoid method of Khachiyan · Projective algorithm of Karmarkar · Semidefinite programming

Basis-exchange	Simplex algorithm of Dantzig · Criss-cross algorithm · Principal pivoting algorithm of Lemke

Combinatorial

Paradigms

Approximation algorithm · Dynamic programming · Greedy algorithm · Integer programming (Branch & bound or cut)

Graph algorithms

Minimum spanning tree

Bellman–Ford algorithm · Borůvka · Dijkstra · Floyd–Warshall · Johnson · Kruskal

Network flows

Dinic · Edmonds–Karp · Ford–Fulkerson · Push-relabel maximum flow

Metaheuristics

Evolutionary algorithm · Hill climbing · Local search · Simulated annealing · Tabu search

Categories (Algorithms · Methods · Heuristics) · Software

Categories:

Optimization methods

Wikimedia Foundation. 2010.

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

Newton's method — In numerical analysis, Newton s method (also known as the Newton–Raphson method), named after Isaac Newton and Joseph Raphson, is a method for finding successively better approximations to the roots (or zeroes) of a real valued function. The… … Wikipedia
Optimization (mathematics) — In mathematics, the term optimization, or mathematical programming, refers to the study of problems in which one seeks to minimize or maximize a real function by systematically choosing the values of real or integer variables from within an… … Wikipedia
Quasi-Newton method — In optimization, quasi Newton methods (also known as variable metric methods) are well known algorithms for finding local maxima and minima of functions. Quasi Newton methods are based on Newton s method to find the stationary point of a function … Wikipedia
BFGS method — In mathematics, the Broyden Fletcher Goldfarb Shanno (BFGS) method is a method to solve an unconstrained nonlinear optimization problem. The BFGS method is derived from the Newton s method in optimization, a class of hill climbing optimization… … Wikipedia
Broyden's method — In mathematics, Broyden s method is a quasi Newton method for the numerical solution of nonlinear equations in more than one variable. It was originally described by C. G. Broyden in 1965. [cite journal last = Broyden first = C. G. title = A… … Wikipedia
optimization — /op teuh meuh zay sheuhn/ 1. the fact of optimizing; making the best of anything. 2. the condition of being optimized. 3. Math. a mathematical technique for finding a maximum or minimum value of a function of several variables subject to a set of … Universalium
Mathematical optimization — For other uses, see Optimization (disambiguation). The maximum of a paraboloid (red dot) In mathematics, computational science, or management science, mathematical optimization (alternatively, optimization or mathematical programming) refers to… … Wikipedia
Gauss–Newton algorithm — The Gauss–Newton algorithm is a method used to solve non linear least squares problems. It can be seen as a modification of Newton s method for finding a minimum of a function. Unlike Newton s method, the Gauss–Newton algorithm can only be used… … Wikipedia
Convex optimization — Convex minimization, a subfield of optimization, studies the problem of minimizing convex functions over convex sets. Given a real vector space X together with a convex, real valued function defined on a convex subset of X, the problem is to find … Wikipedia
Cutting-plane method — In mathematical optimization, the cutting plane method is an umbrella term for optimization methods which iteratively refine a feasible set or objective function by means of linear inequalities, termed cuts. Such procedures are popularly used to… … Wikipedia

Academic Dictionaries and Encyclopedias

Newton's method in optimization

Contents

Method

Geometric interpretation

Higher dimensions

Other approximations

See also

References

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Newton's method in optimization

Contents

Method

Geometric interpretation

Higher dimensions

Other approximations

See also

References

Look at other dictionaries:

Share the article and excerpts

Direct link