- Newton's method in optimization
-
In mathematics, Newton's method is an iterative method for finding roots of equations. More generally, Newton's method is used to find critical points of differentiable functions, which are the zeros of the derivative function.
Contents
Method
Newton's Method attempts to construct a sequence xn from an initial guess x0 that converges towards x * such that . This x * is called a stationary point of .
The second order Taylor expansion fT(x) of function around xn (where Δx = x − xn) is: , attains its extremum when its derivative with respect to Δx is equal to zero, i.e. when Δx solves the linear equation:
(Considering the right-hand side of the above equation as a quadratic in Δx, with constant coefficients.)
Thus, provided that is a twice-differentiable function well approximated by its second order Taylor expansion and the initial guess is chosen close enough to x * , the sequence (xn) defined by:
will converge towards a root of f', i.e. x * for which f'(x * ) = 0.
Geometric interpretation
The geometric interpretation of Newton's method is that at each iteration one approximates by a quadratic function around , and then takes a step towards the maximum/minimum of that quadratic function (in higher dimensions, this may also be a saddle point). Note that if happens to be a quadratic function, then the exact extremum is found in one step.
Higher dimensions
The above iterative scheme can be generalized to several dimensions by replacing the derivative with the gradient, , and the reciprocal of the second derivative with the inverse of the Hessian matrix, . One obtains the iterative scheme
Usually Newton's method is modified to include a small step size γ > 0 instead of γ = 1
This is often done to ensure that the Wolfe conditions are satisfied at each step of the iteration.
Where applicable, Newton's method converges much faster towards a local maximum or minimum than gradient descent. In fact, every local minimum has a neighborhood N such that, if we start with Newton's method with step size γ = 1 converges quadratically (if the Hessian is invertible and a Lipschitz continuous function of in that neighborhood).
Finding the inverse of the Hessian in high dimensions can be an expensive operation. In such cases, instead of directly inverting the Hessian it's better to calculate the vector as the solution to the system of linear equations
which may be solved by various factorizations or approximately (but to great accuracy) using iterative methods. Many of these methods are only applicable to certain types of equations, for example the Cholesky factorization and conjugate gradient will only work if is a positive definite matrix. While this may seem like a limitation, it's often useful indicator of something gone wrong, for example if a minimization problem is being approached and is not positive definite, then the iterations are converging to a saddle point and not a minimum.
On the other hand, if a constrained optimization is done (for example, with Lagrange multipliers), the problem may become one of saddle point finding, in which case the Hessian will be symmetric indefinite and the solution of will need to be done with a method that will work for such, such as the LDLT variant of Cholesky factorization or the conjugate residual method.
There also exist various quasi-Newton methods, where an approximation for the Hessian (or its inverse directly) is built up from changes in the gradient.
If the Hessian is close to a non-invertible matrix, the inverted Hessian can be numerically unstable and the solution may diverge. In this case, certain workarounds have been tried in the past, which have varied success with certain problems. One can, for example, modify the Hessian by adding a correction matrix Bn so as to make positive definite. One approach is to diagonalize Hf and choose Bn so that has the same eigenvectors as Hf, but with each negative eigenvalue replaced by
An approach exploited in the Levenberg–Marquardt algorithm (which uses an approximate Hessian) is to add a scaled identity matrix to the Hessian, , with the scale adjusted at every iteration as needed. For large μ and small Hessian, the iterations will behave like gradient descent with step size . This results in slower but more reliable convergence where the Hessian doesn't provide useful information.
Other approximations
Some functions are poorly approximated by quadratics, particularly when far from a maximum or minimum. In these cases, approximations other than quadratic may be more appropriate.[1]
See also
- Quasi-Newton method
- Gradient descent
- Gauss–Newton algorithm
- Levenberg–Marquardt algorithm
- Trust region
- Optimization
References
- ^ Thomas P. Minka (2002-04-17) (PDF). Beyond Newton's Method. http://research.microsoft.com/en-us/um/people/minka/papers/minka-newton.pdf. Retrieved 2009-02-20.
- Avriel, Mordecai (2003). Nonlinear Programming: Analysis and Methods. Dover Publishing. ISBN 0-486-43227-0.
- Bonnans, J. Frédéric; Gilbert, J. Charles; Lemaréchal, Claude; Sagastizábal, Claudia A. (2006). Numerical optimization: Theoretical and practical aspects. Universitext (Second revised ed. of translation of 1997 French ed.). Berlin: Springer-Verlag. pp. xiv+490. doi:10.1007/978-3-540-35447-5. ISBN 3-540-35445-X. MR2265882. http://www.springer.com/mathematics/applications/book/978-3-540-35445-1.
- Fletcher, Roger (1987). Practical methods of optimization (2nd ed.). New York: John Wiley & Sons. ISBN 978-0-471-91547-8.
- Nocedal, Jorge & Wright, Stephen J. (1999). Numerical Optimization. Springer-Verlag. ISBN 0-387-98793-2.
Optimization: Algorithms, methods, and heuristics Unconstrained nonlinear: Methods calling ... ... and gradients... and HessiansNewton's methodConstrained nonlinear GeneralDifferentiableAugmented Lagrangian methods · Sequential quadratic programming · Successive linear programmingConvex minimization GeneralBasis-exchangeCombinatorial ParadigmsApproximation algorithm · Dynamic programming · Greedy algorithm · Integer programming (Branch & bound or cut)Graph algorithmsMetaheuristics Categories:- Optimization methods
Wikimedia Foundation. 2010.