Derivation of the conjugate gradient method

In numerical linear algebra, the conjugate gradient method is an iterative method for numerically solving the linear system

$\boldsymbol{Ax}=\boldsymbol{b}$

where $\boldsymbol{A}$ is symmetric positive-definite. The conjugate gradient method can be derived from several different perspectives, including specialization of the conjugate direction method for optimization, and variation of the Arnoldi/Lanczos iteration for eigenvalue problems.

The intent of this article is to document the important steps in these derivations.

1 Derivation from the conjugate direction method
- 1.1 The conjugate direction method
2 Derivation from the Arnoldi/Lanczos iteration
3 References

Derivation from the conjugate direction method

The conjugate gradient method can be seen as a special case of the conjugate direction method applied to minimization of the quadratic function

$f(\boldsymbol{x})=\boldsymbol{x}^\mathrm{T}\boldsymbol{A}\boldsymbol{x}-2\boldsymbol{b}^\mathrm{T}\boldsymbol{x}\text{.}$

The conjugate direction method

In the conjugate direction method for minimizing

$f(\boldsymbol{x})=\boldsymbol{x}^\mathrm{T}\boldsymbol{A}\boldsymbol{x}-2\boldsymbol{b}^\mathrm{T}\boldsymbol{x}\text{.}$

one starts with an initial guess $\boldsymbol{x}_0$ and the corresponding residual $\boldsymbol{r}_0=\boldsymbol{b}-\boldsymbol{Ax}_0$ , and computes the iterate and residual by the formulae

$\begin{align} \alpha_i&=\frac{\boldsymbol{p}_i^\mathrm{T}\boldsymbol{r}_i}{\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_i}\text{,}\\ \boldsymbol{x}_{i+1}&=\boldsymbol{x}_i+\alpha_i\boldsymbol{p}_i\text{,}\\ \boldsymbol{r}_{i+1}&=\boldsymbol{r}_i-\alpha_i\boldsymbol{Ap}_i \end{align}$

where $\boldsymbol{p}_0,\boldsymbol{p}_1,\boldsymbol{p}_2,\ldots$ are a series of mutually conjugate directions, i.e.,

$\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_j=0$

for any $i\neq j$ .

The conjugate direction method is imprecise in the sense that no formulae are given for selection of the directions $\boldsymbol{p}_0,\boldsymbol{p}_1,\boldsymbol{p}_2,\ldots$ . Specific choices lead to various methods including the conjugate gradient method and Gaussian elimination.

Derivation from the Arnoldi/Lanczos iteration

Further information: Arnoldi iteration and Lanczos iteration

The conjugate gradient method can also be seen as a variant of the Arnoldi/Lanczos iteration applied to solving linear systems.

The general Arnoldi method

In the Arnoldi iteration, one starts with a vector $\boldsymbol{r}_0$ and gradually builds an orthonormal basis $\{\boldsymbol{v}_1,\boldsymbol{v}_2,\boldsymbol{v}_3,\ldots\}$ of the Krylov subspace

$\mathcal{K}(\boldsymbol{A},\boldsymbol{r}_0)=\{\boldsymbol{r}_0,\boldsymbol{Ar}_0,\boldsymbol{A}^2\boldsymbol{r}_0,\ldots\}$

by defining $\boldsymbol{v}_i=\boldsymbol{w}_i/\lVert\boldsymbol{w}_i\rVert_2$ where

$\boldsymbol{w}_i=\begin{cases} \boldsymbol{r}_0 & \text{if }i=1\text{,}\\ \boldsymbol{Av}_{i-1}-\sum_{j=1}^{i-1}(\boldsymbol{v}_j^\mathrm{T}\boldsymbol{Av}_{i-1})\boldsymbol{v}_j & \text{if }i>1\text{.} \end{cases}$

In other words, for $i > 1$ , $\boldsymbol{v}_i$ is found by Gram-Schmidt orthogonalizing $\boldsymbol{Av}_{i-1}$ against $\{\boldsymbol{v}_1,\boldsymbol{v}_2,\ldots,\boldsymbol{v}_{i-1}\}$ followed by normalization.

Put in matrix form, the iteration is captured by the equation

$\boldsymbol{AV}_i=\boldsymbol{V}_{i+1}\boldsymbol{\tilde{H}}_i$

where

$\begin{align} \boldsymbol{V}_i&=\begin{bmatrix} \boldsymbol{v}_1 & \boldsymbol{v}_2 & \cdots & \boldsymbol{v}_i \end{bmatrix}\text{,}\\ \boldsymbol{\tilde{H}}_i&=\begin{bmatrix} h_{11} & h_{12} & h_{13} & \cdots & h_{1,i}\\ h_{21} & h_{22} & h_{23} & \cdots & h_{2,i}\\ & h_{32} & h_{33} & \cdots & h_{3,i}\\ & & \ddots & \ddots & \vdots\\ & & & h_{i,i-1} & h_{i,i}\\ & & & & h_{i+1,i} \end{bmatrix}=\begin{bmatrix} \boldsymbol{H}_i\\ h_{i+1,i}\boldsymbol{e}_i^\mathrm{T} \end{bmatrix} \end{align}$

with

$h_{ji}=\begin{cases} \boldsymbol{v}_j^\mathrm{T}\boldsymbol{Av}_i & \text{if }j\leq i\text{,}\\ \lVert\boldsymbol{w}_{i+1}\rVert_2 & \text{if }j=i+1\text{,}\\ 0 & \text{if }j>i+1\text{.} \end{cases}$

When applying the Arnoldi iteration to solving linear systems, one starts with $\boldsymbol{r}_0=\boldsymbol{b}-\boldsymbol{Ax}_0$ , the residual corresponding to an initial guess $\boldsymbol{x}_0$ . After each step of iteration, one computes $\boldsymbol{y}_i=\boldsymbol{H}_i^{-1}(\lVert\boldsymbol{r}_0\rVert_2\boldsymbol{e}_1)$ and the new iterate $\boldsymbol{x}_i=\boldsymbol{x}_0+\boldsymbol{V}_i\boldsymbol{y}_i$ .

The direct Lanzcos method

For the rest of discussion, we assume that $\boldsymbol{A}$ is symmetric positive-definite. With symmetry of $\boldsymbol{A}$ , the upper Hessenberg matrix $\boldsymbol{H}_i=\boldsymbol{V}_i^\mathrm{T}\boldsymbol{AV}_i$ becomes symmetric and thus tridiagonal. It then can be more clearly denoted by

$\boldsymbol{H}_i=\begin{bmatrix} a_1 & b_2\\ b_2 & a_2 & b_3\\ & \ddots & \ddots & \ddots\\ & & b_{i-1} & a_{i-1} & b_i\\ & & & b_i & a_i \end{bmatrix}\text{.}$

This enables a short three-term recurrence for $\boldsymbol{v}_i$ in the iteration, and the Arnoldi iteration is reduced to the Lanczos iteration.

Since $\boldsymbol{A}$ is symmetric positive-definite, so is $\boldsymbol{H}_i$ . Hence, $\boldsymbol{H}_i$ can be LU factorized without partial pivoting into

$\boldsymbol{H}_i=\boldsymbol{L}_i\boldsymbol{U}_i=\begin{bmatrix} 1\\ c_2 & 1\\ & \ddots & \ddots\\ & & c_{i-1} & 1\\ & & & c_i & 1 \end{bmatrix}\begin{bmatrix} d_1 & b_2\\ & d_2 & b_3\\ & & \ddots & \ddots\\ & & & d_{i-1} & b_i\\ & & & & d_i \end{bmatrix}$

with convenient recurrences for $c i$ and $d i$ :

$\begin{align} c_i&=b_i/d_{i-1}\text{,}\\ d_i&=\begin{cases} a_1 & \text{if }i=1\text{,}\\ a_i-c_ib_i & \text{if }i>1\text{.} \end{cases} \end{align}$

Rewrite $\boldsymbol{x}_i=\boldsymbol{x}_0+\boldsymbol{V}_i\boldsymbol{y}_i$ as

$\begin{align} \boldsymbol{x}_i&=\boldsymbol{x}_0+\boldsymbol{V}_i\boldsymbol{H}_i^{-1}(\lVert\boldsymbol{r}_0\rVert_2\boldsymbol{e}_1)\\ &=\boldsymbol{x}_0+\boldsymbol{V}_i\boldsymbol{U}_i^{-1}\boldsymbol{L}_i^{-1}(\lVert\boldsymbol{r}_0\rVert_2\boldsymbol{e}_1)\\ &=\boldsymbol{x}_0+\boldsymbol{P}_i\boldsymbol{z}_i \end{align}$

with

$\begin{align} \boldsymbol{P}_i&=\boldsymbol{V}_{i}\boldsymbol{U}_i^{-1}\text{,}\\ \boldsymbol{z}_i&=\boldsymbol{L}_i^{-1}(\lVert\boldsymbol{r}_0\rVert_2\boldsymbol{e}_1)\text{.} \end{align}$

It is now important to observe that

$\begin{align} \boldsymbol{P}_i&=\begin{bmatrix} \boldsymbol{P}_{i-1} & \boldsymbol{p}_i \end{bmatrix}\text{,}\\ \boldsymbol{z}_i&=\begin{bmatrix} \boldsymbol{z}_{i-1}\\ \zeta_i \end{bmatrix}\text{.} \end{align}$

In fact, there are short recurrences for $\boldsymbol{p}_i$ and $ζ i$ as well:

$\begin{align} \boldsymbol{p}_i&=\frac{1}{d_i}(\boldsymbol{v}_i-b_i\boldsymbol{p}_{i-1})\text{,}\\ \zeta_i&=-c_i\zeta_{i-1}\text{.} \end{align}$

With this formulation, we arrive at a simple recurrence for $\boldsymbol{x}_i$ :

$\begin{align} \boldsymbol{x}_i&=\boldsymbol{x}_0+\boldsymbol{P}_i\boldsymbol{z}_i\\ &=\boldsymbol{x}_0+\boldsymbol{P}_{i-1}\boldsymbol{z}_{i-1}+\zeta_i\boldsymbol{p}_i\\ &=\boldsymbol{x}_{i-1}+\zeta_i\boldsymbol{p}_i\text{.} \end{align}$

The relations above straightforwardly lead to the direct Lanczos method, which turns out to be slightly more complex.

The conjugate gradient method from imposing orthogonality and conjugacy

If we allow $\boldsymbol{p}_i$ to scale and compensate for the scaling in the constant factor, we potentially can have simpler recurrences of the form:

$\begin{align} \boldsymbol{x}_i&=\boldsymbol{x}_{i-1}+\alpha_{i-1}\boldsymbol{p}_{i-1}\text{,}\\ \boldsymbol{r}_i&=\boldsymbol{r}_{i-1}-\alpha_{i-1}\boldsymbol{Ap}_{i-1}\text{,}\\ \boldsymbol{p}_i&=\boldsymbol{r}_i+\beta_{i-1}\boldsymbol{p}_{i-1}\text{.} \end{align}$

As premises for the simplification, we now derive the orthogonality of $\boldsymbol{r}_i$ and conjugacy of $\boldsymbol{p}_i$ , i.e., for $i\neq j$ ,

$\begin{align} \boldsymbol{r}_i^\mathrm{T}\boldsymbol{r}_j&=0\text{,}\\ \boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_j&=0\text{.} \end{align}$

The residuals are mutually orthogonal because $\boldsymbol{r}_i$ is essentially a multiple of $\boldsymbol{v}_{i+1}$ since for $i = 0$ , $\boldsymbol{r}_0=\lVert\boldsymbol{r}_0\rVert_2\boldsymbol{v}_1$ , for $i > 0$ ,

$\begin{align} \boldsymbol{r}_i&=\boldsymbol{b}-\boldsymbol{Ax}_i\\ &=\boldsymbol{b}-\boldsymbol{A}(\boldsymbol{x}_0+\boldsymbol{V}_i\boldsymbol{y}_i)\\ &=\boldsymbol{r}_0-\boldsymbol{AV}_i\boldsymbol{y}_i\\ &=\boldsymbol{r}_0-\boldsymbol{V}_{i+1}\boldsymbol{\tilde{H}}_i\boldsymbol{y}_i\\ &=\boldsymbol{r}_0-\boldsymbol{V}_i\boldsymbol{H}_i\boldsymbol{y}_i-h_{i+1,i}(\boldsymbol{e}_i^\mathrm{T}\boldsymbol{y}_i)\boldsymbol{v}_{i+1}\\ &=\lVert\boldsymbol{r}_0\rVert_2\boldsymbol{v}_1-\boldsymbol{V}_i(\lVert\boldsymbol{r}_0\rVert_2\boldsymbol{e}_1)-h_{i+1,i}(\boldsymbol{e}_i^\mathrm{T}\boldsymbol{y}_i)\boldsymbol{v}_{i+1}\\ &=-h_{i+1,i}(\boldsymbol{e}_i^\mathrm{T}\boldsymbol{y}_i)\boldsymbol{v}_{i+1}\text{.}\end{align}$

To see the conjugacy of $\boldsymbol{p}_i$ , it suffices to show that $\boldsymbol{P}_i^\mathrm{T}\boldsymbol{AP}_i$ is diagonal:

$\begin{align} \boldsymbol{P}_i^\mathrm{T}\boldsymbol{AP}_i&=\boldsymbol{U}_i^{-\mathrm{T}}\boldsymbol{V}_i^\mathrm{T}\boldsymbol{AV}_i\boldsymbol{U}_i^{-1}\\ &=\boldsymbol{U}_i^{-\mathrm{T}}\boldsymbol{H}_i\boldsymbol{U}_i^{-1}\\ &=\boldsymbol{U}_i^{-\mathrm{T}}\boldsymbol{L}_i\boldsymbol{U}_i\boldsymbol{U}_i^{-1}\\ &=\boldsymbol{U}_i^{-\mathrm{T}}\boldsymbol{L}_i \end{align}$

is symmetric and lower triangular simultaneously and thus must be diagonal.

Now we can derive the constant factors $α i$ and $β i$ with respect to the scaled $\boldsymbol{p}_i$ by solely imposing the orthogonality of $\boldsymbol{r}_i$ and conjugacy of $\boldsymbol{p}_i$ .

Due to the orthogonality of $\boldsymbol{r}_i$ , it is necessary that $\boldsymbol{r}_{i+1}^\mathrm{T}\boldsymbol{r}_i=(\boldsymbol{r}_i-\alpha_i\boldsymbol{Ap}_i)^\mathrm{T}\boldsymbol{r}_i=0$ . As a result,

$\begin{align} \alpha_i&=\frac{\boldsymbol{r}_i^\mathrm{T}\boldsymbol{r}_i}{\boldsymbol{r}_i^\mathrm{T}\boldsymbol{Ap}_i}\\ &=\frac{\boldsymbol{r}_i^\mathrm{T}\boldsymbol{r}_i}{(\boldsymbol{p}_i-\beta_{i-1}\boldsymbol{p}_{i-1})^\mathrm{T}\boldsymbol{Ap}_i}\\ &=\frac{\boldsymbol{r}_i^\mathrm{T}\boldsymbol{r}_i}{\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_i}\text{.} \end{align}$

Similarly, due to the conjugacy of $\boldsymbol{p}_i$ , it is necessary that $\boldsymbol{p}_{i+1}^\mathrm{T}\boldsymbol{Ap}_i=(\boldsymbol{r}_{i+1}+\beta_i\boldsymbol{p}_i)^\mathrm{T}\boldsymbol{Ap}_i=0$ . As a result,

$\begin{align} \beta_i&=-\frac{\boldsymbol{r}_{i+1}^\mathrm{T}\boldsymbol{Ap}_i}{\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_i}\\ &=-\frac{\boldsymbol{r}_{i+1}^\mathrm{T}(\boldsymbol{r}_i-\boldsymbol{r}_{i+1})}{\alpha_i\boldsymbol{p}_i^\mathrm{T}\boldsymbol{Ap}_i}\\ &=\frac{\boldsymbol{r}_{i+1}^\mathrm{T}\boldsymbol{r}_{i+1}}{\boldsymbol{r}_i^\mathrm{T}\boldsymbol{r}_i}\text{.} \end{align}$

This completes the derivation.

References

Hestenes, M. R.; Stiefel, E. (December 1952). "Methods of conjugate gradients for solving linear systems" (PDF). Journal of Research of the National Bureau of Standards 49 (6). http://nvl.nist.gov/pub/nistpubs/jres/049/6/V49.N06.A08.pdf.
Saad, Y. (2003). "Chapter 6: Krylov Subspace Methods, Part I". Iterative methods for sparse linear systems (2nd ed.). SIAM. ISBN 978-0898715347.

v · d · eNumerical linear algebra

Key concepts	Floating point · Numerical stability · BLAS

Problems	Matrix multiplication · Matrix decompositions · Linear equations · Sparse problems

Hardware	CPU cache · TLB · Cache-oblivious algorithm · SIMD · Multiprocessing

Software	Specialized libraries · General purpose software

Categories:

Numerical linear algebra
Optimization methods
Gradient methods

Wikimedia Foundation. 2010.

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

Conjugate gradient method — A comparison of the convergence of gradient descent with optimal step size (in green) and conjugate vector (in red) for minimizing a quadratic function associated with a given linear system. Conjugate gradient, assuming exact arithmetic,… … Wikipedia
Galerkin method — In mathematics, in the area of numerical analysis, Galerkin methods are a class of methods for converting a continuous operator problem (such as a differential equation) to a discrete problem. In principle, it is the equivalent of applying the… … Wikipedia
Gauss–Newton algorithm — The Gauss–Newton algorithm is a method used to solve non linear least squares problems. It can be seen as a modification of Newton s method for finding a minimum of a function. Unlike Newton s method, the Gauss–Newton algorithm can only be used… … Wikipedia
Least squares — The method of least squares is a standard approach to the approximate solution of overdetermined systems, i.e., sets of equations in which there are more equations than unknowns. Least squares means that the overall solution minimizes the sum of… … Wikipedia
Stress (mechanics) — Continuum mechanics … Wikipedia
Expectation-maximization algorithm — An expectation maximization (EM) algorithm is used in statistics for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables. EM alternates between performing an… … Wikipedia
CMA-ES — stands for Covariance Matrix Adaptation Evolution Strategy. Evolution strategies (ES) are stochastic, derivative free methods for numerical optimization of non linear or non convex continuous optimization problems. They belong to the class of… … Wikipedia
Cuckoo search — (CS) is an optimization algorithm developed by Xin she Yang and Suash Deb in 2009.[1][2] It was inspired by the obligate brood parasitism of some cuckoo species by laying their eggs in the nests of other host birds (of other species). Some host… … Wikipedia
Hamiltonian mechanics — is a re formulation of classical mechanics that was introduced in 1833 by Irish mathematician William Rowan Hamilton. It arose from Lagrangian mechanics, a previous re formulation of classical mechanics introduced by Joseph Louis Lagrange in 1788 … Wikipedia
Calculus of variations — is a field of mathematics that deals with extremizing functionals, as opposed to ordinary calculus which deals with functions. A functional is usually a mapping from a set of functions to the real numbers. Functionals are often formed as definite … Wikipedia

Academic Dictionaries and Encyclopedias

Derivation of the conjugate gradient method

Contents