- Linear least squares
**Linear least squares**is an importantcomputation al problem, that arises primarily in applications when it is desired to fit a linearmathematical model tomeasurement s obtained fromexperiment s. The goals of linear least squares are to extract predictions from the measurements and to reduce the effect of measurement errors. Mathematically, it can be stated as the problem of finding an approximate solution to anoverdetermined system of linear equations.Linear least square problems admit a closed-form solution, in contrast to

non-linear least squares problems, which often have to be solved by an iterative procedure.**Motivational example**As a result of an experiment, four $(x,\; y)$ data points were obtained, $(1,\; 6),$ $(2,\; 5),$ $(3,\; 7),$ and $(4,\; 10)$ (shown in red in the picture on the right). It is desired to find a line $y=alpha+eta\; x$ that fits "best" these four points. In other words, we would like to find the numbers $alpha$ and $eta$ that approximately solve the overdetermined linear system:$egin\{alignat\}\{3\}alpha\; +\; 1eta\; ;\; =\; ;\; 6\; \backslash alpha\; +\; 2eta\; ;\; =\; ;\; 5\; \backslash alpha\; +\; 3eta\; ;\; =\; ;\; 7\; \backslash alpha\; +\; 4eta\; ;\; =\; ;\; 10\; \backslash end\{alignat\}$of four equations in two unknowns in some "best" sense.

The

least squares approach to solving this problem is to try to make as small as possible the sum of squares of "errors" between the right- and left-hand sides of these equations, that is, to find the minimum of the function: $S(alpha,\; eta)=\; left\; [6-(alpha+1eta)\; ight]\; ^2+left\; [5-(alpha+2eta)\; ight]\; ^2+left\; [7-(alpha\; +\; 3eta)\; ight]\; ^2+left\; [10-(alpha\; +\; 4eta)\; ight]\; ^2.$

The minimum is determined by calculating the

partial derivative s of $S(alpha,\; eta)$ in respect to $alpha$ and $eta$ and setting them to zero. This results in a system of two equations in two unknowns, which, when solved, gives the solution:$alpha=3.5$:$eta=1.4$

and the equation $y=3.5+1.4x$ of the line of best fit. The residuals, that is, the discrepancies between the $y$ values from the experiment and the $y$ values calculated using the line of best fit are then found to be $1.1,$ $-1.3,$ $-0.7,$ and $0.9$ (see the picture on the right). The minimum value of the sum of squares is $S(3.5,\; 1.4)=1.1^2+(-1.3)^2+(-0.7)^2+0.9^2=4.2.$

**The general problem**Consider an

overdetermined system :$sum\_\{j=1\}^\{n\}\; X\_\{ij\}eta\_j\; =\; y\_i,\; (i=1,\; 2,\; dots,\; m),$

of $m$

linear equation s in $n$ unknowns, $eta\_1,\; eta\_2,\; dots,\; eta\_n,$ with $m\; >\; n,$ written in matrix form as:$mathbf\{X\}oldsymbol\; eta\; =\; mathbf\; y.$

Such a system usually has no solution, and the goal is then to find the numbers $eta\_j$ which fit the equations "best", in the sense of minimizing the sum of squares of differences between the right- and left-hand sides of the equations. The justification for choosing this criterion is given in properties, below.

The linear least squares problem has a unique solution, provided that the $n$ columns of the matrix $X$ are

linearly independent . The solution is obtained by solving the**normal equations**:$mathbf\{left(X^TX\; ight)hat\; oldsymbol\; eta=X^Ty\}.$

**Uses in data fitting**The primary application of linear least squares is in data fitting. Given a set of "m" data points $y\_1,\; y\_2,dots,\; y\_m,$ consisting of experimentally measured values taken at "m" values $x\_1,\; x\_2,dots,\; x\_m$ of an independent variable ($x\_i$ may be scalar or vector quantities), and given a model function $y=f(x,\; oldsymbol\; eta),$ with $oldsymbol\; eta\; =\; (eta\_1,\; eta\_2,\; dots,\; eta\_n),$ it is desired to find the parameters $eta\_j$ such that the model function fits "best" the data. In linear least squares the model function is assumed to be linear in the parameters $eta\_j,$ so

:$f(x,\; oldsymbol\; eta)\; =\; sum\_\{j=1\}^\{n\}\; eta\_j\; phi\_j(x).$

Here, the functions $phi\_j$ may be nonlinear in the variable

**x**.Ideally, the model function fits the data exactly, so

: $y\_i\; =\; f(x\_i,\; oldsymbol\; eta)$

for all $i=1,\; 2,\; dots,\; m.$ This is usually not possible in practice, as there are more data points than there are parameters to be determined. The approach chosen then is to find the minimal possible value of the sum of squares of the residuals:$r\_i(oldsymbol\; eta)=\; y\_i\; -\; f(x\_i,\; oldsymbol\; eta),\; (i=1,\; 2,\; dots,\; m)$so to minimize the function

:$S(oldsymbol\; eta)=sum\_\{i=1\}^\{m\}r\_i^2(oldsymbol\; eta).$

The problem then reduces to the overdetermined linear system mentioned earlier, with $X\_\{ij\}=phi\_j(x\_i).$

**Derivation of the normal equations**"S" is minimized when its gradient with respect to each parameter is equal to zero. The elements of the gradient vector are the partial derivatives of "S" with respect to the parameters:

:$frac\{partial\; S\}\{partial\; eta\_j\}=2sum\_i\; r\_ifrac\{partial\; r\_i\}\{partial\; eta\_j\}=0\; (j=1,2,dots,\; n).$Since $r\_i=\; y\_i\; -\; sum\_\{j=1\}^\{n\}\; X\_\{ij\}eta\_j$, the derivatives are

:$frac\{partial\; r\_i\}\{partial\; eta\_j\}=-X\_\{ij\}.$

Substitution of the expressions for the residuals and the derivatives into the gradient equations gives

:$frac\{partial\; S\}\{partial\; eta\_j\}=-2sum\_\{i=1\}^\{m\}X\_\{ij\}\; left(\; y\_i-sum\_\{k=1\}^\{n\}\; X\_\{ik\}eta\_k\; ight)=0.$

Upon rearrangement, the

**normal equations**:$sum\_\{i=1\}^\{m\}sum\_\{k=1\}^\{n\}\; X\_\{ij\}X\_\{ik\}hat\; eta\_k=sum\_\{i=1\}^\{m\}\; X\_\{ij\}y\_i\; (j=1,2,dots,\; n),$are obtained. The normal equations are written in matrix notation as

:$mathbf\{left(X^TX\; ight)hat\; oldsymbol\; eta=X^Ty\}.$

The solution of the normal equations yields the vector $hat\; oldsymbol\; eta$ of the optimal parameter values.

**Inverting the normal equations**Although the algebraic solution of the normal equations can be written as:$mathbf\{\; hat\; oldsymboleta=left(X^TX\; ight)^\{-1\}X^Ty\}$it is not good practice to invert the normal equations matrix. An exception occurs in

numerical smoothing and differentiation where an analytical expression is required.If the matrix $mathbf\{X^TX\}$ is well-conditioned and positive definite, that is, it has full rank, the normal equations can be solved directly by using the

Cholesky decomposition $mathbf\{X^TX=R^TR\}$, where**R**is an uppertriangular matrix , giving: $mathbf\{\; R^T\; R\; hat\; oldsymboleta\; =\; X^Ty\}.$The solution is obtained in two stages, a forward substitution, $mathbf\{R^Tz=X^Ty\}$, followed by a backward substitution $mathbf\{Rhat\; oldsymboleta=z\}$. Both subtitutions are facilitated by the triangular nature of**R**.See example of linear regression for a worked-out numerical example with three parameters.

**Orthogonal decomposition methods**Orthogonal decomposition methods of solving the least squares problem are slower than the normal equations method but are more numerically stable.

The extra stability results from not having to form the product $mathbf\{X^TX\}$.The residuals are written in matrix notation as:$mathbf\{r=y-Xoldsymboleta\}.$The matrix

**X**is subjected to an orthogonal decomposition; theQR decomposition will serve to illustrate the process. :$mathbf\{X=QR\}$where**Q**is anorthogonal $m\; imes\; m$ matrix and**R**is an $m\; imes\; n$ matrix which is partitioned into a $n\; imes\; n$ block, $mathbfR\_n$, and a $m-n\; imes\; n$ zero block. $mathbfR\_n$ is upper triangular.:$mathbf\{R\}=\; egin\{bmatrix\}mathbf\{R\}\_n\; \backslash mathbf\{0\}end\{bmatrix\}.$The residual vector is left-multiplied by $mathbf\; \{Q^T\}$. :$mathbf\{Q^Tr=Q^T\; y\; -left(Q^TQ\; ight)R\; oldsymboleta\}=\; egin\{bmatrix\}mathbf\{left(Q^T\; y\; ight)\}\_n\; -mathbf\{R\}\_n\; oldsymboleta\; \backslash mathbf\{left(Q^T\; y\; ight)\}\_\{m-n\}end\{bmatrix\}=\; egin\{bmatrix\}mathbf\{U\}\backslash mathbf\{L\}end\{bmatrix\}$The sum of squares of the transformed residuals, $S=mathbf\{r^T\; Q\; Q^Tr\}$, is the same as before, $S=mathbf\{r^Tr\}$ because**Q**isorthogonal .:$S=mathbf\{U^TU+L^TL\}$ The minimum value of "S" is attained when the upper block,**U**, is zero. Therefore the parameters are found by solving:$mathbf\{R\}\_n\; hatoldsymboleta\; =mathbf\{left(Q^T\; y\; ight)\}\_n$These equations are easily solved as $mathbf\{R\}\_n$ is upper triangular.An alternative decomposition of

**X**is thesingular value decomposition (SVD) [*C.L. Lawson and R.J. Hanson, Solving Least Squares Problems, Prentice-Hall,1974*] :$mathbf\{\; X\; =\; USigma\; V^*\}.$This is effectively another kind of orthogonal decomposition as both**U**and**V**are orthogonal. This method is the most computationally intensive, but is particularly useful if the normal equations matrix, $mathbf\{X^TX\}$, is very ill-conditioned (i.e. if itscondition number multiplied by the machine's relativeround-off error is appreciably large). In that case, including the smallestsingular value s in the inversion merely adds numerical noise to the solution. This can be cured using the truncated SVD approach, giving a more stable and exact answer, by explicitly setting to zero all singular values below a certain threshold and so ignoring them, a process closely related tofactor analysis .**Properties of the least-squares estimators**The gradient equations at the minimum can be written as:$mathbf\{(y-Xhatoldsymboleta)X\}=0.$A geometrical interpretation of these equations is that the vector of residuals, $mathbf\{y-Xhatoldsymboleta\}$ is orthogonal to the

column space of $mathbf\{X\}$, since the dot product $mathbf\{(y-Xhatoldsymboleta)cdot\; Xv\}$ is equal to zero for "any" conformal vector, $mathbf\{v\}$. This means that $mathbf\{y\}-mathbf\{X\}oldsymbol\; hat\; eta$ is the shortest of all possible vectors $mathbf\{y\}-mathbf\{X\}oldsymbol\; eta$, that is, the variance of the residuals is the minimum possible. This is illustrated at the right.If the experimental errors, $epsilon\; ,$, are uncorrelated, have a mean of zero and a constant variance, $sigma$, the

Gauss-Markov theorem states that the least-squares estimator, $hat\; eta$, has the minimum variance of all estimators that are linear combinations of the observations. In this sense it is the best, or optimal, estimator of the parameters. Note particularly that this property is independent of the statisticaldistribution function of the errors. In other words, "the distribution function of the errors need not be anormal distribution ".For example, it is easy to show that the

arithmetic mean of a set of measurements of a quantity is the least-squares estimator of the value of that quantity. If the conditions of the Gauss-Markov theorem apply, the arithmetic mean is optimal, whatever the distribution of errors of the measurements might be.However, in the case that the experimental errors do belong to a Normal distribuition, the least-squares estimator is also a

maximum likelihood estimator. [*H. Margenau and G.M. Murphy, The Mathematics of Physics and Chemistry, Van Nostrand, 1943, 1956*]These properties underpin the use of the method of least squares for all types of data fitting, even when the assumptions are not strictly valid.

**Limitations**An assumption underlying the treatment given above is that the independent variable, "x", is free of error. In practice, the errors on the measurements of the independent variable are usually much smaller than the errors on the dependent variable and can therefore be ignored. When this is not the case,

total least squares also known as "Errors-in-variables model", or "Rigorous least squares", should be used. This can be done by adjusting the weighting scheme to take into account errors on both the dependent and independent variables and then following the standard procedure.P. Gans, Data fitting in the Chemical Sciences, Wiley, 1992] [*W.E. Deming, Statistical adjustment of Data, Wiley, 1943*]In some cases the (weighted) normal equations matrix $mathbf\{X^TX\}$ is

ill-conditioned ; this occurs when the measurements have only a marginal effect on one or more of the estimated parameters.When fitting polynomials the normal equations matrix is aVandermonde matrix . Vandermode matrices become increasingly ill-conditioned as the order of the matrix increases.] In these cases, the least squares estimate amplifies the measurement noise and may be grossly inaccurate. Various regularization techniques can be applied in such cases, the most common of which is calledTikhonov regularization . If further information about the parameters is known, for example, a range of possible values of**x**, thenminimax techniques can also be used to increase the stability of the solution.Another drawback of the least squares estimator is the fact that the norm of the residuals, $|mathbf\{y-Xoldsymboleta\}|$ is minimized, whereas in some cases one is truly interested in obtaining small error in the parameter $mathbf\{oldsymboleta\}$, e.g., a small value of $|oldsymboleta-hatoldsymboleta|$. However, since $oldsymboleta$ is unknown, this quantity cannot be directly minimized. If a

prior probability on $oldsymboleta$ is known, then aBayes estimator can be used to minimize themean squared error , $E\; left\{\; |\; oldsymboleta\; -\; hatoldsymboleta\; |^2\; ight\}$. The least squares method is often applied when no prior is known. Surprisingly, however, better estimators can be constructed, an effect known asStein's phenomenon . For example, if the measurement error is Gaussian, several estimators are known which dominate, or outperform, the least squares technique; the best known of these is theJames-Stein estimator .**Weighted linear least squares**When the observations are not equally reliable, a weighted sum of squares:$S=sum\_\{i=1\}^\{m\}W\_\{ii\}r\_i^2$may be minimized.

Each element of the

diagonal weight matrix,**W**should,ideally, be equal to the reciprocal of thevariance of the measurement. [*This implies that the observations are uncorrelated. If the observations are*] The normal equations are then:$mathbf\{left(X^TWX\; ight)hat\; oldsymbol\; eta=X^TWy\}.$correlated , the expression:$S=sum\_k\; sum\_j\; r\_k\; W\_\{kj\}\; r\_j,$applies. In this case the weight matrix should ideally be equal to the inverse of thevariance-covariance matrix of the observations.**Parameter errors, correlation and confidence limits**The parameter values are linear combinations of the observed values:$mathbf\{hat\; eta=(X^TWX)^\{-1\}X^TWy\}$Therefore an expression for the errors on the parameter can be obtained by

error propagation from the errors on the observations. Let thevariance-covariance matrix for the observations be denoted by**M**and that of the parameters by**M**. Then,:$mathbf\{M^eta=(X^TWX)^\{-1\}X^TW\; M\; W^TX(X^TWX)^\{-1$When $mathbf\{W=M^\{-1$, this simplifies to:$mathbf\{M^eta=(X^TWX)^\{-1.$^{$eta$}When unit weights are used ($mathbf\{W=I,\; hat\; eta=(X^TX)^\{-1\}X^Ty\}$) it is implied that the experimental errors are uncorrelated and all equal: $mathbf\{M\}=sigma^2\; mathbf\{I\}$, where $sigma^2,$ is known as the variance of an observation of unit weight, and $mathbf\{I\}$ is an

identity matrix . In this case $sigma^2,$ is approximated by $frac\{S\}\{m-n\}$, where "S" is the minimum value of the objective function:$mathbf\{M^eta=\}frac\{S\}\{m-n\}mathbf\{(X^TX)^\{-1.$In all cases, thevariance of the parameter $eta\_i$ is given by $M^eta\_\{ii\}$ and thecovariance between parameters $eta\_i$ and $eta\_j$ is given by $M^eta\_\{ij\}$.Standard deviation is the square root of variance and the correlation coefficient is given by $ho\_\{ij\}\; =\; M^eta\_\{ij\}/sigma\_i/sigma\_j$. These error estimates reflect onlyrandom errors in the measurements. The true uncertainty in the parameters is larger due to the presence ofsystematic errors which, by definition, cannot be quantified.Note that even though the observations may be un-correlated, the parameters are always correlated.It is often "assumed", for want of any concrete evidence, that the error on a parameter belongs to a

Normal distribution with a mean of zero and standard deviation $sigma$. Under that assumption the followingconfidence limits can be derived.:68% confidence limits, $hat\; eta\; pm\; sigma$:95% confidence limits, $hat\; eta\; pm\; 2sigma$:99% confidence limits, $hat\; eta\; pm\; 2.5sigma$The assumption is not unreasonable when "m>>n". If the experimental errors are normally distributed the parameters will belong to aStudent's t-distribution with "m-n" degrees of freedom. When "m>>n" Student's t-distribution approximates to a Normal distribution. Note, however, that these confidence limits cannot take systematic error into account. Also, parameter errors should be quoted to one significant figure only, as they are subject tosampling error . [*J. Mandel, The Statistical Analysis of Experimental Data, Interscience, 1964*]When the number of observations is relatively small,

Chebychev's inequality can be used for an upper bound on probabilities, regardless of any assumptions about the distribution of experimental errors: the maximum probabilities that a parameter will be more than 1, 2 or 3 standard deviations away from its expectation value are 100%, 25% and 11% respectively.**Residual values and correlation**The residuals are related to the observations by:$mathbf\{hat\; r=y-X\; hat\; eta=y-X\; left(X^TWX\; ight)^\{-1\}X^T\; y\}$The symmetric,

idempotent matrix $mathbf\{X\; left(X^TWX\; ight)^\{-1\}X^T\}$ is known in the statistics literature as thehat matrix , $mathbf\{H\}$. Thus,:$mathbf\{hat\; r=left(I-H\; ight)\; y\}$where**I**is anidentity matrix . The variance-covariance matrice of the residuals,**M**is given by:$mathbf\{M^r=left(I-H\; ight)\; M\; left(I-H\; ight)\}.$This shows that even though the observations may be uncorrelated, the residuals are^{r}always correlated.The sum of residual values is equal to zero whenever the model function contains a constant term. Left-multiply the expression for the residuals by $mathbf\{X^T\}$.:$mathbf\{X^That\; r=X^Ty-X^Toldsymbolhateta=X^Ty-(X^TX)(X^TX)^\{-1\}X^Ty=0\}$Say, for example, that the first term of the model is a constant, so that $X\_\{i1\}=1$ for all "i". In that case it follows that:$sum\_i^m\; X\_\{i1\}\; hat\; r\_i=sum\_i^m\; hat\; r\_i=0.$Thus, in the motivational example, above, the fact that the sum of residual values is equal to zero it is not accidental but is a consequence of the presence of the constant term, α, in the model.

If experimental error follows a

normal distribution , then, because of the linear relationship between residuals and observations, so should residuals, [*K.V. Mardia, J.T. Kent and J.M. Bibby, Multivariate analysis, Academic Press, 1979*] but since the observations are only a sample of the population of all possible observations, the residuals should belong to aStudent's t-distribution .Studentized residual s are useful in making a statistical test for anoutlier when a particular residual appears to be excessively large.**Objective function**The objective function can be written as:$S=mathbf\{\; y^T(I-H)^T(I-H)y=y^T(I-H)y\}$since $mathbf\{\; (I-H)\}$ is also symmetric and idempotent. It can be shown from this, [

*W. C. Hamilton, Statistics in Physical Science, The Ronald Press, New York, 1964*] that theexpected value of "S" is "m-n". Note, however, that this is true only if the weights have been assigned correctly. If unit weights are assumed, the expected value of "S" is $(m-n)sigma^2$, where $sigma^2$ is the variance of an observation.If it is assumed that the residuals belong to a Normal distribution, the objective function, being a sum of weighted squared residuals, will belong to a Chi-square ($chi\; ^2$) distribution with "m-n" degrees of freedom. Some illustrative percentile values of $chi\; ^2$ are given in the following table. [

*M.R. Spiegel, Probability and Statistics, Schaum's Outline Series, McGraw-Hill 1982*] :These values can be used for a statistical criterion as to thegoodness-of-fit . When unit weights are used, the numbers should be divided by the variance of an observation.**Typical uses and applications*** Polynomial fitting: models are

polynomial s in an independent variable, "x":

** Straight line: $f(x,\; oldsymbol\; eta)=alpha\; +eta\; x$. [*F.S. Acton, Analysis of Straight-Line Data, Wiley, 1959*]

** Quadratic: $f(x,\; oldsymbol\; eta)=alpha\; +\; eta\; x\; +gamma\; x^2$.

** Cubic, quartic and higher polynomials. For high-order polynomials the use oforthogonal polynomials is recommended. [*P.G. Guest, Numerical Methods of Curve Fitting, Cambridge University Press, 1961.*]

*Numerical smoothing and differentiation — this is an application of polynomial fitting.

*Multinomials in more than one independent variable, including surface fitting

*Curve fitting withB-spline s

*Chemometrics ,Calibration curve ,Standard addition ,Gran plot , analysis of mixtures**Notes****References***Cite book | author=Björck, Åke | authorlink= | coauthors= | title=Numerical methods for least squares problems | date=1996 | publisher=SIAM | location=Philadelphia | isbn=0-89871-360-9 | pages=

*Cite book | author=Bevington, Philip R | coauthors=Robinson, Keith D | title=Data Reduction and Error Analysis for the Physical Sciences | date=2003 | publisher=McGraw Hill | location= | isbn=0072472278 | pages=**External links*** [

*http://mathworld.wolfram.com/LeastSquaresFitting.html Least Squares Fitting – From MathWorld*]

* [*http://mathworld.wolfram.com/LeastSquaresFittingPolynomial.html Least Squares Fitting-Polynomial – From MathWorld*]

*Wikimedia Foundation.
2010.*