# Least mean squares filter

Least mean squares filter

Least mean squares (LMS) algorithms are a class of adaptive filter used to mimic a desired filter by finding the filter coefficients that relate to producing the least mean squares of the error signal (difference between the desired and the actual signal). It is a stochastic gradient descent method in that the filter is only adapted based on the error at the current time. It was invented in 1960 by Stanford University professor Bernard Widrow and his first Ph.D. student, Ted Hoff.

## Problem formulation

Relationship to the least mean squares filter


The realization of the causal Wiener filter looks a lot like the solution to the least squares estimate, except in the signal processing domain. The least squares solution, for input matrix \scriptstyle\mathbf{X} and output vector \scriptstyle\mathbf{y} is

   \boldsymbol{\hat\beta} = (\mathbf{X} ^\mathbf{T}\mathbf{X})^{-1}\mathbf{X}^{\mathbf{T}}\boldsymbol y .


The FIR Wiener filter is related to the least mean squares filter, but minimizing its error criterion does not rely on cross-correlations or auto-correlations. Its solution converges to the Wiener filter solution. Most linear adaptive filtering problems can be formulated using the block diagram above. That is, an unknown system $\mathbf{h}(n)$ is to be identified and the adaptive filter attempts to adapt the filter $\hat{\mathbf{h}}(n)$ to make it as close as possible to $\mathbf{h}(n)$, while using only observable signals x(n), d(n) and e(n); but y(n), v(n) and h(n) are not directly observable. Its solution is closely related to the Wiener filter.

### definition of symbols

$\mathbf{x}(n) = \left[x(n), x(n-1), \dots, x(n-p+1)\right]^T$
$\mathbf{h}(n) = \left[h_0(n), h_1(n), \dots, h_{p-1}(n)\right]^T,\quad \mathbf{h}(n) \in \mathbb{C}^p$
$y(n) = \mathbf{h}^H(n) \cdot \mathbf{x}(n)$
d(n) = y(n) + ν(n)
$e(n) = d(n) - \hat{y}(n) = d(n) - \hat{\mathbf{h}}^H(n) \cdot \mathbf{x}(n)$

## Idea

The idea behind LMS filters is to use steepest descent to find filter weights $\mathbf{h}(n)$ which minimize a cost function. We start by defining the cost function as

$C(n) = E\left\{|e(n)|^{2}\right\}$

where e(n) is the error at the current sample 'n' and E{.} denotes the expected value.

This cost function (C(n)) is the mean square error, and it is minimized by the LMS. This is where the LMS gets its name. Applying steepest descent means to take the partial derivatives with respect to the individual entries of the filter coefficient (weight) vector

$\nabla_{\hat{\mathbf{h}}^H} C(n) = \nabla_{\hat{\mathbf{h}}^H} E\left\{e(n) \, e^{*}(n)\right\}=2E\left\{\nabla_{\hat{\mathbf{h}}^H} ( e(n)) \, e^{*}(n) \right\}$

where $\nabla$ is the gradient operator.

$\nabla_{\hat{\mathbf{h}}^H} (e(n))= \nabla_{\hat{\mathbf{h}}^H} \left(d(n) - \hat{\mathbf{h}}^H \cdot \mathbf{x}(n)\right)=-\mathbf{x}(n)$
$\nabla C(n) = -2E\left\{\mathbf{x}(n) \, e^{*}(n)\right\}$

Now, $\nabla C(n)$ is a vector which points towards the steepest ascent of the cost function. To find the minimum of the cost function we need to take a step in the opposite direction of $\nabla C(n)$. To express that in mathematical terms

$\hat{\mathbf{h}}(n+1)=\hat{\mathbf{h}}(n)-\frac{\mu}{2} \nabla C(n)=\hat{\mathbf{h}}(n)+\mu \, E\left\{\mathbf{x}(n) \, e^{*}(n)\right\}$

where $\frac{\mu}{2}$ is the step size(adaptation constant). That means we have found a sequential update algorithm which minimizes the cost function. Unfortunately, this algorithm is not realizable until we know $E\left\{\mathbf{x}(n) \, e^{*}(n)\right\}$.

Generally, the expectation above is not computed. Instead, to run the LMS in an online (updating after each new sample is received) environment, we use an instantaneous estimate of that expectation. See below.

## Simplifications

For most systems the expectation function ${E}\left\{\mathbf{x}(n) \, e^{*}(n)\right\}$ must be approximated. This can be done with the following unbiased estimator

$\hat{E}\left\{\mathbf{x}(n) \, e^{*}(n)\right\}=\frac{1}{N}\sum_{i=0}^{N-1}\mathbf{x}(n-i) \, e^{*}(n-i)$

where N indicates the number of samples we use for that estimate. The simplest case is N = 1

$\hat{E}\left\{\mathbf{x}(n) \, e^{*}(n)\right\}=\mathbf{x}(n) \, e^{*}(n)$

For that simple case the update algorithm follows as

$\hat{\mathbf{h}}(n+1)=\hat{\mathbf{h}}(n)+\mu \mathbf{x}(n) \, e^{*}(n)$

Indeed this constitutes the update algorithm for the LMS filter.

## LMS algorithm summary

The LMS algorithm for a pth order algorithm can be summarized as

 Parameters: p = filter order μ = step size Initialisation: $\hat{\mathbf{h}}(0)=0$ Computation: For n = 0,1,2,... $\mathbf{x}(n) = \left[x(n), x(n-1), \dots, x(n-p+1)\right]^T$ $e(n) = d(n)-\hat{\mathbf{h}}^{H}(n)\mathbf{x}(n)$ $\hat{\mathbf{h}}(n+1) = \hat{\mathbf{h}}(n)+\mu\,e^{*}(n)\mathbf{x}(n)$

where $\hat{\mathbf{h}}^{H}(n)$ denotes the Hermitian transpose of $\hat{\mathbf{h}}(n)$.

## Convergence and stability in the mean

Assume that the true filter $\mathbf{h}(n)=\mathbf{h}$ is constant, and that the input signal x(n) is wide-sense stationary. Then $E\{\hat{\mathbf{h}}(n)\}$ converges to $\mathbf{h}$ as $n\rightarrow \infty$ if and only if

$0<\mu<\frac{2}{\lambda_{\mathrm{max}}},$

where λmax is the greatest eigenvalue of the autocorrelation matrix $R=E\{\mathbf{x}(n)\mathbf{x}^H(n)\}$. If this condition is not fulfilled, the algorithm becomes unstable and $\hat{\mathbf{h}}(n)$ diverges.

Maximum convergence speed is achieved when

$\mu=\frac{2}{\lambda_{\mathrm{max}}+\lambda_{\mathrm{min}}},$

where λmin is the smallest eigenvalue of R. Given that μ is less than or equal to this optimum, the convergence speed is determined by μλmin, with a larger value yielding faster convergence. This means that faster convergence can be achieved when λmax is close to λmin, that is, the maximum achievable convergence speed depends on the eigenvalue spread of R.

A white noise signal has autocorrelation matrix R = σ2I, where σ2 is the variance of the signal. In this case all eigenvalues are equal, and the eigenvalue spread is the minimum over all possible matrices. The common interpretation of this result is therefore that the LMS converges quickly for white input signals, and slowly for colored input signals, such as processes with low-pass or high-pass characteristics.

It is important to note that the above upperbound on μ only enforces stability in the mean, but the coefficients of $\hat{\mathbf{h}}(n)$ can still grow infinitely large, i.e. divergence of the coefficients is still possible. A more practical bound is

$0<\mu<\frac{2}{tr\left[R\right]},$

where $tr\left[R\right]$ denotes the trace of R. This bound guarantees that the coefficients of $\hat{\mathbf{h}}(n)$ do not diverge (in practice, the value of μ should not be chosen close to this upper bound, since it is somewhat optimistic due to approximations and assumptions made in the derivation of the bound).

## Normalised least mean squares filter (NLMS)

The main drawback of the "pure" LMS algorithm is that it is sensitive to the scaling of its input x(n). This makes it very hard (if not impossible) to choose a learning rate μ that guarantees stability of the algorithm (Haykin 2002). The Normalised least mean squares filter (NLMS) is a variant of the LMS algorithm that solves this problem by normalising with the power of the input. The NLMS algorithm can be summarised as:

 Parameters: p = filter order μ = step size Initialization: $\hat{\mathbf{h}}(0)=0$ Computation: For n = 0,1,2,... $\mathbf{x}(n) = \left[x(n), x(n-1), \dots, x(n-p+1)\right]^T$ $e(n) = d(n)-\hat{\mathbf{h}}^{H}(n)\mathbf{x}(n)$ $\hat{\mathbf{h}}(n+1) = \hat{\mathbf{h}}(n)+\frac{\mu\,e^{*}(n)\mathbf{x}(n)}{\mathbf{x}^H(n)\mathbf{x}(n)}$

### Optimal learning rate

It can be shown that if there is no interference (v(n) = 0), then the optimal learning rate for the NLMS algorithm is

μopt = 1

and is independent of the input x(n) and the real (unknown) impulse response $\mathbf{h}(n)$. In the general case with interference ($v(n) \ne 0$), the optimal learning rate is

$\mu_{opt}=\frac{E\left[\left|y(n)-\hat{y}(n)\right|^2\right]}{E\left[|e(n)|^2\right]}$

The results above assume that the signals v(n) and x(n) are uncorrelated to each other, which is generally the case in practice.

### Proof

Let the filter misalignment be defined as $\Lambda(n) = \left| \mathbf{h}(n) - \hat{\mathbf{h}}(n) \right|^2$, we can derive the expected misalignment for the next sample as:

$E\left[ \Lambda(n+1) \right] = E\left[ \left| \hat{\mathbf{h}}(n) + \frac{\mu\,e^{*}(n)\mathbf{x}(n)}{\mathbf{x}^H(n)\mathbf{x}(n)} - \mathbf{h}(n) \right|^2 \right]$
$E\left[ \Lambda(n+1) \right] = E\left[ \left| \hat{\mathbf{h}}(n) + \frac{\mu\, \left( v^*(n)+y^*(n)-\hat{y}^*(n) \right) \mathbf{x}(n)}{\mathbf{x}^H(n)\mathbf{x}(n)} - \mathbf{h}(n) \right|^2 \right]$

Let $\mathbf{\delta}=\hat{\mathbf{h}}(n)-\mathbf{h}(n)$ and $r(n) = \hat{y}(n)-y(n)$

$E\left[ \Lambda(n+1) \right] = E\left[ \left| \mathbf{\delta}(n) - \frac{\mu\, \left( v(n)+r(n) \right) \mathbf{x}(n)}{\mathbf{x}^H(n)\mathbf{x}(n)} \right|^2 \right]$
$E\left[ \Lambda(n+1) \right] = E\left[ \left( \mathbf{\delta}(n) - \frac{\mu\, \left( v(n)+r(n) \right) \mathbf{x}(n)}{\mathbf{x}^H(n)\mathbf{x}(n)} \right)^H \left( \mathbf{\delta}(n) - \frac{\mu\, \left( v(n)+r(n) \right) \mathbf{x}(n)}{\mathbf{x}^H(n)\mathbf{x}(n)} \right) \right]$

Assuming independence, we have:

$E\left[ \Lambda(n+1) \right] = \Lambda(n) + E\left[ \left( \frac{\mu\, \left( v(n)-r(n) \right) \mathbf{x}(n)}{\mathbf{x}^H(n)\mathbf{x}(n)} \right)^H \left( \frac{\mu\, \left( v(n)-r(n) \right) \mathbf{x}(n)}{\mathbf{x}^H(n)\mathbf{x}(n)} \right) \right] - 2 E\left[\frac{\mu|r(n)|^2}{\mathbf{x}^H(n)\mathbf{x}(n)}\right]$
$E\left[ \Lambda(n+1) \right] = \Lambda(n) + \frac{\mu^2 E\left[|e(n)|^2\right]}{\mathbf{x}^H(n)\mathbf{x}(n)} - \frac{2 \mu E\left[|r(n)|^2\right]}{\mathbf{x}^H(n)\mathbf{x}(n)}$

The optimal learning rate is found at $\frac{dE\left[ \Lambda(n+1) \right]}{d\mu} = 0$, which leads to:

$2 \mu E\left[|e(n)|^2\right] - 2 E\left[|r(n)|^2\right] = 0$
$\mu = \frac{E\left[|r(n)|^2\right]}{E\left[|e(n)|^2\right]}$

## References

• Monson H. Hayes: Statistical Digital Signal Processing and Modeling, Wiley, 1996, ISBN 0-471-59431-8
• Simon Haykin: Adaptive Filter Theory, Prentice Hall, 2002, ISBN 0-13-048434-2
• Simon S. Haykin, Bernard Widrow (Editor): Least-Mean-Square Adaptive Filters, Wiley, 2003, ISBN 0-471-21570-8
• Bernard Widrow, Samuel D. Stearns: Adaptive Signal Processing, Prentice Hall, 1985, ISBN 0-13-004029-0
• Weifeng Liu, Jose Principe and Simon Haykin: Kernel Adaptive Filtering: A Comprehensive Introduction, John Wiley, 2010, ISBN 0470447532
• Paulo S.R. Diniz: Adaptive Filtering: Algorithms and Practical Implementation, Kluwer Academic Publishers, 1997, ISBN 0-7923-9912-9

Wikimedia Foundation. 2010.

### Look at other dictionaries:

• Least Mean Squares — Der LMS Algorithmus (Least Mean Squares Algorithmus) ist ein Algorithmus zur Approximation der Lösung des Least Mean Squares Problems, das z. B. in der digitalen Signalverarbeitung vorkommt. In der Neuroinformatik ist der Algorithmus vor allem… …   Deutsch Wikipedia

• Recursive least squares filter — Recursive least squares (RLS) algorithm is used in adaptive filters to find the filter coefficients that relate to recursively producing the least squares (minimum of the sum of the absolute squared) of the error signal (difference between the… …   Wikipedia

• Adaptive filter — An adaptive filter is a filter that self adjusts its transfer function according to an optimizing algorithm. Because of the complexity of the optimizing algorithms, most adaptive filters are digital filters that perform digital signal processing… …   Wikipedia

• Wiener filter — In signal processing, the Wiener filter is a filter proposed by Norbert Wiener during the 1940s and published in 1949.ref|Wiener1949 Its purpose is to reduce the amount of noise present in a signal by comparison with an estimation of the desired… …   Wikipedia

• Multidelay block frequency domain adaptive filter — The Multidelay block frequency domain adaptive filter (MDF) algorithm is a block based frequency domain implementation of the (normalised) Least mean squares filter (LMS) algorithm. Contents 1 Introduction 2 Variable definitions 3 Algorithm… …   Wikipedia

• LMS filter — noun least mean squares filter …   Wiktionary

• Kalman filter — Roles of the variables in the Kalman filter. (Larger image here) In statistics, the Kalman filter is a mathematical method named after Rudolf E. Kálmán. Its purpose is to use measurements observed over time, containing noise (random variations)… …   Wikipedia

• Ensemble Kalman filter — The ensemble Kalman filter (EnKF) is a recursive filter suitable for problems with a large number of variables, such as discretizations of partial differential equations in geophysical models. The EnKF originated as a version of the Kalman filter …   Wikipedia

• Similarities between Wiener and LMS — The Least mean squares filter solution converges to the Wiener filter solution, assuming that the unknown system is LTI and the noise is stationary. Both filters can be used to identify the impulse response of an unknown system, knowing only the… …   Wikipedia

• Marcian Hoff — Marcian Edward Ted Hoff, Jr. (born October 28, 1937 in Rochester, New York), is one of the inventors of the microprocessor.[1][2] Hoff joined Intel in 1967 as employee number 12, and is credited with coming up with the idea of using a universal… …   Wikipedia