A Tutorial on Least Mean Square (LMS) Algorithm

I was once assigned a problem: there is a black box that maintains a float value; it has several interfaces to access that value

  1. estimate(): return the estimated current value with noise,
  2. update(v): add delta \(v\) to the current value.

And the goal is to set the black box to the target value. The straightforward solution is to use LMS algorithm (here for some mathematical explanation) to approach the target value. The procedure may look like

1
2
3
4
5
6
T: target
mu: step size
while not converged:
    v = estimate()
    correction = mu*(T - v)
    update(correction)

You can play with it to see how it works

If this is the whole story, we will be happy. However, there is an additional constraint about the execution time of the above steps

  1. estimate(): take 1 unit of time (e.g., 1s),
  2. update(v): also take 1 unit of time.

Compared to the above two steps, the time to calculate the correction (one addition and one multiplication) can be ignored.

For the above procedure, each iteration (estimate & update) will take 2 units of time. It is basically too slow for the application (we can increase \(\mu\), but the variance of the black box when converged will be large; or we may fine tune \(\mu\) for the system, e.g.,decrease the \(\mu\) when it approaches the target.). Fortunately, the system also provides another interface

  1. estimateThenUpdate(v): return the estimated value with noise, then update the black box with delta \(v\). It also take 1 unit of time.

By following the same procedure as above, the loop with the new interface can be written as

1
2
3
4
5
6
T: target
mu: step size
correction = 0
while not converged:
    v = estimateThenUpdate(correction)
    correction = mu*(T - v)

The good news is that we can achieve both estimate and update in 1 unit of time. That is, compared to the original one, iterations in a fixed time period is doubled. However, some people quickly pointed out that to make it equivalent to the original design, we actually need updateThenEstimate, instead of estimateThenUpdate. That's a valid concern; what the new design does (with original two steps) is

1
2
3
4
5
6
7
T: target
mu: step size
correction = 0
while not converged:
    v = estimate() # v(n-1)
    update(correction) # v(n) = v(n-1) + correction
    correction = mu*(T - v) # correction to get v(n+1) is based on v(n-1)

In other words, we are not calculating the correction based on the estimate of the current value, but the estimate of the value before applying the last correction (outdated). So there is a delay between estimate and update. You can play with it in the following demo. For example, \(\text{delay}=1\) means there is 1 update after estimate used to calculate the next correction.

Looks like the system still converges (e.g., for some \(\mu\)). Does it mean the delay doesn't impact the system at all? You may have already noticed that the one with delay 10 may have much larger overshoot than the one without delay (original design).

Let's look at these two designs more closely. Without delay, for \(n\)th iteration, first we estimate the value

$$ \hat{v}_0(n) = v_0(n) + n(n), $$

where \(v_0(n)\) is the signal in black box, and \(n(n)\) is the noise (awgn). Then calculate the distance to the target

$$ e_0(n) = T - \hat{v}_0(n). $$

Finally, the error \(e_0(n)\) is applied to the value in update step

$$ v_0(n+1) = v_0(n) + \mu*e_0(n). $$

Combining the above steps, we have

$$ \begin{align} v_0(n+1) &= v_0(n) + \mu (T- (v_0(n) + n(n))) \nonumber \\ &= (1-\mu)v_0(n) + \mu(T - n(n)). \label{eqn:v0} \end{align} $$

When \(v_0(n)\) converges,

$$ \text{var}(v_0(n+1))= (1-\mu)^2 \text{var}(v_0(n)) + \mu^2 \text{var}(n(n)), $$

and as \(\text{var}(v_0(n+1)) = \text{var}(v_0(n))\), so

$$ \text{var}(v_0)= \frac{\mu}{2-\mu} \text{var}(n) $$

So smaller the \(\mu\), smaller the variance of the signal in black box. But can we set \(\mu\) to be negative value? It doesn't make any sense. The above conclusion is only true if the value in black box converges. It is easy to see that Eq. (\ref{eqn:v0}) is a 1st order difference equation, where \(T-n(n)\) is the input, and \(v_0(n+1)\) is the output. Its \(z\) transform can be written as

$$ H_0(z) = \frac{\mu}{1-(1-\mu)z^{-1}}, $$

To make it stable, the pole of \(z\) transform (\(z_p\)) shall be within the unit circle, that is

$$ -1< z_p = 1-\mu < 1, $$

or

$$ \begin{align} 0< \mu <2. \label{eq:v0_mu} \end{align} $$

Another way to see this is to rewrite the equation Eq. (\ref{eqn:v0}) as,

$$ \begin{align} v_0(n+1) &= (1-\mu)v_0(n) + \mu(T - n(n)), \longrightarrow \nonumber \\ v_n(n+1) - T &= (1-\mu)(v_0(n)-T) - \mu n(n). \end{align} $$

Take the expectation of both side, we have

$$ \begin{align} E(v_n(n+1) - T) &= (1-\mu)E((v_0(n)-T)) \nonumber \\ & = (1-\mu)^{n+1}E(v_0(0)-T). \label{eqn:v0_error} \end{align} $$

So to make the error magnitude to decrease with iteration, \(|1-\mu|<1\); that is \(0<\mu<2\). You can play with the original design to see the effect of various \(\mu\). When system converges, use large \(\mu\) may not be a good idea as it will increase the variance of the signal in black box; and similarly, very small \(\mu\) before system converges may also not be a good idea as it will take longer to converge.

For the case with 1 unit delay

$$ \begin{align} v_1(n+1) &= v_1(n) + \mu (T- (v_1(n-1) + n(n))) \nonumber \\ &= v_1(n)-\mu v_1(n-1) + \mu(T - n(n)). \label{eqn:v1} \end{align} $$

As shown in the above equation, the estimate (and the correction) is based on \(v_1(n-1)\), instead of \(v_1(n)\) as in the case without delay; then the correction is applied to the current value \(v_1(n)\) to get the new value \(v_1(n+1)\). The \(z\) transform of this system is

$$ H_1(z) = \frac{\mu}{1-z^{-1} + \mu z^{-2}} $$

Similarly, to make such linear system stable, the poles of the \(z\) transform shall be within the unit circle, that is

$$ \left|\frac{1 \pm \sqrt{1-4\mu}}{2}\right| < 1, $$

or

$$ 0 < \mu < 1. $$

So the range of \(\mu\) is half of the system without delay (Eq. (\ref{eq:v0_mu})). The variance of the signal with 1 delay can be written as (check here for details).

$$ \text{var}(v_1) = \frac{1+\mu}{(1-\mu)[(1+\mu)^2-1]}\mu^2 \text{var}(n) $$

As shown in the following plot, for same \(\mu\), the variance of the signal with 1 delay is larger, especially when \(\mu\) is large. However, when \(\mu\) is small (e.g., \(\mu<0.1\)), the difference between these two is very small.

The next question is how many iteration do we need? Or will the system with 1 delay actually help to make the system converge better (i.e., within the fixed amount of time)? If \(\mu\) is fixed, for the system without delay, we can easily use Eq. (\ref{eqn:v0_error}) to calculate the minimal number of iteration needed, so that \(E(v_0(n) - T)\) is smaller than some value (e.g., from the system spec). For example, if \(\mu=0.2\), \(v_0(0) = 0\), and the spec is to make the bias smaller than \(1\%T\), then it will need \(\log(0.01)/\log(1-0.2) \approx 21\) iterations.

For the case with delay, it is much more complicated to do the similar analysis. One way to simplify it is to view Eq. (\ref{eqn:v0}) and Eq. (\ref{eqn:v1}) as an linear system (e.g., IIR filter) to filter a noisy DC signal (target). So the group delay will give us good idea about the delay between input (target) and output (signal in black box). The general group delay is complicated and frequency dependent. Fortunately, in this application, the input signal (i.e., target) is constant, that is, its frequency is 0; the group delay at frequency 0 can be written as (see here for detail)

$$ \begin{align} \tau_0(0) &= \frac{1-u}{u},\nonumber \\ \tau_1(0) &= \frac{1-2u}{u}. \end{align} $$

As shown in the following plot, the group delay for \(v_0\) and \(v_1\) is very close to each other. For example, when \(\mu=0.1\), the delay between input (target) and output is roughly 10 for both systems, which means that the input will appear at the output after roughly 10 iterations. In our application, as \(v_1\) is two times faster than \(v_0\), we can expect that \(v_1\) will converge better for the fixed amount of time. For example, you can try the demo to compare the results between two systems (e.g., 40 iterations with 1 delay vs 20 iterations without delay).