about smoothing data

A boring post.

Suppose some system can be (partly) described by x(t), e.g., the position of the wrist of a person over time.   When measuring position x(t), one obtains data \overline x(t) = x(t) + \varepsilon(t).  This leads to catastrophes when a derivative must be determined.  A standard way to compute the derivative is given by

    \[x'(t_i) = {x(t_i) - x(t_{i-1}) \over t_i - t_{i-1}}\]

which delivers the actual derivative for \lim  t_i - t_{i-1} \rightarrow 0\dots by definition.

But when the signal has noise \epsilon, the situation is radically different.  If we use the same, numerical, approach to find derivatives, we find that 

    \[\overline x'(t_i) = {(x(t_i)+\epsilon(t_i)) - (x(t_{i-1})+\epsilon(t_{i-1}) \over t_i - t_{i-1}} = x'(t_i) + {\epsilon(t_i) - \epsilon(t_{i-1}) \over t_i - t_{i-1}}.\]

 Now, since we assume that \epsilon is normally distributed noise, it can be as positive as negative, and when \epsilon ranges from \epsilon_\mathrm{min}<0 to \epsilon_\mathrm{max}>0, then the extra term \epsilon(t_i) - \epsilon(t_{i-1}) has twice that range.

So, differentiation of a noisy signal doubles the noise!

The filter

Consider what happens when differentiating such a signal twice, to get the acceleration: double double means four times as much noise.  Clearly, this is not the way to go.

A student, who is (made) aware of this fact, will then go on to smooth or filter the original signal.  A good idea.  Terminology here is discipline-dependent and often leads to confusion.  An engineer, who has enjoyed a control theory class, will try to build a low-pass filter, which is an input-output system passing low signal frequencies, while blocking higher frequencies.  The engineer will look at the data \vec x as a summation of frequencies, thus describing x without its index t.  A computer scientist, however, will rather look at a sequence of single measurements x(t), and thus does not directly consider frequencies as a mode of describing the signal.

The engineer’s view has a strong advantage: by looking at the complete data \vec x, one can analyse which “band” of frequencies, occurring in the signal, are desired, and which can be considered noise.  The correct choice of filter (typically, your engineer will like some version of a Butterworth filter, which is the closest you can get to an ideal filter with amplification 1 at desired frequencies and 0 elsewhere;  Mr. Butterworth was a perfectionist) will take out those (high) frequencies related to noise, and leave the rest in tact.

So far for the theory.  In practice, there is a small snatch with the above: when running a process (let’s say, controlling a robot in real time) the data are not available in a whole, but one data point comes in after the other.  So, indeed I get data points \overline x(t), and I do not want to wait to gather enough data, in order to apply my filter.

delay-free filtering

That is, in fact, the crux: can smooth out a signal delay-free?  Yes, and it is relatively easy.  The idea is to make a model of the data, and use that to predict the next step.  The following approach is very applicable to robotic movement, as movement can be well described using polynomials of low order (in fact, up to 5 for fast movements [why 5?  What is fast?  I will cover that elsewhere.])

A standard approach to smoothing out data is a Moving-Average (MA) filter.  The basic principle is simple: the smoothed value y is set to y(t)={1\over n}\sum_{i=-n/2}^{n/2-1} \overline x(t+i).  It works, but has a disadvantage: it lags.  Even if we only take data from the past, i.e.,  y(t)={1\over n}\sum_{i=-n+1}^{0} \overline x(t+i).

What does this do?  In fact, it fits the data \overline x(t) to a straight line, and—in the second case above, where only past points are taken—extrapolates from that.  A linear model is good, if the underlying data x is linear, and the noise \overline x - x isn’t (with respect to n, the number of points we smooth over).  But otherwise?  It will fail.

The bad news: for the kind of (movement) data we are talking about, the linear assumption does not hold. We need to find a better model. An elegant method was described in 1957, by which orthogonal polynomials fit a set of data until an error level is obtained which is around the variance of the data .  Orthogonal polynomials means that, a higher order can be added without having to recompute the previous parameters.  The solution that is found, in a predictive form, is

    \[y(t) = \sum_{i=0}^{n} \overline a_i t^{i};\]

a_i are parameters found from optimisation.  A very attractive method, obtaining the same, was published in 1964 ; in this paper, it is shown that the parameters C of the equation

    \[y(t) = \sum C_i \overline x_i\]

can be precomputed, independent of the data x, leading to a very efficient implementation.

The standard implementation of this well-known Savitzky-Golay (“SG”) filter introduces a delay.  After all, it predicts the value of the data at the centre point, using data points -n\dots n.  But with some tricks, it can be made predictive, too.  How?  A first, not quite correct, approach is to, well, pretend you’re half blind: match the polynomial only on the first half of the data, and predict the “middle” from there.  A method which, dumb as it is, gives acceptable results.

The right way is to recompute the SG parameters—the C_i above—to predict the last rather than the middle data point for the polynomial.  In fact, an elegant solution was published in .  It computes the parameters differently and more efficiently, and that we can use to make a “predictive” filter.  Here’s an implementation of a piece of code doing the promised work: psgolayp.  (My sincere apologies that it is written in Matlab.  I don’t like Matlab all that much, but for historical reasons it came out this way.  I promise it may be the last piece of Matlab posted here.)

So now never come to me again with a moving average filter.