A tutorial on time series with neural network

1 Introduction

I was once asked to use the accelerometer and gyroscope data (i.e., from a sensor attached to a shoe) to detect the events that whether the shoe is taken off or put on. In other words, you are given a time series (e.g., 50 samples per second),

$$ \begin{align*} v_x[0], v_y[0], v_z[0], &a_x[0], a_y[0], a_z[0]\\ v_x[1], v_y[1], v_z[1], &a_x[1], a_y[1], a_z[1]\\ v_x[2], v_y[2], v_z[2], &a_x[2], a_y[2], a_z[2]\\ &\vdots \end{align*} $$

where \(v_x\), \(v_y\), \(v_z\) are speed for axis \(x\), \(y\) and \(z\) respectively, and \(a_x\), \(a_y\), \(a_z\) are the angle speed for each axis. The problem is to detect the human event (take off/put on shoes) from this data.

Before diving into building models and starting training with the data, I think the first question we shall ask is whether the information we get can solve the problem (maybe this is the second question; the first one should be why we want to detect the event?); that is, does it have information for the problem.

If you torture the data long enough, it will confess.
Ronald H. Coase, Essays on Economics and Economists

My first intuition is that accelerator and gyroscope are not the right data to solve this problem. Someone may argue that there are many neural networks available to solve much more difficult problems very well, e.g., recognize the cat in the following image.


But why? Why do we believe it will work? One reason is that the image (the specific way that RGB values are combined together) obviously has the information about a cat, since our brain can easily recognize it, although it may not be obvious how our brain can build a mapping between these RGB values and its label (e.g., cat). In this case, we may be able to "mimic" the way how brain may work to recognize it.

How about the aforementioned "shoe event" case? What will be the right intuition behind it? Does the event fundamental different from others (e.g., lift a shoe without put it on, walk with a shoe in your bag/car etc.)? In my opinion, trying to solve the problem without understanding is dangerous. Of cause, we can always "train" a system to let it outputs what we "expected". But, eventually we need to deal with all kinds of weird corner cases.

So, what will be the better signal for this problem? One potential choice is capacitive sensor. For example, just like how a capacitive touch sensor works, we can put a capacitive sensor inside the shoe. As human foot is conductive, When it is inside a shoe, it will change the reading from the capacitive sensor (either increase or decrease, depend on the senor design). Thus we can detect the event by identifying the signal change. Such design also has several challenges. For example, the capacitive sensor signal may be significantly reduced when the user wears a thick sock (just like the bad experience you may encounter when touch your phone with a glove). Similarly, when the shoe is wet, the capacitive sensor may stop working properly since water is also conductive.

Another potential choice is to use a force sensor to measure the pressure. For example, imaging a pressure sensor is installed near the heel. When a shoe is on, the weight of your body/foot will press against the sensor, and thus generate a "higher" reading. On the other hand, when the shoe is off, the reading will be much smaller. Thus, it is not difficult to see that we can easily detect the event without complex algorithm (be careful to select a threshold properly to minimize false triggering, e.g., when your cat stays inside your shoe.).

However, in some case, we do have to detect the event when a better signal is not available. For example, use the current and previous blood glucose level to predict the level in next 30 mins. Or use accelerator and gyroscope data from a smart watch to detect a event that the user raises hand (e.g., to check the time). Or for the above "shoe event" case, there is no sensor on your shoe, and only data is from the accelerometer in your smart watch.

2 Preprocessing and features

To build the model, the first step is to pre-processing the data so that it has the following structure

$$ \begin{align} \{\textbf{x}, y\}, \end{align} $$

where \(\textbf{x}\) is the input (features), which is usually a vector or matrix, \(y\) is the expected output or label (e.g., 0 or 1).

For example, this dataset records the accelerometer data when user was doing various tasks (e.g., walk). Let's try a simpler problem: differentiate walk event from climb chair event. First, download the data

# download data file if necessary
if not os.path.isdir(folder):
    dataurl = "https://github.com/wchill/HMP_Dataset/archive/master.zip"
    filename, header = urllib.request.urlretrieve(dataurl, "HMP_Dataset.zip")
    with  zipfile.ZipFile(filename, 'r') as zf:

Fig. (2) and (3) show a sample accelerometer data for walk and climb chair event. Here are some observations

  1. Both data shows some kind of periodicity. It makes sense since each data repeats same events multiple times.
  2. Climb chair data looks different from walk. For example, for climb chair event, \(y\) and \(z\) data look more periodic than walk. It may make sense since the user will move up and down to finish each event. It gives us some confidence/intuition that some classification algorithm may work to differentiate them.
  3. Roughly the first 2 sec data (i.e., 64 samples) looks very different from the remaining. I guess during that time the user was preparing to start the event. So we need to ignore these data for training and test. For simplicity, we ignore first 2 sec data for the whole dataset; however it may not be a good use of the data since the preparing period for some data may be longer (e.g., 3 sec) or shorter. We really should clean each data separately. Same for the last 2 sec data, we cut the last 2 sec data for the whole dataset for simplicity.
Fig.2. Accelerometer reading for walk event
Fig.3. Accelerometer reading for climb chair event

Besides the raw senor data, we may also use some statistics and aggregator. For example, in this paper, \(\textbf{x}\) is a \(31\times 50\) matrix. That is, 31 features are generated for each frame (e.g., mean/std over the last 10/20/50 samples separately (e.g., \(mean(v_x[i-9:i])\)), differential signal between current sample and the samples 20/40/50 steps before (e.g., \(v_x[i]- v_x[i-20]\)), and the absolute values). Then current frame is combined with the features from the last 49 frames (2D image with dimension \(31\times 50\)) is feed into a 1D CNN (along time axis, not very useful to learn across feature axis in this case).

Some useful statistics and features

  • mean, std, max, min, median
  • signal magnitude area, e.g. \(\sum_{k=i-N+1}^{i}{v_x[k]}\)
  • energy, e.g. \(\frac{1}{N}\sum_{k=i-N+1}^{i}{v_x[k]^2}\)
  • logarithm
  • FFT. In some case, FFT is useful. For example for the above \(31\times 50\) input, the 'actual' event can happen at any samples, FFT may be able to mitigate such uncertainty. As shown in Fig. (4), data 1 and data 2 are two samples from same data source (a sine wave). Their time domain representations looks different (time shift); however, their frequency domain representation (magnitude) are same (right plot), which may help neural network to learn fast.
Fig.4. Accelerometer reading for climb chair event
  • normalization: As usual, it may be useful to normalize each feature with the corresponding training data, e.g., zero mean, unit variance, i.e.,

    $$ \begin{align} f_{train} &= (f_{train} - \mu_{f_{train}})/\sigma_{f_{train}},\nonumber \\ f_{test} &= (f_{test} - \mu_{f_{train}})/\sigma_{f_{train}}, \end{align} $$

    where \(f_{train}\) is a feature from training samples, \(f_{test}\) is the same feature from test samples, and

    $$ \begin{align} \mu_{f_{train}} = \frac{1}{N}\sum_{i=0}^{N-1}{f_{train}[i]}, \nonumber \\ \sigma_{f_{train}} = \frac{1}{N}\sum_{i=0}^{N-1}{\left(f_{train}[i]-\mu_{f_{train}}\right)^2}. \end{align} $$
  • differencing: In some case, if a feature is not stationary, we may need to compute the difference between consecutive samples. 1st order differencing

    $$ \begin{align} df_1[n] = f[n] - f[n-1], \end{align} $$

    2nd order differencing

    $$ \begin{align} df_2[n] &= df_1[n] - df_1[n-1]\nonumber\\ &= f[n]-2f[n-1]+f[n-2], \end{align} $$

    Lag-m differences (seasonal differencing)

    $$ \begin{align} d_m[n] &= f[n] - f[n-m]. \end{align} $$

Back to the example, as the raw accelerometer data shows different pattern, let's try to use raw data (within a window) for classification. Let's define

  1. event \(y\):
    • 1 \(\rightarrow\) walk,
    • 0 \(\rightarrow\) climb chair;
  2. features: \(\textbf{x}[N, 3]\). It contains last \(N\) frames of raw accelerometer data, each frame has 3 values (i.e., \(v_x\), \(v_y\), \(v_z\))

Next step is to determine the value of \(N\). Each step of walk can be viewed as roughly same; in this case, the reading from accelerometer shall also roughly periodic (based on how long each step lasts). We may want to choose \(N\) such that the data will cover the whole period. A reasonable walk speed is usually larger than 3600m/hour; that is \(3600m/3600s = 1m/s\). And if the step length (either left or right step) is roughly 0.75m, which will take \(0.75m/(1m/s) = 0.75s\). Thus, a stride (two steps, one left, one right) will take \(0.75s*2=1.5s\) to finish. In this case, we can take \(2s\) accelerator data as features (i.e., \(N=2s*32samples/s=64samples\)).

In reality, the above assumption may not be true. For example, older people may walk more slowly with smaller stride length. We may need to find some way to train each group separately. Or we can treat \(N\) as an hyper-parameter to be determined with cross validation.

Now we have all the information to load the data: \(\textbf{x}[:, 64, 3]\), \(y \in [0, 1]\)

def load_data(folder, steps, label):
    """ load data in folder
        return (x[:, steps, 3], y)
    def walk_through_files(path, ext='.txt'):
        """ walk trhough all files in 'path' with extension 'ext' """
        for filepath in glob.iglob(os.path.join("%s/*%s"%(path, ext))):

    def load_file(filename, steps, label):
        """ load data file, and return (x[steps], y) """
        sample_rate = 32 # 32 samples/sec
        data = pd.read_csv(filename, sep=" ")
        data = data.values
        x, y = list(), list()
        # ignore the first and last 2 sec data, as the user may not start event
        # yet or have already stopped event.
        for i in range(sample_rate*2, len(data)-1-steps-sample_rate*2):
            x.append(data[i:i+steps, :])
        return x, y

    x_all, y_all = list(), list()
    for f in walk_through_files(folder):
        # load all files in folder
        x, y = load_file(f, steps, label)
        x_all += x
        y_all += y
    # x dimension ~ (batch, steps=steps, channels=3)
    return np.moveaxis(np.dstack(x_all), -1, 0), np.array(y_all)

steps = 64
x_walk, y_walk = load_data(folder+'walk', steps, 1)
x_climb_chair, y_climb_chair = load_data(folder+'Climb_stairs', steps, 0)

As usual, to test our model, the data loaded shall be split into two parts: training and test. Here for both walk and climb chair data, \(80\%\) data is used for training and remaining is for test.

walk_train_len = int(len(x_walk)*0.8)
climb_train_len = int(len(x_climb_chair)*0.8)

x_train = np.concatenate((x_walk[:walk_train_len, :, :], x_climb_chair[:climb_train_len, :, :]))
y_train = to_categorical(np.concatenate((y_walk[:walk_train_len], y_climb_chair[:climb_train_len])))

x_test = np.concatenate((x_walk[walk_train_len:, :, :], x_climb_chair[climb_train_len:, :, :]))
y_test = to_categorical(np.concatenate((y_walk[walk_train_len:], y_climb_chair[climb_train_len:])))

3 Neural network model

All data is ready, we can set up our model. The following model is kind of chosen arbitrarily; you can play around with the parameters to compare the performance.

n_timesteps, n_features = x_train.shape[1], x_train.shape[2]
n_outputs = 2
model = Sequential()
model.add(Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(n_timesteps, n_features)))
model.add(Conv1D(filters=64, kernel_size=3, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(n_outputs, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

Now it's time to train the model

class TestCallback(Callback):
    def __init__(self, test_data):
        self.test_data = test_data
        self.accuracy =  []

    def on_epoch_end(self, epoch, logs={}):
        x, y = self.test_data
        loss, acc = self.model.evaluate(x, y, verbose=0)
        print('\nTesting loss: {}, acc: {}\n'.format(loss, acc))
        self.accuracy.append([logs["accuracy"], acc])
test = TestCallback((x_test, y_test))

h = model.fit(x_train, y_train, epochs=100, batch_size=batch_size, verbose=1, callbacks=[test])

Here we define the callback to evaluate the accuracy on test data after each epoch. Fig. (5) shows that after 10 epochs, the test data achieves \(\sim 97\%\) accuracy.

Fig.5. Accuracy of training and testing dataset

Full code can be download here. The "performance" may be improved by choosing the hyper-parameters or adjusting the model; however, considering that the data may not be labeled correctly (remember we just simply ignore the first/last 2sec data) and accelerometer data may not contain all the information to differentiate the events, the result looks pretty good.

Back to the "shoe event" problem, could we use the similar procedure to achieve the similar performance? I am afraid not. The above simplified problem is just try to differentiate two events from each other (walk, climb chair). To make "shoe event" detector useful, it needs to detect "shoe event" from all other events. Furthermore, compared to other events, "shoe event" may be very rare; that's how often will you put on/take off your shoes? Let's say we have a model that can detect \(99\%\) "shoe event" when it happens; that is \(p(\hat{s}/s) = 0.99\). And \(1\%\) false detection when other events happen(\(p(\hat{s}/\bar{s}) = 0.01\)). Does such model work? For example, for a certain day, the sensor is active for 8 hours (e.g., wear shoes for 8 hours every day), and all "shoe events" take 5 mins. Then \(p(s) = 5 mins/8 hours = 0.01\). In this case, when our model shows it detects the event (\(\hat{s}\)), the probability that it is triggered by a true "shoe event" is

$$ \begin{align} p(s/\hat{s}) &= \frac{p(\hat{s}/s)*p(s)}{p(\hat{s}/s)*p(s) + p(\hat{s}/\bar{s})*p(\bar{s})} \nonumber \\ &= \frac{0.99*0.01}{0.99*0.01 + 0.01*(1-0.01)} \nonumber \\ &= 0.5. \end{align} $$

So when the model detects a event, we have half-half chance that the detection is triggered by the actual "shoe event", which is not particularly assuring. The problem is that the event we are interested in happens rarely. To improve the performance, we need to decrease \(p(\hat{s}/\bar{s})\), for example

  • have more data to cover more cases
  • more features (e.g., pressure sensor)