21
$\begingroup$

I'm considering two strategies to do "data augmentation" on time-series forecasting.

First, a little bit of background. A predictor $P$ to forecast the next step of a time-series $\lbrace A_i\rbrace$ is a function that typically depends on two things, the time-series past states, but also the predictor's past states:

$$P(\lbrace A_{i\leq t-1}\rbrace,P_{S_{t-1}})$$

If we want to adjust/train our system to obtain a good $P$, then we'll need enough data. Sometimes available data won't be enough, so we consider doing data augmentation.

First approach

Suppose we have the time-series $\lbrace A_i \rbrace$, with $1 \leq i \leq n$. And suppose also that we have $\epsilon$ that meets the following condition: $0<\epsilon < |A_{i+1} - A_i| \forall i \in \lbrace 1, \ldots,n\rbrace$.

We can construct a new time series $\lbrace B_i = A_i+r_i\rbrace$, where $r_i$ is a realization of the distribution $N(0,\frac{\epsilon}{2}) $.

Then, instead of minimizing the loss function only over $\lbrace A_i \rbrace$, we do that also over $\lbrace B_i \rbrace$. So, if the optimization process takes $m$ steps, we have to "initialize" the predictor $2m$ times, and we'll compute approximately $2m(n-1)$ predictor internal states.

Second approach

We compute $\lbrace B_i \rbrace$ as before, but we don't update the predictor's internal state using $\lbrace B_i \rbrace$, but $\lbrace A_i \rbrace$. We only use the two series together at the time of computing the loss function, so we'll compute approximately $m(n-1)$ predictor internal states.

Of course, there is less computational work here (although the algorithm is a little bit uglier), but it does not matter for now.

The doubt

The problem is: from a statistical point of view, which is the the "best" option? And why?

My intuition tells me that the first one is better, because it helps to "regularize" the weights related with the internal state, while the second one only helps to regularize the weights related with the observed time-series' past.


Extra:

  • Any other ideas to do data augmentation for time series forecasting?
  • How to weight the synthetic data in the training set?
$\endgroup$

3 Answers 3

9
$\begingroup$

Any other ideas to do data augmentation for time series forecasting?

I'm currently thinking about the same problem. I've found the paper "Data Augmentation for Time Series Classification using Convolutional Neural Networks" by Le Guennec et al. which doesn't cover forecasting however. Still the augmentation methods mentioned there look promising. The authors communicate 2 methods:

Window Slicing (WS)

A first method that is inspired from the computer vision community [8,10] consists in extracting slices from time series and performing classification at the slice level. This method has been introduced for time series in [6]. At training, each slice extracted from a time series of class y is assigned the same class and a classifier is learned using the slices. The size of the slice is a parameter of this method. At test time, each slice from a test time series is classified using the learned classifier and a majority vote is performed to decide a predicted label. This method is referred to as window slicing (WS) in the following.

Window Warping (WW)

The last data augmentation technique we use is more time-series specific. It consists in warping a randomly selected slice of a time series by speeding it up or down, as shown in Fig. 2. The size of the original slice is a parameter of this method. Fig. 2 shows a time series from the “ECG200” dataset and corresponding transformed data. Note that this method generates input time series of different lengths. To deal with this issue, we perform window slicing on transformed time series for all to have equal length. In this paper, we only consider warping ratios equal to 0.5 or 2, but other ratios could be used and the optimal ratio could even be fine tuned through cross-validation on the training set. In the following, this method will be referred to as window warping (WW).

Fig. 2 from paper

The authors kept 90% of the series unchanged (i.e. WS was set to a 90% slice and for WW 10% of the series were warped). The methods are reported to reduce classification error on several types of (time) series data, except on 1D representations of image outlines. The authors took their data from here: http://timeseriesclassification.com

How to weight the synthetic data in the training set?

In image augmentation, since the augmentation isn't expected to change the class of an image, it's afaik common to weight it as any real data. Time series forecasting (and even time series classification) might be different:

  1. A time series is not easily perceivable as a contiguous object for humans, so depending on how much you tamper with it, is it still the same class? If you only slice and warp a little and classes are visually distinct, this might not pose a problem for classification tasks
  2. For forecasting, I would argue that

    2.1 WS is still a nice method. No matter at which 90%-part of the series you look, you would still expect a forecast based on the same rules => full weight.

    2.2 WW: The closer it happens to the end of the series, the more cautious I would be. Intuitively, I would come up with a weight factor sliding between 0 (warping at the end) and 1 (warping at the beginning), assuming that the most recent features of the curve are the most relevant.

$\endgroup$
9
$\begingroup$

I have recently implemented another approach inspired by this paper from Bergmeir, Hyndman and Benitez.

The idea is to take a time series and first apply a transformation such as the Box Cox transformation or Yeo-johnson (which solves some problems with the Box Cox) to stabilise the variance of the series, then applying an STL decomposition on the transformed series for seasonal series or a loess decomposition to get the residuals of the series. Taking these residuals and bootstrapping them with a moving block bootstrap to generate $B$ additional series. These $B$ additional series then have the initial trend and seasonality of the starting series added back to the bootstrapped residuals before lastly inverting the power transform applied in the first step.

In this way as many additional time series as needed can be generated that represent the initial time series quite well. Here is an example of the application on some real data to generate additional similar time series:

Augmented series

Here the augmentation is shown using a Yeo-johnson transformation and not Box Cox as suggested in the original paper.

$\endgroup$
1
  • 1
    $\begingroup$ I know you posted that a long time ago. However, I was curious to know if you still happen to have your script? Thanks $\endgroup$
    – lalaland
    Commented Jul 22, 2022 at 18:24
8
$\begingroup$

Any other ideas to do data augmentation for time series forecasting?

Another answer with a different approach, based on "Dataset Augmentation in Feature Space" by DeVries and Taylor.

In this work, we demonstrate that extrapolating between samples in feature space can be used to augment datasets and improve the performance of supervised learning algorithms. The main benefit of our approach is that it is domain-independent, requiring no specialized knowledge, and can therefore be applied to many different types of problems.

Sounds promising to me. In principle you can take any autoencoder to create representations in the feature space. These features can be interpolated or extrapolated.

The figure below shows as an example interpolation of two feature space vectors $C_j$ and $C_k$ (be aware that more positive results are reported for extrapolating from two vectors, see the paper for details). The resulting augmented vector $C'$ is then decoded back to the input space and fed into the network for training.

The paper again covers only sequence classification. But again IMO the principles are the same for regression analysis. You get new data from presumably the same distribution as your real data, that's what you want.

architecture of AE augmentation

If we elaborate this principle of data generation by a neural network further, we'll end up with Generative Adversarial Networks (GAN). They could be used in a similar fashion to generate augmented data which will probably be the most sophisticated state-of-the-art way to do so.

$\endgroup$

Not the answer you're looking for? Browse other questions tagged or ask your own question.