40
$\begingroup$

I've been using the caret package in R to build predictive models for classification and regression. Caret provides a unified interface to tune model hyper-parameters by cross validation or boot strapping. For example, if you are building a simple 'nearest neighbors' model for classification, how many neighbors should you use? 2? 10? 100? Caret helps you answer this question by re-sampling your data, trying different parameters, and then aggregating the results to decide which yield the best predictive accuracy.

I like this approach because it is provides a robust methodology for choosing model hyper-parameters, and once you've chosen the final hyper-parameters it provides a cross-validated estimate of how 'good' the model is, using accuracy for classification models and RMSE for regression models.

I now have some time-series data that I want to build a regression model for, probably using a random forest. What is a good technique to assess the predictive accuracy of my model, given the nature of the data? If random forests don't really apply to time series data, what's the best way to build an accurate ensemble model for time series analysis?

$\endgroup$
2
  • 4
    $\begingroup$ Caret now supports time-series cross-validation - r-bloggers.com/time-series-cross-validation-5 $\endgroup$ Commented Jul 12, 2014 at 0:00
  • 1
    $\begingroup$ @Zach This' an old post but i wonder if you have any new thoughts? Are you aware of any recent work on sequential model validation? $\endgroup$
    – horaceT
    Commented Aug 26, 2016 at 20:44

5 Answers 5

11
$\begingroup$

The "classical" k-times cross-validation technique is based on the fact that each sample in the available data set is used (k-1)-times to train a model and 1 time to test it. Since it is very important to validate time series models on "future" data, this approach will not contribute to the stability of the model.

One important property of many (most?) time series is the correlation between the adjacent values. As pointed out by IrishStat, if you use previous readings as the independent variables of your model candidate, this correlation (or lack of independence) plays a significant role and is another reason why k-times cross validation isn't a good idea.

One way to overcome over this problem is to "oversample" the data and decorrelate it. If the decorrelation process is successful, then using cross validation on time series becomes less problematic. It will not, however, solve the issue of validating the model using future data

Clarifications

by validating model on future data I mean constructing the model, waiting for new data that wasn't available during model construction, testing, fine-tuning etc and validating it on that new data.

by oversampling the data I mean collecting time series data at frequency much higher than practically needed. For example: sampling stock prices every 5 seconds, when you are really interested in hourly alterations. Here, when I say "sampling" I don't mean "interpolating", "estimating" etc. If the data cannot be measured at higher frequency, this technique is meaningless

$\endgroup$
3
  • $\begingroup$ What is the 'classical' way to validate a model on future data? What do you mean by 'oversampling?' Thank you! $\endgroup$
    – Zach
    Commented Mar 27, 2011 at 18:02
  • $\begingroup$ It should be noted that the statistical properties of real-world time series data (especially financial data) can vary depending on the sampling frequency. For many financial time series there is no straightforward $\sqrt{T}$ relationship between the standard deviations sampled with period $p$ and with period $pT$. In fact the standard deviation tends to increase as the sampling frequency increases. Similarly correlation decreases as the sampling frequency increases (this is commonly know as the Epps effect) $\endgroup$ Commented Nov 15, 2011 at 13:35
  • $\begingroup$ @bgbg I'm facing very similar problem and just found your post. Can you cite some references on oversampling and decorrelating time series data? I'd think if the memory in the time series is short enough (could show that fitting an arima), one could just take "non-overlapping" samples and do the usual cross-validation. Any thought appreciated. $\endgroup$
    – horaceT
    Commented Aug 26, 2016 at 20:42
11
$\begingroup$

http://robjhyndman.com/researchtips/crossvalidation/ contains a quick tip for cross validation of time series. Regarding using random forest for time series data....not sure although it seems like an odd choice given that the model is fitted using bootstrap samples. There are classic time series methods of course (e.g. ARIMA) that can be used, as can ML techniques like Neural Nets (example example pdf). Perhaps some of the time series experts can comment on how well ML techniques work compared to time series specific algorithms.

$\endgroup$
2
  • 3
    $\begingroup$ This pretty much hits the nail on the head. I'm trying to figure out to apply Machine Learning techniques to time series analysis. $\endgroup$
    – Zach
    Commented Mar 28, 2011 at 13:30
  • 1
    $\begingroup$ I have had success using random forests for forecasting before. Check out: biomedcentral.com/1471-2105/15/276 $\endgroup$ Commented Sep 17, 2014 at 17:50
8
$\begingroup$

Here is some example code for cross-validating time series models. I expanded on this code in my blog, incorporating the foreach package to speed things up and allowing for a possible xreg term in the cross-validation.

Here's a copy of the code from Rob Hyndman's blog:

library(fpp) # To load the data set a10
plot(a10, ylab="$ million", xlab="Year", main="Antidiabetic drug sales")
plot(log(a10), ylab="", xlab="Year", main="Log Antidiabetic drug sales")

k <- 60 # minimum data length for fitting a model
n <- length(a10)
mae1 <- mae2 <- mae3 <- matrix(NA,n-k,12)
st <- tsp(a10)[1]+(k-2)/12

for(i in 1:(n-k))
{
  xshort <- window(a10, end=st + i/12)
  xnext <- window(a10, start=st + (i+1)/12, end=st + (i+12)/12)
  fit1 <- tslm(xshort ~ trend + season, lambda=0)
  fcast1 <- forecast(fit1, h=12)
  fit2 <- Arima(xshort, order=c(3,0,1), seasonal=list(order=c(0,1,1), period=12), 
      include.drift=TRUE, lambda=0, method="ML")
  fcast2 <- forecast(fit2, h=12)
  fit3 <- ets(xshort,model="MMM",damped=TRUE)
  fcast3 <- forecast(fit3, h=12)
  mae1[i,1:length(xnext)] <- abs(fcast1[['mean']]-xnext)
  mae2[i,1:length(xnext)] <- abs(fcast2[['mean']]-xnext)
  mae3[i,1:length(xnext)] <- abs(fcast3[['mean']]-xnext)
}

plot(1:12, colMeans(mae1,na.rm=TRUE), type="l", col=2, xlab="horizon", ylab="MAE",
     ylim=c(0.65,1.05))
lines(1:12, colMeans(mae2,na.rm=TRUE), type="l",col=3)
lines(1:12, colMeans(mae3,na.rm=TRUE), type="l",col=4)
legend("topleft",legend=c("LM","ARIMA","ETS"),col=2:4,lty=1)

Results

$\endgroup$
1
  • $\begingroup$ Hi Zach. I implemented a slightly different version of Hyndman's code in order to detect the appropriate number of terms for a regression model with time series. Unfortunately, the CV error plot showed several local minima which made me question about how to select the number of terms correctly. The full problem is described here. Have you faced something similar before? $\endgroup$ Commented Dec 15, 2015 at 19:24
5
$\begingroup$

If you have time series data then you might have a "degrees of freedom problem" . For example if you have 4 observations taken at hourly intervals and then decide to use 241 observations at 1minute intervals, you have 241 observations but they are not necessarily independent. When you submit these 241 values/measurements to an analytical package, the package might expect that these are 241 independent values as it proceeds to perform it's particular magic. If you have time series data you might have to upgrade your analytics. I don't know the program you refer to but it is a reasonable guess on my part ( I could be wrong ! ) that it's tests ( F tests / T tests ...etc ) probably don't apply to your problem set.

$\endgroup$
1
$\begingroup$

I can recomend you 2 interesting papers to read that are online

1.Streamed Learning: One-Pass SVMs, by Piyush Rai, Hal Daum´e III, Suresh Venkatasubramanian

2.Streaming k-means approximation, by Nir Ailon

Hope it clarifies you a little your ideas

$\endgroup$
1

Not the answer you're looking for? Browse other questions tagged or ask your own question.