Effective ways to remove outliers in a timeseries

Question

I am looking for efficient ways to remove outliers in my data. I have tried several solutions that I found here on Stack Overflow and elsewhere but none of them worked for me (4 high values of 21637, 19590, 21659 and 200000 in June 1993, August 1994 and March 1995 should be detected and removed in the sample data posted at the end of this post). Any suggestions would be greatly appreciated!

Here is what I tested so far:

Overview of the data

How to remove outliers from a dataset

3 outliers were still there and many legit high values at the end of the timeseries were removed.
```
y <- dat$Value
y_filter <- y[!y %in% boxplot.stats(y)$out]
plot(y_filter)
```

Calculating outliers in R

Similar issue as the 1st method

FindOutliers <- function(data) {
  data <- data[!is.na(data)]
  lowerq = quantile(data)[2]
  upperq = quantile(data)[4]
  iqr = upperq - lowerq #Or use IQR(data)
  # we identify extreme outliers
  extreme.threshold.upper = (iqr * 1.5) + upperq
  extreme.threshold.lower = lowerq - (iqr * 1.5)
  result <- which(data > extreme.threshold.upper | data < extreme.threshold.lower)

  return(result)
}

# use the function to identify outliers
temp <- FindOutliers(y)
# remove the outliers
y1 <- y[-temp]
# show the data with the outliers removed
y1
#>   [1]   511   524   310   721   514   318   511   511   318   510 21637     0
#>  [13]   319   511   305   317     0     0     0     0     0     0     0     0
#>  [25] 19590     0     0     0     0     0     0     0     0 21659     0     0
#>  [37]     0     0    19     7     0     5     9    21    50   187   291   321
#>  [49]   445   462   462   695   694   693  1276  2560  5500 11663 24307 25205
#>  [61] 21667 18610 16739 14294 13517 12296 11247 11209 10954 11228 11387 11190
#>  [73] 11193 11562 12279 11994 12073 11965 11386 10512 10677 10115 10118  9530
#>  [85]  9016  9086  8167  8171  7561  7268  7359  7753  7168  7206  6926  6646
#>  [97]  6674  6100  6177  5975  6033  5767  5497  5862  5594  5319
plot(y1)

Checking outliers with performance

Same issue as the other two methods

library(performance)
outliers_list <- check_outliers(y) # Find outliers
outliers_list # Show the row index of the outliers
#> 9 outliers detected: cases 60, 61, 62, 63, 64, 65, 66, 67, 68.
#> - Based on the following method and threshold: zscore_robust (3.291).
#> - For variable: y.
#> 
#> -----------------------------------------------------------------------------
#> Outliers per variable (zscore_robust): 
#> 
#> $y
#>    Row Distance_Zscore_robust
#> 60  60               3.688817
#> 61  61               4.384524
#> 62  62               4.491276
#> 63  63               4.362517
#> 64  64               4.438994
#> 65  65               4.525319
#> 66  66               4.576871
#> 67  67               4.393886
#> 68  68               3.804809
str(outliers_list)
#>  'check_outliers' logi [1:116] FALSE FALSE FALSE FALSE FALSE FALSE ...
#>  - attr(*, "data")='data.frame': 116 obs. of  4 variables:
#>   ..$ Row                   : int [1:116] 1 2 3 4 5 6 7 8 9 10 ...
#>   ..$ Distance_Zscore_robust: num [1:116] 0.645 0.643 0.669 0.619 0.644 ...
#>   ..$ Outlier_Zscore_robust : num [1:116] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..$ Outlier               : num [1:116] 0 0 0 0 0 0 0 0 0 0 ...
#>  - attr(*, "threshold")=List of 1
#>   ..$ zscore_robust: num 3.29
#>  - attr(*, "method")= chr "zscore_robust"
#>  - attr(*, "text_size")= num 3
#>  - attr(*, "variables")= chr "y"
#>  - attr(*, "raw_data")='data.frame': 116 obs. of  1 variable:
#>   ..$ y: num [1:116] 511 524 310 721 514 318 511 511 318 510 ...
#>  - attr(*, "outlier_var")=List of 1
#>   ..$ zscore_robust:List of 1
#>   .. ..$ y:'data.frame': 9 obs. of  2 variables:
#>   .. .. ..$ Row                   : int [1:9] 60 61 62 63 64 65 66 67 68
#>   .. .. ..$ Distance_Zscore_robust: num [1:9] 3.69 4.38 4.49 4.36 4.44 ...
#>  - attr(*, "outlier_count")=List of 2
#>   ..$ zscore_robust:'data.frame':    0 obs. of  2 variables:
#>   .. ..$ Row            : int(0) 
#>   .. ..$ n_Zscore_robust: int(0) 
#>   ..$ all          :'data.frame':    0 obs. of  2 variables:
#>   .. ..$ Row            : num(0) 
#>   .. ..$ n_Zscore_robust: num(0)

y[!outliers_list]
#>   [1]   511   524   310   721   514   318   511   511   318   510 21637     0
#>  [13]   319   511   305   317     0     0     0     0     0     0     0     0
#>  [25] 19590     0     0     0     0     0     0     0     0 21659     0     0
#>  [37]     0     0    19     7     0     5     9    21    50   187   291   321
#>  [49]   445   462   462   695   694   693  1276  2560  5500 11663 24307 30864
#>  [61] 25205 21667 18610 16739 14294 13517 12296 11247 11209 10954 11228 11387
#>  [73] 11190 11193 11562 12279 11994 12073 11965 11386 10512 10677 10115 10118
#>  [85]  9530  9016  9086  8167  8171  7561  7268  7359  7753  7168  7206  6926
#>  [97]  6646  6674  6100  6177  5975  6033  5767  5497  5862  5594  5319

par(mfrow = c(2,1), oma = c(2,2,0,0) + 0.1, mar = c(0,0,2,1) + 0.2)
plot(y)
plot(y[!outliers_list])

Test Data

library(tibble)

dat <- tibble::tribble(
  ~DateTime, ~Value,
  "1993-06-06 11:00:00",     NA,
  "1993-06-06 12:00:00",    524,
  "1993-06-06 13:00:00",    310,
  "1993-06-06 14:00:00",    721,
  "1993-06-06 15:00:00",    514,
  "1993-06-06 16:00:00",    318,
  "1993-06-06 17:00:00",    511,
  "1993-06-06 18:00:00",    511,
  "1993-06-06 19:00:00",    318,
  "1993-06-06 20:00:00",    510,
  "1993-06-06 21:00:00",  21637,
  "1993-06-06 22:00:00",     NA,
  "1993-06-06 23:00:00",    319,
  "1993-06-07 24:00:00",    511,
  "1993-06-07 01:00:00",    305,
  "1993-06-07 02:00:00",    317,
  "1994-08-25 06:00:00",      0,
  "1994-08-25 07:00:00",      0,
  "1994-08-25 08:00:00",      0,
  "1994-08-25 09:00:00",     NA,
  "1994-08-25 10:00:00",      0,
  "1994-08-25 11:00:00",      0,
  "1994-08-25 12:00:00",      0,
  "1994-08-25 13:00:00",      0,
  "1994-08-25 14:00:00",  19590,
  "1994-08-26 06:00:00",      0,
  "1994-08-26 07:00:00",      0,
  "1994-08-26 08:00:00",      0,
  "1994-08-26 09:00:00",      0,
  "1994-08-26 10:00:00",     NA,
  "1994-08-26 11:00:00",     NA,
  "1994-08-26 12:00:00",      0,
  "1994-08-26 13:00:00",      0,
  "1994-08-26 14:00:00",  21659,
  "1994-08-26 15:00:00",      0,
  "1994-08-26 16:00:00",      0,
  "1994-08-26 17:00:00",      0,
  "1994-08-26 20:00:00",      0,
  "1994-08-26 21:00:00",     19,
  "1994-08-26 22:00:00",     NA,
  "1995-03-09 18:00:00",     NA,
  "1995-03-09 19:00:00",     NA,
  "1995-03-09 20:00:00",      9,
  "1995-03-09 21:00:00",     21,
  "1995-03-09 22:00:00",     50,
  "1995-03-09 23:00:00",    187,
  "1995-03-10 24:00:00",    291,
  "1995-03-10 01:00:00",    321,
  "1995-03-10 02:00:00",    445,
  "1995-03-10 03:00:00",  2e+05,
  "1995-03-10 04:00:00",    462,
  "1995-03-10 05:00:00",    695,
  "1995-03-10 06:00:00",    694,
  "1995-03-10 07:00:00",    693,
  "1995-03-10 08:00:00",   1276,
  "1995-03-10 09:00:00",   2560,
  "1995-03-10 10:00:00",   5500,
  "1995-03-10 11:00:00",  11663,
  "1995-03-10 12:00:00",  24307,
  "1995-03-10 15:00:00",  36154,
  "1995-03-10 17:00:00",  41876,
  "1995-03-10 18:00:00",  42754,
  "1995-03-10 19:00:00",     NA,
  "1995-03-10 20:00:00",     NA,
  "1995-03-10 21:00:00",  43034,
  "1995-03-10 22:00:00",  43458,
  "1995-03-10 23:00:00",  41953,
  "1995-03-11 24:00:00",  37108,
  "1995-03-11 01:00:00",  30864,
  "1995-03-11 02:00:00",  25205,
  "1995-03-11 03:00:00",     NA,
  "1995-03-11 04:00:00",     NA,
  "1995-03-11 05:00:00",     NA,
  "1995-03-11 06:00:00",     NA,
  "1995-03-11 07:00:00",  13517,
  "1995-03-11 08:00:00",  12296,
  "1995-03-11 09:00:00",  11247,
  "1995-03-11 10:00:00",  11209,
  "1995-03-11 11:00:00",  10954,
  "1995-03-11 12:00:00",  11228,
  "1995-03-11 13:00:00",  11387,
  "1995-03-11 14:00:00",  11190,
  "1995-03-11 15:00:00",  11193,
  "1995-03-11 16:00:00",  11562,
  "1995-03-11 17:00:00",  12279,
  "1995-03-11 18:00:00",  11994,
  "1995-03-11 19:00:00",  12073,
  "1995-03-11 20:00:00",  11965,
  "1995-03-11 21:00:00",  11386,
  "1995-03-11 22:00:00",  10512,
  "1995-03-11 23:00:00",  10677,
  "1995-03-12 24:00:00",  10115,
  "1995-03-12 01:00:00",  10118,
  "1995-03-12 02:00:00",   9530,
  "1995-03-12 03:00:00",   9016,
  "1995-03-12 04:00:00",   9086,
  "1995-03-12 05:00:00",   8167,
  "1995-03-12 06:00:00",   8171,
  "1995-03-12 07:00:00",   7561,
  "1995-03-12 08:00:00",   7268,
  "1995-03-12 09:00:00",   7359,
  "1995-03-12 10:00:00",   7753,
  "1995-03-12 11:00:00",   7168,
  "1995-03-12 12:00:00",   7206,
  "1995-03-12 13:00:00",   6926,
  "1995-03-12 14:00:00",   6646,
  "1995-03-12 15:00:00",   6674,
  "1995-03-12 16:00:00",   6100,
  "1995-03-12 17:00:00",   6177,
  "1995-03-12 18:00:00",   5975,
  "1995-03-12 19:00:00",   6033,
  "1995-03-12 20:00:00",   5767,
  "1995-03-12 21:00:00",   5497,
  "1995-03-12 22:00:00",   5862,
  "1995-03-12 23:00:00",   5594,
  "1995-03-13 24:00:00",     NA
)

^{Created on 2023-10-05 with reprex v2.0.2}

Can you fit a spline/line of best fit and then check the deviation? E.g. sp <- smooth.spline(dat$Value); dev <- abs(dat$Value - sp$y); dat$Value[dev > quantile(dev, 0.975)] seems to give some results that make sense? — thelatemail, Commented Oct 5, 2023 at 22:48
Useful links: Outliers detection in R and Dealing with outliers and missing values — Tung, Commented Oct 19, 2023 at 16:46

Abraham JA · Accepted Answer · 2023-10-05 23:33:32Z

In your time series maybe it's a better idea to detect local outliers instead of global ones. To remove these outliers you could use a time window, for example (n = 10):

library(data.table)

setDT(dat)

dat[, upper := frollapply(Value, n = 10, FUN = quantile, p = 0.75, na.rm = TRUE, align = "center")]
dat[, lower := frollapply(Value, n = 10, FUN = quantile, p = 0.25, na.rm = TRUE, align = "center")]
dat[, iqr := frollapply(Value, n = 10, FUN = IQR, na.rm = TRUE, align = "center")]
dat[Value < lower - iqr * 3 | Value > upper + iqr * 3]

              DateTime Value upper  lower    iqr
1: 1993-06-06 21:00:00 21637   511 317.25 193.75
2: 1994-08-25 14:00:00 19590     0   0.00   0.00
3: 1994-08-26 14:00:00 21659     0   0.00   0.00

dat[Value < lower - iqr * 3 | Value > upper + iqr * 3, Value := NA_integer_]

dat[, mean := frollmean(Value, n = 10, na.rm = TRUE, align = "center")]
dat[, Value := fcoalesce(Value, mean)]
dat[Value < lower - iqr * 3 | Value > upper + iqr * 3]

plot(dat$Value)

This works great! Can you add some explanation to your code so that people who are new to outlier detection (and data.table) can follow? Thank you! — Tung, Commented Oct 6, 2023 at 5:46

DrJerryTAO · Accepted Answer · 2023-10-18 00:08:58Z

8

You can use functions from the {forecast} package to identify time series outliers with and without seasonality. It uses Friedman's super smoother. It can also interpolate outliers and missing values with the expected value in place using other data points. It starts by removing trend and seasonality, so the remaining component is the error term. If one observed error value is three IQR below the first quartile or above the third quartile, then it is considered an outlier.

Read more at Hyndman (2021) "Detecting time series outliers" https://robjhyndman.com/hyndsight/tsoutliers/.

library(dplyr)
library(forecast)
dat %>%
  mutate(
    outlier = row_number() %in% tsoutliers(Value)$index, 
    replacement = tsclean(Value)
  )

edited Oct 18, 2023 at 0:08

answered Oct 10, 2023 at 8:40

DrJerryTAO

2701 silver badge10 bronze badges

your solution removed 3 outliers but it did remove some high values at the end of the timeseries. Can you please check? Thank you!
– Tung
Commented Oct 17, 2023 at 4:46
Yes. I removed the lambda = "auto argument, and it identified fewer outliers.
– DrJerryTAO
Commented Oct 18, 2023 at 0:10
I contacted Rob Hyndman and he suggested using Tukey's (Running Median) Smoothing instead
– Tung
Commented Oct 19, 2023 at 16:03
1

The choice depends on what you consider as outliers and the occurrence pattern. In your case, the four points you consider outliers are sporadic, so running median will easily identify them. If several outliers are consecutive, representing a transient shift, however, some of the running median variants may have difficulties recognizing the points in middle as outliers.
– DrJerryTAO
Commented Jan 24 at 6:44

Add a comment |

Waldi · Accepted Answer · 2023-10-16 19:41:53Z

You could use a Hampel filter which is an efficient signal processing filter to remove outliers.

This filter is implemented in the pracma package, see.

With the provided example :

library(pracma)

# non NA values index
ValInd <- which(!is.na(dat$Value))

# Find outliers index using Hampel filter
OutlierInd <- ValInd[pracma::hampel(dat$Value[ValInd],k=6)$ind]

dat$Value[OutlierInd]
#> [1] 21637 19590 21659

# Remove outliers
dat$Value[OutlierInd] <- NA

plot(dat$Value)

Note that result depends on observation window length, in this case k=6 provides the expected result.

Performance comparison:

Hampel <- function(){
ValInd <- which(!is.na(dat$Value))
OutlierInd <- ValInd[pracma::hampel(dat$Value[ValInd],k=6)$ind]
dat$Value[OutlierInd] <- NA
}

dt <- function() {
setDT(dat)
dat[, upper := frollapply(Value, n = 10, FUN = quantile, p = 0.75, na.rm = TRUE, align = "center")]
dat[, lower := frollapply(Value, n = 10, FUN = quantile, p = 0.25, na.rm = TRUE, align = "center")]
dat[, iqr := frollapply(Value, n = 10, FUN = IQR, na.rm = TRUE, align = "center")]
#dat[Value < lower - iqr * 3 | Value > upper + iqr * 3]
dat[Value < lower - iqr * 3 | Value > upper + iqr * 3, Value := NA_integer_]
dat[, mean := frollmean(Value, n = 10, na.rm = TRUE, align = "center")]
#dat[, Value := fcoalesce(Value, mean)]
#dat[Value < lower - iqr * 3 | Value > upper + iqr * 3]
}

Forecast <- function() {
  dat %>%
    mutate(
      outlier = row_number() %in% tsoutliers(
        Value, lambda = "auto")$index, 
      replacement = tsclean(Value, lambda = "auto")
    )
}

microbenchmark::microbenchmark(Hampel(),dt(),Forecast())

#Unit: milliseconds
       expr      min        lq       mean   median        uq      max neval
#   Hampel()   3.4124   3.62625   4.568761   3.8711   4.35435  14.1761   100
#       dt()  23.6843  24.78875  27.691938  25.5765  31.51740  40.2852   100
# Forecast() 214.8229 222.16470 230.230439 227.2018 232.95140 340.5815   100

A fairly comparison should avoid the print commands in dt() and the imputation steps in dt() and Forecast() — Abraham JA, Commented Oct 16, 2023 at 19:12
Thanks @Waldi! Your solution is significantly faster than other solutions when applying to my real dataset (500K+ rows). Btw, for Forecast(0, if you run dat$Value = tsclean(dat$Value, lambda = 'auto'), it will take only 1/4 of the time you showed in the benchmark table. — Tung, Commented Oct 17, 2023 at 4:57

Tung · Accepted Answer · 2023-10-28 23:48:43Z

3

Dr. Rob Hyndman suggested using Tukey's (Running Median) Smoothing

y <- ts(na.omit(dat$Value))
plot(y, type="b")
smoothy <- smooth(y)
lines(smoothy, col="red")

In the tidyverse framework

smooth2 <- function(y) {
  miss <- is.na(y)
  smoothy <- y
  smoothy[!miss] <- smooth(y[!miss])
  return(smoothy)
}
dat |>
  mutate(smooth_val = smooth2(Value))

edited Oct 28, 2023 at 23:48

answered Oct 19, 2023 at 16:02

Tung

27.8k8 gold badges95 silver badges124 bronze badges

Add a comment |

ThomasIsCoding · Accepted Answer · 2023-10-14 06:11:02Z

I think the existing solutions by @Waldi (personally, my favorite) , @Shadow JA and @DrJerryTAO are all excellent enough to detect the outliers.

Idea: MA-based approach

There actually exist many ways to detect the outliers, and moving-average (MA) approach is also a possible solution, where the the local mean and standard deviation from the previous n consecutive samples will be used to check the current sample.

A few things before running:

There are multiple options for MA, e.g., simple MA, cumulative MA, weighted MA, exponential MA and so on, which may result in different outlier detection criteria.
The code example below applied "simple MA", say, equi-weighted. This would work fine for time series with slow and small variations, but may overreact to sudden and big changes, thus labelling with "outliers".

Code Example

Below is one base R implementation for example (there should be mature packages do it more beautifully and efficiently)

DetectOutliers <- function(v, n = 3) {
    helper <- function(x) {
        wts <- rep(1, n) / n
        m <- stats::filter(x, wts, sides = 1)
        m2 <- stats::filter(x^2, wts, sides = 1)
        sd <- sqrt(m2 - m^2)
        idx <- abs(x - c(NA, head(m, -1))) / c(NA, head(sd, -1)) >= 3
        replace(idx, is.na(idx), FALSE)
    }
    x <- na.omit(v)
    k <- !(helper(x) | rev(helper(rev(x))))
    idx <- replace(rep(FALSE, length(v)), which(!is.na(v))[!k], TRUE)
    ifelse(is.na(v), NA, idx)
}

Output

dat %>%
    dplyr::filter(!DetectOutliers(Value)) %>%
    select(Value) %>%
    pluck(1) %>%
    plot()

shows something like below

your code removed some high values at the end of the timeseries. Can you double-check? Thanks! — Tung, Commented Oct 17, 2023 at 4:48
@Tung that is because of the use of mean instead of median. As I stated in the answer, moving-average may overreact to the changes. — ThomasIsCoding, Commented Oct 17, 2023 at 11:09

PGSA · Accepted Answer · 2023-10-20 06:57:29Z

1

there are couple steps to follow:

delete NaN
play around with 0
play around with data and subset dataset
construct baseline (where value start to increase and when values start to decrease)
exclude baseline
construct linear regression on the subset of data without baseline
construct confidential interval around liner fit
say that outliers are everything outside of range linear_regression +- confidential interval
you can always play around type of linear regression, and who to set up baseline, how to setup confidential interval

edited Oct 20, 2023 at 6:57

PGSA

3,0731 gold badge8 silver badges24 bronze badges

answered Oct 19, 2023 at 13:59

Mila Stanisavljevic

111 bronze badge

Add a comment |

Collectives™ on Stack Overflow

Effective ways to remove outliers in a timeseries

6 Answers 6

Idea: MA-based approach

Code Example

Output

Not the answer you're looking for? Browse other questions tagged
r
time-series
outliers
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Idea: MA-based approach

Code Example

Output

Not the answer you're looking for? Browse other questions tagged rtime-seriesoutliers or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
r
time-series
outliers
or ask your own question.