In our blog series on „Time Traveling with Data Science“, we previously introduced different tasks in time series analysis. In this blog post, we now present the task of Outlier Detection.
Outliers are data so different from others that one may suspect that the system that generated them has undergone an interesting event.
On the one hand, outliers may be due to a failure of the measurement process and therefore be considered as noise that needs to be eliminated (these are usually called anomalies). On the other hand, outliers may represent genuine events that need to be analyzed in more detail (these are usually referred to as novelties).
For more about time series analysis, see our blog series:
Not all detection methods can detect the same type of outliers. It is first necessary to specify (i.e., to model) the expected normal behavior of the system and the shape of the data in the „normal“ case. It is also important to specify in which context the normal behavior is defined. Global outliers, also called extreme values, are observations whose values differ from all others on a global scale. Local outliers differ from other observations only in a given (local) context. In the context of time series, the time dimension also plays a role in categorizing outliers. Outliers can be single points in time, a sub-sequence (a contiguous group of observations), or the entire time series. Such a categorization allows refining the type of detection method to be applied.
Figure 1: Left: a signal (blue) with global outliers (red), right: a signal (blue) with local outliers (red)
In principle, once the assumptions about the context and the normal behavior are available, an outlier detection algorithm calculates a so-called outlier score for a given object, which represents the deviation of that object from normal behavior. Based on this score, a decision is made on whether a given observation is considered an outlier or not. When existing examples of outliers (i.e., labels) are available, outlier analysis can be solved using classification algorithms (using 2 classes: outliers, inliers) with the constraint that the number of outliers is relatively small compared to the number of inliers (this is referred to as an unbalanced problem). In the absence of existing labeled data, outlier analysis methods are based on unsupervised learning approaches.
Figure 2: Working principle
A very simple outlier detection algorithm for time series consists of calculating the (normalized) distance of a given point from the mean of the points located in a local time interval (the so-called z-score in a moving window). This algorithm targets 1D single point outliers and assumes that 1) only the local context matters, 2) the normal behavior of the data points should follow a Gaussian distribution, and 3) the order of the data points in the given time interval does not matter (i.e., older data points have the same weight as more recent data points).
Figure 3: Sliding zscore, every point too far away from the mean (orange) is marked as an outlier (red)
This type of algorithm is usually referred to as parametric because the assumption underlying the expected normal behavior is given by a predefined probability distribution. Other algorithms rely on other assumptions; for example, that outliers are likely to be found in empty regions of the data space while normal points are more likely to be found in dense regions, or that outliers may respond differently to data transformations (such as a PCA or autoencoder) and can be detected by examining their behavior when the data is transformed. The model underlying the expected normal behavior does not always have to be specified manually and some algorithms rely on machine learning to extract it from the data.
In addition to dealing with a multitude of outlier detection algorithms, data scientists face other problems when implementing projects that require the use of such methods. First of all, the definition of an outlier itself changes from application to application and the difficulty lies in extracting and using the domain knowledge in order to define the boundary between a normal observation and a suspicious one. Moreover, outliers are rare by definition, which makes them difficult to study in the first place. Finally, all the methods developed in the field of outlier analysis are not adapted to time series, and even less so when applied in a continuous real-time environment.
If you are interested in working on projects involving time series analysis, please feel free to contact our expert Dr. Julien Siebert.
Guest author: Maryam Arabshahi
Researcher at DFKI Kaiserslautern in the „Intelligent Networks“ department.
Maryam obtained her Master’s degree in Computer Science at TU Kaiserslautern with a specialization in Artificial intelligence and Data Science on the topic of interactive model selection methods for outlier detection in time series.
C. C. Aggarwal, Outlier Analysis. Springer International Publishing, 2017.
H. Wang, M. J. Bah, and M. Hammad, Progress in outlier detection techniques: A survey,“ IEEE Access, vol. 7, pp. 107964–108000, 2019.