Time Traveling with Data Science: Pattern Recognition, Motifs Discovery and the Matrix Profile (Part 4)

In Part 4 of our Fraunhofer IESE blog series on „Time Traveling with Data Science“, we continue our journey in the field of time series analysis. In this blog post, our experts from Fraunhofer IESE and our guest author Markus Heidt present the fundamental task of finding similar patterns through pattern recognition. We also provide a short introduction to the most powerful algorithms on the market, the family of algorithms based on the Matrix Profile.

Guest Author: Markus Heidt

Markus Heidt is an award-winning data scientist with experience in applied AI Consulting and Finance. He is highly interested in time series, Natural Language Processing, DevOps, and AI use cases.

Feel free to contact him via LinkedIn.

The task: Finding similar patterns with pattern recognition

Identifying similar patterns is a fundamental task in time series analysis, which can be used for pattern recognition (finding patterns similar to a given predefined pattern), motifs discovery (finding similar patterns that occur frequently and are statistically significant), clustering (finding groups of similar patterns), or outlier analysis (finding patterns that are mostly dissimilar to all others; these are also called discords).

The application domain for such tasks is broad (if not universal). To pick a single example, imagine data coming from electrocardiograms. Here, detecting heart-beat-specific patterns such as ventricular fibrillation might be life-saving (Sansone et al. 2013).

In principle, identifying similar patterns (or subsequences) in a time series, which go by the nice and technical name of „time series subsequences all-pairs-similarity-search“ (TSAPSS), requires two things: first, a measure of similarity between two subsequences (or two patterns), and second, a way to efficiently search for similar subsequences in the time series.

The problem: Pattern recognition is costly

Why does pattern recognition need to be efficient? – Because the problem by itself requires the computation of a lot (an understatement!) of comparisons. Researchers in the field have stated that „as such the lack of progress on TSAPSS stems not from a lack of interest, but from the daunting nature of the problem“ (Yeh et al. 2017). To illustrate this point, consider a time series that is 1000 data points long. We wish to compute the similarity between all subsequences that are 10 points long. Doing so requires computing the similarity between the first subsequence (from point 1 to 10) and the second subsequence (from point 2 to 11), between the first and the third subsequence (from point 3 to 12), etc. until the 991st subsequence (from point 991 to 1000) – these are already 991 comparisons just for comparing the first subsequence with all the others. Then this process needs to be repeated for the second subsequence (comparing it with the third, the fourth, …, the 991st), for the third subsequence, for the fourth, etc., until all possible pairs of subsequences have been compared. For those who like maths, it would be a total of 991*990/2=490,545 comparisons. If the time series is now 10000 points long (and the patterns are still 10 points long), this becomes 49,905,045 comparisons (100 times as many). The increase is exponential.

So why bother? – It turns out that researchers have developed an algorithm based on something called the Matrix Profile, which drastically reduces (also an understatement) the computational time required to perform „time series subsequences all-pairs-similarity-search“.

The solution: The Matrix Profile

The underlying principles behind the Matrix Profile are delightfully simple yet powerful. And there are only two!

The first idea is to recognize that the distance between two subsequences can be transformed into a mathematical operation called a convolution. Why is this important? Because convolutions are operations that can be computed in a very efficient way using the Fast Fourier Transform (FFT) algorithm and its inverse (another ingenious and beautiful algorithm; for a visual introduction see this video).

The second idea is to recognize that it is not necessary to store all distances between all pairs of subsequences (e.g., the 490,545 comparisons as explained above). It is enough to store, for each subsequence, only the distance to the most similar subsequence and where this subsequence is located (for the example above, this boils down to storing the information about 991 subsequences).

The first results related to the Matrix Profile were first published in 2016 and were entitled: „Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View That Includes Motifs, Discords and Shapelets“ (Yeh 2016). As of 2022, the latest article from the same research group is entitled „Matrix Profile XXIV: Scaling Time Series Anomaly Detection to Trillions of Datapoints and Ultra-fast Arriving Data Streams“ (Lu 2022). This research shows that what was deemed impossible a couple of years ago is now within our reach (more about the research itself can be found here).

Talk is cheap, show me the code (Linus Torvalds)

The algorithms based on the Matrix Profile (there are several ones) have been implemented and made open source.

The Matrix Profile Foundation proposes implementations in Python, R, and GoLang (see https://matrixprofile.org/). Stumpy is a Python library dedicated to the efficient computation of Matrix-Profile-based algorithms (see https://stumpy.readthedocs.io/). The sktime library also proposes its own implementation of the Matrix Profile (see https://www.sktime.org/). Finally, the tslearn library proposes choosing between a numpy-based implementation and the one from stumpy, the latter being more efficient (see https://tslearn.readthedocs.io/).

We recommend the following stumpy tutorials:

Tutorial: The Matrix Profile and
Tutorial: Pattern Matching for a quick and hands-on introduction to the Matrix Profile.

For more about Time Series Analysis, see our blog series:

References

[1] Lu, Y., Wu, R., Mueen, A., Zuluaga, M. A., Keogh, E. (2022). Matrix Profile XXIV: Scaling Time Series Anomaly Detection to Trillions of Datapoints and Ultra-fast Arriving Data Streams. ACM SIGKDD 2022

[2] Sansone M, Fusco R, Pepino A, Sansone C. (2013). Electrocardiogram pattern recognition and analysis based on artificial neural networks and support vector machines: a review. J Healthc Eng. 2013;4(4):465-504. doi: 10.1260/2040-2295.4.4.465. PMID: 24287428.

[3] Yeh, C. C. M., Zhu, Y., Ulanova, L., Begum, N., Ding, Y., Dau, H. A., Furtado Silva, D., Mueen, A., Keogh E. (2016). „Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View That Includes Motifs, Discords and Shapelets“ 2016 IEEE 16th International Conference on Data Mining (ICDM), 2016, pp. 1317-1322, doi: 10.1109/ICDM.2016.0179

[4] Yeh, C. C. M., Zhu, Y., Ulanova, L., Begum, N., Ding, Y., Dau, H. A., Zimmerman, Z., Furtado Silva, D., Mueen, A., Keogh, E. (2018). Time series joins, motifs, discords and shapelets: a unifying view that exploits the matrix profile. Data Mining and Knowledge Discovery, 32(1), 83-123