07-12, 11:25–11:55 (US/Pacific), Ballroom
Traditional time series analysis techniques have found success in a variety of data mining tasks. However, they often require years of experience to master and the recent development of straightforward, easy-to-use analysis tools has been lacking. We address these needs with STUMPY, a scientific Python library that implements a novel yet intuitive approach for discovering patterns, anomalies, and other insights from any time series data. This presentation will cover the necessary background needed to follow the live interactive demo, requires no prior experience, and promises a simple, powerful, and scalable time series analysis package that will complement your current toolset.
Numerous classical methods exist for understanding and analyzing time series data, such as data visualization, summary statistics, ARIMA modeling, Markov modeling, and machine learning. However, when a data practitioner is presented with new or unfamiliar time series data, many of the aforementioned approaches often fail to uncover any significant pattern, anomaly, or conserved behavior since it isn’t known, a priori, whether or not an interesting insight even exists. A naive but straightforward approach (covered in detail here https://stumpy.readthedocs.io/en/latest/Tutorial_The_Matrix_Profile.html) could involve comparing the Euclidean distance for every subsequence within the time series in a pairwise fashion in order to identify subsequences that are either significantly common or exceptionally rare. This seems intuitive at first and it provides an exact solution to our problem but, as the size of the dataset increases (>10,000 data points), this brute force search can quickly become computationally intractable and reveals why approximate or less interpretable solutions (above) have prevailed. Recently, independent research conducted at UC Riverside (https://www.cs.ucr.edu/~eamonn/MatrixProfile.html) has spawned a collection of brand new ideas and developed scalable algorithms that addresses this problem. However, the knowledge and capabilities that have been transferred to the scientific Python community has been limited.
In this talk, we will introduce STUMPY, a powerful, scalable, and modern Python library that can be used for (but not limited to) exploratory data analysis, pattern discovery, anomaly detection, finding shapelets, semantic segmentation, time series chains, and much more. The only requirement for understanding this talk is a basic familiarity with the Pythagorean theorem. If you work with time series data then this is the talk for you! We begin by clearly defining our time series analysis objective and we review common approaches typically used for achieving this goal. Next, we’ll delve into the foundational research that underlies STUMPY’s core algorithms and then we’ll highlight its ease-of-use and high-performance scalability with respect to known benchmarks. This is followed by a quick review of the available online tutorials and other relevant resources and, finally, we’ll conclude with a live interactive demo along with a showcase of illustrative examples.
The STUMPY library, open sourced under the BSD-3 license, was created and developed by Sean M. Law. The source code is hosted on Github (https://github.com/TDAmeritrade/stumpy) with 7M+ combined downloads/installs between PyPI and conda-forge and 2,900+ Github stars. Developed with scientific Python users in mind, this library only depends on NumPy, SciPy, and Numba. STUMPY is highly performant (https://stumpy.readthedocs.io/en/latest/#performance) and has been tested on time series data with over 100 million data points using 256 distributed CPU cores or scaled to as many as 16 NVIDIA GPUs. We use Github Actions for continuous integrated testing and ensure user confidence with 100% test coverage. STUMPY’s API documentation and tutorials are hosted by readthedocs.org (https://stumpy.readthedocs.io/en/latest/index.html) and pre-installed computing environments are available for exploration via mybinder.org (https://mybinder.org/v2/gh/TDAmeritrade/stumpy/main?filepath=notebooks). This work was featured at PyData Global (https://www.youtube.com/watch?v=T9_z7EpA8QM&) and published in “S. M. Law, (2019). STUMPY: A Powerful and Scalable Python Library for Time Series Data Mining. Journal of Open Source Software, 4(39), 1504, https://doi.org/10.21105/joss.01504”.
I am a Principal Data Scientist currently working with a multi-talented R&D team at a Fortune 500 finance firm. I have experience producing cutting edge methodologies, building high-performance predictive models, developing rapid prototypes, and I am an inventor on several finance-related patents.
Additionally, I co-organize the monthly PyData Ann Arbor data science meetup and I am also the creator and core maintainer of STUMPY, a powerful and scalable open source Python package for modern time series analysis.