From 818512d29a3ae98df4ad1c86726b0d6a1c5e1cfe Mon Sep 17 00:00:00 2001 From: Matthew Middlehurst Date: Tue, 5 Sep 2023 15:16:44 +0100 Subject: [PATCH 1/6] SS start --- aep/01_similarity_search.md | 56 +++++++++++++++++++++++++++++++++++++ 1 file changed, 56 insertions(+) create mode 100644 aep/01_similarity_search.md diff --git a/aep/01_similarity_search.md b/aep/01_similarity_search.md new file mode 100644 index 0000000..fc89540 --- /dev/null +++ b/aep/01_similarity_search.md @@ -0,0 +1,56 @@ +# Time Series Similarity Search Module + +## Overview + +This AEP introduces a new module and base cass for time series similarity search. + +## Problem Statement and Use Cases + +At its simplest, similarity search is the task of finding the closest subseries in +series X to given series q (<= length of X) given a similarity measure. + +TODO: use cases and examples + +## Implementation + +- new package +- new base class +- example subclass + +## Examples code/structure (if applicable) + +Base class: + +- BaseSimilaritySearch + - methods: + - \_\_init\_\_(distance, normalise?): + - takes distance: function, default = euclidean. + - fit(X): + - takes X: a single/multiple univariate/multivariate series (internal type?) tbc + - returns self + - predict(q): + - takes q: a single/multiple univariate/multivariate series (internal type?) tbc + - iterate over X, find closest k matches, some abstract method to do the iteration + - returns indexes of closest k /distances/series? + + - abstract: + - \_fit() + - \_predict() + - \_iterator() + +Subclasses: + Optimisations: + early abandon? + Distance pruning + +## Considerations and Alternatives + +TODO + +## Discussion + +TODO + +## References + +TODO \ No newline at end of file From fb3f87b88a96fd066d1c4340f98e2471e3434aae Mon Sep 17 00:00:00 2001 From: Matthew Middlehurst Date: Tue, 5 Sep 2023 15:24:28 +0100 Subject: [PATCH 2/6] contributors --- aep/01_similarity_search.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/aep/01_similarity_search.md b/aep/01_similarity_search.md index fc89540..d6be6ce 100644 --- a/aep/01_similarity_search.md +++ b/aep/01_similarity_search.md @@ -1,5 +1,7 @@ # Time Series Similarity Search Module +Contributors: @MatthewMiddlehurst @TonyBagnall @baraline @hadifawaz1999 + ## Overview This AEP introduces a new module and base cass for time series similarity search. From e017500c9dca60352f1abeedad57920974d7d6f8 Mon Sep 17 00:00:00 2001 From: Tony Bagnall Date: Tue, 24 Oct 2023 13:34:06 +0100 Subject: [PATCH 3/6] Update 01_similarity_search.md --- aep/01_similarity_search.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/aep/01_similarity_search.md b/aep/01_similarity_search.md index d6be6ce..1b47c82 100644 --- a/aep/01_similarity_search.md +++ b/aep/01_similarity_search.md @@ -32,13 +32,13 @@ Base class: - returns self - predict(q): - takes q: a single/multiple univariate/multivariate series (internal type?) tbc - - iterate over X, find closest k matches, some abstract method to do the iteration + - iterate over X, find closest k matches, *some abstract method to do the iteration) - returns indexes of closest k /distances/series? - abstract: - \_fit() - \_predict() - - \_iterator() + - \_iterator() maybe? Subclasses: Optimisations: @@ -47,7 +47,7 @@ Subclasses: ## Considerations and Alternatives -TODO +do we even need a base class? Maybe just have a suite of functions? ## Discussion From 98a85d4059fd6324e894de11f7cbe48942d0664b Mon Sep 17 00:00:00 2001 From: Antoine Guillaume Date: Wed, 25 Oct 2023 15:32:59 +0200 Subject: [PATCH 4/6] more detail on problem statement --- aep/01_similarity_search.md | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/aep/01_similarity_search.md b/aep/01_similarity_search.md index 1b47c82..721a090 100644 --- a/aep/01_similarity_search.md +++ b/aep/01_similarity_search.md @@ -11,7 +11,22 @@ This AEP introduces a new module and base cass for time series similarity search At its simplest, similarity search is the task of finding the closest subseries in series X to given series q (<= length of X) given a similarity measure. -TODO: use cases and examples +To obtain the most similar subserie, a distance vector, often called distance +profile, must be computed. This distance profile will store the similarity +between the query and all candidate subseries. In this context, the most +similar subseries will be the one with maximize the similarity measure ( or +minimize the distance or dissimilarity). + +The main research challenge for the distance profiles is the computational +complexity. Most of the contributions in the litterature are aimed toward +optimizing the (di)similarity functions (e.g. Mueen algortihm for the +normalized euclidean distance), proposing lower bounds to prune subseries, +or approximations methods. + +In terms of use cases, distance profiles are the base component of the matrix +profile. + +TODO: add more use cases and references ## Implementation @@ -55,4 +70,4 @@ TODO ## References -TODO \ No newline at end of file +TODO From 2b8c8e9d520ea17ead3d205b273d6cd98bd0a7da Mon Sep 17 00:00:00 2001 From: Antoine Guillaume Date: Wed, 25 Oct 2023 16:29:26 +0200 Subject: [PATCH 5/6] Complete problem statement and implementation --- aep/01_similarity_search.md | 73 ++++++++++++++++++++++++++----------- 1 file changed, 51 insertions(+), 22 deletions(-) diff --git a/aep/01_similarity_search.md b/aep/01_similarity_search.md index 721a090..cfab371 100644 --- a/aep/01_similarity_search.md +++ b/aep/01_similarity_search.md @@ -24,15 +24,29 @@ normalized euclidean distance), proposing lower bounds to prune subseries, or approximations methods. In terms of use cases, distance profiles are the base component of the matrix -profile. +profile methods[1]. They are also used in shapelets methods, where the +shapelet is viewed as the query), convolutional kernels (e.g. Rocket) or +more generaly, any method using a sliding window to compute a function between +a fixed query (a kernel, a shapelet, ...) and a (collection of) time series. -TODO: add more use cases and references +Altough, the computational optimizations used in the similarity search +litterature have not been widely adopted or explored in other the less +"obvious" contexts such as shapelets. ## Implementation -- new package -- new base class -- example subclass +The current implementation is designed around a new module named +`similarity_search`, which contains `BaseSimilaritySearch`, a +base class for all applications that use a distance profile to +extract a set of subseries. One subclass is `TopKSimilaritySearch`, +which given a collection of time series and a query, will return +the most `k` similar subseries given a distance function. + +A submodule named `distance_profiles` contains the methods used +to compute distance profiles for different distance functions. +One additional goal of this submodule would be to provide optimized +methods to compute distance profiles for other tasks that do not fit +the similarity search module (e.g. shapelets). ## Examples code/structure (if applicable) @@ -40,34 +54,49 @@ Base class: - BaseSimilaritySearch - methods: - - \_\_init\_\_(distance, normalise?): - - takes distance: function, default = euclidean. + - \_\_init\_\_(distance, normalise, store_distance_profile): + - takes distance: function, default = euclidean. + - wheter to use a z-normalized distance + - if the distance profiles should be stored after calling predict - fit(X): - - takes X: a single/multiple univariate/multivariate series (internal type?) tbc + - takes X: a 3D array : collection of multivariate series. How other + modules handle the case where we have a 2D array as input ? (i.e. + is it a collection of univariate series or a multivariate series) + - fetch the distance profile function linked to the distance and normalize parameters. - returns self - - predict(q): - - takes q: a single/multiple univariate/multivariate series (internal type?) tbc - - iterate over X, find closest k matches, *some abstract method to do the iteration) - - returns indexes of closest k /distances/series? - + - predict(q, q_index=None, exclusion_factor=2.0): + - takes q: a single multivariate series (internal type?) tbc + - q_index to as a tuple (id_sample, id_timestamp) to specify if q was extracted from X + - Initiate the boolean mask of same shape as X given to the child classes and the + distance profile function, which store the part of the distance profile that should + not be computed. This can be used to indicate where the query was sampled in X, but + also during lower bound pruning to indicate which part are prunned. + - exlucsion_factor specify the area around q_index (+/- l//exclusion_factor) that should + be excluded from the distance profile computation and left to np.inf. + - Compute the means and standard deviations of the subseries in X if normalize was True. + - abstract: - - \_fit() + - \_fit() : - \_predict() - \_iterator() maybe? -Subclasses: - Optimisations: - early abandon? - Distance pruning - +For the `distance_profile` submodule, we can distinguish different type of optimizations: + - Direct distance function optimization (e.g. Mueen) + - Early abandon of distance computation (e.g. EA-DTW) + - Lower bound pruning (e.g. Keogh LB for DTW) + ## Considerations and Alternatives -do we even need a base class? Maybe just have a suite of functions? +- do we even need a base class? Maybe just have a suite of functions? ## Discussion -TODO +I think base class can be useful to define the common code between some similarity search +use cases. For example, TopK search or threshold search (i.e. all subseries with a distance +bellow a threshold are returned). We might need to refine it when we extend the scope of +the module (e.g. matrix profile), as I don't think the current `BaseSimilaritySearch` class +would fit all applications. ## References -TODO +[1] https://www.cs.ucr.edu/~eamonn/MatrixProfile.html From 57e5a838dd4708f56672fcb5452a4652985157b4 Mon Sep 17 00:00:00 2001 From: Antoine Guillaume Date: Tue, 4 Jun 2024 20:46:37 +0200 Subject: [PATCH 6/6] add changes from isssue #1243 --- aep/01_similarity_search.md | 100 +++++++++++++++--------------------- 1 file changed, 40 insertions(+), 60 deletions(-) diff --git a/aep/01_similarity_search.md b/aep/01_similarity_search.md index cfab371..ec9e74d 100644 --- a/aep/01_similarity_search.md +++ b/aep/01_similarity_search.md @@ -33,69 +33,49 @@ Altough, the computational optimizations used in the similarity search litterature have not been widely adopted or explored in other the less "obvious" contexts such as shapelets. -## Implementation - -The current implementation is designed around a new module named -`similarity_search`, which contains `BaseSimilaritySearch`, a -base class for all applications that use a distance profile to -extract a set of subseries. One subclass is `TopKSimilaritySearch`, -which given a collection of time series and a query, will return -the most `k` similar subseries given a distance function. - -A submodule named `distance_profiles` contains the methods used -to compute distance profiles for different distance functions. -One additional goal of this submodule would be to provide optimized -methods to compute distance profiles for other tasks that do not fit -the similarity search module (e.g. shapelets). - -## Examples code/structure (if applicable) - -Base class: - -- BaseSimilaritySearch - - methods: - - \_\_init\_\_(distance, normalise, store_distance_profile): - - takes distance: function, default = euclidean. - - wheter to use a z-normalized distance - - if the distance profiles should be stored after calling predict - - fit(X): - - takes X: a 3D array : collection of multivariate series. How other - modules handle the case where we have a 2D array as input ? (i.e. - is it a collection of univariate series or a multivariate series) - - fetch the distance profile function linked to the distance and normalize parameters. - - returns self - - predict(q, q_index=None, exclusion_factor=2.0): - - takes q: a single multivariate series (internal type?) tbc - - q_index to as a tuple (id_sample, id_timestamp) to specify if q was extracted from X - - Initiate the boolean mask of same shape as X given to the child classes and the - distance profile function, which store the part of the distance profile that should - not be computed. This can be used to indicate where the query was sampled in X, but - also during lower bound pruning to indicate which part are prunned. - - exlucsion_factor specify the area around q_index (+/- l//exclusion_factor) that should - be excluded from the distance profile computation and left to np.inf. - - Compute the means and standard deviations of the subseries in X if normalize was True. - - - abstract: - - \_fit() : - - \_predict() - - \_iterator() maybe? - -For the `distance_profile` submodule, we can distinguish different type of optimizations: - - Direct distance function optimization (e.g. Mueen) - - Early abandon of distance computation (e.g. EA-DTW) - - Lower bound pruning (e.g. Keogh LB for DTW) - -## Considerations and Alternatives - -- do we even need a base class? Maybe just have a suite of functions? +# +## Module structure : +``` +- aeon/ +|---- similarity_search/ +|-------- BaseSimilaritySearch.py +|-------- query_search/ +|------------ BaseQuerySearch.py +|-------- series_search/ +|------------ BaseSeriesSearch.py +|-------- index_search/ +|------------ BaseIndexSearch.py +``` + +- Query search : Given a query Q and a series/collection X, evaluate the similarity between Q and each admissible candidate in X. +- Series search : Given a length parameter (for now, we add techniques that don't require it later) do a query search for all admissible queries in a series/collection X. In the naive case, this is simply a broadcasting of query search, but more optimized algorithms exists for this case (e.g. STUMP/STOMP for Euclidean distance) +- Index search : Given a series/collection X and a length parameter (again, for now), build an indexing (e.g.what the Faiss library does) of all admissible candidates in X. Then, this indexing can be used as an estimator to answer query search tasks. This is generally used when you have a frozen historical set and want fast answers for new queries / when the inputs do not fit in memory. + +## Expected data and internal input conversion : + +We could accept series/collection in numpy and series in pd.Series data as input, but we would ideally convert all of it to numpy collection, and make use of the axis argument we introduced in other modules to avoid the channel problem: + +For query search, we would implement heavy computation numba functions in a series case and loop over it with the collection. This can for example allow passing down (between series of a collection X) best-so-far values when doing early abandon or pruning with lower bounds. + +- The same reasoning apply to series search. +- The indexing case generally work with out-of-memory data, and require updates after loading new parts of the dataset, which are generally made of collections/big series. + +## Interaction with other modules: + +We would still use the distance module for the naive search cases, which are the one without speed-ups (which can lead to exact or approximative results). It would also be nice to offer some visualisation through the visualisation module. +As for the reverse, which is which aeon estimator could benefit from the similarity search module speed-ups, it is still to be explored. + +## Documentation and notebooks: + +I would like to continue to have 3 type of notebooks for similarity search : + +- A benchmark notebook, which would show the effect of different speed-up option, and would help us determine which one to put as "default" speed-ups for a given distance function. +- An example notebook, which shows how to use the estimators on some toy cases and how to visualize the results +- A more theoretical one, which would explain the maths behind the different speed-ups and the base tasks. ## Discussion -I think base class can be useful to define the common code between some similarity search -use cases. For example, TopK search or threshold search (i.e. all subseries with a distance -bellow a threshold are returned). We might need to refine it when we extend the scope of -the module (e.g. matrix profile), as I don't think the current `BaseSimilaritySearch` class -would fit all applications. +Open to discussion ## References