Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline Optimization Strategies #1105

Open
JGreenlee opened this issue Jan 31, 2025 · 40 comments
Open

Pipeline Optimization Strategies #1105

JGreenlee opened this issue Jan 31, 2025 · 40 comments

Comments

@JGreenlee
Copy link

JGreenlee commented Jan 31, 2025

reducing DB queries (e-mission/e-mission-server#1014)

The low-hanging fruit is to:

  1. reduce redundant/unnecessary queries
    • I was able to find a couple instances where data was queried and we already had it in memory
  2. where queries are necessary, refactor small, frequent queries into larger, infrequent queries

The changes implemented (so far as of 1/30) drastically reduce the number of DB queries during trip segmentation.
For a typical day of travel for a user (e.g. shankari_2015-07-22, shankari_2016-07-27) I measured about 4000 calls to _get_entries_for_timeseries, during trip segmentation alone!

After the changes, this is cut down to about 50. Local pipeline runs are consistently, but only slightly, faster than before (about 10%)
I am hopeful that we will see more significant effects on stage / prod where DB calls are more of a bottleneck

polars

We (@TeachMeTW and I) suspect that another cause of slowness (which is more noticeable locally, where DB queries are not a bottleneck) is from pandas.
polars is much faster, but pandas is used extensively in the pipeline, so it is not practical to change all usages at once. Switching to polars would need to happen gradually or just in a few key places.

I wanted to run some tests to see if the performance benefit is worth the effort. I extended get_data_df to support either pandas or polars, thinking that we can support both for now and differentiate them by different variable names (example_df vs example_pldf)

builtin_timeseries.py

     @staticmethod
-    def to_data_df(key, entry_it, map_fn = None):
+    def to_data_df(key, entry_it, map_fn=None, use_polars=False):
         """
         Converts the specified iterator into a dataframe
         :param key: The key whose entries are in the iterator
@@ -292,9 +314,15 @@ class BuiltinTimeSeries(esta.TimeSeries):
         """
         if map_fn is None:
             map_fn = BuiltinTimeSeries._to_df_entry
+        
         # Dataframe doesn't like to work off an iterator - it wants everything in memory
-        df = pd.DataFrame([map_fn(e) for e in entry_it])
+        if use_polars:
+            import polars as pl
+            df = pl.DataFrame([map_fn(e) for e in entry_it])
+        else:
+            df = pd.DataFrame([map_fn(e) for e in entry_it])
         logging.debug("Found %s results" % len(df))
+
         if len(df) > 0:
             dedup_check_list = [item for item in ecwe.Entry.get_dedup_list(key)
                                 if item in df.columns] + ["metadata_write_ts"]

However, when I tested this on actual OpenPATH data, I ran into countless errors like this:

TypeError: unexpected value while building Series of type Float64; found value of type Int64: 1469620667

When types differ within a column, pandas automatically casts, but polars does not in the name of performance. I searched, but cannot find a flag to enable this behavior in polars, so there is no way to use this without patching the incongruent types in the data

For now I have this hack in builtin_timeseries.py, which uses MongoDB aggregate to correct the data types for a bunch of fields such that polars will accept them

             ts_query = self._get_query(key_list, time_query, geo_query,
                                 extra_query_list)
-            ts_db_cursor = tsdb.find(ts_query)
             ts_db_count = tsdb.count_documents(ts_query)
-            if sort_key is None:
-                ts_db_result = ts_db_cursor
+            if aggregation:
+                agg_pipeline = [
+                    {"$match": ts_query},
+                    {"$limit": edb.result_limit},
+                    {"$set": {
+                        "data.ts": {"$toDouble": "$data.ts"},
+                        "data.start_ts": {"$toDouble": "$data.start_ts"},
+                        "data.end_ts": {"$toDouble": "$data.end_ts"},
+                        "metadata.key": {"$toString": "$metadata.key"},
+                        "data.sensed_mode": {"$toInt": "$data.sensed_mode"},
+                        "_id": {"$toString": "$_id"},
+                        "data.start_place": {"$toString": "$data.start_place"},
+                        "data.end_place": {"$toString": "$data.end_place"},
+                        "data.raw_trip": {"$toString": "$data.raw_trip"},
+                        "data.cleaned_trip": {"$toString": "$data.cleaned_trip"},
+                        "data.expected_trip": {"$toString": "$data.expected_trip"},
+                        "data.inferred_trip": {"$toString": "$inferred_trip"},
+                        "data.confirmed_trip": {"$toString": "$confirmed_trip"},
+                        "data.start_confirmed_place._id": {"$toString": "$data.start_confirmed_place._id"},
+                        "data.end_confirmed_place._id": {"$toString": "$data.end_confirmed_place._id"},
+                        # "user_id": {"$toString": "$user_id"},
+                     }}
+                ]
+                if sort_key is not None:
+                    agg_pipeline.append({"$sort": {sort_key: pymongo.ASCENDING}})
+                ts_db_result = tsdb.aggregate(agg_pipeline)
             else:
-                ts_db_result = ts_db_cursor.sort(sort_key, pymongo.ASCENDING)
+                ts_db_cursor = tsdb.find(ts_query)
+                if sort_key is None:
+                    ts_db_result = ts_db_cursor
+                else:
+                    ts_db_result = ts_db_cursor.sort(sort_key, pymongo.ASCENDING)
+                ts_db_result.limit(edb.result_limit)
             # We send the results from the phone in batches of 10,000
             # And we support reading upto 100 times that amount at a time, so over

It's not a permanent solution but it's enough for me to continue experimenting with polars

@shankari
Copy link
Contributor

As part of follow-on cleanup for e-mission/e-mission-server#1014, I would like to see:

  • what was justification for this change? In particular, what were the timing results from our sample programs (stage, ccebike, smart commute, stm-community) that led you to focus on these areas?
  • I see that we are querying for loc_df in trip_segmentation and passing it in the time/distance filters. And then we are querying for transition and motion activity inside segment_into_trips. Why? It seems like we can just read all input data from that time range and pass it into segment_into_trips at the same time, in three different dataframes. This will not have an impact on performance, but it is cleaner and easier to understand.

@JGreenlee
Copy link
Author

what was justification for this change? In particular, what were the timing results from our sample programs (stage, ccebike, smart commute, stm-community) that led you to focus on these areas?

In #1098, segment_into_trips (specifically segment_into_trips/has_trip_ended) was identified as a bottleneck on production.

However, it's not a bottleneck when running the pipeline locally, so I added local instrumentation that just measures the number of DB calls instead of measure the time it takes to execute.
I identified has_trip_ended as a place where a lot of DB calls originate from because it calls is_tracking_restarted_in_range and get_ongoing_motion_in_range, which are the functions that were repeatedly querying background/filtered_location and background/motion_activity

At the same time I have been working on implementing polars and have some findings below

@JGreenlee
Copy link
Author

JGreenlee commented Feb 3, 2025

I wrote a notebook that processes a day of example data (e.g. shankari_2016-07-27) through the pipeline on different git branches and measures both execution time and # of DB calls. Below is a comparison of 3 branches:

(this version of master is before segmentation_optimization was merged in)

Image

(note that locally, mode_inference is running slower than trip_segmentation, but we are only focused on optimizing segementation right now)

Image Image Image
  • The segmentation_optimization branch (which has since been merged) drastically reduces the number of DB calls and cuts the local runtime of TRIP_SEGMENTATION by about 50%
  • The polars branch has little to no improvement over master in trip segmentation

@JGreenlee
Copy link
Author

It seems like we can just read all input data from that time range and pass it into segment_into_trips at the same time, in three different dataframes.

Yes, this would be a good refactor

@shankari
Copy link
Contributor

shankari commented Feb 3, 2025

Before looking at pandas versus polars, I think we need to look at pandas in the first place. IIRC, a pretty big chunk of the code in trip segmentation iterates over the points in the trajectory one by one to check for the distance from the previous points. Switching to a vectorized operation (such as pandas.diff https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html) should make it much faster.
https://duckduckgo.com/?t=ftsa&q=iterate+versus+vectorized&ia=web

I think that would be more low-hanging fruit than switching from pandas to polars. It will likely not affect DB calls, but should be faster computationally

@JGreenlee
Copy link
Author

I'll explore this further to see if I can implement the current behavior with a vectorized operation. However, I am not sure if it will be possible to do that without significant changes to the segmentation logic.

I am new to this and learning, but from what I've read it does seem that the reliance on row-wise iteration is likely the reason switching to polars didn't speed anything up. With column-wise operations, polars can multi-thread, but row-wise operations have to be done one at a time regardless of which library is used

@shankari
Copy link
Contributor

shankari commented Feb 3, 2025

However, I am not sure if it will be possible to do that without significant changes to the segmentation logic.

Right, this is the reason why I haven't worked on it yet. I do think it is not that complicated, but it is not as simple as changing the arguments to a function and pre-loading data.

@TeachMeTW
Copy link

I will take a look at this as well and do my own analysis with pandas.

@JGreenlee
Copy link
Author

JGreenlee commented Feb 5, 2025

The current segmentation logic is as follows:

  1. Input a dataframe of background/filtered_location entries
  2. Read statemachine/transition and background/motion_activity entries during the time range
  3. Iterate row-wise:
  • a. Compute distances of the last n unprocessed points to the current point

  • b. Compute distances of unprocessed points in the last t minutes to the current point

  • c. If any of the following conditions are true:

    • there is a statemachine/transition between prev row and current row
    • there is a gap of 2 * t from the prev row and no background/motion_activity occurs between
    • there is a gap of 12 hours or more, regardless of motion activity
    • there is a gap of 2 * t from the prev row and the speed is below dist_threshold / time_threshold
    • there are at least n - 1 points going back at least 30 secs, and the maximum distance for those points does not e

    Then mark that a trip has ended. Record the last point as the lower median of the last n or last t points.
    Rows up to this point will not be considered in subsequent iterations for computing distances

  1. Return the marked segmentation points

I am struggling with the fact that

Rows up to this point will not be considered to compute distances in subsequent iterations.

If not for this, we could just perform an upfront computation of the max distances for all locations in last n points and last t minutes. I implemented this and got close to the current behavior but I found edge cases where it doesn't work.
The problem is that we don't know exactly how many points will need to be considered because it depends on whether a segment was identified beforehand (specifically, if a segment was identified in the last t minutes, then our precomputation used the wrong subset of points)
We could try to sniff out cases like this and re-check them, but I think that could lead to a "domino effect". The root problem is that when we identify a segment, it changes how we handle later segmentation.

My next thought is that we can use recursion or a while loop. It would precompute distances but identify only one segment at a time. Then it would do it again, excluding the rows that were already segmented.
However, this seems like it would lead to a lot of wasted computation.

@shankari
Copy link
Contributor

shankari commented Feb 5, 2025

If not for this, we could just perform an upfront computation of the max distances for all locations in last n points and last t minutes. I implemented this and got close to the current behavior but I found edge cases where it doesn't work.

Was this faster than the current point by point iteration? If so, by how much? I think that would be a good back-of-the-envelope check before we handle all the corner cases.

Then mark that a trip has ended. Record the last point as the lower median of the last n or last t points.
Rows up to this point will not be considered in subsequent iterations for computing distances

I assume this is

        if (len(last10PointsDistances) < self.point_threshold - 1 or
                    len(last5MinsDistances) == 0 or
                    last5MinTimes.max() < self.time_threshold - 30):
            logging.debug("Too few points to make a decision, continuing")
            return False

We should discuss this further, but let me outline how I think this would work at a high level.

With the restructure, I think we would work in three stages:

  • compute the distance and time metrics for the entire loc_df at one time using geopandas
  • do a first pass where we use basic diff with different periods values on the distance and time to identify potential trip ends (these will be << the number of points)
  • then, use the more complex checks involving motion activity, transitions, and minimum number of points to find the final trip ends from the potential ones

With the approach above, if we have $p$ points that map into $t$ trips, where $p &gt;&gt; t$, we will use vectorized operations for all the $O(p)$ steps and an iterative for loop only for the $O(t)$ steps

@shankari
Copy link
Contributor

shankari commented Feb 5, 2025

I can verify that after merging e-mission/e-mission-server#1014 I don't see any errors in the AWS logs. The only ERROR in the past few days is ERROR:root:habitica not configured, game functions not suppor which we should really clean up.

Although... it looks like we have no incoming data on stage (there is no match for segmenting into trips), so it looks like the trip segmentation stage was never run.

@JGreenlee
Copy link
Author

It is much faster, and this isn't even with geopandas or polars yet. This is just using pandas and doing the distance calculations upfront, more efficiently using numpy

Image Image

However, it segments in the wrong places and generates too many trips (27 instead of 22).

@JGreenlee
Copy link
Author

JGreenlee commented Feb 5, 2025

How bad is it if the segmentation behavior is slightly different? What if it identifies the end of some trips one point earlier or later?

For example, in DwellSegmentationTimeFilter one thing I noticed is that when determining what point gets marked as last point of the trip (https://github.com/e-mission/e-mission-server/blob/aaedcd3caf7551efa8af0088c8fa56158e1c9725/emission/analysis/intake/segmentation/trip_segmentation_methods/dwell_segmentation_time_filter.py#L314-L316),
we look at the median points of last10Points_df and last5MinsPoints_df and take whichever comes first

But there is a discrepancy because last10Points_df includes curr_point and last5MinsPoints_df does not, which seems arbitrary.

Usually last10Points_df and last5MinsPoints_df overlap significantly. If not for that discrepancy, much of this could be simplified; we could keep both in one df and call it something like recent_points_df

@JGreenlee
Copy link
Author

JGreenlee commented Feb 6, 2025

  • compute the distance and time metrics for the entire loc_df at one time using geopandas
  • do a first pass where we use basic diff with different periods values on the distance and time to identify potential trip ends (these will be << the number of points)
  • then, use the more complex checks involving motion activity, transitions, and minimum number of points to find the final trip ends from the potential ones

This could work for false positives, but what about false negatives?


On shankari_2015-aug-27, the existing implementation detects trip ends at 28, 49, 138, 197, 207, 240, 298, 326.

With the precomputed distances approach, we get 28, 30, 31, 49, 50, 51, 138, 139, 140, 197, 198, 208, 209, 240, 241, 298, 299, 300, 326
After removing "duplicates" we can get this to 28, 49, 138, 197, 208, 240, 298, 326
This is almost the same as expected, except we have 208 instead of 207

Why wasn't a trip end detected at 207?
on master, the max of last10PointsDistance at idx 207 is 30.15985953803261
on the new implementation, it is 132.63581139711928

The reason for the discrepancy is that point 197 was excluded from the max calculation on master because a trip end was detected there.
With the precomputed approach we don't know ahead of time whether 197 is going to be a trip end so it is included in the max calculation

@JGreenlee
Copy link
Author

JGreenlee commented Feb 6, 2025

So maybe the move is to compute all the distances ahead of time, but not the maxes of those distances.
Then use a loop or recursion, computing the maxes in each iteration to detect trip ends one-by-one
This way, when an trip end is detected, we can exclude all those points from later max computations

This is not as optimal as computing all of it upfront, including the maxes, but I would expect that computing the distances themselves is the more computationally-heavy task

And although we still have to iterate, we should only need to iterate per segment rather than per point

@shankari
Copy link
Contributor

shankari commented Feb 6, 2025

How bad is it if the segmentation behavior is slightly different? What if it identifies the end of some trips one point earlier or later?

I think it would be fine if segmentation behavior is slightly different, as long as we have explored the differences and are convinced that they are small. We will need to regenerate the ground truth for the unit tests, but I have some code to do that from the time we switched to the gis branch as the master.

As an aside, the current hacks and rules for trip segmentation are from my own data that I collected in the 2016-2017 timeframe. I have noticed that the segmentation rules don't work quite as well with more modern phones; I have a whole collection of "bad segmentation" trips from the pixel phone for example. So I think that we might actually want to have a deeper dive into that (maybe with a summer intern!) but I would like to do that separately from this scalability improvement.

Usually last10Points_df and last5MinsPoints_df overlap significantly. If not for that discrepancy, much of this could be simplified; we could keep both in one df and call it something like recent_points_df

I would be fine with that, as long as recent_points_df actually does include both count-based and distance-based points.

So maybe the move is to compute all the distances ahead of time, but not the maxes of those distances.
Then use a loop or recursion, computing the maxes in each iteration to detect trip ends one-by-one
And although we still have to iterate, we should only need to iterate per segment rather than per point

Yup! You can compute the distances and the diffs, then use the diffs to segment through iteration, excluding previous points and doing the more heavyweight checks with the motion activity and transitions etc.

Note that to compute the max, you wouldn't need to iterate over all the points, you can use the dataframe directly (something like df[start_idx, end_idx].distance.max())

And as you can also see, we end up with ~ 8 segments for ~ 350 points, so switching from $O(p)$ to $O(t)$ will give us a huge improvement in the number of times we iterate

@JGreenlee
Copy link
Author

JGreenlee commented Feb 7, 2025

shankari added a commit to shankari/e-mission-server that referenced this issue Feb 11, 2025
…tabase

We have made several changes to cache summary information in the user profile.
This summary information can be used to improve the scalability of the admin
dashboard (e-mission/op-admin-dashboard#145) but will
also be used to determine dormant deployment and potentially in a future
OpenPATH-wide dashboard.

These changes started by making a single call to cache both trip and call stats
e-mission#1005

This resulted in all the composite trips being read every hour, so we split the
stats into pipeline-dependent and pipeline-independent stats, in
88bb35a
(part of e-mission#1005)

The idea was that, since the composite object query was slow because the
composite trips were large, we could run only the queries for the server API
stats every hour, and read the composite trips only when they were updated.

However, after the improvements to the trip segmentation pipeline
(e-mission#1014, results in
e-mission/e-mission-docs#1105 (comment))
reading the server stats is now the bottleneck.

Checking the computation time on the big deployments (e.g. ccebikes), although
the time taken has significantly improved as the database load has gone down,
even in the past two days, we see a median of ~ 10 seconds and a max of over
two minutes.

And we don't really need to query this data to generate the call summary
statistics. Instead of computing them on every run, we can compute them only
when _they_ change, which happens when we receive calls to the API.

So, in the API's after_request hook, in addition to adding a new stat, we can
also just update the `last_call_ts` in the profile. This potentially makes
every API call a teeny bit slower since we are adding one more DB write, but it
significantly lowers the DB load, so should make the system as a whole faster.

Testing done:
- the modified `TestUserStat` test for the `last_call_ts` passes
@JGreenlee
Copy link
Author

MODE_INFERENCE optimization

The next bottleneck is the MODE_INFERENCE stage. Calls to the Overpass API account for almost 100% of the execution time https://github.com/e-mission/e-mission-server/blob/d6d6ca449d78cf1fbc4038f53eb5647772c936ff/emission/net/ext_service/transit_matching/match_stops.py#L29

I have not looked at this code before, so I took some time to familiarize myself. My high-level understanding is that for every IN_VEHICLE cleaned section, we query for bus/train stops at the start and end locations. Then we see if any routes are shared between those stops. If so, we infer that the mode was bus/train

The query looks like

[out:json][timeout:25];
(
  node["highway"="bus_stop"]({bbox});
  node["railway"="station"]({bbox});
  node["public_transport"]({bbox});
  way["railway"="station"]({bbox});
  relation["route"]({bbox});
);
out body;
>;

and is run separately for the start and the end of every section

@JGreenlee
Copy link
Author

JGreenlee commented Feb 12, 2025

I have a few ideas:

  1. The end of one section is often (if not always?) the same as the start point of the next. So for any consecutive sections that are both IN_VEHICLE, we are making duplicate queries.
  2. It may be more efficient to combine queries into batches (of 5 or 10, for example)
  3. It may be possible to optimize the queries themselves

Caveat to (1): This will not work if section.startStopRadius and section.endStopRadius differ. It appears that they used to be different but are now both 150 meters; in fact, only startStopRadius is used in this code https://github.com/e-mission/e-mission-server/blob/d6d6ca449d78cf1fbc4038f53eb5647772c936ff/emission/analysis/classification/inference/mode/rule_engine.py#L173-L174

@JGreenlee
Copy link
Author

JGreenlee commented Feb 14, 2025

(2) is possible by doing something like this:

[out:json][timeout:25];

(
  node["highway"="bus_stop"]({bbox1});
  node["railway"="station"]({bbox1});
  node["public_transport"]({bbox1});
  way["railway"="station"]({bbox1});
  relation["route"]({bbox1});
);
out body;
out count;

(
  node["highway"="bus_stop"]({bbox2});
  node["railway"="station"]({bbox2});
  node["public_transport"]({bbox2});
  way["railway"="station"]({bbox2});
  relation["route"]({bbox2});
);
out body;
out count;

The returned list of results will separated by “count” entries, allowing us to distinguish which results were for which locations

Initially, I tested this by just querying the start and end locations together (2 points per query), and I saw a modest improvement (~25% faster)


I also looked into (3) but did not find much opportunity for optimization.

I checked the API responses and found that “relations”, and their children “members”, make up the vast majority of the results.
I checked the code to see where we use them, thinking that we may be querying more than we need.
I did find that we only use relation members that have a type of node. https://github.com/e-mission/e-mission-server/blob/4fe4e6df50736b050a3c4284bf0c8b5b406e26f4/emission/net/ext_service/transit_matching/match_stops.py#L94-L102
However, there does not appear to be any way to selectively filter members in the relation (it is all or nothing) https://gis.stackexchange.com/a/166378

@JGreenlee
Copy link
Author

Draft PR to optimize MODE_INFERENCE:

Pasted Graphic 1 Pipeline stage runtimes for MODE_INFERENCE

@shankari
Copy link
Contributor

@JGreenlee I am not sure why this is not showing up in your instrumentation, but when I run the first round of trip segmentation fixes (pre-reading all data from the DB), the next slow task is section segmentation. I reset the pipelines for ccebikes and started up the intake pipeline with:

  • n_workers = 3
  • docker resources limited to 2 CPU and 2GB RAM
  • skip_if_... = True

From the logs, the timings are:

2025-02-20 13:59:36,465:INFO:8313653312:For stage PipelineStages.TRIP_SEGMENTATION, start_ts = 2024-09-09T16:43:33.687312
2025-02-20 14:43:21,082:INFO:8313653312:For stage PipelineStages.TRIP_SEGMENTATION, last_ts_processed = 2024-11-26T23:47:34.326266

2025-02-20 14:43:21,178:INFO:8313653312:For stage PipelineStages.SECTION_SEGMENTATION, start_ts = 2024-09-09T16:38:33.687312
2025-02-20 16:31:36,955:INFO:8313653312:++++++++++++++++++++Processing trip 67b7a6a6600f509989281872 for user [redacted]++++++++++++++++++++

... not yet complete at 16:31pm

In fact, two out of three processes are stuck in SECTION_SEGMENTATION; the third is stuck in TRIP_SEGMENTATION

$ grep "For stage" /var/tmp/intake_0.log | tail -n 1
2025-02-20 13:59:29,480:INFO:8313653312:For stage PipelineStages.TRIP_SEGMENTATION, start_ts = 2024-12-24T04:40:58.037000

$ grep "For stage" /var/tmp/intake_1.log | tail -n 1
2025-02-20 14:03:42,810:INFO:8313653312:For stage PipelineStages.SECTION_SEGMENTATION, start_ts = 2024-12-23T22:28:00.921000

$ grep "For stage" /var/tmp/intake_2.log | tail -n 1
2025-02-20 14:43:21,178:INFO:8313653312:For stage PipelineStages.SECTION_SEGMENTATION, start_ts = 2024-09-09T16:38:33.687312

I think that the next step, even before MODE_INFERENCE, would be to rewrite section segmentation also to read all entries from the database upfront, instead of reading on a point-by-point basis.

I would also suggest that you instrument running the pipeline against a bigger dataset (e.g. ccebikes) with constrained docker resources (e.g. 2 CPU/2 GB RAM) to understand where the production bottlenecks are likely to be.

@shankari
Copy link
Contributor

Quick check after a few hours; the first process has now moved to SECTION_SEGMENTATION, but the other two are still stuck in the previous SECTION_SEGMENTATION step. So right now, all three processes are stuck in SECTION_SEGMENTATION

$ grep "For stage" /var/tmp/intake_0.log | tail -n 1
2025-02-20 17:25:09,543:INFO:8313653312:For stage PipelineStages.SECTION_SEGMENTATION, start_ts = 2024-12-24T04:37:21.258000
$ grep "For stage" /var/tmp/intake_1.log | tail -n 1
2025-02-20 14:03:42,810:INFO:8313653312:For stage PipelineStages.SECTION_SEGMENTATION, start_ts = 2024-12-23T22:28:00.921000
$ grep "For stage" /var/tmp/intake_2.log | tail -n 1
2025-02-20 14:43:21,178:INFO:8313653312:For stage PipelineStages.SECTION_SEGMENTATION, start_ts = 2024-09-09T16:38:33.687312

@shankari
Copy link
Contributor

Quick check after a few more hours, all three processes are still stuck on the same SECTION_SEGMENTATION

$ grep "For stage" /var/tmp/intake_0.log | tail -n 1
2025-02-20 17:25:09,543:INFO:8313653312:For stage PipelineStages.SECTION_SEGMENTATION, start_ts = 2024-12-24T04:37:21.258000
$ grep "For stage" /var/tmp/intake_1.log | tail -n 1
2025-02-20 14:03:42,810:INFO:8313653312:For stage PipelineStages.SECTION_SEGMENTATION, start_ts = 2024-12-23T22:28:00.921000
$ grep "For stage" /var/tmp/intake_2.log | tail -n 1
2025-02-20 14:43:21,178:INFO:8313653312:For stage PipelineStages.SECTION_SEGMENTATION, start_ts = 2024-09-09T16:38:33.687312

@shankari
Copy link
Contributor

Ah, the third has now moved on to JUMP_SMOOTHING. But it took from 14:43 to 20:49, so around 6 hours

$ grep "For stage" /var/tmp/intake_0.log | tail -n 1
2025-02-20 17:25:09,543:INFO:8313653312:For stage PipelineStages.SECTION_SEGMENTATION, start_ts = 2024-12-24T04:37:21.258000
$ grep "For stage" /var/tmp/intake_1.log | tail -n 1
2025-02-20 14:03:42,810:INFO:8313653312:For stage PipelineStages.SECTION_SEGMENTATION, start_ts = 2024-12-23T22:28:00.921000
$ grep "For stage" /var/tmp/intake_2.log | tail -n 1
2025-02-20 20:49:08,261:INFO:8313653312:For stage PipelineStages.JUMP_SMOOTHING, start_ts = 2024-09-09T16:38:33.687312

While the TRIP_SEGMENTATION for the same user took less than an hour

2025-02-20 13:59:36,465:INFO:8313653312:For stage PipelineStages.TRIP_SEGMENTATION, start_ts = 2024-09-09T16:43:33.687312
2025-02-20 14:43:21,082:INFO:8313653312:For stage PipelineStages.TRIP_SEGMENTATION, last_ts_processed = 2024-11-26T23:47:34.326266

2025-02-20 14:43:21,178:INFO:8313653312:For stage PipelineStages.SECTION_SEGMENTATION, start_ts = 2024-09-09T16:38:33.687312
2025-02-20 20:49:08,247:INFO:8313653312:For stage PipelineStages.SECTION_SEGMENTATION, last_ts_processed = 2024-11-26T23:33:34.999963

2025-02-20 20:49:08,261:INFO:8313653312:For stage PipelineStages.JUMP_SMOOTHING, start_ts = 2024-09-09T16:38:33.687312

@shankari
Copy link
Contributor

The user being processed in intake_2 has finally made it past JUMP_SMOOTHING!! The others are still stuck in SECTION_SEGMENTATION though

$ grep "For stage" /var/tmp/intake_0.log | tail -n 1
2025-02-20 17:25:09,543:INFO:8313653312:For stage PipelineStages.SECTION_SEGMENTATION, start_ts = 2024-12-24T04:37:21.258000
$ grep "For stage" /var/tmp/intake_1.log | tail -n 1
2025-02-20 14:03:42,810:INFO:8313653312:For stage PipelineStages.SECTION_SEGMENTATION, start_ts = 2024-12-23T22:28:00.921000
$ grep "For stage" /var/tmp/intake_2.log | tail -n 1
2025-02-20 23:04:30,770:INFO:8313653312:For stage PipelineStages.CLEAN_RESAMPLING, start_ts = 2024-09-09T16:38:33.687312

@shankari
Copy link
Contributor

After waiting overnight, the intake_0 user is still stuck in SECTION_SEGMENTATION

$ grep "For stage" /var/tmp/intake_0.log | tail -n 1
2025-02-20 17:25:09,543:INFO:8313653312:For stage PipelineStages.SECTION_SEGMENTATION, start_ts = 2024-12-24T04:37:21.258000
$ grep "For stage" /var/tmp/intake_1.log | tail -n 1
2025-02-21 04:15:52,177:INFO:8313653312:For stage PipelineStages.TRIP_SEGMENTATION, start_ts = 2023-11-07T18:41:41.716000
$ grep "For stage" /var/tmp/intake_2.log | tail -n 1
2025-02-21 01:24:12,743:INFO:8313653312:For stage PipelineStages.TRIP_SEGMENTATION, start_ts = 2024-08-10T02:37:05.585000

The other two have finished the user they were on, and have gotten stuck in the next TRIP_SEGMENTATION stage.

@shankari
Copy link
Contributor

The database container is fairly stable, though, without a lot of churn.

Image

@JGreenlee
Copy link
Author

I see. My instrumentation has been based on real_examples and smaller dumps, simply because I have not been able to successfully run the pipeline on large dumps.

In hindsight, I should have looked closer at the logs before giving up on the large dumps. I didn't realize it was getting stuck on particular users (although that does make sense); I thought there was just too much data to get through in a reasonable time.

Would you be able to send me some of the opcodes/ UUIDs that got stuck so I can reproduce this myself (without having to run it overnight)?

@shankari
Copy link
Contributor

@JGreenlee wait a minute - it looks like this behavior is specific to mongo 8

If I reset the pipeline, and then run the same code on mongo 4, all three of the users that were stuck earlier are done in under 15 mins.

Sat Feb 22 11:19:21 PST 2025
Sat Feb 22 11:32:53 PST 2025

There's a huge investigation into query plans in #1109 and I think we finally have a workaround. Regardless, given our current DB characteristics, I think that reading entries upfront is generally a good idea to avoid multiple small DB calls.

TeachMeTW pushed a commit to TeachMeTW/e-mission-server that referenced this issue Feb 26, 2025
…tabase

We have made several changes to cache summary information in the user profile.
This summary information can be used to improve the scalability of the admin
dashboard (e-mission/op-admin-dashboard#145) but will
also be used to determine dormant deployment and potentially in a future
OpenPATH-wide dashboard.

These changes started by making a single call to cache both trip and call stats
e-mission#1005

This resulted in all the composite trips being read every hour, so we split the
stats into pipeline-dependent and pipeline-independent stats, in
88bb35a
(part of e-mission#1005)

The idea was that, since the composite object query was slow because the
composite trips were large, we could run only the queries for the server API
stats every hour, and read the composite trips only when they were updated.

However, after the improvements to the trip segmentation pipeline
(e-mission#1014, results in
e-mission/e-mission-docs#1105 (comment))
reading the server stats is now the bottleneck.

Checking the computation time on the big deployments (e.g. ccebikes), although
the time taken has significantly improved as the database load has gone down,
even in the past two days, we see a median of ~ 10 seconds and a max of over
two minutes.

And we don't really need to query this data to generate the call summary
statistics. Instead of computing them on every run, we can compute them only
when _they_ change, which happens when we receive calls to the API.

So, in the API's after_request hook, in addition to adding a new stat, we can
also just update the `last_call_ts` in the profile. This potentially makes
every API call a teeny bit slower since we are adding one more DB write, but it
significantly lowers the DB load, so should make the system as a whole faster.

Testing done:
- the modified `TestUserStat` test for the `last_call_ts` passes
@JGreenlee
Copy link
Author

JGreenlee commented Feb 26, 2025

We implemented a more robust way to track DB calls, and we see that DB calls are scattered throughout section segmentation in many places.

We have addressed the background/filtered_location queries in segment_trip_into_sections, but there are numerous other places where queries occur:

Image
stage name reading branch count
0 SECTION_SEGMENTATION db_call/aggregate ['_get_entries_for_timeseries', 'find_entries', 'get_data_df', 'get_location_streams_for_trip', 'segment_into_sections'] master 52
1 SECTION_SEGMENTATION db_call/aggregate ['_get_entries_for_timeseries', 'find_entries', 'get_data_df', 'get_location_streams_for_trip', 'segment_into_sections'] section_optimization 52
2 SECTION_SEGMENTATION db_call/aggregate ['_get_entries_for_timeseries', 'find_entries', 'get_data_df', 'segment_into_motion_changes', 'segment_into_sections'] master 26
3 SECTION_SEGMENTATION db_call/aggregate ['_get_entries_for_timeseries', 'find_entries', 'get_data_df', 'segment_into_motion_changes', 'segment_into_sections'] section_optimization 26
4 SECTION_SEGMENTATION db_call/aggregate ['_get_entries_for_timeseries', 'find_entries', 'get_entries', 'segment_current_sections', 'run_intake_pipeline_for_user'] master 1
5 SECTION_SEGMENTATION db_call/aggregate ['_get_entries_for_timeseries', 'find_entries', 'get_entries', 'segment_current_sections', 'run_intake_pipeline_for_user'] section_optimization 1
6 SECTION_SEGMENTATION db_call/aggregate ['_get_entries_for_timeseries', 'find_entries', 'segment_trip_into_sections', 'segment_current_sections', 'run_intake_pipeline_for_user'] master 26
7 SECTION_SEGMENTATION db_call/aggregate ['_get_entries_for_timeseries', 'find_entries', 'segment_trip_into_sections', 'segment_current_sections', 'run_intake_pipeline_for_user'] section_optimization 52
8 SECTION_SEGMENTATION db_call/find ['<dictcomp>', 'segment_trip_into_sections', 'segment_current_sections', 'run_intake_pipeline_for_user', 'run_intake_pipeline'] section_optimization 26
9 SECTION_SEGMENTATION db_call/find ['<listcomp>', 'get_entries', 'segment_current_sections', 'run_intake_pipeline_for_user', 'run_intake_pipeline'] master 2
10 SECTION_SEGMENTATION db_call/find ['<listcomp>', 'get_entries', 'segment_current_sections', 'run_intake_pipeline_for_user', 'run_intake_pipeline'] section_optimization 1
11 SECTION_SEGMENTATION db_call/find ['<listcomp>', 'to_data_df', 'get_data_df', 'get_location_streams_for_trip', 'segment_into_sections'] master 104
12 SECTION_SEGMENTATION db_call/find ['<listcomp>', 'to_data_df', 'get_data_df', 'get_location_streams_for_trip', 'segment_into_sections'] section_optimization 52
13 SECTION_SEGMENTATION db_call/find ['<listcomp>', 'to_data_df', 'get_data_df', 'segment_into_motion_changes', 'segment_into_sections'] master 52
14 SECTION_SEGMENTATION db_call/find ['<listcomp>', 'to_data_df', 'get_data_df', 'segment_into_motion_changes', 'segment_into_sections'] section_optimization 26
15 SECTION_SEGMENTATION db_call/find ['find_one', 'get_current_state', 'get_time_range_for_stage', 'get_time_range_for_sectioning', 'segment_current_sections'] master 1
16 SECTION_SEGMENTATION db_call/find ['find_one', 'get_current_state', 'get_time_range_for_stage', 'get_time_range_for_sectioning', 'segment_current_sections'] section_optimization 1
17 SECTION_SEGMENTATION db_call/find ['find_one', 'get_current_state', 'mark_stage_done', 'mark_sectioning_done', 'segment_current_sections'] master 1
18 SECTION_SEGMENTATION db_call/find ['find_one', 'get_current_state', 'mark_stage_done', 'mark_sectioning_done', 'segment_current_sections'] section_optimization 1
19 SECTION_SEGMENTATION db_call/find ['find_one', 'get_entry_at_ts', '<lambda>', 'segment_trip_into_sections', 'segment_current_sections'] master 52
20 SECTION_SEGMENTATION db_call/find ['find_one', 'get_entry_from_id', 'df_row_to_entry', '<lambda>', 'segment_trip_into_sections'] master 60
21 SECTION_SEGMENTATION db_call/find ['find_one', 'get_entry_from_id', 'get_entry', 'get_object', '_get_distance_from_start_place_to_end'] master 26
22 SECTION_SEGMENTATION db_call/find ['find_one', 'get_entry_from_id', 'get_entry', 'get_object', '_get_distance_from_start_place_to_end'] section_optimization 26
23 SECTION_SEGMENTATION db_call/find ['find_one', 'get_entry_from_id', 'get_entry', 'get_object', 'get_time_query_for_trip_like'] master 26
24 SECTION_SEGMENTATION db_call/find ['find_one', 'get_entry_from_id', 'get_entry', 'get_object', 'get_time_query_for_trip_like'] section_optimization 26
25 SECTION_SEGMENTATION db_call/find ['get_time_range_for_stage', 'get_time_range_for_sectioning', 'segment_current_sections', 'run_intake_pipeline_for_user', 'run_intake_pipeline'] master 1
26 SECTION_SEGMENTATION db_call/find ['get_time_range_for_stage', 'get_time_range_for_sectioning', 'segment_current_sections', 'run_intake_pipeline_for_user', 'run_intake_pipeline'] section_optimization 1
27 SECTION_SEGMENTATION db_call/find ['mark_stage_done', 'mark_sectioning_done', 'segment_current_sections', 'run_intake_pipeline_for_user', 'run_intake_pipeline'] master 1
28 SECTION_SEGMENTATION db_call/find ['mark_stage_done', 'mark_sectioning_done', 'segment_current_sections', 'run_intake_pipeline_for_user', 'run_intake_pipeline'] section_optimization 1
29 SECTION_SEGMENTATION db_call/insert ['insert', 'segment_trip_into_sections', 'segment_current_sections', 'run_intake_pipeline_for_user', 'run_intake_pipeline'] master 38
30 SECTION_SEGMENTATION db_call/insert ['insert', 'segment_trip_into_sections', 'segment_current_sections', 'run_intake_pipeline_for_user', 'run_intake_pipeline'] section_optimization 38
31 SECTION_SEGMENTATION db_call/insert ['save', 'get_time_range_for_stage', 'get_time_range_for_sectioning', 'segment_current_sections', 'run_intake_pipeline_for_user'] master 1
32 SECTION_SEGMENTATION db_call/insert ['save', 'get_time_range_for_stage', 'get_time_range_for_sectioning', 'segment_current_sections', 'run_intake_pipeline_for_user'] section_optimization 1
33 SECTION_SEGMENTATION db_call/update ['replace_one', 'save', 'mark_stage_done', 'mark_sectioning_done', 'segment_current_sections'] master 1
34 SECTION_SEGMENTATION db_call/update ['replace_one', 'save', 'mark_stage_done', 'mark_sectioning_done', 'segment_current_sections'] section_optimization 1
35 SECTION_SEGMENTATION db_call/update ['replace_one', 'save', 'update', 'segment_trip_into_sections', 'segment_current_sections'] master 8
36 SECTION_SEGMENTATION db_call/update ['replace_one', 'save', 'update', 'segment_trip_into_sections', 'segment_current_sections'] section_optimization 8

@JGreenlee
Copy link
Author

JGreenlee commented Feb 26, 2025

Previously, we just recorded the number of calls to _get_entries_for_timeseries; now we have a more sophisticated approach:

# record a stat every time the DB is queried by monitoring the MongoDB client

import inspect
import time
from pymongo.monitoring import register, CommandListener
import emission.storage.decorations.stats_queries as esds

class QueryMonitor(CommandListener):
    def started(self, event):
        event_cmd = str(event.command)
        call_stack = [f.function for f in inspect.stack()][12:]
        if (
            event.command_name in {"find", "aggregate", "insert", "update", "delete"}
            and 'stats/pipeline_time' not in event_cmd
        ):
            call_stack = [f.function for f in inspect.stack()][12:17]
            esds.store_pipeline_time(None,
                                     f'db_call/{event.command_name}',
                                     time.time(),
                                     str(call_stack))
    def __init__(self): pass
    def succeeded(self, _): pass
    def failed(self, _): pass

register(QueryMonitor())

Which counts every DB call but also allows us to identify
i) what type of query it was ("find", "aggregate", "insert", "update", "delete")
ii) the function stack from wherever the query was made (e.g. ['_get_entries_for_timeseries', 'find_entries', 'get_data_df', 'get_location_streams_for_trip', 'segment_into_sections'])

We think it would be beneficial to check this into the repo somewhere, but I am not sure the correct location for it

@JGreenlee
Copy link
Author

Follow-on to the investigation in e-mission/e-mission-server#1032 (comment):

I am going to leave ccebikes running overnight. I am not sure if it will finish because this one took over an hour and ccebikes has 90x as many trips


I have had ccebikes running for about 11 hours and I don't think it is close to being done

Two users are stuck in MODE_INFERENCE for the last 1 and 2 hours respectively

grep "For stage" /var/tmp/intake_1.log | tail -n 1
2025-02-28 09:10:06,127:INFO:8737540160:For stage PipelineStages.MODE_INFERENCE, start_ts is None
grep "For stage" /var/tmp/intake_2.log | tail -n 1
2025-02-28 10:00:08,979:INFO:8737540160:For stage PipelineStages.MODE_INFERENCE, start_ts is None

For intake_0 I can tell it's in TRIP_SEGMENTATION based on recent logs, but I can't tell how long it's been stuck because nothing comes up (maybe due to limitation of grep or tail?)

grep "For stage" /var/tmp/intake_0.log | tail -n 1

@JGreenlee
Copy link
Author

I am considering stopping the run because it would probably take all weekend and I will not be home to keep an eye on it.
It is clear that the bottlenecks are in TRIP_SEGMENTATION and MODE_INFERENCE, which is what e-mission/e-mission-server#1017 and e-mission/e-mission-server#1026 already address. This is consistent with what I have been finding all along.

After those are merged, it may become more practical to do side-by-side comparisons on large dumps.

As for e-mission/e-mission-server#1032, it does drastically reduce the number of DB operations which significantly improves CLEAN_RESAMPLING, CREATE_CONFIRMED_OBJECTS, and CREATE_COMPOSITE_OBJECTS (at least on test data and stm-community).
However this is overshadowed by how slow TRIP_SEGMENTATION and MODE_INFERENCE are

@shankari
Copy link
Contributor

For intake_0 I can tell it's in TRIP_SEGMENTATION based on recent logs, but I can't tell how long it's been stuck because nothing comes up (maybe due to limitation of grep or tail?)

It could also be that the logs have rolled over. In that case, I would expect to see an intake_0.log.1 or similar

@JGreenlee
Copy link
Author

JGreenlee commented Feb 28, 2025

I stopped it, but these are the results from that run (looks like it got through 49 users out of 108)

Image Image

@shankari
Copy link
Contributor

we can do a before-and-after at the end of next week after e-mission/e-mission-server#1017 and e-mission/e-mission-server#1026 and e-mission/e-mission-server#1032 are merged

@JGreenlee
Copy link
Author

JGreenlee commented Mar 3, 2025

While working on e-mission/e-mission-server#1032 I found that create_index is called repeatedly throughout the pipeline. According to Mongo docs, duplicate calls to create_index are ignored: https://www.mongodb.com/docs/manual/reference/method/db.collection.createIndex/#recreating-an-existing-index

However, removing them significantly reduced the local execution time for CLEAN_RESAMPLING, CREATE_CONFIRMED_OBJECTS, and CREATE_COMPOSITE_OBJECTS, suggesting that the duplicate calls introduce some overhead

image image image

The duplicate create_index calls only happen during the pipeline in spots where theget_*_database methods from edb are called directly, rather than using methods from the timeseries (the preferred pattern)

Here are the places where create_index calls originate from:

reading count
51 ['create_composite_objects', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 28
53 ['getUserCache', 'moveToLongTerm', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 5
56 ['get_analysis_timeseries_db', 'create_composite_objects', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 70
58 ['get_analysis_timeseries_db', 'get_sections_for_trip', 'get_cleaned_sections_for_trip', '_fix_squished_place_mismatch', 'link_trip_start', 'create_and_link_timeline', 'save_cleaned_segments_for_timeline', 'save_cleaned_segments_for_ts', 'clean_and_resample', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 140
60 ['get_analysis_timeseries_db', 'get_sections_for_trip', 'get_raw_sections_for_trip', 'get_raw_timeline_for_trip', 'get_filtered_trip', 'save_cleaned_segments_for_timeline', 'save_cleaned_segments_for_ts', 'clean_and_resample', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 910
62 ['get_analysis_timeseries_db', 'get_sections_for_trip', 'get_section_summary', 'create_confirmed_entry', 'create_and_link_timeline', 'create_confirmed_objects', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 2310
64 ['get_analysis_timeseries_db', 'get_sections_for_trip', 'get_sections_for_confirmed_trip', 'create_composite_trip', 'create_composite_objects', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 770
66 ['get_analysis_timeseries_db', 'get_stops_for_trip', 'get_raw_stops_for_trip', 'get_raw_timeline_for_trip', 'get_filtered_trip', 'save_cleaned_segments_for_timeline', 'save_cleaned_segments_for_ts', 'clean_and_resample', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 910
68 ['get_cleaned_sections_for_trip', '_fix_squished_place_mismatch', 'link_trip_start', 'create_and_link_timeline', 'save_cleaned_segments_for_timeline', 'save_cleaned_segments_for_ts', 'clean_and_resample', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 4
70 ['get_raw_sections_for_trip', 'get_raw_timeline_for_trip', 'get_filtered_trip', 'save_cleaned_segments_for_timeline', 'save_cleaned_segments_for_ts', 'clean_and_resample', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 26
72 ['get_raw_stops_for_trip', 'get_raw_timeline_for_trip', 'get_filtered_trip', 'save_cleaned_segments_for_timeline', 'save_cleaned_segments_for_ts', 'clean_and_resample', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 26
74 ['get_section_summary', 'create_confirmed_entry', 'create_and_link_timeline', 'create_confirmed_objects', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 66
76 ['get_sections_for_confirmed_trip', 'create_composite_trip', 'create_composite_objects', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 22
78 ['get_sections_for_trip', 'get_cleaned_sections_for_trip', '_fix_squished_place_mismatch', 'link_trip_start', 'create_and_link_timeline', 'save_cleaned_segments_for_timeline', 'save_cleaned_segments_for_ts', 'clean_and_resample', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 56
80 ['get_sections_for_trip', 'get_raw_sections_for_trip', 'get_raw_timeline_for_trip', 'get_filtered_trip', 'save_cleaned_segments_for_timeline', 'save_cleaned_segments_for_ts', 'clean_and_resample', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 364
82 ['get_sections_for_trip', 'get_section_summary', 'create_confirmed_entry', 'create_and_link_timeline', 'create_confirmed_objects', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 924
84 ['get_sections_for_trip', 'get_sections_for_confirmed_trip', 'create_composite_trip', 'create_composite_objects', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 308
86 ['get_stops_for_trip', 'get_raw_stops_for_trip', 'get_raw_timeline_for_trip', 'get_filtered_trip', 'save_cleaned_segments_for_timeline', 'save_cleaned_segments_for_ts', 'clean_and_resample', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 364
88 ['load_model', '_load_stored_trip_model', 'predict_cluster_confidence_discounting', 'compute_and_save_algorithms', 'run_prediction_pipeline', 'infer_labels', 'run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 3
91 ['run_intake_pipeline', '<module>'] 6
93 ['run_intake_pipeline_for_user', 'run_intake_pipeline', '<module>'] 2
  • We should make a PR to comb through these places in the pipeline and replace them with timeseries methods (expanding the capability of the timeseries methods if necessary)

@JGreenlee
Copy link
Author

JGreenlee commented Mar 7, 2025

e-mission/e-mission-server#1017 (comment)

As a future fix, it would be good to replace the existing add_speed_dist... function with this vectorized implementation. That should speed up JUMP_SMOOTHING as well

  • add_dist_heading_speed currently iterates point-by-point for each of distances, speeds, and headings. We should be able to do something similar to what we now do in trip segmentation, i.e. use diff() for ts_diff, haversine_numpy for dist_diff, and then speed is dist_diff / ts_diff

@JGreenlee
Copy link
Author

e-mission/e-mission-server#1017 (comment)

  • > I think that an example of how searchsorted generates the indices would be helpful. You could even put that into a unit test, so we are not caught by surprise if np makes subtle changes across versions and we upgrade to the newer version; similar to emission/tests//netTests/TestMetricsConfirmedTripsPandas.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants