2-3× Speedup for `BaseDataset.set_task()` #378

zzachw · 2025-05-02T05:53:16Z

Main: 2× Speedup for BaseDataset.set_task()

This PR addresses a major bottleneck in BaseDataset.set_task()—the repeated calls to Patient.get_events().

Fast time range filtering via binary search on sorted timestamps (O(N) → O(log N))
Efficient event type filtering using pre-built lookups (O(N) → O(1))
Reduced timestamp precision from microseconds to milliseconds for lower processing overhead

Runtime improvements on full MIMIC-IV:
InHospitalMortalityMIMIC4: 40min -> 20min
Readmission30DaysMIMIC4 : 30min -> 14min
MIMIC4EDBenchmark (to be introduced in the next PR): 7hr -> 48min -> 20min

Should solve issue #331

Other Changes

Set num_workers=1 by default in set_task() to reduce peak memory usage
Removed propagation of dev flag to child MIMIC-IV dataset classes to ensure consistent patient cohorts across sub-datasets
Refactor InHospitalMortalityMIMIC4 and Readmission30DaysMIMIC4 for improved performance and clarity

- Add fast time range filtering via binary search on sorted timestamps - Add efficient event type filtering using pre-built index lookups - Reduce timestamp precision from microseconds to milliseconds - Set default num_workers=1 in set_task for better memory control - Remove unused dev flag from child MIMIC4 dataset classes

…proved performance and clarity

Copilot

Pull Request Overview

This PR significantly improves the performance of the dataset task methods by optimizing event filtering and reducing unnecessary overhead. Key changes include:

Faster event filtering via binary search and pre-built lookups.
Adjustments to multi-threading default settings in sample generation.
Refinements to documentation and code structure across tasks and dataset initialization.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pyhealth/tasks/readmission_30days_mimic4.py	Updated task docstring and refactored event code extraction using Polars
pyhealth/tasks/in_hospital_mortality_mimic4.py	Updated pivot settings and improved docstring clarity
pyhealth/datasets/mimic4.py	Removed dev flag propagation to child datasets with added clarifying comments
pyhealth/datasets/base_dataset.py	Revised num_workers default and timestamp casting with ms precision
pyhealth/data/data.py	Replaced regular filtering with fast filtering using binary search techniques

Comments suppressed due to low confidence (1)

pyhealth/datasets/base_dataset.py:266

[nitpick] Please confirm that using millisecond precision for timestamps meets all necessary requirements for event ordering, especially in cases where events occur in rapid succession.

timestamp_expr.cast(pl.Datetime(time_unit="ms")).alias("timestamp"),

Copilot · 2025-05-02T05:54:11Z

pyhealth/data/data.py

@@ -129,37 +169,39 @@ def get_events(
            Union[pl.DataFrame, List[Event]]: Filtered events as a DataFrame 
            or a list of Event objects.


[nitpick] Consider adding a brief note in the function docstring clarifying that the fast filtering approach requires the timestamp column to be pre-sorted, as this requirement is crucial for the binary search optimization to work correctly.

Suggested change

or a list of Event objects.

or a list of Event objects.

Note:

The fast filtering approach used in this method assumes that the timestamp column

is pre-sorted. Ensure that the data source is sorted by timestamp for correct results.

We always sort the timestamp in init.

Copilot · 2025-05-02T05:54:11Z

pyhealth/tasks/readmission_30days_mimic4.py

 from .base_task import BaseTask


 class Readmission30DaysMIMIC4(BaseTask):
-    """Task for predicting 30-day readmission using MIMIC-IV data."""
+    """
+    Task for predicting 30-day readmission using MIMIC-IV data.


[nitpick] Consider enhancing the task docstring by including usage examples or linking to external documentation to help future developers understand the expected inputs and outputs.

I think we can later create some local copies of the google colab notebooks and upload it on GitHub.

…ers are provided

jhnwu3

Works for the demos we have so far.

zzachw added 2 commits May 2, 2025 00:42

Refactor InHospitalMortalityMIMIC4 and Readmission30DaysMIMIC4 for im…

80be4f5

…proved performance and clarity

zzachw requested a review from Copilot May 2, 2025 05:53

Copilot AI reviewed May 2, 2025

View reviewed changes

zzachw added core Core functionality (Patient API, BaseDataset, event stream format, etc.) infra Infrastructure: data loading, caching, pipelines labels May 2, 2025

zzachw linked an issue May 2, 2025 that may be closed by this pull request

Improve iter_patients() speed #331

Closed

zzachw requested a review from jhnwu3 May 2, 2025 19:08

Fix bug in get_events to ensure event_type is only asserted when filt…

6ee6e72

…ers are provided

jhnwu3 approved these changes May 3, 2025

View reviewed changes

zzachw merged commit 9305a6f into master May 8, 2025

zzachw deleted the feat/fast_set_task branch May 8, 2025 02:51

zzachw mentioned this pull request May 9, 2025

Fix AssertionError: event_type must be provided if filters are provided. #388

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

2-3× Speedup for `BaseDataset.set_task()` #378

2-3× Speedup for `BaseDataset.set_task()` #378

Uh oh!

zzachw commented May 2, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 2, 2025

Uh oh!

zzachw May 2, 2025

Uh oh!

Copilot AI May 2, 2025

Uh oh!

zzachw May 2, 2025

Uh oh!

jhnwu3 left a comment

Uh oh!

Uh oh!

		@@ -129,37 +169,39 @@ def get_events(
		Union[pl.DataFrame, List[Event]]: Filtered events as a DataFrame
		or a list of Event objects.

2-3× Speedup for BaseDataset.set_task() #378

2-3× Speedup for BaseDataset.set_task() #378

Uh oh!

Conversation

zzachw commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main: 2× Speedup for BaseDataset.set_task()

Other Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 2, 2025

Choose a reason for hiding this comment

Uh oh!

zzachw May 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 2, 2025

Choose a reason for hiding this comment

Uh oh!

zzachw May 2, 2025

Choose a reason for hiding this comment

Uh oh!

jhnwu3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

2-3× Speedup for `BaseDataset.set_task()` #378

2-3× Speedup for `BaseDataset.set_task()` #378

zzachw commented May 2, 2025 •

edited

Loading