Skip to content

2-3× Speedup for BaseDataset.set_task() #378

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 8, 2025
Merged

Conversation

zzachw
Copy link
Collaborator

@zzachw zzachw commented May 2, 2025

Main: 2× Speedup for BaseDataset.set_task()

This PR addresses a major bottleneck in BaseDataset.set_task()—the repeated calls to Patient.get_events().

  • Fast time range filtering via binary search on sorted timestamps (O(N) → O(log N))
  • Efficient event type filtering using pre-built lookups (O(N) → O(1))
  • Reduced timestamp precision from microseconds to milliseconds for lower processing overhead

Runtime improvements on full MIMIC-IV:
InHospitalMortalityMIMIC4: 40min -> 20min
Readmission30DaysMIMIC4 : 30min -> 14min
MIMIC4EDBenchmark (to be introduced in the next PR): 7hr -> 48min -> 20min

Should solve issue #331

Other Changes

  1. Set num_workers=1 by default in set_task() to reduce peak memory usage

  2. Removed propagation of dev flag to child MIMIC-IV dataset classes to ensure consistent patient cohorts across sub-datasets

  3. Refactor InHospitalMortalityMIMIC4 and Readmission30DaysMIMIC4 for improved performance and clarity

zzachw added 2 commits May 2, 2025 00:42
- Add fast time range filtering via binary search on sorted timestamps
- Add efficient event type filtering using pre-built index lookups
- Reduce timestamp precision from microseconds to milliseconds
- Set default num_workers=1 in set_task for better memory control
- Remove unused dev flag from child MIMIC4 dataset classes
@zzachw zzachw requested a review from Copilot May 2, 2025 05:53
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR significantly improves the performance of the dataset task methods by optimizing event filtering and reducing unnecessary overhead. Key changes include:

  • Faster event filtering via binary search and pre-built lookups.
  • Adjustments to multi-threading default settings in sample generation.
  • Refinements to documentation and code structure across tasks and dataset initialization.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pyhealth/tasks/readmission_30days_mimic4.py Updated task docstring and refactored event code extraction using Polars
pyhealth/tasks/in_hospital_mortality_mimic4.py Updated pivot settings and improved docstring clarity
pyhealth/datasets/mimic4.py Removed dev flag propagation to child datasets with added clarifying comments
pyhealth/datasets/base_dataset.py Revised num_workers default and timestamp casting with ms precision
pyhealth/data/data.py Replaced regular filtering with fast filtering using binary search techniques
Comments suppressed due to low confidence (1)

pyhealth/datasets/base_dataset.py:266

  • [nitpick] Please confirm that using millisecond precision for timestamps meets all necessary requirements for event ordering, especially in cases where events occur in rapid succession.
timestamp_expr.cast(pl.Datetime(time_unit="ms")).alias("timestamp"),

@@ -129,37 +169,39 @@ def get_events(
Union[pl.DataFrame, List[Event]]: Filtered events as a DataFrame
or a list of Event objects.
Copy link
Preview

Copilot AI May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider adding a brief note in the function docstring clarifying that the fast filtering approach requires the timestamp column to be pre-sorted, as this requirement is crucial for the binary search optimization to work correctly.

Suggested change
or a list of Event objects.
or a list of Event objects.
Note:
The fast filtering approach used in this method assumes that the timestamp column
is pre-sorted. Ensure that the data source is sorted by timestamp for correct results.

Copilot uses AI. Check for mistakes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always sort the timestamp in init.

from .base_task import BaseTask


class Readmission30DaysMIMIC4(BaseTask):
"""Task for predicting 30-day readmission using MIMIC-IV data."""
"""
Task for predicting 30-day readmission using MIMIC-IV data.
Copy link
Preview

Copilot AI May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider enhancing the task docstring by including usage examples or linking to external documentation to help future developers understand the expected inputs and outputs.

Copilot uses AI. Check for mistakes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can later create some local copies of the google colab notebooks and upload it on GitHub.

@zzachw zzachw added core Core functionality (Patient API, BaseDataset, event stream format, etc.) infra Infrastructure: data loading, caching, pipelines labels May 2, 2025
@zzachw zzachw linked an issue May 2, 2025 that may be closed by this pull request
@zzachw zzachw requested a review from jhnwu3 May 2, 2025 19:08
Copy link
Collaborator

@jhnwu3 jhnwu3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for the demos we have so far.

@zzachw zzachw merged commit 9305a6f into master May 8, 2025
@zzachw zzachw deleted the feat/fast_set_task branch May 8, 2025 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core functionality (Patient API, BaseDataset, event stream format, etc.) infra Infrastructure: data loading, caching, pipelines
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve iter_patients() speed
2 participants