-
Notifications
You must be signed in to change notification settings - Fork 408
2-3× Speedup for BaseDataset.set_task()
#378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add fast time range filtering via binary search on sorted timestamps - Add efficient event type filtering using pre-built index lookups - Reduce timestamp precision from microseconds to milliseconds - Set default num_workers=1 in set_task for better memory control - Remove unused dev flag from child MIMIC4 dataset classes
…proved performance and clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR significantly improves the performance of the dataset task methods by optimizing event filtering and reducing unnecessary overhead. Key changes include:
- Faster event filtering via binary search and pre-built lookups.
- Adjustments to multi-threading default settings in sample generation.
- Refinements to documentation and code structure across tasks and dataset initialization.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
pyhealth/tasks/readmission_30days_mimic4.py | Updated task docstring and refactored event code extraction using Polars |
pyhealth/tasks/in_hospital_mortality_mimic4.py | Updated pivot settings and improved docstring clarity |
pyhealth/datasets/mimic4.py | Removed dev flag propagation to child datasets with added clarifying comments |
pyhealth/datasets/base_dataset.py | Revised num_workers default and timestamp casting with ms precision |
pyhealth/data/data.py | Replaced regular filtering with fast filtering using binary search techniques |
Comments suppressed due to low confidence (1)
pyhealth/datasets/base_dataset.py:266
- [nitpick] Please confirm that using millisecond precision for timestamps meets all necessary requirements for event ordering, especially in cases where events occur in rapid succession.
timestamp_expr.cast(pl.Datetime(time_unit="ms")).alias("timestamp"),
@@ -129,37 +169,39 @@ def get_events( | |||
Union[pl.DataFrame, List[Event]]: Filtered events as a DataFrame | |||
or a list of Event objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider adding a brief note in the function docstring clarifying that the fast filtering approach requires the timestamp column to be pre-sorted, as this requirement is crucial for the binary search optimization to work correctly.
or a list of Event objects. | |
or a list of Event objects. | |
Note: | |
The fast filtering approach used in this method assumes that the timestamp column | |
is pre-sorted. Ensure that the data source is sorted by timestamp for correct results. |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always sort the timestamp in init.
from .base_task import BaseTask | ||
|
||
|
||
class Readmission30DaysMIMIC4(BaseTask): | ||
"""Task for predicting 30-day readmission using MIMIC-IV data.""" | ||
""" | ||
Task for predicting 30-day readmission using MIMIC-IV data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Consider enhancing the task docstring by including usage examples or linking to external documentation to help future developers understand the expected inputs and outputs.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can later create some local copies of the google colab notebooks and upload it on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for the demos we have so far.
Main: 2× Speedup for BaseDataset.set_task()
This PR addresses a major bottleneck in
BaseDataset.set_task()
—the repeated calls toPatient.get_events()
.Runtime improvements on full MIMIC-IV:
InHospitalMortalityMIMIC4
: 40min -> 20minReadmission30DaysMIMIC4
: 30min -> 14minMIMIC4EDBenchmark
(to be introduced in the next PR): 7hr -> 48min -> 20minShould solve issue #331
Other Changes
Set
num_workers=1
by default inset_task()
to reduce peak memory usageRemoved propagation of
dev
flag to child MIMIC-IV dataset classes to ensure consistent patient cohorts across sub-datasetsRefactor
InHospitalMortalityMIMIC4
andReadmission30DaysMIMIC4
for improved performance and clarity