Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Narwhals implementation of from_dataframe and performance benchmark #2661

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

authierj
Copy link
Contributor

@authierj authierj commented Jan 31, 2025

Checklist before merging this PR:

  • Mentioned all issues that this PR fixes or addresses.
  • Summarized the updates of this PR under Summary.
  • Added an entry under Unreleased in the Changelog.

Fixes #2635.

Summary

A first draft of from_dataframe has been adapted to work with any dataframe. This is done using narwhals and the function is called from_narwhals_dataframe. In order to test the performance of the method, a file narwhals_test_time.py has been added to the pull request.
It seems that from_narwhals_dataframe is ~3x slower than from_dataframe.

Other Information

Copy link

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for giving this a go!

I've left a couple of comments

I suspect the .to_list() calls may be responsible for the slow-down. I'll take a look

darts/timeseries.py Outdated Show resolved Hide resolved
darts/timeseries.py Outdated Show resolved Hide resolved
@authierj
Copy link
Contributor Author

Hi @MarcoGorelli ,

Thanks for already looking at this and for your insights!

@authierj
Copy link
Contributor Author

authierj commented Feb 3, 2025

Hi @MarcoGorelli,

I investigated the issue, and it appears that the .to_list() call is not responsible for the slowdown. However, the call series_df.to_numpy()[:, :, np.newaxis] on line 906 is very slow. The investigation is going on!

Copy link
Contributor

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @authierj for the effort on this! We really appreciate it! I left very non-relevant comments 😂

darts/timeseries.py Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, this is the script you are using to estimate run time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly!

time_index = time_col_vals.to_list()

xa = xr.DataArray(
series_df.to_numpy()[:, :, np.newaxis],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do some additional check when converting from pandas to numpy. However I don't think that's enough to justify a 3x slow down.

Copy link
Contributor

@FBruzzesi FBruzzesi Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: Well, it actually looks like that is the culprit.

@MarcoGorelli, I was able to pin it down to the following code block via py-spy:

       to_convert = [
            key
            for key, val in self.schema.items()
            if val == dtypes.Datetime and val.time_zone is not None  # type: ignore[attr-defined]
        ]

which calls self.schema - which seems to be responsible for the majority of the overhead.

Copy link

@MarcoGorelli MarcoGorelli Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, thanks @FBruzzesi - I presume this is due to the infer_dtype call?

cuDF and Dask don't allow for storing arbirary objects, if they have .dtype 'object' then .schema just reports it as String. Maybe we should just do the same for pandas? .schema should be quick

alternatively, we should just pass fewer values to infer_dtype, that should be enough to sniff the correct dtype in most cases. and default to string if it can't be inferred

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been really useful, thanks @authierj for highlighting this area for improvement!

There's quite a few fastpaths we can use in Narwhals to reduce the overhead here

For a start, I hadn't realised how expensive pandas.DataFrame.__getitem__ when the dataframe is backed by a 2D numpy array

Will make some perf improvements in Narwhals, then we can revisit

Copy link

codecov bot commented Feb 4, 2025

Codecov Report

Attention: Patch coverage is 10.00000% with 36 lines in your changes missing coverage. Please review.

Project coverage is 93.96%. Comparing base (c48521c) to head (0041203).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
darts/timeseries.py 10.00% 36 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2661      +/-   ##
==========================================
- Coverage   94.23%   93.96%   -0.28%     
==========================================
  Files         141      141              
  Lines       15509    15590      +81     
==========================================
+ Hits        14615    14649      +34     
- Misses        894      941      +47     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Use this label to request a new feature improvement New feature or improvement
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

Add TimeSeries.from_polars
4 participants