Narwhals implementation of `from_dataframe` and performance benchmark #2661

authierj · 2025-01-31T09:25:02Z

Checklist before merging this PR:

Mentioned all issues that this PR fixes or addresses.
Summarized the updates of this PR under Summary.
Added an entry under Unreleased in the Changelog.

Summary

A first draft of from_dataframe has been adapted to work with any dataframe. This is done using narwhals and the function is called from_narwhals_dataframe. In order to test the performance of the method, a file narwhals_test_time.py has been added to the pull request.
It seems that from_narwhals_dataframe is ~3x slower than from_dataframe.

Other Information

MarcoGorelli

thanks for giving this a go!

I've left a couple of comments

I suspect the .to_list() calls may be responsible for the slow-down. I'll take a look

darts/timeseries.py

authierj · 2025-01-31T12:56:31Z

Hi @MarcoGorelli ,

Thanks for already looking at this and for your insights!

authierj · 2025-02-03T15:54:14Z

Hi @MarcoGorelli,

I investigated the issue, and it appears that the .to_list() call is not responsible for the slowdown. However, the call series_df.to_numpy()[:, :, np.newaxis] on line 906 is very slow. The investigation is going on!

FBruzzesi

Thanks @authierj for the effort on this! We really appreciate it! I left very non-relevant comments 😂

darts/timeseries.py

FBruzzesi · 2025-02-03T18:56:54Z

narwhals_test_time.py

Just to make sure, this is the script you are using to estimate run time?

yes exactly!

FBruzzesi · 2025-02-03T18:57:54Z

darts/timeseries.py

+            time_index = time_col_vals.to_list()
+
+        xa = xr.DataArray(
+            series_df.to_numpy()[:, :, np.newaxis],


We do some additional check when converting from pandas to numpy. However I don't think that's enough to justify a 3x slow down.

Edit: Well, it actually looks like that is the culprit.

@MarcoGorelli, I was able to pin it down to the following code block via py-spy:

to_convert = [ key for key, val in self.schema.items() if val == dtypes.Datetime and val.time_zone is not None # type: ignore[attr-defined] ]

which calls self.schema - which seems to be responsible for the majority of the overhead.

interesting, thanks @FBruzzesi - I presume this is due to the infer_dtype call?

cuDF and Dask don't allow for storing arbirary objects, if they have .dtype 'object' then .schema just reports it as String. Maybe we should just do the same for pandas? .schema should be quick

alternatively, we should just pass fewer values to infer_dtype, that should be enough to sniff the correct dtype in most cases. and default to string if it can't be inferred

This has been really useful, thanks @authierj for highlighting this area for improvement!

There's quite a few fastpaths we can use in Narwhals to reduce the overhead here

For a start, I hadn't realised how expensive pandas.DataFrame.__getitem__ when the dataframe is backed by a 2D numpy array

Will make some perf improvements in Narwhals, then we can revisit

codecov · 2025-02-04T09:24:28Z

Codecov Report

Attention: Patch coverage is 10.00000% with 36 lines in your changes missing coverage. Please review.

Project coverage is 93.96%. Comparing base (c48521c) to head (0041203).
Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
darts/timeseries.py	10.00%	36 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2661      +/-   ##
==========================================
- Coverage   94.23%   93.96%   -0.28%     
==========================================
  Files         141      141              
  Lines       15509    15590      +81     
==========================================
+ Hits        14615    14649      +34     
- Misses        894      941      +47

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

“authierj” and others added 2 commits January 31, 2025 10:16

narwhals implementation for and test benchmark

28a9298

Merge branch 'master' into feature/add_timeseries_from_polars

6382082

MarcoGorelli reviewed Jan 31, 2025

View reviewed changes

darts/timeseries.py Outdated Show resolved Hide resolved

darts/timeseries.py Outdated Show resolved Hide resolved

MarcoGorelli mentioned this pull request Jan 31, 2025

enh: add more dtype methods: is_integer, is_signed_integer, is_unsigned_integer, is_float, is_temporal narwhals-dev/narwhals#1899

Closed

FBruzzesi reviewed Feb 3, 2025

View reviewed changes

changes from MarcoGorelli incorporated

0041203

MarcoGorelli mentioned this pull request Feb 4, 2025

perf: use fastpath in DataFrame.to_numpy for pandas, improve performance for DataFrame.schema for pandas, use fewer values to sniff dtype for pandas objects narwhals-dev/narwhals#1929

Merged

10 tasks

dennisbader assigned authierj Feb 4, 2025

dennisbader added feature request Use this label to request a new feature improvement New feature or improvement labels Feb 4, 2025

“authierj” and others added 2 commits February 6, 2025 09:27

improvement thanks to reviewers

576e88e

Merge branch 'master' into feature/add_timeseries_from_polars

e013a42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Narwhals implementation of `from_dataframe` and performance benchmark #2661

Narwhals implementation of `from_dataframe` and performance benchmark #2661

authierj commented Jan 31, 2025 •

edited

Loading

MarcoGorelli left a comment

authierj commented Jan 31, 2025

authierj commented Feb 3, 2025

FBruzzesi left a comment

FBruzzesi Feb 3, 2025

authierj Feb 4, 2025

FBruzzesi Feb 3, 2025

FBruzzesi Feb 3, 2025 •

edited

Loading

MarcoGorelli Feb 4, 2025 •

edited

Loading

MarcoGorelli Feb 4, 2025

codecov bot commented Feb 4, 2025

Narwhals implementation of from_dataframe and performance benchmark #2661

Are you sure you want to change the base?

Narwhals implementation of from_dataframe and performance benchmark #2661

Conversation

authierj commented Jan 31, 2025 • edited Loading

Summary

Other Information

MarcoGorelli left a comment

Choose a reason for hiding this comment

authierj commented Jan 31, 2025

authierj commented Feb 3, 2025

FBruzzesi left a comment

Choose a reason for hiding this comment

FBruzzesi Feb 3, 2025

Choose a reason for hiding this comment

authierj Feb 4, 2025

Choose a reason for hiding this comment

FBruzzesi Feb 3, 2025

Choose a reason for hiding this comment

FBruzzesi Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli Feb 4, 2025

Choose a reason for hiding this comment

codecov bot commented Feb 4, 2025

Codecov Report

Narwhals implementation of `from_dataframe` and performance benchmark #2661

Narwhals implementation of `from_dataframe` and performance benchmark #2661

authierj commented Jan 31, 2025 •

edited

Loading

FBruzzesi Feb 3, 2025 •

edited

Loading

MarcoGorelli Feb 4, 2025 •

edited

Loading