-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Narwhals implementation of from_dataframe
and performance benchmark
#2661
base: master
Are you sure you want to change the base?
Narwhals implementation of from_dataframe
and performance benchmark
#2661
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for giving this a go!
I've left a couple of comments
I suspect the .to_list()
calls may be responsible for the slow-down. I'll take a look
Hi @MarcoGorelli , Thanks for already looking at this and for your insights! |
Hi @MarcoGorelli, I investigated the issue, and it appears that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @authierj for the effort on this! We really appreciate it! I left very non-relevant comments 😂
narwhals_test_time.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure, this is the script you are using to estimate run time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes exactly!
time_index = time_col_vals.to_list() | ||
|
||
xa = xr.DataArray( | ||
series_df.to_numpy()[:, :, np.newaxis], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do some additional check when converting from pandas to numpy. However I don't think that's enough to justify a 3x slow down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: Well, it actually looks like that is the culprit.
@MarcoGorelli, I was able to pin it down to the following code block via py-spy:
to_convert = [
key
for key, val in self.schema.items()
if val == dtypes.Datetime and val.time_zone is not None # type: ignore[attr-defined]
]
which calls self.schema
- which seems to be responsible for the majority of the overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, thanks @FBruzzesi - I presume this is due to the infer_dtype
call?
cuDF and Dask don't allow for storing arbirary objects, if they have .dtype
'object'
then .schema
just reports it as String
. Maybe we should just do the same for pandas? .schema
should be quick
alternatively, we should just pass fewer values to infer_dtype
, that should be enough to sniff the correct dtype in most cases. and default to string if it can't be inferred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been really useful, thanks @authierj for highlighting this area for improvement!
There's quite a few fastpaths we can use in Narwhals to reduce the overhead here
For a start, I hadn't realised how expensive pandas.DataFrame.__getitem__
when the dataframe is backed by a 2D numpy array
Will make some perf improvements in Narwhals, then we can revisit
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2661 +/- ##
==========================================
- Coverage 94.23% 93.96% -0.28%
==========================================
Files 141 141
Lines 15509 15590 +81
==========================================
+ Hits 14615 14649 +34
- Misses 894 941 +47 ☔ View full report in Codecov by Sentry. |
Checklist before merging this PR:
Fixes #2635.
Summary
A first draft of
from_dataframe
has been adapted to work with any dataframe. This is done using narwhals and the function is calledfrom_narwhals_dataframe
. In order to test the performance of the method, a filenarwhals_test_time.py
has been added to the pull request.It seems that
from_narwhals_dataframe
is ~3x slower thanfrom_dataframe
.Other Information