-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
ENH: Introduce pandas.col
#62103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Introduce pandas.col
#62103
Conversation
When this is added, and then released, One comment is that I'm not sure it will support some basic arithmetic, such as: result = df.assign(addcon=pd.col("a") + 10) Or alignment with other series: b = df["b"] # or this could be from a different DF
result = df.assign(add2=pd.col("a") + b) Also, don't you need to add some tests?? |
Thanks for taking a look!
Yup, they're both supported: In [8]: df = pd.DataFrame({'a': [1,2,3]})
In [9]: s = pd.Series([90,100,110], index=[2,1,0])
In [10]: df.assign(
...: b=pd.col('a')+10,
...: c=pd.col('a')+s,
...: )
Out[10]:
a b c
0 1 11 111
1 2 12 102
2 3 13 93
😄 Definitely, I just wanted to test the waters first, as I think this would be perceived as a significant API change |
I don't see it as a "change", more like an addition to the API that makes it easier to use. The existing way of using |
Is assign the main use case? |
Currently it would only work in places that accept Getting it to work in |
628a3b0
to
b41b99d
Compare
I haven't seen any objections, so I'll work on adding docs + user guide + tests If anyone intends to block this then I'd appreciate it if you could speak out as soon as possible (also going to cc @mroeschke here in case you were against this) |
I would be OK adding this API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few questions - and I caught one typo in the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
caught one issue in the docs. I'm good with this. Will hope that others will review/approve/merge.
Co-authored-by: Irv Lustig <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will let someone else review/approve/merge .
Haven't looked at the implementation, but big +1 from me. |
pandas/tests/test_col.py
Outdated
@@ -31,6 +32,8 @@ | |||
(pd.col("a") < 1, [False, False], "(col('a') < 1)"), | |||
(pd.col("a") <= 1, [True, False], "(col('a') <= 1)"), | |||
(pd.col("a") == 1, [True, False], "(col('a') == 1)"), | |||
(np.log(pd.col("a")), [0.0, 0.6931471805599453], "log(col('a'))"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log
is dangerous to use in a test because the floating point value could be different on different platforms.
Maybe use np.min()
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, thanks! have replaced with np.power
, so then we also test passing in a non-expression argument (2
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! LGTM
pandas/core/col.py
Outdated
max_cols = 10 | ||
if len(columns_list) > max_cols: | ||
columns_hint = columns_list[:max_cols] + ["..."] | ||
else: | ||
columns_hint = columns_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to base this on the length of the column names. If the column names were all of length 15 on average, you'd have a very long message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, done, thanks (using a totally arbitrary limit of 90, but i think it looks good)
Thanks @MarcoGorelli |
Very nice @MarcoGorelli |
xref @jbrockmendel 's comment #56499 (comment)
I'd also discussed this with @phofl , @WillAyd , and @jorisvandenbossche (who originally showed us something like this in Basel at euroscipy 2023)
Demo:
Output:
NumPy ufuncs are also supported:
Expressions also get pretty-printed, demo:
What's here should be enough for it to be usable. For the type hints to show up correctly, extra work should be done in
pandas-stubs
. But, I think it should be possible to develop tooling to automate theExpr
docs and types based on theSeries
ones (going to cc @Dr-Irv here too then)As for the "
col
" name, that's what PySpark, Polars, Daft, and Datafusion use, so I think it'd make sense to follow the conventionI'm opening as a request for comments. Would people want this API to be part of pandas?This is ready for reviewOne of my main motivations for introducing it is that it avoids common issues with scoping. For example, if you use
assign
to increment two columns' values by 10 and try to writedf.assign(**{col: lambda df: df[col] + 10 for col in ('a', 'b')})
then you'll be in for a big surprisewhereas with
pd.col
, you get what you were probably expecting:Further advantages:
<function __main__.<lambda>(df)
Expected objections:
TODO:
Potential follow-ups (if there's interest):