Skip to content

ENH: Introduce pandas.col #62103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Aug 22, 2025
Merged

ENH: Introduce pandas.col #62103

merged 33 commits into from
Aug 22, 2025

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Aug 13, 2025

xref @jbrockmendel 's comment #56499 (comment)

I'd also discussed this with @phofl , @WillAyd , and @jorisvandenbossche (who originally showed us something like this in Basel at euroscipy 2023)

Demo:

import pandas as pd
from datetime import datetime

df = pd.DataFrame(
    {
        "a": [1, -2, 3],
        "b": [4, 5, 6],
        "c": [datetime(2020, 1, 1), datetime(2025, 4, 2), datetime(2026, 12, 3)],
        "d": ["fox", "beluga", "narwhal"],
    }
)

result = df.assign(
    # The usual Series methods are supported
    a_abs=pd.col("a").abs(),
    # And can be combined
    a_centered=pd.col("a") - pd.col("a").mean(),
    a_plus_b=pd.col("a") + pd.col("b"),
    # Namespace are supported too
    c_year=pd.col("c").dt.year,
    c_month_name=pd.col("c").dt.strftime("%B"),
    d_upper=pd.col("d").str.upper(),
).loc[pd.col("a_abs") > 1]  # This works in `loc` too

print(result)

Output:

   a  b          c        d  a_abs  a_centered  a_plus_b  c_year c_month_name  d_upper
1 -2  5 2025-04-02   beluga      2   -2.666667         3    2025        April   BELUGA
2  3  6 2026-12-03  narwhal      3    2.333333         9    2026     December  NARWHAL

NumPy ufuncs are also supported:

In [6]: df.assign(a_log = np.log(pd.col('a')))
Out[6]: 
   a     a_log
0  1  0.000000
1  2  0.693147
2  3  1.098612

Expressions also get pretty-printed, demo:

In [4]: pd.col('value')
Out[4]: col('value')

In [5]: pd.col('value') * pd.col('weight')
Out[5]: (col('value') * col('weight'))

In [6]: (pd.col('value') - pd.col('value').mean()) / pd.col('value').std()
Out[6]: ((col('value') - col('value').mean()) / col('value').std())

In [7]: pd.col('timestamp').dt.strftime('%B')
Out[7]: col('timestamp').dt.strftime('%B')

What's here should be enough for it to be usable. For the type hints to show up correctly, extra work should be done in pandas-stubs. But, I think it should be possible to develop tooling to automate the Expr docs and types based on the Series ones (going to cc @Dr-Irv here too then)

As for the "col" name, that's what PySpark, Polars, Daft, and Datafusion use, so I think it'd make sense to follow the convention


I'm opening as a request for comments. Would people want this API to be part of pandas? This is ready for review

One of my main motivations for introducing it is that it avoids common issues with scoping. For example, if you use assign to increment two columns' values by 10 and try to write df.assign(**{col: lambda df: df[col] + 10 for col in ('a', 'b')}) then you'll be in for a big surprise

In [19]: df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})

In [20]: df.assign(**{col: lambda df: df[col] + 10 for col in ('a', 'b')})
Out[20]:
    a   b
0  14  14
1  15  15
2  16  16

whereas with pd.col, you get what you were probably expecting:

In [4]: df.assign(**{col: pd.col(col) + 10 for col in ('a', 'b')})
Out[4]: 
    a   b
0  11  14
1  12  15
2  13  16

Further advantages:

  • expressions are introspectable so the repr can be made to look nice, whereas an anonymous lambda is always going to look something like <function __main__.<lambda>(df)
  • the syntax looks more modern and more aligned with modern tools

Expected objections:

  • this expands the pandas API even further. Sure, I don't disagree, but I think this is a common enough and longstanding enough request that it's worth expanding it for this

TODO:

  • tests, API docs, user guide. But first, I just wanted to get a feel for people's thoughts, and to see if anyone's opposed to it

Potential follow-ups (if there's interest):

  • serialise / deserialise expressions

@MarcoGorelli MarcoGorelli changed the title ENH: Introduce pandas.col RFC: Introduce pandas.col Aug 13, 2025
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 13, 2025

For the type hints to show up correctly, extra work should be done in pandas-stubs. But, I think it should be possible to develop tooling to automate the Expr docs and types based on the Series ones (going to cc @Dr-Irv here too then)

When this is added, and then released, pandas-stubs can be updated with proper stubs.

One comment is that I'm not sure it will support some basic arithmetic, such as:

result = df.assign(addcon=pd.col("a") + 10)

Or alignment with other series:

b = df["b"]  # or this could be from a different DF
result = df.assign(add2=pd.col("a") + b)

Also, don't you need to add some tests??

@MarcoGorelli
Copy link
Member Author

Thanks for taking a look!

One comment is that I'm not sure it will support some basic arithmetic [...] Or alignment with other series:

Yup, they're both supported:

In [8]: df = pd.DataFrame({'a': [1,2,3]})

In [9]: s = pd.Series([90,100,110], index=[2,1,0])

In [10]: df.assign(
    ...:     b=pd.col('a')+10,
    ...:     c=pd.col('a')+s,
    ...: )
Out[10]: 
   a   b    c
0  1  11  111
1  2  12  102
2  3  13   93

Also, don't you need to add some tests??

😄 Definitely, I just wanted to test the waters first, as I think this would be perceived as a significant API change

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 13, 2025

Definitely, I just wanted to test the waters first, as I think this would be perceived as a significant API change

I don't see it as a "change", more like an addition to the API that makes it easier to use. The existing way of using df.assign(foo=lambda df: df["a"] + df["b"]) would still work, but df.assign(foo=pd.col("a") + pd.col("b")) is cleaner.

@jbrockmendel
Copy link
Member

Is assign the main use case?

@MarcoGorelli
Copy link
Member Author

Currently it would only work in places that accept DataFrame -> Series callables which, as far as I know, is only DataFrame.assign and filtering with DataFrame.loc

Getting it to work in GroupBy.agg is more complex, but it is possible, albeit with some restrictions

@MarcoGorelli MarcoGorelli marked this pull request as ready for review August 14, 2025 10:09
@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Aug 15, 2025

I haven't seen any objections, so I'll work on adding docs + user guide + tests

If anyone intends to block this then I'd appreciate it if you could speak out as soon as possible (also going to cc @mroeschke here in case you were against this)

@mroeschke
Copy link
Member

I would be OK adding this API.

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions - and I caught one typo in the docs

Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caught one issue in the docs. I'm good with this. Will hope that others will review/approve/merge.

Co-authored-by: Irv Lustig <[email protected]>
Copy link
Contributor

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will let someone else review/approve/merge .

@rhshadrach
Copy link
Member

Haven't looked at the implementation, but big +1 from me.

@@ -31,6 +32,8 @@
(pd.col("a") < 1, [False, False], "(col('a') < 1)"),
(pd.col("a") <= 1, [True, False], "(col('a') <= 1)"),
(pd.col("a") == 1, [True, False], "(col('a') == 1)"),
(np.log(pd.col("a")), [0.0, 0.6931471805599453], "log(col('a'))"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log is dangerous to use in a test because the floating point value could be different on different platforms.

Maybe use np.min() instead.

Copy link
Member Author

@MarcoGorelli MarcoGorelli Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, thanks! have replaced with np.power, so then we also test passing in a non-expression argument (2)

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! LGTM

Comment on lines 264 to 268
max_cols = 10
if len(columns_list) > max_cols:
columns_hint = columns_list[:max_cols] + ["..."]
else:
columns_hint = columns_list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to base this on the length of the column names. If the column names were all of length 15 on average, you'd have a very long message.

Copy link
Member Author

@MarcoGorelli MarcoGorelli Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, done, thanks (using a totally arbitrary limit of 90, but i think it looks good)

@mroeschke mroeschke merged commit 188b2da into pandas-dev:main Aug 22, 2025
38 checks passed
@mroeschke
Copy link
Member

Thanks @MarcoGorelli

@mroeschke mroeschke added this to the 3.0 milestone Aug 22, 2025
@jbrockmendel
Copy link
Member

Very nice @MarcoGorelli

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants