Skip to content

fix(series): arithmetics for Series[Any] #1343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

cmp0xff
Copy link
Contributor

@cmp0xff cmp0xff commented Aug 22, 2025

This PR implements the ideas from #1274 (comment) and #1274 (comment).

  • Tests added: Please use assert_type() to assert the type of any return value

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only issues I see are with respect to the code inside of if TYPE_CHECKING_INVALID_USAGE that you added.

The concern here is that some of the lines will execute fine if the Series comes from a DataFrame and has the correct type inside, but we see that it is a static Series[Any]. And vice versa.

I think I prefer where if you are doing an operation like subtraction where it will sometimes work and sometimes not work, and the inferred type of one of the operands is Series[Any], then we detect that as a typing problem. But we need to be selective.

For example,

df = pd.DataFrame({"a": [1,2,3], "b": pd.to_datetime(["1/1/2025", "2/1/2025", "3/1/2025"])})
sa = df["a"]
sb = df["b"]
sa - pd.Timestamp("1/1/2024")  # fails at runtime
sb - pd.Timestamp("1/1/2024") # works at runtime

Here sa and sb are Series[Any] (mypy) or Series[Unknown] (pyright). So the typing either has to accept both cases or reject both cases.

I think we have to be selective here, and probably disallow subtraction with untyped Series when the other argument is known to be time related (Timestamp, Timedelta and associated Series) or is a string or Series[str]. I think the current stubs are more permissive, but now I'm not sure that's the right thing to do.

Comment on lines +173 to +177
if TYPE_CHECKING_INVALID_USAGE:
_0 = left_td - s
check(assert_type(left_ts - a, "TimedeltaSeries"), pd.Series, pd.Timedelta)
if TYPE_CHECKING_INVALID_USAGE:
_1 = left_td - a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you have TYPE_CHECKING_INVALID_USAGE, that means we should have # type: ignore and # pyright: ignore statements that demonstrate the type checker can catch those errors.

check(assert_type(left_ts - a, "TimedeltaSeries"), pd.Series, pd.Timedelta)
if TYPE_CHECKING_INVALID_USAGE:
_1 = left_ts - a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of valid code that should be accepted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it was a typo. 1fb597b

However, I did not easily understand your comment here.

This is an example of valid code that should be accepted.

left_ts and left_td are Series[Any] at type checking. But you wrote

I think we have to be selective here, and probably disallow subtraction with untyped Series when the other argument is known to be time related (Timestamp, Timedelta and associated Series)

It seems to me that in your plan, both left_td - a and left_ts - a should give an error / a Never at type checking. Am I right?

My proposed plan is more permissive and will not detect the problem of left_ts - a, because at runtime, Series[Any] - TimedeltaSeries can either be TimestampSeries or TimedeltaSeries or give an error. In my proposed plan, at type checking, it would give Series[Any].

@cmp0xff
Copy link
Contributor Author

cmp0xff commented Aug 23, 2025

Hi @Dr-Irv , thank you for drafting the plan.

Current plan

I think I prefer where if you are doing an operation like subtraction where it will sometimes work and sometimes not work, and the inferred type of one of the operands is Series[Any], then we detect that as a typing problem. But we need to be selective.

I think we have to be selective here, and probably disallow subtraction with untyped Series when the other argument is known to be time related (Timestamp, Timedelta and associated Series) or is a string or Series[str].

I would like to summarise this typing plan as following:

  1. When the calculation can give a runtime error, typing shows an error or Never
  2. Certain cases are exceptions

Timestamp and Timedelta: permissive or forbidding

With this typing plan, I have the following examples in my mind:

  1. Series[Any] (int) - TimestampSeries -> error at type checking, error at runtime
  2. Series[Any] (Timestamp) - TimestampSeries -> error at type checking, TimedeltaSeires at runtime

As a user I probably do not want the static type checker to aggressivly point out a potential problem. When the stub is less permissive and more forbidding, the static type checker becomes more aggresive. It seems better to me to allow both cases at the stage of static type checking, otherwise the user may need to manually ignore the type checker in many cases.

int: exceptions to the plan

"We need to be selective" is important in the plan, because we also have

  1. Series[Any] (int) + Series[int] -> Series[Any] at type checking, Series[int] at runtime
  2. Series[Any] (str) + Series[int] -> Series[Any] at type checking, error at runtime

Currently we are happy with the stub giving us Series[Any] for adding Series[Any] to Series[int]. This is an exception, which may potentially confuse the user.

Proposing a consistent plan

I would like to propose a new typing plan as following:

  1. When the calculation gives several typing results or a runtime error, typing shows Series[Any]
  2. When the calculation gives one typing result, say Series[R], or a runtime error, typing shows Series[R]
  3. When the calculation always gives a runtime error, typing shows an error or Never

With this typing plan, the previous examples give different results:

  1. Series[Any] (int) - TimestampSeries -> TimedeltaSeries at type checking, error at runtime (TimedeltaSeries is the only possible result that is valid, so unfortunately the type checker does not cache the potential problem here)
  2. Series[Any] (Timestamp) - TimestampSeries -> TimedeltaSeries at type checking, TimedeltaSeires at runtime
  3. Series[Any] (int) + Series[int] -> Series[Any] at type checking, Series[int] at runtime (no exceptional rule in the plan)
  4. Series[Any] (str) + Series[int] -> Series[Any] at type checking, error at runtime (Series[float], Series[int] etc. are possible valid results, so unfortunately the type checker does not cache the potential problem here)

Further examples:

  1. Series[Any] (int) + Series[str] -> Series[str] at type checking, error at runtime (Series[str] is the only possible result that is valid, so unfortunately the type checker does not cache the potential problem here)
  2. Series[Any] (str) + Series[str] -> Series[str] at type checking, Series[str] at runtime
  3. Series[Any] * TimestampSeries -> error / Never at type checking, error at runtime (Timestamp is consistently not multiplicative)

Thank you for reading the lengthy explanation. What do you think?

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Aug 23, 2025

Thank you for reading the lengthy explanation. What do you think?

The challenge here is the issue of wide vs narrow types. See https://github.com/pandas-dev/pandas-stubs/blob/main/docs/philosophy.md#narrow-vs-wide-arguments for some writeup I did about that.

Let's consider this example from your list:

Series[Any] (int) - TimestampSeries -> TimedeltaSeries at type checking, error at runtime

In what is in main today, the following code works as you describe there, i.e., the type checker infers that result is TimedeltaSeries, but it fails at runtime.

si = pd.DataFrame({"a": pd.Series([1,2,3])})["a"]
st = pd.Series(pd.date_range("1/1/2005", "1/3/2005"))
result = si - st

I think we do a better service to users if we actually catch this via typing, i.e., for Timedelta, TimedeltaSeries, Timestamp, TimestampSeries, str and Series[str], if they are in a binary operation with a Series[Any] (either before the operator or after the operator), the type checker reports an error. That's telling the user "We don't know how to handle a generic series with another operand that has a specified type", but we are limiting the types we do that with to just the ones I mentioned.

This makes the user then cast the variable si above to Series[int] (in which case we catch the failure), and know it will possibly fail at runtime.

Let's also consider this example:

st = pd.Series(pd.date_range("1/1/2005", "1/3/2005"))
sd = pd.DataFrame({"a": [pd.Timedelta("1 day"), pd.Timedelta("2 days"), pd.Timedelta("3 days")]})["a"]
result = st - sd

In this case, if we adopt my proposal, the type checker would say that st - sd is invalid. But the type of sd is partially unknown, so we are then suggesting that the user do:

result = st - cast("pd.Series[Timedelta]", sd)

which is telling the type checker "I know this is a series of timedeltas"

I'm choosing what I consider to be a happy medium here between your proposal, and something that would be too narrow (e.g., disallowing Series[Any].__sub__(Series[Any])), by suggesting that if we know the types of ONE of the operands, but not the other, we try to catch the error via static typing.

I should say that the current behavior in the stubs is from 3 years ago when we first inherited the project from something MIcrosoft had started, and now that I have more experience with typing, as well as using the stubs in my own code, I've come around to trying to find more things with static type checking if we can find them, then not.

So the summary of my proposal is (with respect to Series) for binary operators a X b, where X is the operator:

  1. If a and b are fully typed, we figure out the result, and if it is an invalid calculation, we catch it.
  2. If a is Series[Any] and b is fully typed, we say that is an error.
  3. If a is fully typed, and b is Series[Any], we say that is an error.
  4. If a and b are not fully typed (i.e., one is Series[Any] and the other is Any or Series[Any], we accept the calculation in typing and don't report an error.

Let me know your thoughts on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants