Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series.combine_first loss of precision #60128

Open
2 of 3 tasks
jessestone7 opened this issue Oct 30, 2024 · 6 comments · May be fixed by #60166
Open
2 of 3 tasks

BUG: Series.combine_first loss of precision #60128

jessestone7 opened this issue Oct 30, 2024 · 6 comments · May be fixed by #60166
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions

Comments

@jessestone7
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> a = pd.Series([1666880195890293744, 5]).to_frame()
>>> b = pd.Series([6, 7, 8]).to_frame()
>>> a
                     0
0  1666880195890293744
1                    5
>>> b
   0
0  6
1  7
2  8
>>> a.combine_first(b)
                     0
0  1666880195890293760
1                    5
2                    8

Issue Description

I tried this on Pandas version 2.2.2 and I see that there is a loss of precision. This could be related to issue #51777.

Expected Behavior

1666880195890293744 should not get changed to 1666880195890293760

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.10.9.final.0
python-bits : 64
OS : Darwin
OS-release : 24.0.0
Version : Darwin Kernel Version 24.0.0: Tue Sep 24 23:37:36 PDT 2024; root:xnu-11215.1.12~1/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 1.26.4
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 67.3.2
pip : 24.2
Cython : 0.29.33
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.12.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.0
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.3
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
sqlalchemy : None
tables : 3.8.0
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@jessestone7 jessestone7 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 30, 2024
@rhshadrach
Copy link
Member

Thanks for the report, confirmed on main. PRs to fix are welcome!

@rhshadrach rhshadrach added Dtype Conversions Unexpected or buggy dtype conversions and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 30, 2024
@cooolheater
Copy link

cooolheater commented Oct 30, 2024

It's because the dtype changed : int64->float64->int64

To combine with 'b', NaN need to be added to 'a',
so to include NaN, 'int64' was promoted to 'float64' by below flow:

combine_first core/frame.py:8785
combine core/frame.py:8644
align core/generic.py:9447
_align_frame core/generic.py:9524
_reindex_with_indexers core/generic.py:5423
reindex_indexer core/internals/managers.py:832
take_nd core/internals/blocks.py:1020
take_nd core/array_algos/take.py:115
_take_nd_ndarray core/array_algos/take.py:131
_take_preprocess_indexer_and_fill_value core/array_algos/take.py:531
maybe_promote core/dtypes/cast.py:589

IMHO, it's a genaral practice, and converting this 'float64' back to 'int64' seems not natural.

So, I think making the result 'float64' could be a solution. WDYT?

Just for reference> if NaN exists from the first, it could be handled as 'int' :

a = pd.Series([1666880195890293744, 5,pd.NA]).to_frame()
b = pd.Series([6, 7, 8]).to_frame()
a.combine_first(b)

@rhshadrach
Copy link
Member

So, I think making the result 'float64' could be a solution. WDYT?

It seems to me we should be able to carry out this operation without passing through floats.

@cooolheater
Copy link

The cause of 'passing through floats' is, it tries to insert NaN , while converting 'a' from len:2 to len:3. In this case, should we insert other values (like 0 ) to keep the int64 type ?

@jessestone7
Copy link
Author

jessestone7 commented Nov 1, 2024

I am merely a user of Pandas, and the underlying code is far over my head, so maybe I should not be commenting here, but I wonder would it be possible to use here whatever solution was used to solve issue #39051 ?

cooolheater added a commit to cooolheater/pandas that referenced this issue Nov 1, 2024
- Issue: There was int64->float64->int64 conversions
- Fix: Carry out the operation without passing through float64
cooolheater added a commit to cooolheater/pandas that referenced this issue Nov 1, 2024
- Issue: There was int64->float64->int64 conversions
- Fix: Carry out the operation without passing through float64
@cooolheater cooolheater linked a pull request Nov 1, 2024 that will close this issue
5 tasks
@cooolheater
Copy link

cooolheater commented Nov 1, 2024

The above patch could solve this issue ( when all the columns are 'int64' ), but could not cover mixed case like below:

a = pd.DataFrame({0:pd.Series([1666880195890293744, 5]),1:pd.Series([1.0,2.0])})
b = pd.Series([6, 7, 8]).to_frame()
a.combine_first(b)

Currently, NDFrame::align simply get just 1 fill_value as a parameter.
IMHO, to solve above case, we need to pass more context as parameters. Is this a right way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants