Skip to content

GH-37598: [Python][Interchange Protocol] Fix the from_dataframe implementation to use the column dtype#37986

Merged
jorisvandenbossche merged 2 commits intoapache:mainfrom
AlenkaF:gh-37598-from_dataframe-should-read-column-dtype
Oct 5, 2023
Merged

GH-37598: [Python][Interchange Protocol] Fix the from_dataframe implementation to use the column dtype#37986
jorisvandenbossche merged 2 commits intoapache:mainfrom
AlenkaF:gh-37598-from_dataframe-should-read-column-dtype

Conversation

@AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented Oct 3, 2023

Rationale for this change

We have been defining buffer dtypes for string and timestamp types incorrectly in the DataFrame Interchange Protocol implementation. This PR is the first step to fix the error and is dealing with the from_dataframe part. The next two steps to solve the connected issue are:

  1. Make sure other libraries have also updated their from_dataframe implementation
  2. Fix the data buffer dtypes for strings and timestamps.

What changes are included in this PR?

Fix the from_dataframe implementation to use the column dtype rather than the data buffer dtype to interpret the buffers. Only for the indices of the categorical column we still use buffer data type in order to convert the indices when constructing the DictionaryArray.

Are these changes tested?

No new tests are added but all the existing tests should pass and with that the stability of the change is tested.

Are there any user-facing changes?

No.

@AlenkaF AlenkaF marked this pull request as ready for review October 3, 2023 09:31
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels Oct 3, 2023
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche jorisvandenbossche merged commit 334c937 into apache:main Oct 5, 2023
@jorisvandenbossche jorisvandenbossche removed the awaiting merge Awaiting merge label Oct 5, 2023
@AlenkaF AlenkaF deleted the gh-37598-from_dataframe-should-read-column-dtype branch October 5, 2023 11:31
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 334c937.

There were 5 benchmark results indicating a performance regression:

The full Conbench report has more details.

@AlenkaF AlenkaF added this to the 14.0.0 milestone Oct 6, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
… implementation to use the column dtype (apache#37986)

### Rationale for this change

We have been defining buffer dtypes for string and timestamp types incorrectly in the DataFrame Interchange Protocol implementation. This PR is the first step to fix the error and is dealing with the `from_dataframe` part. The next two steps to solve the connected issue are:

2. Make sure other libraries have also updated their `from_dataframe` implementation
3. Fix the data buffer dtypes for strings and timestamps.

### What changes are included in this PR?

Fix the `from_dataframe` implementation to use the column dtype rather than the data buffer dtype to interpret the buffers. Only for the indices of the categorical column we still use buffer data type in order to convert the indices when constructing the `DictionaryArray`.

### Are these changes tested?

No new tests are added but all the existing tests should pass and with that the stability of the change is tested.

### Are there any user-facing changes?

No.
* Closes: apache#37598

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
… implementation to use the column dtype (apache#37986)

### Rationale for this change

We have been defining buffer dtypes for string and timestamp types incorrectly in the DataFrame Interchange Protocol implementation. This PR is the first step to fix the error and is dealing with the `from_dataframe` part. The next two steps to solve the connected issue are:

2. Make sure other libraries have also updated their `from_dataframe` implementation
3. Fix the data buffer dtypes for strings and timestamps.

### What changes are included in this PR?

Fix the `from_dataframe` implementation to use the column dtype rather than the data buffer dtype to interpret the buffers. Only for the indices of the categorical column we still use buffer data type in order to convert the indices when constructing the `DictionaryArray`.

### Are these changes tested?

No new tests are added but all the existing tests should pass and with that the stability of the change is tested.

### Are there any user-facing changes?

No.
* Closes: apache#37598

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] Interchange object data buffer has the wrong dtype / from_dataframe incorrect

2 participants