Skip to content

Commit f4e054a

Browse files
AlenkaFjorisvandenbossche
authored andcommitted
apacheGH-37598: [Python][Interchange Protocol] Fix the from_dataframe implementation to use the column dtype (apache#37986)
### Rationale for this change We have been defining buffer dtypes for string and timestamp types incorrectly in the DataFrame Interchange Protocol implementation. This PR is the first step to fix the error and is dealing with the `from_dataframe` part. The next two steps to solve the connected issue are: 2. Make sure other libraries have also updated their `from_dataframe` implementation 3. Fix the data buffer dtypes for strings and timestamps. ### What changes are included in this PR? Fix the `from_dataframe` implementation to use the column dtype rather than the data buffer dtype to interpret the buffers. Only for the indices of the categorical column we still use buffer data type in order to convert the indices when constructing the `DictionaryArray`. ### Are these changes tested? No new tests are added but all the existing tests should pass and with that the stability of the change is tested. ### Are there any user-facing changes? No. * Closes: apache#37598 Lead-authored-by: AlenkaF <frim.alenka@gmail.com> Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
1 parent 8d87747 commit f4e054a

1 file changed

Lines changed: 18 additions & 5 deletions

File tree

python/pyarrow/interchange/from_dataframe.py

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919

2020
from typing import (
2121
Any,
22+
Tuple,
2223
)
2324

2425
from pyarrow.interchange.column import (
@@ -204,7 +205,9 @@ def column_to_array(
204205
pa.Array
205206
"""
206207
buffers = col.get_buffers()
207-
data = buffers_to_array(buffers, col.size(),
208+
data_type = col.dtype
209+
data = buffers_to_array(buffers, data_type,
210+
col.size(),
208211
col.describe_null,
209212
col.offset,
210213
allow_copy)
@@ -236,7 +239,9 @@ def bool_column_to_array(
236239
)
237240

238241
buffers = col.get_buffers()
239-
data = buffers_to_array(buffers, col.size(),
242+
data_type = col.dtype
243+
data = buffers_to_array(buffers, data_type,
244+
col.size(),
240245
col.describe_null,
241246
col.offset)
242247
data = pc.cast(data, pa.bool_())
@@ -274,11 +279,15 @@ def categorical_column_to_dictionary(
274279
raise NotImplementedError(
275280
"Non-dictionary categoricals not supported yet")
276281

282+
# We need to first convert the dictionary column
277283
cat_column = categorical["categories"]
278284
dictionary = column_to_array(cat_column)
279-
285+
# Then we need to convert the indices
286+
# Here we need to use the buffer data type!
280287
buffers = col.get_buffers()
281-
indices = buffers_to_array(buffers, col.size(),
288+
_, data_type = buffers["data"]
289+
indices = buffers_to_array(buffers, data_type,
290+
col.size(),
282291
col.describe_null,
283292
col.offset)
284293

@@ -326,6 +335,7 @@ def map_date_type(data_type):
326335

327336
def buffers_to_array(
328337
buffers: ColumnBuffers,
338+
data_type: Tuple[DtypeKind, int, str, str],
329339
length: int,
330340
describe_null: ColumnNullType,
331341
offset: int = 0,
@@ -339,6 +349,9 @@ def buffers_to_array(
339349
buffer : ColumnBuffers
340350
Dictionary containing tuples of underlying buffers and
341351
their associated dtype.
352+
data_type : Tuple[DtypeKind, int, str, str],
353+
Dtype description of the column as a tuple ``(kind, bit-width, format string,
354+
endianness)``.
342355
length : int
343356
The number of values in the array.
344357
describe_null: ColumnNullType
@@ -360,7 +373,7 @@ def buffers_to_array(
360373
is responsible for keeping the memory owner object alive as long as
361374
the returned PyArrow array is being used.
362375
"""
363-
data_buff, data_type = buffers["data"]
376+
data_buff, _ = buffers["data"]
364377
try:
365378
validity_buff, validity_dtype = buffers["validity"]
366379
except TypeError:

0 commit comments

Comments
 (0)