Skip to content

Length mismatch when using read_parquet chunked #769

Open
@samjeckert

Description

@samjeckert

Describe the bug

When reading a large parquet file from S3 using read_parquet, I get errors like ValueError: Length mismatch: Expected axis has 75536 elements, new values have 6741043 elements. Expected axis value matches the integer value of chunked (or 65_536 if chunked=True).

Traceback:

Traceback (most recent call last):
  File "refresh.py", line 3, in <module>
    scores.refresh_score_partitions()
  File "/Users/sameckert/aw/project_explorer/app/scores.py", line 152, in refresh_score_partitions
    for df in dfs:
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 400, in _read_parquet_chunked
    path_root=path_root,
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 295, in _arrowtable2df
    df = _apply_index(df=df, metadata=metadata)
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 224, in _apply_index
    df.index = pd.RangeIndex(start=col["start"], stop=col["stop"], step=col["step"])
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/pandas/core/generic.py", line 5154, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/pandas/core/generic.py", line 564, in _set_axis
    self._mgr.set_axis(axis, labels)
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 227, in set_axis
    f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 75536 elements, new values have 6741043 elements

Environment

Provide your pip list output, particularly the version of the AWS Data Wrangler library you used. Providing this information may significantly improve resolution times.

asn1crypto==1.4.0; python_version >= "3.6" and python_version < "3.10"
awswrangler==2.9.0; python_version >= "3.6" and python_version < "3.10"
beautifulsoup4==4.9.3; python_version >= "3.6" and python_version < "3.10"
boto3==1.17.100; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.2" and python_full_version < "4.0.0"
botocore==1.20.100; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.2" and python_full_version < "4.0.0" and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.0")
certifi==2021.5.30; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
chardet==4.0.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
click==7.1.2; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
dataclasses==0.8; python_version >= "3.6" and python_version < "3.7" and python_full_version >= "3.6.1"
et-xmlfile==1.1.0; python_version >= "3.6" and python_version < "3.10"
fastapi==0.63.0; python_version >= "3.6"
future==0.18.2; python_version >= "2.6" and python_full_version < "3.0.0" or python_full_version >= "3.3.0"
h11==0.12.0; python_version >= "3.6"
idna==2.10; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
jmespath==0.10.0; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.0"
llvmlite==0.36.0; python_version >= "3.6" and python_version < "3.10"
lmdb==1.2.1
lxml==4.6.3; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.5.0"
mysqlclient==2.0.3; python_version >= "3.5"
nmslib==2.1.1
numba==0.53.1; python_version >= "3.6" and python_version < "3.10"
numpy==1.19.5; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.1"
openpyxl==3.0.7; python_version >= "3.6" and python_version < "3.10"
pandas==1.1.5; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.1"
pg8000==1.19.5; python_version >= "3.6" and python_version < "3.10"
psutil==5.8.0; python_version >= "2.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
pyarrow==4.0.1; python_version >= "3.6" and python_version < "3.10"
pyathena==2.3.0; python_full_version >= "3.6.1" and python_full_version < "4.0.0"
pybind11==2.6.1; python_version >= "2.7" and python_version < "3.0" or python_version > "3.0" and python_version < "3.1" or python_version > "3.1" and python_version < "3.2" or python_version > "3.2" and python_version < "3.3" or python_version > "3.3" and python_version < "3.4" or python_version > "3.4"
pydantic==1.8.2; python_full_version >= "3.6.1" and python_version >= "3.6"
pymysql==1.0.2; python_version >= "3.6" and python_version < "3.10"
python-dateutil==2.8.1; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.1"
pytz==2021.1; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.1"
redis==3.5.3; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
redshift-connector==2.0.882; python_version >= "3.6" and python_version < "3.10"
requests==2.25.1; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.5.0"
retrying==1.3.3; python_full_version >= "3.6.2" and python_full_version < "4.0.0"
s3transfer==0.4.2; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.0"
scramp==1.4.0; python_version >= "3.6" and python_version < "3.10"
six==1.16.0; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.2" and python_full_version < "4.0.0"
soupsieve==2.2.1; python_version >= "3.6" and python_version < "3.10"
standardiser==0.1.12
starlette==0.13.6; python_version >= "3.6"
tenacity==6.3.1; python_full_version >= "3.6.2" and python_full_version < "4.0.0"
typing-extensions==3.10.0.0; python_full_version >= "3.6.1" and python_version >= "3.6" and python_version < "3.8"
urllib3==1.26.6; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_full_version >= "3.6.0" and python_version < "3.10" and python_version >= "3.6"
uvicorn==0.13.4

To Reproduce

Steps to reproduce the behavior.

Failing code:

boto3_session = boto3.Session()
dfs = wr.s3.read_parquet(f"s3://large_file.parquet.gz", chunked=75_536, ignore_index=True, boto3_session=boto3_session)
for df in dfs:
    print(len(df.index))

P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions