Open
Description
Describe the bug
When reading a large parquet file from S3 using read_parquet, I get errors like ValueError: Length mismatch: Expected axis has 75536 elements, new values have 6741043 elements
. Expected axis value matches the integer value of chunked (or 65_536 if chunked=True
).
Traceback:
Traceback (most recent call last):
File "refresh.py", line 3, in <module>
scores.refresh_score_partitions()
File "/Users/sameckert/aw/project_explorer/app/scores.py", line 152, in refresh_score_partitions
for df in dfs:
File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 400, in _read_parquet_chunked
path_root=path_root,
File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 295, in _arrowtable2df
df = _apply_index(df=df, metadata=metadata)
File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 224, in _apply_index
df.index = pd.RangeIndex(start=col["start"], stop=col["stop"], step=col["step"])
File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/pandas/core/generic.py", line 5154, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/pandas/core/generic.py", line 564, in _set_axis
self._mgr.set_axis(axis, labels)
File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 227, in set_axis
f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 75536 elements, new values have 6741043 elements
Environment
Provide your pip list
output, particularly the version of the AWS Data Wrangler library you used. Providing this information may significantly improve resolution times.
asn1crypto==1.4.0; python_version >= "3.6" and python_version < "3.10"
awswrangler==2.9.0; python_version >= "3.6" and python_version < "3.10"
beautifulsoup4==4.9.3; python_version >= "3.6" and python_version < "3.10"
boto3==1.17.100; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.2" and python_full_version < "4.0.0"
botocore==1.20.100; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.2" and python_full_version < "4.0.0" and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.0")
certifi==2021.5.30; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
chardet==4.0.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
click==7.1.2; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
dataclasses==0.8; python_version >= "3.6" and python_version < "3.7" and python_full_version >= "3.6.1"
et-xmlfile==1.1.0; python_version >= "3.6" and python_version < "3.10"
fastapi==0.63.0; python_version >= "3.6"
future==0.18.2; python_version >= "2.6" and python_full_version < "3.0.0" or python_full_version >= "3.3.0"
h11==0.12.0; python_version >= "3.6"
idna==2.10; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
jmespath==0.10.0; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.0"
llvmlite==0.36.0; python_version >= "3.6" and python_version < "3.10"
lmdb==1.2.1
lxml==4.6.3; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.5.0"
mysqlclient==2.0.3; python_version >= "3.5"
nmslib==2.1.1
numba==0.53.1; python_version >= "3.6" and python_version < "3.10"
numpy==1.19.5; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.1"
openpyxl==3.0.7; python_version >= "3.6" and python_version < "3.10"
pandas==1.1.5; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.1"
pg8000==1.19.5; python_version >= "3.6" and python_version < "3.10"
psutil==5.8.0; python_version >= "2.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
pyarrow==4.0.1; python_version >= "3.6" and python_version < "3.10"
pyathena==2.3.0; python_full_version >= "3.6.1" and python_full_version < "4.0.0"
pybind11==2.6.1; python_version >= "2.7" and python_version < "3.0" or python_version > "3.0" and python_version < "3.1" or python_version > "3.1" and python_version < "3.2" or python_version > "3.2" and python_version < "3.3" or python_version > "3.3" and python_version < "3.4" or python_version > "3.4"
pydantic==1.8.2; python_full_version >= "3.6.1" and python_version >= "3.6"
pymysql==1.0.2; python_version >= "3.6" and python_version < "3.10"
python-dateutil==2.8.1; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.1"
pytz==2021.1; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.1"
redis==3.5.3; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
redshift-connector==2.0.882; python_version >= "3.6" and python_version < "3.10"
requests==2.25.1; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.5.0"
retrying==1.3.3; python_full_version >= "3.6.2" and python_full_version < "4.0.0"
s3transfer==0.4.2; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.0"
scramp==1.4.0; python_version >= "3.6" and python_version < "3.10"
six==1.16.0; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.2" and python_full_version < "4.0.0"
soupsieve==2.2.1; python_version >= "3.6" and python_version < "3.10"
standardiser==0.1.12
starlette==0.13.6; python_version >= "3.6"
tenacity==6.3.1; python_full_version >= "3.6.2" and python_full_version < "4.0.0"
typing-extensions==3.10.0.0; python_full_version >= "3.6.1" and python_version >= "3.6" and python_version < "3.8"
urllib3==1.26.6; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_full_version >= "3.6.0" and python_version < "3.10" and python_version >= "3.6"
uvicorn==0.13.4
To Reproduce
Steps to reproduce the behavior.
Failing code:
boto3_session = boto3.Session()
dfs = wr.s3.read_parquet(f"s3://large_file.parquet.gz", chunked=75_536, ignore_index=True, boto3_session=boto3_session)
for df in dfs:
print(len(df.index))
P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.