-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Open
Labels
BugIO JSONread_json, to_json, json_normalizeread_json, to_json, json_normalizeIO NetworkLocal or Cloud (AWS, GCS, etc.) IO IssuesLocal or Cloud (AWS, GCS, etc.) IO Issues
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.read_json(path_or_buf="s3://...json", lines=True, chunksize=100)
Issue Description
This issue happens when using Pandas read_json with s3fs, with a non-null chunksize. There's a similar report for the null chunksize case.
Using the code above will results in this error:
TypeError: initial_value must be str or None, not bytes
I found out it's due to this method
def _preprocess_data(self, data):
"""
At this point, the data either has a `read` attribute (e.g. a file
object or a StringIO) or is a string that is a JSON document.
If self.chunksize, we prepare the data for the `__next__` method.
Otherwise, we read it into memory for the `read` method.
"""
if hasattr(data, "read") and not (self.chunksize or self.nrows):
with self:
data = data.read()
if not hasattr(data, "read") and (self.chunksize or self.nrows):
--> data = StringIO(data)
return data
I found the fix is simple, just the change the above line to:
data = StringIO(ensure_str(data))
Will put together a PR
Expected Behavior
Using pandas read_json with S3 url and non-null chunksize should work
Installed Versions
If happens with current versions (1.4.3)
Metadata
Metadata
Assignees
Labels
BugIO JSONread_json, to_json, json_normalizeread_json, to_json, json_normalizeIO NetworkLocal or Cloud (AWS, GCS, etc.) IO IssuesLocal or Cloud (AWS, GCS, etc.) IO Issues