Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_as_pandas() converts 'NA' to nan - consider adding optional keep_default_na=False argument #154

Closed
NanisTe opened this issue Jul 31, 2020 · 3 comments

Comments

@NanisTe
Copy link

NanisTe commented Jul 31, 2020

'NA' strings in a result in result_set.py are converted to nan by the function _as_pandas().

Consider adding optional argument keep_default_na in the pd.read_csv() inside _as_pandas() function to control behaviour.

This correlates to issue #118 and #120

@laughingman7743
Copy link
Owner

laughingman7743 commented Aug 2, 2020

How about allowing the option to keep_default_na, na_values, in the execution method as follows:
c379656

from pyathena import connect
from pyathena.pandas_cursor import PandasCursor

cursor = connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
                 region_name='us-west-2',
                 cursor_class=PandasCursor).cursor()

df = cursor.execute("SELECT * FROM many_rows", keep_default_na=False, na_values=[""]).as_pandas()

@NanisTe
Copy link
Author

NanisTe commented Aug 3, 2020

How about allowing the option to keep_default_na, na_values, in the execution method as follows:
c379656

from pyathena import connect
from pyathena.pandas_cursor import PandasCursor

cursor = connect(s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
                 region_name='us-west-2',
                 cursor_class=PandasCursor).cursor()

df = cursor.execute("SELECT * FROM many_rows", keep_default_na=False, na_values=[""]).as_pandas()

Is that an already working solution or do you suggest to implement something like this?
I would keep it in as_pandas() since it is much more related to that than to the query execution itself.

@laughingman7743
Copy link
Owner

Is that an already working solution or do you suggest to implement something like this?

It is implemented in the following branches.
#120

I would keep it in as_pandas() since it is much more related to that than to the query execution itself.

The current implementation is designed to load the CSV automatically after the query is executed, so calling as_pasdas does not load the CSV. The _as_pandas method is called in the constructor of the result_set object.
I don't want to make any major changes to this implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants