- 
                Notifications
    
You must be signed in to change notification settings  - Fork 173
 
ENH: Add read archive function #1440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
ENH: Add read archive function #1440
Conversation
          Codecov ReportAttention: Patch coverage is  
 Additional details and impacted files@@            Coverage Diff             @@
##              dev    #1440      +/-   ##
==========================================
- Coverage   89.07%   83.16%   -5.91%     
==========================================
  Files          87       87              
  Lines        5374     6440    +1066     
==========================================
+ Hits         4787     5356     +569     
- Misses        587     1084     +497      | 
    
| 
           @Sabrina-Hassaim kudos on getting docs and tests working ... kindly work on adding tests for the missing parts, as indicated by the CI. from there we can work on the code itself ... i've got a couple of suggestions but we could do it in steps, after coverage is ok for the existing setup  | 
    
| 
           also @Sabrina-Hassaim kindly close the other PR  | 
    
4c84cd6    to
    36caf34      
    Compare
  
    | return dfs if len(dfs) > 1 else dfs[0] | ||
| 
               | 
          ||
| 
               | 
          ||
| def _select_files_interactively(compatible_files: list[str]) -> list[str]: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not sure we should support this - is there any benefit to this? @ericmjl @pyjanitor-devs/core-devs thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth keeping around just to see what it might do for the library. If it turns out not to be used very widely we can just deprecate it at a later date. On the other hand, if it's very popular, then we have the benefit of having it around.
| extract_to_df: bool = True, | ||
| file_type: str | None = None, | ||
| selected_files: list[str] | None = None, | ||
| ) -> pd.DataFrame | list[str]: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we should allow for more flexibility, via kwargs, where you can pass extra info to read_csv, read_excel, read_parquet, etc
| extract_to_df: bool = True, | ||
| file_type: str | None = None, | ||
| selected_files: list[str] | None = None, | ||
| ) -> pd.DataFrame | list[str]: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add an engine argument to support other dataframe libraries, e.g polars. Have a look at some of the IO functions that support polars
| 
           kindly add a line to changelog.md  | 
    
PR Description
Please describe the changes proposed in the pull request:
1. Implementation of the read_archive Function:
Added a new method to read archive files (.zip, .tar, .tar.gz) and extract their contents as a DataFrame or a list of compatible files.
Supports CSV and Excel file formats within the archives.
2. Unit Tests
**This PR resolves #1171 **
PR Checklist
Please ensure that you have done the following:
<your_username>:dev, but rather from<your_username>:<feature-branch_name>.AUTHORS.md.CHANGELOG.mdunder the latest version header (i.e. the one that is "on deck") describing the contribution.Automatic checks
There will be automatic checks run on the PR. These include:
Relevant Reviewers
Please tag maintainers to review.