-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
284 MBox Refresher #295
base: master
Are you sure you want to change the base?
284 MBox Refresher #295
Conversation
… helix config in accordance to new save file structure I have created the parse_mbox_latest_date and refresh_mbox functions. The latter function deletes the latest year and month mbox file that is currently downloaded (identified by parse_mbox_latest_date), and redownloads that along with any file beyond up until the current year. The naming convention of the downloaded files are also changed to what we have agreed on. Just to note, download_mod_mbox REMAINS UNCHANGED since I'm only using download_mod_mbox_per_month.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #295 +/- ##
==========================================
- Coverage 39.79% 36.42% -3.37%
==========================================
Files 20 20
Lines 3091 3495 +404
==========================================
+ Hits 1230 1273 +43
- Misses 1861 2222 +361 ☔ View full report in Codecov by Sentry. |
Thank you @ian-lastname. I will try to make a pass before our meeting! |
…ted refresh_pipermail, updated news Found out that the pipermail downloader function already downloads the files by month and year, so all I really needed to do was change it so that it downloads the files as mbox files (change the extension from .txt to .mbox). Created the refresher for pipermail. I had no need to create a parse latest pipermail since they were mbox files anyway.
…to ensure it does not download files past current year and month Added checks in the aforementioned functions so that the refreshers won't download "mail from the future"
…ountered Done as requested by Carlos
@ian-lastname thanks! |
- Remove archive_url and archive_type parameters from download_pipermail(). - Add start_year_month and end_year_month parameters for date filtering. - Remove convert_pipermail_to_mbox() function, as download_pipermail() now handles file conversion automatically. - Change file naming convention to 'kaiaulu_'YYYYMM.mbox'. - Attempt to download and decompress files directly without saving .gz to disk, but could not establish a valid connection. Signed-off-by: Dao McGill <[email protected]>
Hi @carlosparadis, I've refactored the download_pipermail() function. Proposed Changes
I've added temporary configuration entries in helix.yml for testing purposes: conf <- yaml::read_yaml("conf/helix.yml")
mailing_list <- conf[["mailing_list"]][["mod_mbox"]][["pipermail_key"]][["mailing_list"]]
start_year_month <- conf[["mailing_list"]][["mod_mbox"]][["pipermail_key"]][["start_year_month"]]
end_year_month <- conf[["mailing_list"]][["mod_mbox"]][["pipermail_key"]][["end_year_month"]]
save_folder_path <- conf[["mailing_list"]][["mod_mbox"]][["pipermail_key"]][["save_folder_path"]] And this function call: download_pipermail(
mailing_list = mailing_list,
start_year_month = start_year_month,
end_year_month = end_year_month,
save_folder_path = save_folder_path
) Testing Results
|
…mail() - Modified helix.yml to use [[“mailing_list”]][[“pipermail”]][[“project_key_1”]] - Added project_key_2 to helix.yml - Created /vignettes/download_mail.Rmd to document information about pipermail downloader - Made function calls explicit for external libraries - ISSUE: Build -> Check is not passing. Seems to be having issues with utags_path, even though I changed the path to the one for universal-ctags in tools.yml
@daomcgill I made an inline comment to reply to your question of changes, since there may be a bit of misunderstanding. Let me know if you can't find it. Two sanity checks:
|
…process_gz_to_mbox_in_folder() - download_pipermail: Attempts to download .txt file first. If unavailable fallback to .gz. If using .gz file, unzips and writes output in .mbox - Added log messages - download_pipermail: Added timeout parameter to deal with case that server takes too long to respond - Added refresh_pipermail function - Updated vignettes/download_mail.Rmd to include refresh_pipermail - Added process_gz_to_mbox_in_folder function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daomcgill i've added some inline comments, for sanity sake you can reply to each comment directly in line on these since they are more specific.
…il refresher. - Replaced paste0 with stringi::stri_c - Removed create directory if does not exist - Added more verbose descriptions/comments - Added dividers within functions - Added verbose parameter - Added else block for refresher - Added call to process_gz_to_mbox_in_folder at end of refresher - parse_mbox: stri_replace_last was not working, changed it to stringi::stri_replace_last_regex - Tested parse_mbox. Perceval was not returning any output. I will look further into why this is happening.
…il refresher. - Replaced paste0 with stringi::stri_c - Removed create directory if does not exist - Added more verbose descriptions/comments - Added dividers within functions - Added verbose parameter - Added else block for refresher - Added call to process_gz_to_mbox_in_folder at end of refresher - parse_mbox: stri_replace_last was not working, changed it to stringi::stri_replace_last_regex - Tested parse_mbox. Perceval was not returning any output. I will look further into why this is happening. Signed-off-by: Dao McGill <[email protected]>
…uh/kaiaulu into 284-mbox-download-refresher
Updated parameters for download_mod_mbox to use Apache Pony Mail links as Apache lists now redirect there - Modified downloads to use YYYYMM instead of YYYY - Removed the option for downloading by year for clearer functionality. - Updated vignette/download_mail.Rmd Signed-off-by: Dao McGill <[email protected]>
- Created `refresh_mod_mbox` function to automatically refresh mailing list archives downloaded using Mod Mbox. - The function checks for the latest downloaded file, deletes it, and redownloads the archive from that month to the current date. - Added documentation for `refresh_mod_mbox` to the notebook. Signed-off-by: Dao McGill <[email protected]>
- Updated vignettes/download_mail.Rmd to working version - Fixed errors in helix.yml - Minor edits in mail.R
- Updated vignettes/download_mail.Rmd to working version - Fixed errors in helix.yml - Minor edits in mail.R Signed-off-by: Dao McGill <[email protected]>
- Takes file path for mbox file to parse - No longer need to pass project_conf Signed-off-by: Dao McGill <[email protected]>
@daomcgill In the new notebook download_mail I included the parse functions to showcase the downloader results through the parser. If you confirm pointing parse_mbox() to a folder lead to weird results not taking all files, I would recommend you add the loop not only to both parsers, but also to the other notebooks that use parse_mbox() in Kaiaulu, as they will be abiding by the new format. See the loop I added on d3dd232. For the notebooks, i'd say search parse_mbox() on GitHub Kaiaulu repo for hits. The notebooks that had _ are a good start (I changed in one of them already). |
@carlosparadis Oh! Now I understand. I just tried passing a folder into parse_mbox() from the mail notebook, and it does actually parse all files. I checked the parsed rows against the number of unique reply IDs, and they were equal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Barring the minor changes requested, the notebook is clear in its presentation and runs as expected.
Signed-off-by: Dao McGill <[email protected]>
@daomcgill could you resolve the merge conflicts? |
@carlosparadis resolved! |
Signed-off-by: Dao McGill <[email protected]>
@carlosparadis fixed a few small issues I overlooked. I think this one should be good? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished review.
Created file and added license info as start.
Defined CLI with docopt library, includes CLI usage and options.
Adds condition for "--version" logic.
- Adds alert if download help is passed - Checks all arguments are not null
- Downloads GH data via github_api_project_issue_events as a starting point
- Now can download and save JSON file to specified path
- Changed logic to download JSOn files individually - Parses all files from input folder into output csv
Includes steps to: - download and parse data - generate start and end activities and process tree Signed-off-by: Connor Narowetz <[email protected]>
@connorn-dev this does not look right. Are you sure you are sending the commits to the right Pull Request? |
The exec of github.R you are extending is in #301. I am not sure why you have been sending commits to the mailing list downloader. Dao's code was only to be used as reference. Let me fix this PR once you commit the code to the appropriate one. Do not rewind this branch, I do not want to risk losing her code. |
I apologize for the confusion, I completely overlooked the branch reference
in my issue. My issue has this PR to be updated but I will change
that. I have the code so feel free to fix the PR, I will update #301
<#301> with the CLI code. For the
vignette should I push to the same PR or create a new one?
Thank you
…On Mon, Feb 24, 2025 at 8:40 PM Carlos Paradis ***@***.***> wrote:
The exec of github.R you are extending is in #301
<#301>. I am not sure why you have
been sending commits to the mailing list downloader. Dao's code was only to
be used as reference.
Let me fix this PR once you commit the code to the appropriate one. Do not
rewind this branch, I do not want to risk losing her code.
—
Reply to this email directly, view it on GitHub
<#295 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASY6FHRBFGEAVDPEIWNHDIL2RQF4JAVCNFSM6AAAAABGOPSDSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBQHAYDIOBVGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
[image: carlosparadis]*carlosparadis* left a comment (sailuh/kaiaulu#295)
<#295 (comment)>
The exec of github.R you are extending is in #301
<#301>. I am not sure why you have
been sending commits to the mailing list downloader. Dao's code was only to
be used as reference.
Let me fix this PR once you commit the code to the appropriate one. Do not
rewind this branch, I do not want to risk losing her code.
—
Reply to this email directly, view it on GitHub
<#295 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASY6FHRBFGEAVDPEIWNHDIL2RQF4JAVCNFSM6AAAAABGOPSDSGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOBQHAYDIOBVGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@connorn-dev extend #301. For now keep github.R separately, but we should discuss later if it makes sense combining them into one file. No worries, just ping me on #301 when you have all the code there and i will fix this one. |
… helix config in accordance to new save file structure
I have created the parse_mbox_latest_date and refresh_mbox functions. The latter function deletes the latest year and month mbox file that is currently downloaded (identified by parse_mbox_latest_date), and redownloads that along with any file beyond up until the current year. The naming convention of the downloaded files are also changed to what we have agreed on. Just to note, download_mod_mbox REMAINS UNCHANGED since I'm only using download_mod_mbox_per_month.