Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

284 MBox Refresher #295

Open
wants to merge 95 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
8c1021d
Created parse_mbox_latest_date and refresh_mbox functions and updated…
ian-lastname Apr 19, 2024
72238a7
Edited download_pipermail to save pipermail files as mbox files, crea…
ian-lastname Apr 24, 2024
99fb7e3
Changed function name from refresh_mbox to refresh_mod_mbox for consi…
ian-lastname Apr 25, 2024
618f2d0
Added checks in refresh functions and in download_mod_mbox_per_month …
ian-lastname Apr 25, 2024
0751218
fix github checks
carlosparadis Apr 28, 2024
be4ff32
Re-added error message in refresh_pipermail when an http error is enc…
ian-lastname Apr 29, 2024
b5be04e
Added comments to download_pipermail
ian-lastname Apr 30, 2024
d2ce222
Minor documentation update for setup verification.
daomcgill Sep 10, 2024
7c585ae
i #284 Refactor download_pipermail function
daomcgill Sep 15, 2024
69ca163
i #284 Updated documentation and modified function for download_piper…
daomcgill Sep 17, 2024
b9a886b
i #284 Edited download_pipermail() and Added refresh_pipermail() and …
daomcgill Sep 17, 2024
3c88140
i #284 Added more descriptive comments. Made minor changes to piperma…
daomcgill Sep 19, 2024
5de3aa2
i #284 Added more descriptive comments. Made minor changes to piperma…
daomcgill Sep 19, 2024
8a373d6
Merge branch '284-mbox-download-refresher' of https://github.com/sail…
daomcgill Sep 19, 2024
b91389b
i #284 Added download_mod_mbox function and edited notebook
daomcgill Sep 21, 2024
0cc4123
i #284 Added refresh_mod_mbox function for refreshing Mod Mbox archives
daomcgill Sep 22, 2024
0dc6001
i #284 Updated Notebook
daomcgill Oct 1, 2024
f0027dc
i #284 Testing Github Actions
daomcgill Oct 2, 2024
9b9c896
i #284 Renamed save_folder_mail parameter to mbox
daomcgill Oct 2, 2024
7249c9b
i #284 Updated Notebook download_mail.Rmd
daomcgill Oct 3, 2024
2a1ba98
Revert "i #284 Testing Github Actions"
daomcgill Oct 3, 2024
7bf8ba6
i #284 Refactored parse_mbox_latest_date and Fixed Roxygen Errors
daomcgill Oct 3, 2024
aa60648
i #284 Update NEWS.md
daomcgill Oct 3, 2024
64e0646
i #284 Updated Notebook, exec/mailinglist.R and R/mail.R
daomcgill Oct 6, 2024
2b6a963
i #284 Changed Notebook to Use Project Working Directory
daomcgill Oct 6, 2024
dc40dba
i #284 Minor Fix: Folder Paths in helix.yml
daomcgill Oct 6, 2024
d6f3b41
i #284 fixes incorrect call
carlosparadis Oct 9, 2024
f02ecb1
i #284 attempt fix on Actions
carlosparadis Oct 9, 2024
7f38d1c
i #284 incomplete storytelling review
carlosparadis Oct 9, 2024
309fa34
i #284 downgrade version of R for XML
carlosparadis Oct 9, 2024
e04bd31
i #284 gcc not found on Actions
carlosparadis Oct 9, 2024
dbd7092
i #284 Refactored download_mail.Rmd
beydlern Oct 10, 2024
ea109bd
Merge branch 'master' into 284-mbox-download-refresher
daomcgill Oct 10, 2024
c4b9d16
i #284 Testing GitHub Actions after Merge
daomcgill Oct 10, 2024
90b05ed
i #284 GH Actions (changed perceval path)
daomcgill Oct 10, 2024
3e5f8f7
i #284 Change Roxygen version
daomcgill Oct 10, 2024
4af2c21
i #284 Update Notebook and config file
daomcgill Oct 11, 2024
8094402
i #284 Final Updates for Mail Notebook
daomcgill Oct 15, 2024
5fb3af7
i #284 Fixed Relative Paths in a Notebook
beydlern Oct 18, 2024
e56848a
i #230 create config file interface
anthonyjlau Nov 12, 2024
b462ddb
Merge branch 'master' into 284-mbox-download-refresher
carlosparadis Nov 12, 2024
def1660
i #284 minor fixes and XML dependency
carlosparadis Nov 12, 2024
bfc75cb
revert utags
carlosparadis Nov 12, 2024
c1830f6
i #284 More narrative and config fixes
carlosparadis Nov 12, 2024
4842100
i #284 Remove description tags
carlosparadis Nov 12, 2024
0f9769e
i #284 more minor doc formatting fixes
carlosparadis Nov 12, 2024
6f6a59b
i #284 Updates to exec/mailinglist.R and Minor Fixes for Mail Configu…
daomcgill Nov 12, 2024
775b5a6
Merge branch 'master' of https://github.com/sailuh/kaiaulu
daomcgill Nov 12, 2024
93f214e
Merge branch 'master' into 284-mbox-download-refresher
daomcgill Nov 12, 2024
e27a604
i #295 Change argument for exec from 'tabulate' to 'parse'
daomcgill Nov 12, 2024
6a5fed6
i #284 Testing Fix for Actions
daomcgill Nov 13, 2024
ffb5c9c
i #284 Try Adding Debugging
daomcgill Nov 13, 2024
e55b6e2
Revert "i #284 Try Adding Debugging"
daomcgill Nov 13, 2024
c797219
i #284 Revert ctags version
daomcgill Nov 13, 2024
092e2ab
Update commit_message_id_coverage.Rd
daomcgill Nov 13, 2024
56dff9c
i #284 Please work
daomcgill Nov 13, 2024
fd97af0
i #295 Last try
daomcgill Nov 13, 2024
8709b95
Revert "i #295 Last try"
daomcgill Nov 13, 2024
71054f9
Revert "i #284 Please work"
daomcgill Nov 13, 2024
382383d
Revert "Update commit_message_id_coverage.Rd"
daomcgill Nov 13, 2024
f11e452
Revert "i #284 Revert ctags version"
daomcgill Nov 13, 2024
09d00c3
Reapply "i #284 Try Adding Debugging"
daomcgill Nov 13, 2024
216fe07
i #284 R version
daomcgill Nov 13, 2024
99823d7
i #284 another R version change attempt
daomcgill Nov 13, 2024
6cd5e11
i #284 Version that was passing check
daomcgill Nov 13, 2024
132355d
i #295 Small changes from updated config
daomcgill Nov 30, 2024
dceded0
i #284 Updates to Mail Notebook
daomcgill Dec 1, 2024
5515d7c
i #284 Update Mailing List Exec to use "refresh"
daomcgill Dec 6, 2024
a89b983
Reverse github actions to match master
carlosparadis Dec 8, 2024
4aa2af2
Remove git.R print statements
carlosparadis Dec 8, 2024
a366573
Remove prints from mail tests
carlosparadis Dec 8, 2024
2887232
Remove prefix underline
carlosparadis Dec 8, 2024
800fccc
Remove additional git prints
carlosparadis Dec 8, 2024
67de9f8
Internal api functions should not be display
carlosparadis Dec 8, 2024
ae1ba66
Unit tests now pass locally
carlosparadis Dec 8, 2024
3697fe3
Remove more prints..
carlosparadis Dec 8, 2024
7fc9e41
Remove strange mbox file path
carlosparadis Dec 8, 2024
207d0c4
Fix parse_mbox removing stderr = TRUE
carlosparadis Dec 8, 2024
d3dd232
Add loop to parse_mbox on notebook
carlosparadis Dec 8, 2024
f74aff3
Documentation pass
carlosparadis Dec 8, 2024
557ad10
i #284 Update refresh functions
daomcgill Dec 8, 2024
1cf86e5
i #284 Missing file in previous commit
daomcgill Dec 8, 2024
f3048a9
i #284 Edit exec/mailinglist.R parse to take file as arg
daomcgill Dec 9, 2024
5ce5830
i #284 Use pipermail path for parsing pipermail folder
daomcgill Dec 9, 2024
41f0850
Merge branch 'master' into 284-mbox-download-refresher
daomcgill Dec 9, 2024
4df52d9
i #284 Minor fixes
daomcgill Dec 9, 2024
9d08ff7
Create /exec/ghevents.R
connorn-dev Feb 11, 2025
1e0a45f
Define CLI
connorn-dev Feb 11, 2025
874bf7d
ghevents.R testing logic
connorn-dev Feb 11, 2025
8fe551a
Add Help alert, Check arguments
connorn-dev Feb 11, 2025
8f35262
Added first Download Logic
connorn-dev Feb 11, 2025
aba7424
Finished CLI Donwload Logic
connorn-dev Feb 18, 2025
606d3ea
Finished Download and Parser Functions
connorn-dev Feb 18, 2025
07d350f
i #341 Vingette with R and Python together
connorn-dev Feb 25, 2025
c592260
i #341 Updated Vignette
connorn-dev Feb 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
i #284 Updated Notebook
- Updated vignettes/download_mail.Rmd to working version
- Fixed errors in helix.yml
- Minor edits in mail.R

Signed-off-by: Dao McGill <[email protected]>
daomcgill committed Oct 2, 2024
commit 0dc60013b730b2057b907ed0b14ada241d497702
4 changes: 2 additions & 2 deletions R/mail.R
Original file line number Diff line number Diff line change
@@ -254,7 +254,7 @@ process_gz_to_mbox_in_folder <- function(folder_path, verbose = TRUE) {
# If there are no .gz files, print a message (if verbose is TRUE) and return NULL
if (length(gz_files) == 0) {
if (verbose) cat("This folder does not contain any .gz files.\n")
return(NULL)
return(invisible(NULL))
}

# Create a vector to store the names of the converted .mbox files
@@ -317,7 +317,7 @@ process_gz_to_mbox_in_folder <- function(folder_path, verbose = TRUE) {
#' @param verbose if TRUE, prints detailed messages during the download process.
#' @return Returns `save_folder_path`, the folder path where the mbox files are stored.
#' @export
download_mod_mbox <- function(mailing_list, start_year_month, end_year_month, save_folder_path, verbose = FALSE) {
download_mod_mbox <- function(mailing_list, start_year_month, end_year_month, save_folder_path, verbose = TRUE) {

########## Extract Mailing List Name ##########
# Extract the mailing list name from the given URL. This is because the actual list name is
8 changes: 6 additions & 2 deletions conf/helix.yml
Original file line number Diff line number Diff line change
@@ -49,29 +49,33 @@ version_control:

mailing_list:
mod_mbox:
mail_key_1:
project_key_1:
mailing_list: https://lists.apache.org/[email protected]
start_year_month: 202310
end_year_month: 202405
save_folder_path: "../save_mbox_mail"
mail_key_2:
save_parsed_folder_path: "../save_parsed_mail"
project_key_2:
mailing_list: https://lists.apache.org/[email protected]
start_year_month: 202201
end_year_month: 202401
save_folder_path: "../save_mbox_mail"
save_parsed_folder_path: "../save_parsed_mail"
pipermail:
project_key_1:
# archive_url: https://mta.openssl.org/mailman/listinfo/
mailing_list: https://mta.openssl.org/pipermail/openssl-users/
start_year_month: 202310
end_year_month: 202405
save_folder_path: "../save_folder_mail"
save_parsed_folder_path: "../save_parsed_mail"
project_key_2:
# archive_url: https://mta.openssl.org/mailman/listinfo/
mailing_list: https://mta.openssl.org/pipermail/openssl-project/
start_year_month: 202203
end_year_month: 202303
save_folder_path: "../save_folder_mail_2"
save_parsed_folder_path: "../save_parsed_mail"

issue_tracker:
jira:
49 changes: 40 additions & 9 deletions vignettes/download_mail.Rmd
Original file line number Diff line number Diff line change
@@ -51,7 +51,7 @@ Each mailing list maintains archives of past messages, often organized by month
To start, we load the project configuration file, which contains parameters for downloading the mailing list archives.

```{r}
conf <- yaml::read_yaml("conf/helix.yml")
conf <- yaml::read_yaml("../conf/helix.yml")
mailing_list <- conf[["mailing_list"]][["pipermail"]][["project_key_1"]][["mailing_list"]]
start_year_month <- conf[["mailing_list"]][["pipermail"]][["project_key_1"]][["start_year_month"]]
end_year_month <- conf[["mailing_list"]][["pipermail"]][["project_key_1"]][["end_year_month"]]
@@ -72,7 +72,8 @@ download_pipermail(
mailing_list = mailing_list,
start_year_month = start_year_month,
end_year_month = end_year_month,
save_folder_path = save_folder_path
save_folder_path = save_folder_path,
verbose = TRUE
)

```
@@ -90,7 +91,8 @@ How refresh_pipermail Works
refresh_pipermail(
mailing_list = mailing_list,
start_year_month = start_year_month,
save_folder_path = save_folder_path
save_folder_path = save_folder_path,
verbose = TRUE
)

```
@@ -105,10 +107,10 @@ Mod Mbox archives also organize mailing lists by topic. The apache mailing list
Similar to Pipermail, we load the configuration for Mod Mbox from the YAML file, which includes the mailing list URL, the date range, and the save folder path.

```{r}
mod_mbox_list <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["mailing_list"]]
mod_start_year_month <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["start_year_month"]]
mod_end_year_month <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["end_year_month"]]
mod_save_folder_path <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["save_folder_path"]]
mod_mbox_list <- conf[["mailing_list"]][["mod_mbox"]][["project_key_1"]][["mailing_list"]]
mod_start_year_month <- conf[["mailing_list"]][["mod_mbox"]][["project_key_1"]][["start_year_month"]]
mod_end_year_month <- conf[["mailing_list"]][["mod_mbox"]][["project_key_1"]][["end_year_month"]]
mod_save_folder_path <- conf[["mailing_list"]][["mod_mbox"]][["project_key_1"]][["save_folder_path"]]
```

### Explanation of Configuration Parameters
@@ -117,7 +119,7 @@ mod_save_folder_path <- conf[["mailing_list"]][["mod_mbox"]][["mail_key_1"]][["s
- end_year_month: The last month to download (format: YYYYMM).
- save_folder_path: The directory where the downloaded .mbox files will be saved.

##Mod Mbox Downloader
## Mod Mbox Downloader
The download_mod_mbox() function downloads Mod Mbox archives by constructing URLs based on the mailing list and date range, saving them as .mbox files named kaiaulu_YYYYMM.mbox.

```{r}
@@ -127,6 +129,7 @@ download_mod_mbox(
end_year_month = mod_end_year_month,
save_folder_path = mod_save_folder_path,
verbose = TRUE
)
```

After running the function, it constructs URLs like: https://lists.apache.org/api/[email protected]&date=2024-01
@@ -143,9 +146,37 @@ How refresh_mod_mbox Works
refresh_mod_mbox(
mailing_list = mod_mbox_list,
start_year_month = mod_start_year_month,
save_folder_path = mod_save_folder_path
save_folder_path = mod_save_folder_path,
verbose = TRUE
)
```

This ensures your archive is up-to-date, accounting for new data that may have been added to the mailing list since the last download.

# Parser

After downloading the mailing list archives as .mbox files, the next step is to parse these files to extract meaningful information for analysis. The parse_mbox() function utilizes the Perceval library to parse .mbox files and convert them into structured data tables. This enables easier manipulation and analysis of mailing list data.

## Mbox Parser
The parse_mbox() function takes an .mbox file and parses it into a structured data.table using the Perceval library.

For the configuration, make sure you have the correct path to the Perceval library in the conf file.

```{r}
tools_config <- yaml::read_yaml("../tools.yml")
parse_perceval_path <- tools_config[["perceval"]]

conf <- yaml::read_yaml("../conf/helix.yml")
parse_mbox_path <- conf[["mailing_list"]][["mod_mbox"]][["project_key_1"]][["save_folder_path"]]
```
Run the function using this:
```{r}
parsed_mail <- parse_mbox(
perceval_path = parse_perceval_path,
mbox_path = parse_mbox_path
)
```
This will store the parsed data into the parsed_mail variable. To view the table, use:
```{r}
View(parsed_mail)
```