Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

284 MBox Refresher #295

Open
wants to merge 95 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
8c1021d
Created parse_mbox_latest_date and refresh_mbox functions and updated…
ian-lastname Apr 19, 2024
72238a7
Edited download_pipermail to save pipermail files as mbox files, crea…
ian-lastname Apr 24, 2024
99fb7e3
Changed function name from refresh_mbox to refresh_mod_mbox for consi…
ian-lastname Apr 25, 2024
618f2d0
Added checks in refresh functions and in download_mod_mbox_per_month …
ian-lastname Apr 25, 2024
0751218
fix github checks
carlosparadis Apr 28, 2024
be4ff32
Re-added error message in refresh_pipermail when an http error is enc…
ian-lastname Apr 29, 2024
b5be04e
Added comments to download_pipermail
ian-lastname Apr 30, 2024
d2ce222
Minor documentation update for setup verification.
daomcgill Sep 10, 2024
7c585ae
i #284 Refactor download_pipermail function
daomcgill Sep 15, 2024
69ca163
i #284 Updated documentation and modified function for download_piper…
daomcgill Sep 17, 2024
b9a886b
i #284 Edited download_pipermail() and Added refresh_pipermail() and …
daomcgill Sep 17, 2024
3c88140
i #284 Added more descriptive comments. Made minor changes to piperma…
daomcgill Sep 19, 2024
5de3aa2
i #284 Added more descriptive comments. Made minor changes to piperma…
daomcgill Sep 19, 2024
8a373d6
Merge branch '284-mbox-download-refresher' of https://github.com/sail…
daomcgill Sep 19, 2024
b91389b
i #284 Added download_mod_mbox function and edited notebook
daomcgill Sep 21, 2024
0cc4123
i #284 Added refresh_mod_mbox function for refreshing Mod Mbox archives
daomcgill Sep 22, 2024
0dc6001
i #284 Updated Notebook
daomcgill Oct 1, 2024
f0027dc
i #284 Testing Github Actions
daomcgill Oct 2, 2024
9b9c896
i #284 Renamed save_folder_mail parameter to mbox
daomcgill Oct 2, 2024
7249c9b
i #284 Updated Notebook download_mail.Rmd
daomcgill Oct 3, 2024
2a1ba98
Revert "i #284 Testing Github Actions"
daomcgill Oct 3, 2024
7bf8ba6
i #284 Refactored parse_mbox_latest_date and Fixed Roxygen Errors
daomcgill Oct 3, 2024
aa60648
i #284 Update NEWS.md
daomcgill Oct 3, 2024
64e0646
i #284 Updated Notebook, exec/mailinglist.R and R/mail.R
daomcgill Oct 6, 2024
2b6a963
i #284 Changed Notebook to Use Project Working Directory
daomcgill Oct 6, 2024
dc40dba
i #284 Minor Fix: Folder Paths in helix.yml
daomcgill Oct 6, 2024
d6f3b41
i #284 fixes incorrect call
carlosparadis Oct 9, 2024
f02ecb1
i #284 attempt fix on Actions
carlosparadis Oct 9, 2024
7f38d1c
i #284 incomplete storytelling review
carlosparadis Oct 9, 2024
309fa34
i #284 downgrade version of R for XML
carlosparadis Oct 9, 2024
e04bd31
i #284 gcc not found on Actions
carlosparadis Oct 9, 2024
dbd7092
i #284 Refactored download_mail.Rmd
beydlern Oct 10, 2024
ea109bd
Merge branch 'master' into 284-mbox-download-refresher
daomcgill Oct 10, 2024
c4b9d16
i #284 Testing GitHub Actions after Merge
daomcgill Oct 10, 2024
90b05ed
i #284 GH Actions (changed perceval path)
daomcgill Oct 10, 2024
3e5f8f7
i #284 Change Roxygen version
daomcgill Oct 10, 2024
4af2c21
i #284 Update Notebook and config file
daomcgill Oct 11, 2024
8094402
i #284 Final Updates for Mail Notebook
daomcgill Oct 15, 2024
5fb3af7
i #284 Fixed Relative Paths in a Notebook
beydlern Oct 18, 2024
e56848a
i #230 create config file interface
anthonyjlau Nov 12, 2024
b462ddb
Merge branch 'master' into 284-mbox-download-refresher
carlosparadis Nov 12, 2024
def1660
i #284 minor fixes and XML dependency
carlosparadis Nov 12, 2024
bfc75cb
revert utags
carlosparadis Nov 12, 2024
c1830f6
i #284 More narrative and config fixes
carlosparadis Nov 12, 2024
4842100
i #284 Remove description tags
carlosparadis Nov 12, 2024
0f9769e
i #284 more minor doc formatting fixes
carlosparadis Nov 12, 2024
6f6a59b
i #284 Updates to exec/mailinglist.R and Minor Fixes for Mail Configu…
daomcgill Nov 12, 2024
775b5a6
Merge branch 'master' of https://github.com/sailuh/kaiaulu
daomcgill Nov 12, 2024
93f214e
Merge branch 'master' into 284-mbox-download-refresher
daomcgill Nov 12, 2024
e27a604
i #295 Change argument for exec from 'tabulate' to 'parse'
daomcgill Nov 12, 2024
6a5fed6
i #284 Testing Fix for Actions
daomcgill Nov 13, 2024
ffb5c9c
i #284 Try Adding Debugging
daomcgill Nov 13, 2024
e55b6e2
Revert "i #284 Try Adding Debugging"
daomcgill Nov 13, 2024
c797219
i #284 Revert ctags version
daomcgill Nov 13, 2024
092e2ab
Update commit_message_id_coverage.Rd
daomcgill Nov 13, 2024
56dff9c
i #284 Please work
daomcgill Nov 13, 2024
fd97af0
i #295 Last try
daomcgill Nov 13, 2024
8709b95
Revert "i #295 Last try"
daomcgill Nov 13, 2024
71054f9
Revert "i #284 Please work"
daomcgill Nov 13, 2024
382383d
Revert "Update commit_message_id_coverage.Rd"
daomcgill Nov 13, 2024
f11e452
Revert "i #284 Revert ctags version"
daomcgill Nov 13, 2024
09d00c3
Reapply "i #284 Try Adding Debugging"
daomcgill Nov 13, 2024
216fe07
i #284 R version
daomcgill Nov 13, 2024
99823d7
i #284 another R version change attempt
daomcgill Nov 13, 2024
6cd5e11
i #284 Version that was passing check
daomcgill Nov 13, 2024
132355d
i #295 Small changes from updated config
daomcgill Nov 30, 2024
dceded0
i #284 Updates to Mail Notebook
daomcgill Dec 1, 2024
5515d7c
i #284 Update Mailing List Exec to use "refresh"
daomcgill Dec 6, 2024
a89b983
Reverse github actions to match master
carlosparadis Dec 8, 2024
4aa2af2
Remove git.R print statements
carlosparadis Dec 8, 2024
a366573
Remove prints from mail tests
carlosparadis Dec 8, 2024
2887232
Remove prefix underline
carlosparadis Dec 8, 2024
800fccc
Remove additional git prints
carlosparadis Dec 8, 2024
67de9f8
Internal api functions should not be display
carlosparadis Dec 8, 2024
ae1ba66
Unit tests now pass locally
carlosparadis Dec 8, 2024
3697fe3
Remove more prints..
carlosparadis Dec 8, 2024
7fc9e41
Remove strange mbox file path
carlosparadis Dec 8, 2024
207d0c4
Fix parse_mbox removing stderr = TRUE
carlosparadis Dec 8, 2024
d3dd232
Add loop to parse_mbox on notebook
carlosparadis Dec 8, 2024
f74aff3
Documentation pass
carlosparadis Dec 8, 2024
557ad10
i #284 Update refresh functions
daomcgill Dec 8, 2024
1cf86e5
i #284 Missing file in previous commit
daomcgill Dec 8, 2024
f3048a9
i #284 Edit exec/mailinglist.R parse to take file as arg
daomcgill Dec 9, 2024
5ce5830
i #284 Use pipermail path for parsing pipermail folder
daomcgill Dec 9, 2024
41f0850
Merge branch 'master' into 284-mbox-download-refresher
daomcgill Dec 9, 2024
4df52d9
i #284 Minor fixes
daomcgill Dec 9, 2024
9d08ff7
Create /exec/ghevents.R
connorn-dev Feb 11, 2025
1e0a45f
Define CLI
connorn-dev Feb 11, 2025
874bf7d
ghevents.R testing logic
connorn-dev Feb 11, 2025
8fe551a
Add Help alert, Check arguments
connorn-dev Feb 11, 2025
8f35262
Added first Download Logic
connorn-dev Feb 11, 2025
aba7424
Finished CLI Donwload Logic
connorn-dev Feb 18, 2025
606d3ea
Finished Download and Parser Functions
connorn-dev Feb 18, 2025
07d350f
i #341 Vingette with R and Python together
connorn-dev Feb 25, 2025
c592260
i #341 Updated Vignette
connorn-dev Feb 25, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
i #284 Refactor download_pipermail function
- Remove archive_url and archive_type parameters from download_pipermail().
- Add start_year_month and end_year_month parameters for date filtering.
- Remove convert_pipermail_to_mbox() function, as download_pipermail() now handles file conversion automatically.
- Change file naming convention to 'kaiaulu_'YYYYMM.mbox'.
- Attempt to download and decompress files directly without saving .gz to disk, but could not establish a valid connection.

Signed-off-by: Dao McGill <[email protected]>
daomcgill committed Sep 15, 2024
commit 7c585aeda18537044f97e85f8648183bd010f10c
136 changes: 86 additions & 50 deletions R/mail.R
Original file line number Diff line number Diff line change
@@ -7,63 +7,99 @@
############## Downloader ##############

#' Download all pipermail files in an archive as mbox files
#' @param archive_url An url pointing to a pipermail archive
#' @param mailing_list The name of the mailing list being downloaded
#' @param archive_type The name of the type of archive that the mailing list is stored in
#' @param start_year_month The year and month of the first file to be downloaded
#' @param end_year_month The year and month of the last file to be downloaded
#' @param save_folder_path The folder path in which all the downloaded pipermail files will be stored
#' @return Returns `destination`, a vector of the downloaded files in the current working directory
#' @return Returns `downloaded_files`, a vector of the downloaded files in the current working directory
#' @export
download_pipermail <- function(archive_url, mailing_list, archive_type, save_folder_path) {

#Get page
pagedata <- httr::GET(archive_url)

#Parse html file into object
tbls_xml <- XML::htmlParse(pagedata)

#Extract href tablenodes from html table
tableNodes <- XML::getNodeSet(tbls_xml, "//td/a[@href]")

#Extract filenames from tablenode content with xmlGetAtrr
hrefs <- sapply(tableNodes, XML::xmlGetAttr, 'href')

#Create Vector
files <- vector()
file_names <- vector()

#Compose download urls for both gunzipped and plain text files
for (i in hrefs ){
if (endsWith(i, ".txt.gz")){
# Converts month from text form into a number for the naming convention
f_month <- match(sub("[^_]*-","", sub(".txt.gz","",i)), month.name)
# Retrieves year number for the naming convention
f_year <- sub("-[^_]*", "", i)
# txt files are actually mbox files, so this renames the extension
file_names <- c(file_names, sprintf("%s%02d.mbox", f_year, f_month))
# Saves regular name so that function can access correct url
i <- stringi::stri_c(archive_url, i, sep = "/")
files <- c(files, i)
} else if (endsWith(i, ".txt")) {
# Same logic, but with txt
f_month <- match(sub("[^_]*-","", sub(".txt","",i)), month.name)
f_year <- sub("-[^_]*", "", i)
file_names <- c(file_names, sprintf("%s%02d.mbox", f_year, f_month))
i <- stringi::stri_c(archive_url, i, sep = "/")
files <- c(files, i)
download_pipermail <- function(mailing_list, start_year_month, end_year_month, save_folder_path) {

# Create directory if it does not exist
if (!dir.exists(save_folder_path)) {
dir.create(save_folder_path, recursive = TRUE)
}

# Get mailing list contents
response <- GET(mailing_list)

# Parse the response
parsed_response <- content(response, "text")
doc_obj <- htmlParse(parsed_response, asText = TRUE)

# Table rows
rows <- getNodeSet(doc_obj, "//tr")

# Skip header row
data_rows <- rows[-1]

# Vector for link storage
links = c()

# Extract the date and link from each row
for (row in data_rows) {
# Date in YYYYMM format
date_extracted <- xpathSApply(row, ".//td[1]", xmlValue)
date_cleaned <- stri_replace_last_regex(date_extracted, pattern = ":$", replacement = "")
date_cleaned <- stri_trim_both(date_cleaned)
# Parse the date
# Add 01 as dummy to make it a valid date
date_parsed <- as.Date(paste0("01 ", date_cleaned), format = "%d %B %Y")
year_month <- format(date_parsed, "%Y%m")

# Check if date is within range
if (year_month >= start_year_month & year_month <= end_year_month) {
# get href from column 3
link_nodes <- xpathSApply(row, ".//td[3]/a", xmlGetAttr, 'href')
# Store the link in links
link <- link_nodes[1]
links <- c(links, link)
}
}
amount <- length(files)
# File downloading loop
for (i in 1:amount){

#download file and place it at the destination
save_file_name <- stringi::stri_c(mailing_list, archive_type, file_names[[i]], sep = "_")
save_file_path <- stringi::stri_c(save_folder_path, save_file_name, sep = "/")
httr::GET(files[[i]], httr::write_disk(save_file_path, overwrite=TRUE))
# Vector for downloaded files
downloaded_files <- c()
for (i in seq_along(links)) {
link <- links[i]

# Extract the name without the .txt.gz extension
base_name <- gsub("\\.txt\\.gz$", "", link)

# Parse the date from the base name
date_parsed <- as.Date(paste0("01-", base_name), format = "%d-%Y-%B")
year_month_clean <- format(date_parsed, "%Y%m")

# Download URL
download_url <- paste0(mailing_list, link)

# Define the destination file
# Rename (also converts to mbox by changing extension to .mbox)
dest_gz <- file.path(save_folder_path, paste0('kaiaulu_', year_month_clean, '.mbox.gz'))
dest <- file.path(save_folder_path, paste0('kaiaulu_', year_month_clean, '.mbox'))

# Download the gz mbox file
cat("Downloading:", download_url, "\n")
GET(download_url, write_disk(dest_gz, overwrite = TRUE))

# Unzip the file
gz_con <- gzfile(dest_gz, open = "rb")
out_con <- file(dest, open = "wb")
while (TRUE) {
bytes <- readBin(gz_con, what = raw(), n = 1024 * 1024)
if (length(bytes) == 0) break
writeBin(bytes, out_con)
}
close(gz_con)
close(out_con)

# Remove the gz file
file.remove(dest_gz)

# Add the downloaded file to the list
downloaded_files <- c(downloaded_files, dest)
}

#Return filenames
return(save_folder_path)
# Return downloaded files
return(downloaded_files)

}

8 changes: 8 additions & 0 deletions conf/helix.yml
Original file line number Diff line number Diff line change
@@ -59,6 +59,14 @@ mailing_list:
mbox: ../../rawdata/helix/mod_mbox/helix-user/
mailing_list: helix-user
archive_type: apache
# Using for testing R/mail.R/pipermail_downloader()
pipermail_key:
archive_url: https://mta.openssl.org/mailman/listinfo/
mailing_list: https://mta.openssl.org/pipermail/openssl-users/
# archive_type
start_year_month: 202310
end_year_month: 202405
save_folder_path: "save_folder_mail"

issue_tracker:
jira: