-
Notifications
You must be signed in to change notification settings - Fork 20
Frequent Itemset Clustering (Apriori and ECLAT) #210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Wander03
wants to merge
72
commits into
tidymodels:main
Choose a base branch
from
Wander03:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
72 commits
Select commit
Hold shift + click to select a range
c785217
created bsaed function for frequent itemsets and association rules
Wander03 4fed387
testing new functions
Wander03 f70d6f6
clustering for freq itemsets function
Wander03 7fb5ede
progress!
Wander03 f2a51f5
fix conditions
Wander03 f8787f4
fix conditions
Wander03 ea1c695
Update text
Wander03 2d30942
change method to mining_method
Wander03 9459749
create vignette for freq itemsets
Wander03 144036c
bug fixing
Wander03 1c19e9d
fixed name
Wander03 6ae24be
premptive changes
Wander03 94dcf46
bug fixing freq itemsets
Wander03 f5c8dd1
code formatting
Wander03 d01188f
bug fixes
Wander03 1074d74
updating cluster functions
Wander03 fdeac74
save average supports for each cluster (to be used in predict)
Wander03 02dbce1
predict not saving output
Wander03 ecf7486
some change
Wander03 cff078c
fixed predcit! Proba is now put in N/A spots
Wander03 4d45209
remove avg support tracker (unused)
Wander03 1f8f633
change best cluster to prioritize size then support
Wander03 690c385
vignette testing
Wander03 ae9917b
predict output formated & cutoff implemented
Wander03 a6826c3
create holder for extract_predictions function (placeholder name)
Wander03 cf3e82b
hard code cutoff
Wander03 c6acf01
change predict formating
Wander03 d81fc52
something>
Wander03 cfb5d74
extract_predictions complete! (still needs a better name)
Wander03 5fc3153
move detail text
Wander03 a3abf08
add tuning for min_support
Wander03 f2805b2
tune update
Wander03 a304ebd
improved example code
Wander03 0fbcdb7
fix params help doc
Wander03 d23f353
create test files (TODO: create tests)
Wander03 dcea789
adding raw to predict
Wander03 b22bcee
split regualar predict and raw predict.
Wander03 be2f3a9
predict fixed! Output is the same :D
Wander03 3ab1362
change default to eclat
Wander03 a5c8edd
fixing replacing wrong part from earlier commit
Wander03 022c984
updating with correct default
Wander03 a61eb1e
augment code written, add correct header text and move unecessary code
Wander03 ffaa380
standardize column names
Wander03 8d18057
testing
Wander03 f99b0aa
testing2
Wander03 0f9e947
hide predict dataframe from arules::inspect()
Wander03 e12ae62
remove `` from predict output item names
Wander03 3bc1e43
added note
Wander03 ace0ce4
re-roder doesnt matter for fit
Wander03 2ed1a98
hide freq itemset output from auto displaying when extracting cluster…
Wander03 18566d2
rename col name in predict
Wander03 4315559
vignettes update with new info from thesis
Wander03 083cb4d
added header descriptions about functions
Wander03 1923aea
added convergence limit and warning message
Wander03 3bfdba4
update freq_itemsets extract_fit_summary and ? information
Wander03 39a92eb
vignette update
Wander03 d68410d
create test cases for freq_itemsets
Wander03 19c58ae
move min_support tuning to dials
Wander03 b59df0b
Merged upstream/main into main
Wander03 fe2537d
re-ran test cases
Wander03 96ad9f2
remove assoc_rules
Wander03 6e5a28f
Add the following
Wander03 1ae32d8
rename `extract_predictions` to `extract_itemset_predictions`
Wander03 f808e10
rename `extract_predictions` to `extract_itemset_predictions`
Wander03 0629618
Add exported functions to _pkgdown.yml
Wander03 e22c3c6
convert all rlang::abort() calls to use {cli}
Wander03 842f8d7
edit toy_df and toy_pred to use " instead of ' and TRUE/FALSE instead…
Wander03 305b078
add example to `augment_itemset_predict`
Wander03 5860141
add skip_if_not_installed("arules") to all tests that use freq_itemse…
Wander03 32705bf
use base R rather than stringr
Wander03 7b2751c
use the reduce() from compat-purrr.R
Wander03 fbe29b3
stats::setNames
Wander03 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -39,6 +39,7 @@ Imports: | |
utils, | ||
vctrs (>= 0.5.0) | ||
Suggests: | ||
arules, | ||
cluster, | ||
ClusterR, | ||
clustMixType (>= 0.3-5), | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,170 @@ | ||
#' Augment Itemset Predictions with Truth Values | ||
#' | ||
#' This function processes the output of a `predict()` call for frequent itemset models | ||
#' and joins it with the corresponding ground truth data. It's designed to prepare | ||
#' the prediction and truth values in a format suitable for calculating evaluation metrics | ||
#' using packages like `yardstick`. | ||
#' | ||
#' @param pred_output A data frame that is the output of `predict()` from a `freq_itemsets` model. | ||
#' It is expected to have a column named `.pred_cluster`, where each cell contains | ||
#' a data frame with prediction details (including `.pred_item`, `.obs_item`, and `item`). | ||
#' @param truth_output A data frame representing the ground truth. It should have a similar | ||
#' structure to the input data used for prediction, where columns represent items | ||
#' and rows represent transactions. | ||
#' | ||
#' @details | ||
#' The function first extracts and combines all individual item prediction data frames | ||
#' nested within the `pred_output`. It then filters for items where a prediction was made | ||
#' (i.e., `!is.na(.pred_item)`) and standardizes item names by removing backticks. | ||
#' The `truth_output` is pivoted to a long format to match the structure of the predictions. | ||
#' Finally, an inner join is performed to ensure that only predicted items are included in | ||
#' the final result, aligning predictions with their corresponding true values. | ||
#' | ||
#' @return A data frame with the following columns: | ||
#' \itemize{ | ||
#' \item `item`: The name of the item. | ||
#' \item `row_id`: An identifier for the transaction (row) from which the prediction came. | ||
#' \item `preds`: The predicted value for the item (either raw probability or binary prediction). | ||
#' \item `truth`: The true value for the item from `truth_output`. | ||
#' } | ||
#' This output is suitable for direct use with `yardstick` metric functions. | ||
#' | ||
#' @examples | ||
#' toy_df <- data.frame( | ||
#' "beer" = c(FALSE, TRUE, TRUE, TRUE, FALSE), | ||
#' "milk" = c(TRUE, FALSE, TRUE, TRUE, TRUE), | ||
#' "bread" = c(TRUE, TRUE, FALSE, TRUE, TRUE), | ||
#' "diapers" = c(TRUE, TRUE, TRUE, TRUE, TRUE), | ||
#' "eggs" = c(FALSE, TRUE, FALSE, FALSE, FALSE) | ||
#' ) | ||
#' | ||
#' new_data <- data.frame( | ||
#' "beer" = NA, | ||
#' "milk" = TRUE, | ||
#' "bread" = TRUE, | ||
#' "diapers" = TRUE, | ||
#' "eggs" = FALSE | ||
#' ) | ||
#' | ||
#' truth_df <- data.frame( | ||
#' "beer" = FALSE, | ||
#' "milk" = TRUE, | ||
#' "bread" = TRUE, | ||
#' "diapers" = TRUE, | ||
#' "eggs" = FALSE | ||
#' ) | ||
#' | ||
#' fi_spec <- freq_itemsets( | ||
#' min_support = 0.05, | ||
#' mining_method = "eclat" | ||
#' ) |> | ||
#' set_engine("arules") |> | ||
#' set_mode("partition") | ||
#' | ||
#' fi_fit <- fi_spec |> | ||
#' fit(~ ., | ||
#' data = toy_df | ||
#' ) | ||
#' | ||
#' aug_pred <- fi_fit |> | ||
#' predict(new_data, type = "raw") |> | ||
#' augment_itemset_predict(truth_output = truth_df) | ||
#' | ||
#' aug_pred | ||
#' | ||
#' # Example use of formatted output | ||
#' aug_pred |> | ||
#' yardstick::rmse(truth, preds) | ||
#' | ||
#' @export | ||
|
||
augment_itemset_predict <- function(pred_output, truth_output) { | ||
# Extract all predictions (bind all .pred_cluster dataframes) | ||
preds_df <- dplyr::bind_rows(pred_output$.pred_cluster, .id = "row_id") %>% | ||
dplyr::filter(!is.na(.pred_item)) %>% # Keep only rows with predictions | ||
dplyr::mutate( | ||
item = gsub("`|TRUE|FALSE", "", item) # Remove backticks, TRUE, and FALSE from item names | ||
) | ||
dplyr::select(row_id, item, preds = .pred_item) # Standardize column names | ||
|
||
# Pivot truth data to long format (to match predictions) | ||
truth_long <- truth_output %>% | ||
tibble::rownames_to_column("row_id") %>% | ||
tidyr::pivot_longer( | ||
cols = -row_id, | ||
names_to = "item", | ||
values_to = "truth_value" | ||
) %>% | ||
dplyr::mutate(truth_value = as.numeric(truth_value)) | ||
|
||
# Join predictions with truth (inner join to keep only predicted items) | ||
result <- preds_df %>% | ||
dplyr::inner_join(truth_long, by = c("row_id", "item")) | ||
|
||
# Return simplified output (preds vs truth) | ||
dplyr::select(result, item, row_id, preds, truth = truth_value) | ||
} | ||
|
||
#' Generate Dataframe with Random NAs and Corresponding Truth | ||
#' | ||
#' @description | ||
#' This helper function creates a new data frame by randomly introducing `NA` values | ||
#' into an input data frame. It also returns the original data frame as a "truth" | ||
#' reference, which can be useful for simulating scenarios with missing data | ||
#' for prediction tasks. | ||
#' | ||
#' @param df The input data frame to which `NA` values will be introduced. | ||
#' It is typically a transactional dataset where columns are items and rows are transactions. | ||
#' @param na_prob The probability (between 0 and 1) that any given cell in the | ||
#' input data frame will be replaced with `NA`. | ||
#' | ||
#' @return A list containing two data frames: | ||
#' \itemize{ | ||
#' \item `na_data`: The data frame with `NA` values randomly introduced. | ||
#' \item `truth`: The original input data frame, serving as the ground truth. | ||
#' } | ||
#' @examples | ||
#' # Create a sample data frame | ||
#' sample_df <- data.frame( | ||
#' itemA = c(1, 0, 1), | ||
#' itemB = c(0, 1, 1), | ||
#' itemC = c(1, 1, 0) | ||
#' ) | ||
#' | ||
#' # Generate NA data and truth with 30% NA probability | ||
#' set.seed(123) | ||
#' na_data_list <- random_na_with_truth(sample_df, na_prob = 0.3) | ||
#' | ||
#' # View the NA data | ||
#' print(na_data_list$na_data) | ||
#' | ||
#' # View the truth data | ||
#' print(na_data_list$truth) | ||
#' | ||
#' This function is not exported as it was used to test and provide examples in | ||
#' the vignettes, it may be formally introduced in the future. | ||
random_na_with_truth <- function(df, na_prob = 0.3) { | ||
# Create a copy of the original dataframe to store truth values | ||
truth_df <- df | ||
|
||
# Create a mask of NAs (TRUE = becomes NA) | ||
na_mask <- matrix( | ||
sample( | ||
c(TRUE, FALSE), | ||
size = nrow(df) * ncol(df), | ||
replace = TRUE, | ||
prob = c(na_prob, 1 - na_prob) | ||
), | ||
nrow = nrow(df) | ||
) | ||
|
||
# Apply the mask to create NA values | ||
na_df <- df | ||
na_df[na_mask] <- NA | ||
|
||
# Return both the NA-filled dataframe and the truth | ||
list( | ||
na_data = na_df, | ||
truth = truth_df | ||
) | ||
} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All exported functions need examples.
I would also like to see the example to help determine the use of it