-
Notifications
You must be signed in to change notification settings - Fork 20
Frequent Itemset Clustering (Apriori and ECLAT) #210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
add predict to vingette
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To more things:
- Add the following
.pred_item item preds row_id setNames truth_value
toutils::globalVariables()
inaaa.R
. - Add exported functions to _pkgdown.yml`
i think i would like to chat about these prediction types in #211 before going through with this PR
R/extract_predictions.R
Outdated
#' @return A data frame with items as columns and non-NA values as rows. | ||
#' @export | ||
|
||
extract_predictions <- function(pred_output) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is essentially that we wanted to respect the tidyclust
predict()
output structure; namely, a one-column tibble. But the output of predictions in column-based clustering like association rules is not cluster assignments, but matrix completion.
What we arrived at was to return a list-col, where each element of the column represents the matrix completion result for that row of the test data.
However, in most use cases, the user wouldn't really need this list-col and would instead want the completed matrix. So, extract_predictions()
was created to take the tidyclust
output object and reconfigure it as the data matrix with predicted completions inserted.
We definitely have no issue with renaming it. But I believe helper function like this is very needed for methods of this structure - unless we choose to expand the allowed structures that predict()
itself returns.
R/extract_fit_summary.R
Outdated
#' @export | ||
extract_fit_summary.itemsets <- function(object, ..., | ||
call = rlang::caller_env(n = 0)) { | ||
rlang::abort( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please convert all rlang::abort()
calls to use {cli}, see 13f30dd for inspiration, or tag me if you need help
toy_df <- data.frame( | ||
'beer' = c(F, T, T, T, F), | ||
'milk' = c(T, F, T, T, T), | ||
'bread' = c(T, T, F, T, T), | ||
'diapers' = c(T, T, T, T, T), | ||
'eggs' = c(F, T, F, F, F) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
toy_df <- data.frame( | |
'beer' = c(F, T, T, T, F), | |
'milk' = c(T, F, T, T, T), | |
'bread' = c(T, T, F, T, T), | |
'diapers' = c(T, T, T, T, T), | |
'eggs' = c(F, T, F, F, F) | |
) | |
toy_df <- data.frame( | |
"beer" = c(FALSE, TRUE, TRUE, TRUE, FALSE), | |
"milk" = c(TRUE, FALSE, TRUE, TRUE, TRUE), | |
"bread" = c(TRUE, TRUE, FALSE, TRUE, TRUE), | |
"diapers" = c(TRUE, TRUE, TRUE, TRUE, TRUE), | |
"eggs" = c(FALSE, TRUE, FALSE, FALSE, FALSE) | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does two things, stops the usage of '
over "
and uses the full name for TRUE
and FALSE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be changed all places
#' @export | ||
|
||
augment_itemset_predict <- function(pred_output, truth_output) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All exported functions need examples.
I would also like to see the example to help determine the use of it
@@ -64,3 +72,12 @@ test_that("prefix is passed in extract_centroids()", { | |||
all(substr(res$.cluster, 1, 2) == "C_") | |||
) | |||
}) | |||
|
|||
test_that("extract_centroids errors for freq_itemsets", { | |||
set.seed(1234) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add skip_if_not_installed("arules")
to all tests that use freq_itemsets()
R/extract_cluster_assignment.R
Outdated
items <- attr(object, "item_names") | ||
itemsets <- arules::DATAFRAME(object) | ||
|
||
itemset_list <- lapply(strsplit(gsub("[{}]", "", itemsets$items), ","), stringr::str_trim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R/predict_helpers.R
Outdated
# Extract frequent itemsets and their supports | ||
items <- attr(object, "item_names") | ||
itemsets <- arules::DATAFRAME(object) | ||
frequent_itemsets <- lapply(strsplit(gsub("[{}]", "", itemsets$items), ","), stringr::str_trim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R/predict_helpers.R
Outdated
|
||
# Create result data frame | ||
data.frame( | ||
item = stringr::str_remove_all(items, "`"), # Remove backticks from item names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R/extract_predictions.R
Outdated
|
||
# Process each observation and combine results using reduce | ||
result_df <- data_frames %>% | ||
purrr::reduce(.f = ~ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use the reduce()
from compat-purrr.R
R/extract_cluster_assignment.R
Outdated
unique_non_zero_clusters <- unique(non_zero_clusters) | ||
|
||
# Map each unique non-zero cluster to a new cluster starting from Cluster_1 | ||
cluster_map <- setNames(paste0(prefix, seq_along(unique_non_zero_clusters)), unique_non_zero_clusters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cluster_map <- setNames(paste0(prefix, seq_along(unique_non_zero_clusters)), unique_non_zero_clusters) | |
cluster_map <- stats::setNames(paste0(prefix, seq_along(unique_non_zero_clusters)), unique_non_zero_clusters) |
.pred_item item preds row_id setNames truth_value to utils::globalVariables() in aaa.R
add example to `extract_itemset_predictions`
Hi Emil! I believe that I addressed all your comments, please let me know if I missed something or if there is something else I need to edit. |
@kbodwin
Relates to other conversations about column-based clustering, e.g. Consider partition data reduction algorithm #66
Adds a partition mode with engine arules to tidyclust (freq_itemsets)
Adds custom cluster and predict functions for
freq_itemsets()
Adds
extract_predictions()
which reformatespredict()
output into a more readable formatAdds
augment_itemset_predict()
which reformatespredict()
output for metric functions (e.g. in yardstick)Note:
devtools::check()
resulted in a warning about code dependencies from purr and stringr