Skip to content

Conversation

Wander03
Copy link

@Wander03 Wander03 commented Jun 19, 2025

@kbodwin

  • Relates to other conversations about column-based clustering, e.g. Consider partition data reduction algorithm #66

  • Adds a partition mode with engine arules to tidyclust (freq_itemsets)

  • Adds custom cluster and predict functions for freq_itemsets()

  • Adds extract_predictions() which reformates predict() output into a more readable format

  • Adds augment_itemset_predict() which reformates predict() output for metric functions (e.g. in yardstick)

Note:

  • devtools::check() resulted in a warning about code dependencies from purr and stringr

Copy link
Member

@EmilHvitfeldt EmilHvitfeldt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To more things:

  • Add the following .pred_item item preds row_id setNames truth_value to utils::globalVariables() in aaa.R.
  • Add exported functions to _pkgdown.yml`

i think i would like to chat about these prediction types in #211 before going through with this PR

#' @return A data frame with items as columns and non-NA values as rows.
#' @export

extract_predictions <- function(pred_output) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of this function, and i think most of it is in the name.

I'm afraid users would confuse it with collect_predictions().

please enlighten me @kbodwin, @Wander03, does users need access to both formats of this input/output?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is essentially that we wanted to respect the tidyclust predict() output structure; namely, a one-column tibble. But the output of predictions in column-based clustering like association rules is not cluster assignments, but matrix completion.

What we arrived at was to return a list-col, where each element of the column represents the matrix completion result for that row of the test data.

However, in most use cases, the user wouldn't really need this list-col and would instead want the completed matrix. So, extract_predictions() was created to take the tidyclust output object and reconfigure it as the data matrix with predicted completions inserted.

We definitely have no issue with renaming it. But I believe helper function like this is very needed for methods of this structure - unless we choose to expand the allowed structures that predict() itself returns.

#' @export
extract_fit_summary.itemsets <- function(object, ...,
call = rlang::caller_env(n = 0)) {
rlang::abort(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please convert all rlang::abort() calls to use {cli}, see 13f30dd for inspiration, or tag me if you need help

Comment on lines 1 to 7
toy_df <- data.frame(
'beer' = c(F, T, T, T, F),
'milk' = c(T, F, T, T, T),
'bread' = c(T, T, F, T, T),
'diapers' = c(T, T, T, T, T),
'eggs' = c(F, T, F, F, F)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
toy_df <- data.frame(
'beer' = c(F, T, T, T, F),
'milk' = c(T, F, T, T, T),
'bread' = c(T, T, F, T, T),
'diapers' = c(T, T, T, T, T),
'eggs' = c(F, T, F, F, F)
)
toy_df <- data.frame(
"beer" = c(FALSE, TRUE, TRUE, TRUE, FALSE),
"milk" = c(TRUE, FALSE, TRUE, TRUE, TRUE),
"bread" = c(TRUE, TRUE, FALSE, TRUE, TRUE),
"diapers" = c(TRUE, TRUE, TRUE, TRUE, TRUE),
"eggs" = c(FALSE, TRUE, FALSE, FALSE, FALSE)
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does two things, stops the usage of ' over " and uses the full name for TRUE and FALSE

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be changed all places

Comment on lines +31 to +33
#' @export

augment_itemset_predict <- function(pred_output, truth_output) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All exported functions need examples.

I would also like to see the example to help determine the use of it

@@ -64,3 +72,12 @@ test_that("prefix is passed in extract_centroids()", {
all(substr(res$.cluster, 1, 2) == "C_")
)
})

test_that("extract_centroids errors for freq_itemsets", {
set.seed(1234)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add skip_if_not_installed("arules") to all tests that use freq_itemsets()

items <- attr(object, "item_names")
itemsets <- arules::DATAFRAME(object)

itemset_list <- lapply(strsplit(gsub("[{}]", "", itemsets$items), ","), stringr::str_trim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Extract frequent itemsets and their supports
items <- attr(object, "item_names")
itemsets <- arules::DATAFRAME(object)
frequent_itemsets <- lapply(strsplit(gsub("[{}]", "", itemsets$items), ","), stringr::str_trim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# Create result data frame
data.frame(
item = stringr::str_remove_all(items, "`"), # Remove backticks from item names
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# Process each observation and combine results using reduce
result_df <- data_frames %>%
purrr::reduce(.f = ~ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use the reduce() from compat-purrr.R

unique_non_zero_clusters <- unique(non_zero_clusters)

# Map each unique non-zero cluster to a new cluster starting from Cluster_1
cluster_map <- setNames(paste0(prefix, seq_along(unique_non_zero_clusters)), unique_non_zero_clusters)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cluster_map <- setNames(paste0(prefix, seq_along(unique_non_zero_clusters)), unique_non_zero_clusters)
cluster_map <- stats::setNames(paste0(prefix, seq_along(unique_non_zero_clusters)), unique_non_zero_clusters)

@Wander03
Copy link
Author

Wander03 commented Jul 3, 2025

Hi Emil! I believe that I addressed all your comments, please let me know if I missed something or if there is something else I need to edit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants