Skip to content

Commit 01b7db9

Browse files
committed
Update to align 'v7.0-dev-feat/performance' with 'v7.0-dev'
2 parents 35d91ff + 08b7b8c commit 01b7db9

File tree

160 files changed

+1097
-331
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

160 files changed

+1097
-331
lines changed

DESCRIPTION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Package: dsBaseClient
22
Title: 'DataSHIELD' Client Side Base Functions
3-
Version: 6.3.5
3+
Version: 7.0.0.9000
44
Description: Base 'DataSHIELD' functions for the client side. 'DataSHIELD' is a software package which allows
55
you to do non-disclosive federated analysis on sensitive data. 'DataSHIELD' analytic functions have
66
been designed to only share non disclosive summary statistics, with built in automated output

NAMESPACE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ export(ds.matrixDimnames)
7373
export(ds.matrixInvert)
7474
export(ds.matrixMult)
7575
export(ds.matrixTranspose)
76+
export(ds.mdPattern)
7677
export(ds.mean)
7778
export(ds.meanByClass)
7879
export(ds.meanSdGp)

R/ds.mdPattern.R

Lines changed: 305 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,305 @@
1+
#'
2+
#' @title Display missing data patterns with disclosure control
3+
#' @description This function is a client-side wrapper for the server-side mdPatternDS
4+
#' function. It generates a missing data pattern matrix similar to mice::md.pattern but
5+
#' with disclosure control applied to prevent revealing small cell counts.
6+
#' @details The function calls the server-side mdPatternDS function which uses
7+
#' mice::md.pattern to analyze missing data patterns. Patterns with counts below the
8+
#' disclosure threshold (default: nfilter.tab = 3) are suppressed to maintain privacy.
9+
#'
10+
#' \strong{Output Format:}
11+
#' - Each row represents a missing data pattern
12+
#' - Pattern counts are shown in row names (e.g., "150", "25")
13+
#' - Columns show 1 if the variable is observed, 0 if missing
14+
#' - Last column shows the total number of missing values per pattern
15+
#' - Last row shows the total number of missing values per variable
16+
#'
17+
#' \strong{Disclosure Control:}
18+
#'
19+
#' Suppressed patterns (count below threshold) are indicated by:
20+
#' - Row name: "suppressed(<N>)" where N is the threshold
21+
#' - All pattern values set to NA
22+
#' - Summary row also suppressed to prevent back-calculation
23+
#'
24+
#' \strong{Pooling Behavior (type='combine'):}
25+
#'
26+
#' When pooling across studies, the function uses a \emph{conservative approach}
27+
#' for disclosure control:
28+
#'
29+
#' 1. Identifies identical missing patterns across studies
30+
#' 2. \strong{EXCLUDES suppressed patterns from pooling} - patterns suppressed in
31+
#' ANY study are not included in the pooled count
32+
#' 3. Sums counts only for non-suppressed identical patterns
33+
#' 4. Re-validates pooled counts against disclosure threshold
34+
#'
35+
#' \strong{Important:} This conservative approach means:
36+
#' - Pooled counts may be \emph{underestimates} if some studies had suppressed patterns
37+
#' - This prevents disclosure through subtraction (e.g., if study A shows count=5
38+
#' and pool shows count=7, one could deduce study B has count=2, violating disclosure)
39+
#' - Different patterns across studies are preserved separately in the pooled result
40+
#'
41+
#' @param x a character string specifying the name of a data frame or matrix on the
42+
#' server-side containing the data to analyze.
43+
#' @param type a character string specifying the output type. If 'split' (default),
44+
#' returns separate patterns for each study. If 'combine', attempts to pool patterns
45+
#' across studies.
46+
#' @param datasources a list of \code{\link[DSI]{DSConnection-class}} objects obtained
47+
#' after login. If the \code{datasources} argument is not specified, the default set of
48+
#' connections will be used: see \code{\link[DSI]{datashield.connections_default}}.
49+
#' @return For type='split': A list with one element per study, each containing:
50+
#' \describe{
51+
#' \item{pattern}{The missing data pattern matrix for that study}
52+
#' \item{valid}{Logical indicating if all patterns meet disclosure requirements}
53+
#' \item{message}{A message describing the validity status}
54+
#' }
55+
#'
56+
#' For type='combine': A list containing:
57+
#' \describe{
58+
#' \item{pattern}{The pooled missing data pattern matrix across all studies}
59+
#' \item{valid}{Logical indicating if all pooled patterns meet disclosure requirements}
60+
#' \item{message}{A message describing the validity status}
61+
#' }
62+
#' @author Xavier Escribà montagut for DataSHIELD Development Team
63+
#' @export
64+
#' @examples
65+
#' \dontrun{
66+
#' ## Version 6, for version 5 see the Wiki
67+
#'
68+
#' # Connecting to the Opal servers
69+
#'
70+
#' require('DSI')
71+
#' require('DSOpal')
72+
#' require('dsBaseClient')
73+
#'
74+
#' builder <- DSI::newDSLoginBuilder()
75+
#' builder$append(server = "study1",
76+
#' url = "http://192.168.56.100:8080/",
77+
#' user = "administrator", password = "datashield_test&",
78+
#' table = "CNSIM.CNSIM1", driver = "OpalDriver")
79+
#' builder$append(server = "study2",
80+
#' url = "http://192.168.56.100:8080/",
81+
#' user = "administrator", password = "datashield_test&",
82+
#' table = "CNSIM.CNSIM2", driver = "OpalDriver")
83+
#' logindata <- builder$build()
84+
#'
85+
#' connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D")
86+
#'
87+
#' # Get missing data patterns for each study separately
88+
#' patterns_split <- ds.mdPattern(x = "D", type = "split", datasources = connections)
89+
#'
90+
#' # View results for study1
91+
#' print(patterns_split$study1$pattern)
92+
#' # var1 var2 var3
93+
#' # 150 1 1 1 0 <- 150 obs complete
94+
#' # 25 0 1 1 1 <- 25 obs missing var1
95+
#' # 25 0 0 25 <- Summary: 25 missing per variable
96+
#'
97+
#' # Get pooled missing data patterns across studies
98+
#' patterns_pooled <- ds.mdPattern(x = "D", type = "combine", datasources = connections)
99+
#' print(patterns_pooled$pattern)
100+
#'
101+
#' # Example with suppressed patterns:
102+
#' # If study1 has a pattern with count=2 (suppressed) and study2 has same pattern
103+
#' # with count=5 (valid), the pooled result will show count=5 (conservative approach)
104+
#' # A warning will indicate: "Pooled counts may underestimate the true total"
105+
#'
106+
#' # Clear the Datashield R sessions and logout
107+
#' datashield.logout(connections)
108+
#' }
109+
#'
110+
ds.mdPattern <- function(x = NULL, type = 'split', datasources = NULL){
111+
112+
# Look for DS connections
113+
if(is.null(datasources)){
114+
datasources <- datashield.connections_find()
115+
}
116+
117+
# Ensure datasources is a list of DSConnection-class
118+
if(!(is.list(datasources) && all(unlist(lapply(datasources, function(d) {methods::is(d,"DSConnection")}))))){
119+
stop("The 'datasources' were expected to be a list of DSConnection-class objects", call.=FALSE)
120+
}
121+
122+
if(is.null(x)){
123+
stop("Please provide the name of a data frame or matrix!", call.=FALSE)
124+
}
125+
126+
# Get study names
127+
study_names <- names(datasources)
128+
129+
# Call the server side function
130+
cally <- call("mdPatternDS", x)
131+
results <- DSI::datashield.aggregate(datasources, cally)
132+
133+
# Process results based on type
134+
if(type == "split"){
135+
# Return individual study results
136+
return(results)
137+
138+
} else if(type == "combine"){
139+
# Pool results across studies
140+
141+
# First check if any study has invalid patterns
142+
any_invalid <- any(sapply(results, function(r) !r$valid))
143+
invalid_studies <- names(results)[sapply(results, function(r) !r$valid)]
144+
145+
if(any_invalid){
146+
warning(
147+
"Disclosure control: Some studies have suppressed patterns (below threshold).\n",
148+
" Studies with suppressed patterns: ", paste(invalid_studies, collapse=", "), "\n",
149+
" These patterns are EXCLUDED from pooling to prevent disclosure.\n",
150+
" Pooled counts may underestimate the true total.",
151+
call. = FALSE
152+
)
153+
}
154+
155+
# Extract patterns from each study
156+
patterns_list <- lapply(results, function(r) r$pattern)
157+
158+
# Check if all patterns have the same variables (columns)
159+
n_vars <- sapply(patterns_list, ncol)
160+
if(length(unique(n_vars)) > 1){
161+
stop("Cannot pool patterns: studies have different numbers of variables", call.=FALSE)
162+
}
163+
164+
var_names <- colnames(patterns_list[[1]])
165+
if(length(patterns_list) > 1){
166+
for(i in 2:length(patterns_list)){
167+
if(!identical(colnames(patterns_list[[i]]), var_names)){
168+
warning("Variable names differ across studies. Pooling by position.")
169+
break
170+
}
171+
}
172+
}
173+
174+
# Pool the patterns
175+
pooled_pattern <- .pool_md_patterns(patterns_list, study_names)
176+
177+
# Check validity of pooled results
178+
# Get threshold from first study's results or use a default check
179+
nfilter.tab <- getOption("default.nfilter.tab")
180+
if(is.null(nfilter.tab)) nfilter.tab <- 3
181+
182+
n_patterns <- nrow(pooled_pattern) - 1
183+
pooled_valid <- TRUE
184+
185+
if(n_patterns > 0){
186+
# Pattern counts are in row names
187+
pattern_counts <- as.numeric(rownames(pooled_pattern)[1:n_patterns])
188+
pattern_counts <- pattern_counts[!is.na(pattern_counts) & pattern_counts > 0]
189+
190+
if(any(pattern_counts < nfilter.tab)){
191+
pooled_valid <- FALSE
192+
}
193+
}
194+
195+
pooled_message <- ifelse(pooled_valid,
196+
"Valid: all pooled pattern counts meet disclosure requirements",
197+
"Some pooled pattern counts may be below threshold")
198+
199+
return(list(
200+
pattern = pooled_pattern,
201+
valid = pooled_valid,
202+
message = pooled_message,
203+
studies = study_names
204+
))
205+
206+
} else {
207+
stop("Argument 'type' must be either 'split' or 'combine'", call.=FALSE)
208+
}
209+
}
210+
211+
#' @title Pool missing data patterns across studies
212+
#' @description Internal function to pool md.pattern results from multiple studies
213+
#' @param patterns_list List of pattern matrices from each study
214+
#' @param study_names Names of the studies
215+
#' @return Pooled pattern matrix
216+
#' @keywords internal
217+
.pool_md_patterns <- function(patterns_list, study_names){
218+
219+
# Initialize with first study's pattern structure
220+
pooled <- patterns_list[[1]]
221+
n_vars <- ncol(pooled)
222+
n_rows <- nrow(pooled) - 1 # Exclude summary row
223+
224+
# Create a list to store unique patterns
225+
unique_patterns <- list()
226+
pattern_counts <- list()
227+
228+
# Process each study
229+
for(i in seq_along(patterns_list)){
230+
pattern <- patterns_list[[i]]
231+
study_n_patterns <- nrow(pattern) - 1
232+
233+
if(study_n_patterns > 0){
234+
for(j in 1:study_n_patterns){
235+
# Get pattern (columns show 1/0 for observed/missing)
236+
pat_vector <- pattern[j, 1:(n_vars-1)]
237+
# Pattern count is in row name
238+
pat_count_str <- rownames(pattern)[j]
239+
pat_count <- suppressWarnings(as.numeric(pat_count_str))
240+
241+
# Skip if suppressed (non-numeric row name like "suppressed(<3)")
242+
if(is.na(pat_count)){
243+
next
244+
}
245+
246+
# Convert pattern to string for comparison
247+
pat_string <- paste(pat_vector, collapse="_")
248+
249+
# Check if this pattern already exists
250+
if(pat_string %in% names(unique_patterns)){
251+
# Add to existing count
252+
pattern_counts[[pat_string]] <- pattern_counts[[pat_string]] + pat_count
253+
} else {
254+
# New pattern
255+
unique_patterns[[pat_string]] <- pat_vector
256+
pattern_counts[[pat_string]] <- pat_count
257+
}
258+
}
259+
}
260+
}
261+
262+
# Build pooled pattern matrix
263+
if(length(unique_patterns) == 0){
264+
# No valid patterns
265+
pooled[1:n_rows, ] <- NA
266+
} else {
267+
# Sort patterns by count (descending)
268+
sorted_idx <- order(unlist(pattern_counts), decreasing = TRUE)
269+
sorted_patterns <- unique_patterns[sorted_idx]
270+
sorted_counts <- pattern_counts[sorted_idx]
271+
272+
# Create new pooled matrix
273+
n_pooled_patterns <- length(sorted_patterns)
274+
pooled <- matrix(NA, nrow = n_pooled_patterns + 1, ncol = n_vars)
275+
colnames(pooled) <- colnames(patterns_list[[1]])
276+
277+
# Set row names (counts for patterns, empty for summary)
278+
row_names <- c(as.character(unlist(sorted_counts)), "")
279+
rownames(pooled) <- row_names
280+
281+
# Fill in patterns
282+
for(i in 1:n_pooled_patterns){
283+
pooled[i, 1:(n_vars-1)] <- sorted_patterns[[i]]
284+
# Calculate number of missing for this pattern
285+
pooled[i, n_vars] <- sum(sorted_patterns[[i]] == 0)
286+
}
287+
}
288+
289+
# Calculate summary row (total missing per variable)
290+
# Sum across studies
291+
summary_row <- rep(0, n_vars)
292+
for(i in seq_along(patterns_list)){
293+
study_summary <- patterns_list[[i]][nrow(patterns_list[[i]]), ]
294+
# Only add if not suppressed
295+
if(!all(is.na(study_summary))){
296+
summary_row <- summary_row + ifelse(is.na(study_summary), 0, study_summary)
297+
}
298+
}
299+
300+
# Add summary row
301+
pooled[nrow(pooled), ] <- summary_row
302+
303+
return(pooled)
304+
}
305+

docker-compose_armadillo.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ services:
33
hostname: armadillo
44
ports:
55
- 8080:8080
6-
image: datashield/armadillo_citest:5.9.4
6+
image: datashield/armadillo_citest:5.11.0
77
environment:
88
LOGGING_CONFIG: 'classpath:logback-file.xml'
99
AUDIT_LOG_PATH: '/app/logs/audit.log'
@@ -16,6 +16,7 @@ services:
1616

1717
default:
1818
hostname: default
19-
image: datashield/rock-omicron-karma-permissive:devel
19+
image: datashield/rock-quebrada-lamda:latest
20+
# image: datashield/rserver-panda-lamda:devel
2021
environment:
2122
DEBUG: "FALSE"

docker-compose_opal.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,6 @@ services:
2020
- MONGO_INITDB_ROOT_USERNAME=root
2121
- MONGO_INITDB_ROOT_PASSWORD=foobar
2222
rock:
23-
image: datashield/rock-lemon-donkey-permissive:draft
23+
image: datashield/rock-quebrada-lamda-permissive:latest
2424
environment:
2525
DEBUG: "FALSE"

docs/404.html

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/LICENSE.html

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)