Skip to content

Commit 9a97118

Browse files
Collective changes updating 6.3.5-dev, 7.0-dev, and v7.0-dev-feat/performance, flowed through branches
2 parents 10cc3af + 533789d commit 9a97118

File tree

173 files changed

+1158
-393
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

173 files changed

+1158
-393
lines changed

.Rbuildignore

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,8 @@
1717
^R/secure.global.ranking.md$
1818
^_pkgdown\.yml$
1919
^docs$
20-
^dsBase_6.3.5.tar.gz$
21-
^dsBase_6.3.5-permissive.tar.gz$
20+
^dsBase_7.0-dev-feat_performance\.tar\.gz$
21+
^dsBase_7.0-dev-feat_performance-permissive\.tar\.gz$
2222
^dsDanger_6.3.4.tar.gz$
2323
^\.circleci$
2424
^\.circleci/config\.yml$

DESCRIPTION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Package: dsBaseClient
22
Title: 'DataSHIELD' Client Side Base Functions
3-
Version: 6.3.5
3+
Version: 7.0.0.9000
44
Description: Base 'DataSHIELD' functions for the client side. 'DataSHIELD' is a software package which allows
55
you to do non-disclosive federated analysis on sensitive data. 'DataSHIELD' analytic functions have
66
been designed to only share non disclosive summary statistics, with built in automated output

NAMESPACE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ export(ds.matrixDimnames)
7373
export(ds.matrixInvert)
7474
export(ds.matrixMult)
7575
export(ds.matrixTranspose)
76+
export(ds.mdPattern)
7677
export(ds.mean)
7778
export(ds.meanByClass)
7879
export(ds.meanSdGp)

R/ds.colnames.R

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@
66
#'
77
#' Server function called: \code{colnamesDS}
88
#' @param x a character string providing the name of the input data frame or matrix.
9-
#' @param datasources a list of \code{\link{DSConnection-class}} objects obtained after login.
9+
#' @param datasources a list of \code{\link[DSI]{DSConnection-class}} objects obtained after login.
1010
#' If the \code{datasources} argument is not specified
11-
#' the default set of connections will be used: see \code{\link{datashield.connections_default}}.
11+
#' the default set of connections will be used: see \code{\link[DSI]{datashield.connections_default}}.
1212
#' @return \code{ds.colnames} returns the column names of
1313
#' the specified server-side data frame or matrix.
1414
#' @author DataSHIELD Development Team

R/ds.mdPattern.R

Lines changed: 305 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,305 @@
1+
#'
2+
#' @title Display missing data patterns with disclosure control
3+
#' @description This function is a client-side wrapper for the server-side mdPatternDS
4+
#' function. It generates a missing data pattern matrix similar to mice::md.pattern but
5+
#' with disclosure control applied to prevent revealing small cell counts.
6+
#' @details The function calls the server-side mdPatternDS function which uses
7+
#' mice::md.pattern to analyze missing data patterns. Patterns with counts below the
8+
#' disclosure threshold (default: nfilter.tab = 3) are suppressed to maintain privacy.
9+
#'
10+
#' \strong{Output Format:}
11+
#' - Each row represents a missing data pattern
12+
#' - Pattern counts are shown in row names (e.g., "150", "25")
13+
#' - Columns show 1 if the variable is observed, 0 if missing
14+
#' - Last column shows the total number of missing values per pattern
15+
#' - Last row shows the total number of missing values per variable
16+
#'
17+
#' \strong{Disclosure Control:}
18+
#'
19+
#' Suppressed patterns (count below threshold) are indicated by:
20+
#' - Row name: "suppressed(<N>)" where N is the threshold
21+
#' - All pattern values set to NA
22+
#' - Summary row also suppressed to prevent back-calculation
23+
#'
24+
#' \strong{Pooling Behavior (type='combine'):}
25+
#'
26+
#' When pooling across studies, the function uses a \emph{conservative approach}
27+
#' for disclosure control:
28+
#'
29+
#' 1. Identifies identical missing patterns across studies
30+
#' 2. \strong{EXCLUDES suppressed patterns from pooling} - patterns suppressed in
31+
#' ANY study are not included in the pooled count
32+
#' 3. Sums counts only for non-suppressed identical patterns
33+
#' 4. Re-validates pooled counts against disclosure threshold
34+
#'
35+
#' \strong{Important:} This conservative approach means:
36+
#' - Pooled counts may be \emph{underestimates} if some studies had suppressed patterns
37+
#' - This prevents disclosure through subtraction (e.g., if study A shows count=5
38+
#' and pool shows count=7, one could deduce study B has count=2, violating disclosure)
39+
#' - Different patterns across studies are preserved separately in the pooled result
40+
#'
41+
#' @param x a character string specifying the name of a data frame or matrix on the
42+
#' server-side containing the data to analyze.
43+
#' @param type a character string specifying the output type. If 'split' (default),
44+
#' returns separate patterns for each study. If 'combine', attempts to pool patterns
45+
#' across studies.
46+
#' @param datasources a list of \code{\link[DSI]{DSConnection-class}} objects obtained
47+
#' after login. If the \code{datasources} argument is not specified, the default set of
48+
#' connections will be used: see \code{\link[DSI]{datashield.connections_default}}.
49+
#' @return For type='split': A list with one element per study, each containing:
50+
#' \describe{
51+
#' \item{pattern}{The missing data pattern matrix for that study}
52+
#' \item{valid}{Logical indicating if all patterns meet disclosure requirements}
53+
#' \item{message}{A message describing the validity status}
54+
#' }
55+
#'
56+
#' For type='combine': A list containing:
57+
#' \describe{
58+
#' \item{pattern}{The pooled missing data pattern matrix across all studies}
59+
#' \item{valid}{Logical indicating if all pooled patterns meet disclosure requirements}
60+
#' \item{message}{A message describing the validity status}
61+
#' }
62+
#' @author Xavier Escribà montagut for DataSHIELD Development Team
63+
#' @export
64+
#' @examples
65+
#' \dontrun{
66+
#' ## Version 6, for version 5 see the Wiki
67+
#'
68+
#' # Connecting to the Opal servers
69+
#'
70+
#' require('DSI')
71+
#' require('DSOpal')
72+
#' require('dsBaseClient')
73+
#'
74+
#' builder <- DSI::newDSLoginBuilder()
75+
#' builder$append(server = "study1",
76+
#' url = "http://192.168.56.100:8080/",
77+
#' user = "administrator", password = "datashield_test&",
78+
#' table = "CNSIM.CNSIM1", driver = "OpalDriver")
79+
#' builder$append(server = "study2",
80+
#' url = "http://192.168.56.100:8080/",
81+
#' user = "administrator", password = "datashield_test&",
82+
#' table = "CNSIM.CNSIM2", driver = "OpalDriver")
83+
#' logindata <- builder$build()
84+
#'
85+
#' connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D")
86+
#'
87+
#' # Get missing data patterns for each study separately
88+
#' patterns_split <- ds.mdPattern(x = "D", type = "split", datasources = connections)
89+
#'
90+
#' # View results for study1
91+
#' print(patterns_split$study1$pattern)
92+
#' # var1 var2 var3
93+
#' # 150 1 1 1 0 <- 150 obs complete
94+
#' # 25 0 1 1 1 <- 25 obs missing var1
95+
#' # 25 0 0 25 <- Summary: 25 missing per variable
96+
#'
97+
#' # Get pooled missing data patterns across studies
98+
#' patterns_pooled <- ds.mdPattern(x = "D", type = "combine", datasources = connections)
99+
#' print(patterns_pooled$pattern)
100+
#'
101+
#' # Example with suppressed patterns:
102+
#' # If study1 has a pattern with count=2 (suppressed) and study2 has same pattern
103+
#' # with count=5 (valid), the pooled result will show count=5 (conservative approach)
104+
#' # A warning will indicate: "Pooled counts may underestimate the true total"
105+
#'
106+
#' # Clear the Datashield R sessions and logout
107+
#' datashield.logout(connections)
108+
#' }
109+
#'
110+
ds.mdPattern <- function(x = NULL, type = 'split', datasources = NULL){
111+
112+
# Look for DS connections
113+
if(is.null(datasources)){
114+
datasources <- datashield.connections_find()
115+
}
116+
117+
# Ensure datasources is a list of DSConnection-class
118+
if(!(is.list(datasources) && all(unlist(lapply(datasources, function(d) {methods::is(d,"DSConnection")}))))){
119+
stop("The 'datasources' were expected to be a list of DSConnection-class objects", call.=FALSE)
120+
}
121+
122+
if(is.null(x)){
123+
stop("Please provide the name of a data frame or matrix!", call.=FALSE)
124+
}
125+
126+
# Get study names
127+
study_names <- names(datasources)
128+
129+
# Call the server side function
130+
cally <- call("mdPatternDS", x)
131+
results <- DSI::datashield.aggregate(datasources, cally)
132+
133+
# Process results based on type
134+
if(type == "split"){
135+
# Return individual study results
136+
return(results)
137+
138+
} else if(type == "combine"){
139+
# Pool results across studies
140+
141+
# First check if any study has invalid patterns
142+
any_invalid <- any(sapply(results, function(r) !r$valid))
143+
invalid_studies <- names(results)[sapply(results, function(r) !r$valid)]
144+
145+
if(any_invalid){
146+
warning(
147+
"Disclosure control: Some studies have suppressed patterns (below threshold).\n",
148+
" Studies with suppressed patterns: ", paste(invalid_studies, collapse=", "), "\n",
149+
" These patterns are EXCLUDED from pooling to prevent disclosure.\n",
150+
" Pooled counts may underestimate the true total.",
151+
call. = FALSE
152+
)
153+
}
154+
155+
# Extract patterns from each study
156+
patterns_list <- lapply(results, function(r) r$pattern)
157+
158+
# Check if all patterns have the same variables (columns)
159+
n_vars <- sapply(patterns_list, ncol)
160+
if(length(unique(n_vars)) > 1){
161+
stop("Cannot pool patterns: studies have different numbers of variables", call.=FALSE)
162+
}
163+
164+
var_names <- colnames(patterns_list[[1]])
165+
if(length(patterns_list) > 1){
166+
for(i in 2:length(patterns_list)){
167+
if(!identical(colnames(patterns_list[[i]]), var_names)){
168+
warning("Variable names differ across studies. Pooling by position.")
169+
break
170+
}
171+
}
172+
}
173+
174+
# Pool the patterns
175+
pooled_pattern <- .pool_md_patterns(patterns_list, study_names)
176+
177+
# Check validity of pooled results
178+
# Get threshold from first study's results or use a default check
179+
nfilter.tab <- getOption("default.nfilter.tab")
180+
if(is.null(nfilter.tab)) nfilter.tab <- 3
181+
182+
n_patterns <- nrow(pooled_pattern) - 1
183+
pooled_valid <- TRUE
184+
185+
if(n_patterns > 0){
186+
# Pattern counts are in row names
187+
pattern_counts <- as.numeric(rownames(pooled_pattern)[1:n_patterns])
188+
pattern_counts <- pattern_counts[!is.na(pattern_counts) & pattern_counts > 0]
189+
190+
if(any(pattern_counts < nfilter.tab)){
191+
pooled_valid <- FALSE
192+
}
193+
}
194+
195+
pooled_message <- ifelse(pooled_valid,
196+
"Valid: all pooled pattern counts meet disclosure requirements",
197+
"Some pooled pattern counts may be below threshold")
198+
199+
return(list(
200+
pattern = pooled_pattern,
201+
valid = pooled_valid,
202+
message = pooled_message,
203+
studies = study_names
204+
))
205+
206+
} else {
207+
stop("Argument 'type' must be either 'split' or 'combine'", call.=FALSE)
208+
}
209+
}
210+
211+
#' @title Pool missing data patterns across studies
212+
#' @description Internal function to pool md.pattern results from multiple studies
213+
#' @param patterns_list List of pattern matrices from each study
214+
#' @param study_names Names of the studies
215+
#' @return Pooled pattern matrix
216+
#' @keywords internal
217+
.pool_md_patterns <- function(patterns_list, study_names){
218+
219+
# Initialize with first study's pattern structure
220+
pooled <- patterns_list[[1]]
221+
n_vars <- ncol(pooled)
222+
n_rows <- nrow(pooled) - 1 # Exclude summary row
223+
224+
# Create a list to store unique patterns
225+
unique_patterns <- list()
226+
pattern_counts <- list()
227+
228+
# Process each study
229+
for(i in seq_along(patterns_list)){
230+
pattern <- patterns_list[[i]]
231+
study_n_patterns <- nrow(pattern) - 1
232+
233+
if(study_n_patterns > 0){
234+
for(j in 1:study_n_patterns){
235+
# Get pattern (columns show 1/0 for observed/missing)
236+
pat_vector <- pattern[j, 1:(n_vars-1)]
237+
# Pattern count is in row name
238+
pat_count_str <- rownames(pattern)[j]
239+
pat_count <- suppressWarnings(as.numeric(pat_count_str))
240+
241+
# Skip if suppressed (non-numeric row name like "suppressed(<3)")
242+
if(is.na(pat_count)){
243+
next
244+
}
245+
246+
# Convert pattern to string for comparison
247+
pat_string <- paste(pat_vector, collapse="_")
248+
249+
# Check if this pattern already exists
250+
if(pat_string %in% names(unique_patterns)){
251+
# Add to existing count
252+
pattern_counts[[pat_string]] <- pattern_counts[[pat_string]] + pat_count
253+
} else {
254+
# New pattern
255+
unique_patterns[[pat_string]] <- pat_vector
256+
pattern_counts[[pat_string]] <- pat_count
257+
}
258+
}
259+
}
260+
}
261+
262+
# Build pooled pattern matrix
263+
if(length(unique_patterns) == 0){
264+
# No valid patterns
265+
pooled[1:n_rows, ] <- NA
266+
} else {
267+
# Sort patterns by count (descending)
268+
sorted_idx <- order(unlist(pattern_counts), decreasing = TRUE)
269+
sorted_patterns <- unique_patterns[sorted_idx]
270+
sorted_counts <- pattern_counts[sorted_idx]
271+
272+
# Create new pooled matrix
273+
n_pooled_patterns <- length(sorted_patterns)
274+
pooled <- matrix(NA, nrow = n_pooled_patterns + 1, ncol = n_vars)
275+
colnames(pooled) <- colnames(patterns_list[[1]])
276+
277+
# Set row names (counts for patterns, empty for summary)
278+
row_names <- c(as.character(unlist(sorted_counts)), "")
279+
rownames(pooled) <- row_names
280+
281+
# Fill in patterns
282+
for(i in 1:n_pooled_patterns){
283+
pooled[i, 1:(n_vars-1)] <- sorted_patterns[[i]]
284+
# Calculate number of missing for this pattern
285+
pooled[i, n_vars] <- sum(sorted_patterns[[i]] == 0)
286+
}
287+
}
288+
289+
# Calculate summary row (total missing per variable)
290+
# Sum across studies
291+
summary_row <- rep(0, n_vars)
292+
for(i in seq_along(patterns_list)){
293+
study_summary <- patterns_list[[i]][nrow(patterns_list[[i]]), ]
294+
# Only add if not suppressed
295+
if(!all(is.na(study_summary))){
296+
summary_row <- summary_row + ifelse(is.na(study_summary), 0, study_summary)
297+
}
298+
}
299+
300+
# Add summary row
301+
pooled[nrow(pooled), ] <- summary_row
302+
303+
return(pooled)
304+
}
305+

armadillo_azure-pipelines.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,10 +58,10 @@ schedules:
5858
- master
5959
always: true
6060
- cron: "0 2 * * *"
61-
displayName: Nightly build - v6.3.5-dev
61+
displayName: Nightly build - v7.0-dev-feat/performance
6262
branches:
6363
include:
64-
- v6.3.5-dev
64+
- v7.0-dev-feat/performance
6565
always: true
6666

6767
#########################################################################################
@@ -235,7 +235,7 @@ jobs:
235235
236236
curl -u admin:admin -X GET http://localhost:8080/packages
237237
238-
curl -u admin:admin --max-time 300 -v -H 'Content-Type: multipart/form-data' -F "file=@dsBase_6.3.5-permissive.tar.gz" -X POST http://localhost:8080/install-package
238+
curl -u admin:admin --max-time 300 -v -H 'Content-Type: multipart/form-data' -F "file=@dsBase_7.0-dev-feat_performance-permissive.tar.gz" -X POST http://localhost:8080/install-package
239239
sleep 60
240240
241241
docker container restart dsbaseclient_armadillo_1

azure-pipelines.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -44,10 +44,10 @@ schedules:
4444
- master
4545
always: true
4646
- cron: "0 2 * * *"
47-
displayName: Nightly build - v6.3.5-dev
47+
displayName: Nightly build - v7.0-dev-feat/performance
4848
branches:
4949
include:
50-
- v6.3.5-dev
50+
- v7.0-dev-feat/performance
5151
always: true
5252

5353
#########################################################################################
@@ -216,7 +216,7 @@ jobs:
216216
- bash: |
217217
R -q -e "library(opalr); opal <- opal.login(username = 'administrator', password = 'datashield_test&', url = 'https://localhost:8443', opts = list(ssl_verifyhost=0, ssl_verifypeer=0)); opal.put(opal, 'system', 'conf', 'general', '_rPackage'); opal.logout(o)"
218218
219-
R -q -e "library(opalr); opal <- opal.login('administrator','datashield_test&', url='https://localhost:8443/', opts = list(ssl_verifyhost=0, ssl_verifypeer=0)); dsadmin.install_github_package(opal, 'dsBase', username = 'datashield', ref = 'v6.3.5-dev'); opal.logout(opal)"
219+
R -q -e "library(opalr); opal <- opal.login('administrator','datashield_test&', url='https://localhost:8443/', opts = list(ssl_verifyhost=0, ssl_verifypeer=0)); dsadmin.install_github_package(opal, 'dsBase', username = 'datashield', ref = '7.0-dev-feat_performance'); opal.logout(opal)"
220220
221221
sleep 60
222222

0 commit comments

Comments
 (0)