Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
3f39334
feat: ds.mdPattern fct
ESCRI11 Oct 28, 2025
e2c190d
Documentation update
StuartWheater Oct 30, 2025
528f8aa
Merge pull request #623 from StuartWheater/v6.3.5-dev
StuartWheater Oct 30, 2025
43d90e4
Changes required for dsBaseClient to be submitted to CRAN
StuartWheater Oct 31, 2025
42a02ca
Merge pull request #621 from ESCRI11/dev-task-14
StuartWheater Nov 2, 2025
80b69f4
Merge branch 'datashield:v6.3.5-dev' into v6.3.5-dev
StuartWheater Nov 3, 2025
afee361
Updated perf profile
StuartWheater Nov 3, 2025
5311eb0
Merge branch 'v7.0-dev-feat/performance' of github.com:StuartWheater/…
StuartWheater Nov 3, 2025
73b6d78
Merge pull request #625 from StuartWheater/v6.3.5-dev
StuartWheater Nov 7, 2025
19f82d0
Update of 'rock' & 'rsever' images
StuartWheater Nov 13, 2025
d192759
Updated packages
StuartWheater Nov 13, 2025
f17fd8a
Fixed type
StuartWheater Nov 13, 2025
20c12de
Updated for new responses for 'foobar'
StuartWheater Nov 13, 2025
d33b4c4
Fixed quotes
StuartWheater Nov 14, 2025
366e8c5
Fix to 'ds.mdPattern' document and documentation regeneration
StuartWheater Nov 14, 2025
9694b00
Removed options call setting datashield.errors.print to TRUE, for the…
StuartWheater Nov 14, 2025
3a76f87
Fixes for 'ds.colnames'
StuartWheater Nov 14, 2025
4202a41
Merge pull request #626 from StuartWheater/v6.3.5-dev
StuartWheater Nov 14, 2025
020b0f5
Merge pull request #629 from datashield/v6.3.5-dev
StuartWheater Nov 14, 2025
877657e
Merge branch 'v7.0-dev' of github.com:StuartWheater/dsBaseClient into…
StuartWheater Nov 14, 2025
1fd30e1
Update documents
StuartWheater Nov 14, 2025
35d91ff
Merge branch 'datashield:v7.0-dev-feat/performance' into v7.0-dev-fea…
StuartWheater Nov 16, 2025
08b7b8c
Merge pull request #630 from StuartWheater/v7.0-dev
StuartWheater Nov 16, 2025
01b7db9
Update to align 'v7.0-dev-feat/performance' with 'v7.0-dev'
StuartWheater Nov 16, 2025
4608bd7
Comment out 'ds.ranksSecure'
StuartWheater Nov 16, 2025
9d07ed3
Rework 'smk_expt-ds.ranksSecure'
StuartWheater Nov 16, 2025
2555477
Fix typos
StuartWheater Nov 16, 2025
5223c6c
Updated 'ds.colnames' manual
StuartWheater Nov 17, 2025
bcd763b
Reaction to changes to packages
StuartWheater Nov 17, 2025
1dace04
Updated perf profiles
StuartWheater Nov 17, 2025
5a246fb
Updated messages expected
StuartWheater Nov 17, 2025
4b7a1b2
Update pipelines
StuartWheater Nov 17, 2025
4f98392
Update 'tar's
StuartWheater Nov 17, 2025
202f441
Fixed escaping typo
StuartWheater Nov 17, 2025
533789d
Fixed typo in '.Rbuildignore'
StuartWheater Nov 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@
^R/secure.global.ranking.md$
^_pkgdown\.yml$
^docs$
^dsBase_6.3.5.tar.gz$
^dsBase_6.3.5-permissive.tar.gz$
^dsBase_7.0-dev-feat_performance\.tar\.gz$
^dsBase_7.0-dev-feat_performance-permissive\.tar\.gz$
^dsDanger_6.3.4.tar.gz$
^\.circleci$
^\.circleci/config\.yml$
Expand Down
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: dsBaseClient
Title: 'DataSHIELD' Client Side Base Functions
Version: 6.3.5
Version: 7.0.0.9000
Description: Base 'DataSHIELD' functions for the client side. 'DataSHIELD' is a software package which allows
you to do non-disclosive federated analysis on sensitive data. 'DataSHIELD' analytic functions have
been designed to only share non disclosive summary statistics, with built in automated output
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ export(ds.matrixDimnames)
export(ds.matrixInvert)
export(ds.matrixMult)
export(ds.matrixTranspose)
export(ds.mdPattern)
export(ds.mean)
export(ds.meanByClass)
export(ds.meanSdGp)
Expand Down
4 changes: 2 additions & 2 deletions R/ds.colnames.R
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@
#'
#' Server function called: \code{colnamesDS}
#' @param x a character string providing the name of the input data frame or matrix.
#' @param datasources a list of \code{\link{DSConnection-class}} objects obtained after login.
#' @param datasources a list of \code{\link[DSI]{DSConnection-class}} objects obtained after login.
#' If the \code{datasources} argument is not specified
#' the default set of connections will be used: see \code{\link{datashield.connections_default}}.
#' the default set of connections will be used: see \code{\link[DSI]{datashield.connections_default}}.
#' @return \code{ds.colnames} returns the column names of
#' the specified server-side data frame or matrix.
#' @author DataSHIELD Development Team
Expand Down
305 changes: 305 additions & 0 deletions R/ds.mdPattern.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,305 @@
#'
#' @title Display missing data patterns with disclosure control
#' @description This function is a client-side wrapper for the server-side mdPatternDS
#' function. It generates a missing data pattern matrix similar to mice::md.pattern but
#' with disclosure control applied to prevent revealing small cell counts.
#' @details The function calls the server-side mdPatternDS function which uses
#' mice::md.pattern to analyze missing data patterns. Patterns with counts below the
#' disclosure threshold (default: nfilter.tab = 3) are suppressed to maintain privacy.
#'
#' \strong{Output Format:}
#' - Each row represents a missing data pattern
#' - Pattern counts are shown in row names (e.g., "150", "25")
#' - Columns show 1 if the variable is observed, 0 if missing
#' - Last column shows the total number of missing values per pattern
#' - Last row shows the total number of missing values per variable
#'
#' \strong{Disclosure Control:}
#'
#' Suppressed patterns (count below threshold) are indicated by:
#' - Row name: "suppressed(<N>)" where N is the threshold
#' - All pattern values set to NA
#' - Summary row also suppressed to prevent back-calculation
#'
#' \strong{Pooling Behavior (type='combine'):}
#'
#' When pooling across studies, the function uses a \emph{conservative approach}
#' for disclosure control:
#'
#' 1. Identifies identical missing patterns across studies
#' 2. \strong{EXCLUDES suppressed patterns from pooling} - patterns suppressed in
#' ANY study are not included in the pooled count
#' 3. Sums counts only for non-suppressed identical patterns
#' 4. Re-validates pooled counts against disclosure threshold
#'
#' \strong{Important:} This conservative approach means:
#' - Pooled counts may be \emph{underestimates} if some studies had suppressed patterns
#' - This prevents disclosure through subtraction (e.g., if study A shows count=5
#' and pool shows count=7, one could deduce study B has count=2, violating disclosure)
#' - Different patterns across studies are preserved separately in the pooled result
#'
#' @param x a character string specifying the name of a data frame or matrix on the
#' server-side containing the data to analyze.
#' @param type a character string specifying the output type. If 'split' (default),
#' returns separate patterns for each study. If 'combine', attempts to pool patterns
#' across studies.
#' @param datasources a list of \code{\link[DSI]{DSConnection-class}} objects obtained
#' after login. If the \code{datasources} argument is not specified, the default set of
#' connections will be used: see \code{\link[DSI]{datashield.connections_default}}.
#' @return For type='split': A list with one element per study, each containing:
#' \describe{
#' \item{pattern}{The missing data pattern matrix for that study}
#' \item{valid}{Logical indicating if all patterns meet disclosure requirements}
#' \item{message}{A message describing the validity status}
#' }
#'
#' For type='combine': A list containing:
#' \describe{
#' \item{pattern}{The pooled missing data pattern matrix across all studies}
#' \item{valid}{Logical indicating if all pooled patterns meet disclosure requirements}
#' \item{message}{A message describing the validity status}
#' }
#' @author Xavier Escribà montagut for DataSHIELD Development Team
#' @export
#' @examples
#' \dontrun{
#' ## Version 6, for version 5 see the Wiki
#'
#' # Connecting to the Opal servers
#'
#' require('DSI')
#' require('DSOpal')
#' require('dsBaseClient')
#'
#' builder <- DSI::newDSLoginBuilder()
#' builder$append(server = "study1",
#' url = "http://192.168.56.100:8080/",
#' user = "administrator", password = "datashield_test&",
#' table = "CNSIM.CNSIM1", driver = "OpalDriver")
#' builder$append(server = "study2",
#' url = "http://192.168.56.100:8080/",
#' user = "administrator", password = "datashield_test&",
#' table = "CNSIM.CNSIM2", driver = "OpalDriver")
#' logindata <- builder$build()
#'
#' connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D")
#'
#' # Get missing data patterns for each study separately
#' patterns_split <- ds.mdPattern(x = "D", type = "split", datasources = connections)
#'
#' # View results for study1
#' print(patterns_split$study1$pattern)
#' # var1 var2 var3
#' # 150 1 1 1 0 <- 150 obs complete
#' # 25 0 1 1 1 <- 25 obs missing var1
#' # 25 0 0 25 <- Summary: 25 missing per variable
#'
#' # Get pooled missing data patterns across studies
#' patterns_pooled <- ds.mdPattern(x = "D", type = "combine", datasources = connections)
#' print(patterns_pooled$pattern)
#'
#' # Example with suppressed patterns:
#' # If study1 has a pattern with count=2 (suppressed) and study2 has same pattern
#' # with count=5 (valid), the pooled result will show count=5 (conservative approach)
#' # A warning will indicate: "Pooled counts may underestimate the true total"
#'
#' # Clear the Datashield R sessions and logout
#' datashield.logout(connections)
#' }
#'
ds.mdPattern <- function(x = NULL, type = 'split', datasources = NULL){

# Look for DS connections
if(is.null(datasources)){
datasources <- datashield.connections_find()
}

# Ensure datasources is a list of DSConnection-class
if(!(is.list(datasources) && all(unlist(lapply(datasources, function(d) {methods::is(d,"DSConnection")}))))){
stop("The 'datasources' were expected to be a list of DSConnection-class objects", call.=FALSE)
}

if(is.null(x)){
stop("Please provide the name of a data frame or matrix!", call.=FALSE)
}

# Get study names
study_names <- names(datasources)

# Call the server side function
cally <- call("mdPatternDS", x)
results <- DSI::datashield.aggregate(datasources, cally)

# Process results based on type
if(type == "split"){
# Return individual study results
return(results)

} else if(type == "combine"){
# Pool results across studies

# First check if any study has invalid patterns
any_invalid <- any(sapply(results, function(r) !r$valid))
invalid_studies <- names(results)[sapply(results, function(r) !r$valid)]

if(any_invalid){
warning(
"Disclosure control: Some studies have suppressed patterns (below threshold).\n",
" Studies with suppressed patterns: ", paste(invalid_studies, collapse=", "), "\n",
" These patterns are EXCLUDED from pooling to prevent disclosure.\n",
" Pooled counts may underestimate the true total.",
call. = FALSE
)
}

# Extract patterns from each study
patterns_list <- lapply(results, function(r) r$pattern)

# Check if all patterns have the same variables (columns)
n_vars <- sapply(patterns_list, ncol)
if(length(unique(n_vars)) > 1){
stop("Cannot pool patterns: studies have different numbers of variables", call.=FALSE)
}

var_names <- colnames(patterns_list[[1]])
if(length(patterns_list) > 1){
for(i in 2:length(patterns_list)){
if(!identical(colnames(patterns_list[[i]]), var_names)){
warning("Variable names differ across studies. Pooling by position.")
break
}
}
}

# Pool the patterns
pooled_pattern <- .pool_md_patterns(patterns_list, study_names)

# Check validity of pooled results
# Get threshold from first study's results or use a default check
nfilter.tab <- getOption("default.nfilter.tab")
if(is.null(nfilter.tab)) nfilter.tab <- 3

n_patterns <- nrow(pooled_pattern) - 1
pooled_valid <- TRUE

if(n_patterns > 0){
# Pattern counts are in row names
pattern_counts <- as.numeric(rownames(pooled_pattern)[1:n_patterns])
pattern_counts <- pattern_counts[!is.na(pattern_counts) & pattern_counts > 0]

if(any(pattern_counts < nfilter.tab)){
pooled_valid <- FALSE
}
}

pooled_message <- ifelse(pooled_valid,
"Valid: all pooled pattern counts meet disclosure requirements",
"Some pooled pattern counts may be below threshold")

return(list(
pattern = pooled_pattern,
valid = pooled_valid,
message = pooled_message,
studies = study_names
))

} else {
stop("Argument 'type' must be either 'split' or 'combine'", call.=FALSE)
}
}

#' @title Pool missing data patterns across studies
#' @description Internal function to pool md.pattern results from multiple studies
#' @param patterns_list List of pattern matrices from each study
#' @param study_names Names of the studies
#' @return Pooled pattern matrix
#' @keywords internal
.pool_md_patterns <- function(patterns_list, study_names){

# Initialize with first study's pattern structure
pooled <- patterns_list[[1]]
n_vars <- ncol(pooled)
n_rows <- nrow(pooled) - 1 # Exclude summary row

# Create a list to store unique patterns
unique_patterns <- list()
pattern_counts <- list()

# Process each study
for(i in seq_along(patterns_list)){
pattern <- patterns_list[[i]]
study_n_patterns <- nrow(pattern) - 1

if(study_n_patterns > 0){
for(j in 1:study_n_patterns){
# Get pattern (columns show 1/0 for observed/missing)
pat_vector <- pattern[j, 1:(n_vars-1)]
# Pattern count is in row name
pat_count_str <- rownames(pattern)[j]
pat_count <- suppressWarnings(as.numeric(pat_count_str))

# Skip if suppressed (non-numeric row name like "suppressed(<3)")
if(is.na(pat_count)){
next
}

# Convert pattern to string for comparison
pat_string <- paste(pat_vector, collapse="_")

# Check if this pattern already exists
if(pat_string %in% names(unique_patterns)){
# Add to existing count
pattern_counts[[pat_string]] <- pattern_counts[[pat_string]] + pat_count
} else {
# New pattern
unique_patterns[[pat_string]] <- pat_vector
pattern_counts[[pat_string]] <- pat_count
}
}
}
}

# Build pooled pattern matrix
if(length(unique_patterns) == 0){
# No valid patterns
pooled[1:n_rows, ] <- NA
} else {
# Sort patterns by count (descending)
sorted_idx <- order(unlist(pattern_counts), decreasing = TRUE)
sorted_patterns <- unique_patterns[sorted_idx]
sorted_counts <- pattern_counts[sorted_idx]

# Create new pooled matrix
n_pooled_patterns <- length(sorted_patterns)
pooled <- matrix(NA, nrow = n_pooled_patterns + 1, ncol = n_vars)
colnames(pooled) <- colnames(patterns_list[[1]])

# Set row names (counts for patterns, empty for summary)
row_names <- c(as.character(unlist(sorted_counts)), "")
rownames(pooled) <- row_names

# Fill in patterns
for(i in 1:n_pooled_patterns){
pooled[i, 1:(n_vars-1)] <- sorted_patterns[[i]]
# Calculate number of missing for this pattern
pooled[i, n_vars] <- sum(sorted_patterns[[i]] == 0)
}
}

# Calculate summary row (total missing per variable)
# Sum across studies
summary_row <- rep(0, n_vars)
for(i in seq_along(patterns_list)){
study_summary <- patterns_list[[i]][nrow(patterns_list[[i]]), ]
# Only add if not suppressed
if(!all(is.na(study_summary))){
summary_row <- summary_row + ifelse(is.na(study_summary), 0, study_summary)
}
}

# Add summary row
pooled[nrow(pooled), ] <- summary_row

return(pooled)
}

6 changes: 3 additions & 3 deletions armadillo_azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,10 +58,10 @@ schedules:
- master
always: true
- cron: "0 2 * * *"
displayName: Nightly build - v6.3.5-dev
displayName: Nightly build - v7.0-dev-feat/performance
branches:
include:
- v6.3.5-dev
- v7.0-dev-feat/performance
always: true

#########################################################################################
Expand Down Expand Up @@ -235,7 +235,7 @@ jobs:

curl -u admin:admin -X GET http://localhost:8080/packages

curl -u admin:admin --max-time 300 -v -H 'Content-Type: multipart/form-data' -F "file=@dsBase_6.3.5-permissive.tar.gz" -X POST http://localhost:8080/install-package
curl -u admin:admin --max-time 300 -v -H 'Content-Type: multipart/form-data' -F "file=@dsBase_7.0-dev-feat_performance-permissive.tar.gz" -X POST http://localhost:8080/install-package
sleep 60

docker container restart dsbaseclient_armadillo_1
Expand Down
6 changes: 3 additions & 3 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,10 @@ schedules:
- master
always: true
- cron: "0 2 * * *"
displayName: Nightly build - v6.3.5-dev
displayName: Nightly build - v7.0-dev-feat/performance
branches:
include:
- v6.3.5-dev
- v7.0-dev-feat/performance
always: true

#########################################################################################
Expand Down Expand Up @@ -216,7 +216,7 @@ jobs:
- bash: |
R -q -e "library(opalr); opal <- opal.login(username = 'administrator', password = 'datashield_test&', url = 'https://localhost:8443', opts = list(ssl_verifyhost=0, ssl_verifypeer=0)); opal.put(opal, 'system', 'conf', 'general', '_rPackage'); opal.logout(o)"

R -q -e "library(opalr); opal <- opal.login('administrator','datashield_test&', url='https://localhost:8443/', opts = list(ssl_verifyhost=0, ssl_verifypeer=0)); dsadmin.install_github_package(opal, 'dsBase', username = 'datashield', ref = 'v6.3.5-dev'); opal.logout(opal)"
R -q -e "library(opalr); opal <- opal.login('administrator','datashield_test&', url='https://localhost:8443/', opts = list(ssl_verifyhost=0, ssl_verifypeer=0)); dsadmin.install_github_package(opal, 'dsBase', username = 'datashield', ref = '7.0-dev-feat_performance'); opal.logout(opal)"

sleep 60

Expand Down
Loading