Keywords
differential privacy - synthetic data - hypothesis testing - statistical inference
- Mann–Whitney U test
Introduction
As the amount of health and medical data collected from individuals has grown, so
has the interest in using it for secondary purposes such as research and innovation.
Many benefits have been proposed to arise from sharing these data,[1] for example, enhancing research reproducibility, building on existing research,
performing meta-analyses, and reducing clinical trial costs by reusing existing data.
However, privacy concerns about the potential harm to individuals that may come from
sharing their sensitive data, along with legislation aimed at addressing these concerns,
restrict the opportunities for sharing individuals' data.
The release of synthetic data, generated using a statistical model derived from an
original sensitive dataset, has been proposed as a potential solution for sharing
biomedical data while preserving individuals' privacy.[2]
[3]
[4] It has been argued that since synthetic data consist of synthetic records instead
of actual records, and synthetic records are not associated with any specific identity,
privacy is preserved.[2] However, it has been repeatedly demonstrated that this is not the case as synthetic
data are not inherently privacy-preserving.[5]
[6]
[7]
[8] In the worst case, a generative model could create near copies of the original sensitive
data it was trained on. Moreover, there are many more subtle ways that models can
leak information about their training data.[5]
[9] At the other extreme, perfect anonymity is guaranteed only when no useful information
from the original data remains. Therefore, in addition to preserving privacy, the
generated data should have high utility, meaning the degree to which the inferences
obtained from the synthetic data correspond to inferences obtained from the original
data.[5]
[10] Consequently, when generating synthetic data, it is essential to find a balance
between the privacy and utility of the data, ensuring that the generated data capture
the primary statistical trends in the original data while also preventing the disclosure
of sensitive information about individuals.[11]
Differential Privacy (DP), a mathematical formulation that provides probabilistic
guarantees on the privacy risk associated with disclosing the output of a computational
task, has been widely accepted as the gold standard of privacy protection.[12]
[13]
[14]
[15] As a result, methods that ensure DP guarantees have been introduced in a broad range
of settings, including descriptive statistics,[13]
[16] inferential statistic,[17]
[18]
[19]
[20] and machine learning applications.[15]
[21] Furthermore, DP offers a theoretically well-founded approach that provides probabilistic
privacy guarantees also for the release of synthetic data. Therefore, several methods
for releasing DP-synthetic data have been proposed.[22]
[23]
[24]
[25]
[26] Some state-of-the-art methods for generating DP-synthetic data use multi-dimensional
histograms, which are standard tools for estimating the distribution of data with
minimal a priori assumptions about its statistical properties. Other methods are based
on machine learning–based generative models, for example, Bayesian and Generative
Adversarial Network (GAN)-based methods. The aim of DP-synthetic data is to be a privacy-preserving
version of the original data that could be safely used in its place, requiring no
expertise on DP or changes to the workflow from the end-user. However, DP-synthetic
data are always a distorted version of the original data, and especially when high
levels of privacy are enforced the level of distortion can be quite considerable.
Even though combining DP with synthetic data guarantees a desired level of privacy,
preservation of the utility remains unclear. In particular, the validity of statistical
significance tests, namely the statistical guarantees of the false-finding probabilities
being at most the significance level, may be lost.
Hypothesis tests for assessing whether two distributions share a certain property
are essential tools in analyzing biomedical data. In this work, we particularly focus
on the Mann–Whitney (MW) U test (a.k.a. Wilcoxon rank-sum test or Mann–Whitney–Wilcoxon
test), as it is the de facto standard for testing whether two groups are drawn from
the same distribution.[27]
[28] It is widely applied in medical research,[29] particularly when analyzing a biomarker between nonhealthy and healthy patients
in clinical trials. It is well known that the MW U test is valid for this question,
that is, the probability of falsely rejecting the null hypothesis of the two groups
being drawn from the same distribution is at most the significance level determined
a priori.[30] Alongside the MW U test, we also consider the Student's t-test,[31] median test,[32] and chi-squared test.[33] In general, the choice of statistical test should be guided by the distribution
characteristics of the dataset and the datatype under analysis.
In order for DP-synthetic data to be useful for basic use cases in medical research,
such as the MW U test, one would hope to observe roughly similar results when carrying
out tests on sensitive medical datasets. Otherwise, there is a risk that discoveries
are missed because of information lost in synthetization, or worse, that false discoveries
are made due to artifacts introduced in the data generation process.
Objectives
DP-synthetic data have been proposed as a solution for publicly releasing anonymized
versions of sensitive data such as medical records. Ideally, this would allow for
performing reliable statistical analyses on the DP-synthetic data without ever needing
to access the original data (see [Fig. 1]). However, there is a risk that DP-synthetic data generation methods distort the
original data in ways that can lead to loss of information and even to false discoveries.
Fig. 1 The overall configuration of the study.
In this study, we empirically evaluate the validity and power of independent sample
tests, such as the MW U test, applied to DP-synthetic data. The Type I and Type II
errors are used to measure the test validity and power, respectively. On one hand,
a test is valid if, for any significance level, the probability that it falsely rejects
a correct null hypothesis is no larger than the significance level.[34] If the test is not valid, its use can lead to false scientific discoveries, and
hence its practical utility can be even negative. On the other hand, the test's power
refers to the probability of correctly rejecting a false null hypothesis.
In our experiments with the MW U test, we evaluated five different DP-synthetic data
generation methods on bivariate real-world medical datasets, as well as data drawn
from two Gaussian distributions. Additionally, we performed experiments with simulated
multivariate data to explore the behavior of MW U test, Student's t-test, median test, and chi-squared test on higher dimensional DP-synthetic data consisting
of different variable types. Our study contributes to understanding the reliability
of statistical analysis when DP-synthetic data are used as a proxy for private data
whose public release is challenging or even impossible.
Methods
In this section, we first present the formal definition of DP. Next, we introduce
DP methods for synthetic data generation while describing the five DP-synthesizers
used in this study. Following that, we explain the validity and power of a statistical
test. Finally, we introduce the independent sample tests considered in this study.
Differential Privacy
DP is a mathematical definition that makes it possible to quantify privacy.[12]
[13] A randomized algorithm M satisfies (ϵ, δ)-DP if for all outputs S of M and for all possible neighboring datasets D, D′ that differ by only one row,
where ϵ is an upper bound on the privacy loss, and δ is a small constant corresponding to a small probability of breaking the DP constraints.
For δ = 0 in particular, solving (1) w.r.t. ϵ results to:
indicating that the log-probability of any output can change no more than ϵ. Accordingly, an algorithm M which is ϵ-DP guarantees that for every run M (D), the outcome obtained is almost equally likely to be obtained on any neighboring
dataset, bounded by the value of ϵ. Informally, in DP, privacy is understood to be protected at a given level of ϵ if the algorithm's output does not overly depend on the input data of any single
contributor; it should yield a similar result if the individual's information is present
in the input or not.
Typically, DP methods are nonprivate methods that are transformed to fulfill the DP
definition. This is achieved by adding noise using a noise mechanism calibrated based
on the ϵ and the algorithm to be privatized.[12]
[13] Choosing the appropriate value of epsilon is context-specific and an open question,
but, for example, ϵ ≤ 1 has been proposed to provide a strong guarantee,[35] while 1 < ϵ ≤ 10 is considered to still produce useful privacy guarantees,[36] depending on the task and type of data.
DP Methods for Synthetic Data Generation
In recent years, several methods for generating DP-synthetic data have been proposed.[23]
[24]
[26]
[37]
[38] Some of the proposed methods are based on histograms or marginals. These methods
privatize the cell counts or proportions of a cross-tabulation of the original sensitive
data to generate the DP-synthetic data. Other methods use a parameterized distribution
or a generative model that has been privately derived from the original data. While
DP methods based on histograms or marginals have been found to produce usable DP-synthetic
data with a reasonable level of privacy guarantee, methods based on parameterized
distributions or deep learning–based generative models have presented greater challenges.[39]
[40]
Generative methods based on marginals share a three-step process: initially, a set
of marginals is identified, either manually by a domain expert or through DP-automatic
selection. Next, these chosen marginals are measured using DP. Finally, synthetic
data are generated from the noisy marginals. To address the challenges of high-dimensional
domains, recent methods have been developed to automatically and privately select
a subset of marginals ensuring their preservation in the synthetic data generated,
such as PrivMRF,[41] PrivBayes,[42] MWEM (Multiplicative Weights Exponential Mechanism),[22] and AIM.[43]
PrivMRF employs Markov Random Fields to generate synthetic data under DP, emphasizing
the retention of statistical correlations between selected marginals within the privacy
constraints. PrivBayes constructs a Bayesian network under DP, utilizing a selected
set of marginals to approximate the underlying data distribution for synthetic data
generation. The MWEM algorithm is designed to generate a data distribution that yields
query responses closely resembling those obtained from the actual dataset. AIM on
the other hand is a workload-adaptive algorithm, allowing for the input of a predefined
set of marginals to be specifically preserved in the final DP-approximated distribution.
There are two approaches to consider when designing a DP workflow: global DP and local
DP.[13] Global DP involves aggregating data in a central location and is managed by a trusted
curator, ensuring privacy at the dataset level. In contrast, local DP decentralizes
the privacy mechanism by applying it directly to the individual's data before it is
shared. Many applications, such as crowdsourced systems, involve data distributed
across multiple individuals who do not trust any other party. These individuals are
only willing to share their information if it has been privatized on their own devices
prior to transmission. In such cases, local privacy methods such as LoPub and LoCop
become applicable, ensuring that each individual's data remain confidential even when
aggregated from diverse sources.[44]
[45]
[46]
In this study, we focus on five well-known DP methods for generating synthetic data
in a global DP setting. These methods have established algorithms or available packages,
making them accessible to any practitioner. Following, we provide a brief description
of each of these DP methods.
-
DP Perturbed Histogram
This method uses the Laplace mechanism[13] to privatize the original histogram bin counts. The noise added to each bin is sampled
separately from a calibrated Laplace distribution. After adding the noise, all negative
counts are set to zero, and individual-level data are generated from the noisy counts.
-
DP Smoothed Histogram
This method generates synthetic data by randomly sampling from the probability distribution
determined by the following histogram. The probabilities of the histogram bins are
proportional to ci
+ 2m/ϵ, where ci
is the number of original data points in the ith histogram bin and m is the size of the synthetic data drawn. The approach is similar to the one discussed
by Wasserman and Zhou.[14] Unlike the other considered DP methods, the utility of this method is inversely
proportional also to the amount synthetic data drawn. Therefore, in our experiments,
we use the method only in settings where the size of the synthetic data generated
is considerably smaller than that of the original sensitive data. A proof of the approach
being DP is presented in [Supplementary Material A.1] (available in the online version).
-
Multiplicative Weights Exponential Mechanism
This algorithm proposed by Hardt et al[22] is based on a combination of the multiplicative weights update rule with the exponential
mechanism. The MWEM algorithm estimates the original data distribution using a DP
iterative process. Here, a uniform distribution over the variables of the original
data is updated using the multiplicative weighting of a query or bin count selected
through the exponential mechanism and privatized with the Laplace mechanism in each
iteration. The privacy budget ϵ is split by the number of iterations, as in every iteration the original data need
to be accessed.
-
Private-PGM
McKenna et al[26] propose this approach for DP-synthetic data generation. It consists of three basic
steps: (1) selecting a set of low-dimensional marginals (i.e., queries) from the original
data. (2) Adding calibrated noise to the marginals. (3) Generating synthetic data
that best explain the noisy marginals. In step 3, based on the noisy marginals, a
Probabilistic Graphical Model (PGM) is used to determine the data distribution that
best captures the variables' relationship and enables synthetic data generation.
-
Differentially Private GAN
GANs[47] consist of a generator, denoted with G, and one or more discriminators D. The goal is that G would learn to produce synthetic data similar to the original data. The two networks
are initialized randomly and trained iteratively in a game-like setup: G is fed noise to create synthetic data, which the D tries to discriminate as being original or synthetic. The generator uses feedback
from the discriminator(s) to update its parameters via gradient descent (see Goodfellow[48] for a detailed explanation). GANs, and other deep learning models, can attain privacy
guarantees by using a DP version of an optimization algorithm, most often differentially
private stochastic gradient descent.[36]
Validity and Power of Independent Sample Tests
Samples are considered independent when individuals in one group do not influence
or share information with individuals in another group. Each group consists of unique
members, and no pairing or matching occurs between them. To evaluate potential statistical
differences between the two groups, researchers commonly use statistical tests designed
for independent samples. These tests determine whether the samples were drawn independently
from distributions with shared properties. The independent sample tests considered
in this work are the MW U test, Student's t-test, median test, and chi-squared test.
The validity and power of a statistical test can be evaluated in terms of Type I and
Type II errors. Let us recall that Type I is the error incurred when a “True” null
hypothesis is rejected, producing false inference. On the other hand, Type II is the
error of failing to reject a “False” null hypothesis (see [Fig. 2]). Following Casella and Berger,[34] we say that p-value corresponding to the observed test statistic is valid if it is at most the
probability of observing as extreme test statistic under the null hypothesis. Consequently,
the significance test is valid if its p-value is valid.
Fig. 2 Possible outcomes of a hypothesis test that tests whether two distributions are the
same. FN, false negative (Type II error); FP, false positive (Type I error); TN, true
negative; TP, true positive.
A priori selected significance level α defines a threshold that, for any valid hypothesis test, forms an upper bound on
the probability of committing Type I error. A typical choice for α is 0.05, indicating a maximum 5% chance of incorrectly rejecting a true null hypothesis.
The probability of making a Type II error is often denoted as β (beta), from which the power of the test can be determined by computing 1 − β. The power of a test can be interpreted as the probability of correctly rejecting
a null hypothesis when it is in fact “False.” The power depends on the analysis task,
being affected by factors such as chosen significance level, the effect size, the
sample size, and the relative sizes of the different groups. In our experiments, we
observed the imbalance between group sizes to have a dominant effect on tests' power
in practice, because of the DP-synthetic data generators' tendency to produce imbalanced
samples for small ϵ values.
Mann–Whitney U test
The MW U test is a statistical test first proposed by Frank Wilcoxon in 1945 and later,
in 1947, formalized by Henry Mann and Donald Whitney.[49]
[50] While there are many different uses and interpretations of the test (see, e.g.,
Fay and Proschan[30] for a comprehensive review), in this article we focus on the null hypothesis that
two samples or groups are drawn from the same distribution. The test carried out on
two groups produces a value of the MW U statistic and the corresponding p-value. The U statistic measures the difference between the groups as the number of
times an observed member of the first group is smaller than that of the second group,
ties being counted as a half-time. The p-value indicates the strength of evidence the value of the U-statistic provides against
the null hypothesis, given that the assumption of the data being independently drawn
holds.[34]
Couch et al[19] proposed a differentially private version of the MW U test (DP-MW). The DP-MW U
test is presented as (ϵ, δ)-DP, where a portion of the privacy budget ϵ and δ are used for privatizing the smallest group size. The privatized size and the rest
of ϵ are then used for privatizing the U statistic using a calibrated Laplace distribution.
In order to calculate the corresponding p-value, the DP-MW U distribution under the null hypothesis is generated based on the
privatized group sizes. Detailed information and algorithms are provided by Couch
et al.[19] The DP-MW U test is not based on analyzing synthetic data, but rather the test is
carried out directly on the original sensitive dataset, and DP guarantees that sensitive
information about individuals is not leaked when releasing the test results.
In this study, the DP-MW U test on the original sensitive data provides us with a
reference point, a valid test with the best-known achievable power when performing
MW U test under DP. In contrast, the ordinary MW U test is evaluated on the DP-synthetic
data. If the validity of the ordinary test is preserved, comparison to the reference
point indicates how much power is lost when general-purpose DP-synthetic data are
generated as an intermediate step.
Student's t-Test, Chi-Squared Test, and Median Test
The Student's t-test (independent or unpaired t-test) is a widely utilized parametric statistical test that assesses whether the
means of two independent samples are significantly different.[31] The null hypothesis states that the means are statistically equivalent, while the
alternative hypothesis suggests that they are not. The test is valid for two independent
samples if their distributions are normal and their variances are equal.
The chi-squared test is a nonparametric test used to analyze the association of two
categorical variables by utilizing a contingency table.[33] Under the null hypothesis, the observed (joint) frequencies should equal to expected
(marginal) frequencies, meaning that the variables are independent. Since, under the
null hypothesis, the test statistic approximately follows a chi-squared distribution,
the test's validity depends on the sample size. However, for small n and for 2 × 2 tables, the appropriate alternative is the Fisher's exact test.
Median test is a nonparametric method used to test the null hypothesis of two (or
more) independent samples being drawn from distributions of equal medians.[32] The test is valid as long as the distributions have equal densities in the neighborhood
of the median (see, e.g., Freidlin and Gastwirth[51] and reference therein).
Experimental Evaluation
To empirically evaluate the utility of independent sample tests applied to DP-synthetic
datasets, we conducted a set of experiments. In each experiment, either simulated
or real-world data were used to represent the original sensitive dataset. These data
were subsequently used to train DP-synthetic data generation methods. Finally, the
independent sample tests were carried out on synthetic data produced by the generator.
First, we examined the behavior of MW U test on DP-synthetic data generated based
on bivariate real-world datasets or simulated data drawn from Gaussian distributions.
As depicted by real distribution of [Fig. 2], we considered two cases for Gaussian data: one where both groups are drawn from
the same distribution (i.e., the null hypothesis is true) and one where they are drawn
from distributions with different means (i.e., the null hypothesis is false). While
in practice synthesizing datasets consisting of only two variables would have quite
limited use cases, these experiments allow demonstrating the fundamentals of how different
DP synthetization approaches affect the validity and power of statistical tests. In
order to provide a more realistic setup, we further performed experiments on a simulated
multivariate dataset. The validity and power of the MW U test, Student's t-test, median test, and chi-squared test were explored in these experiments.
In the overall study design (see [Fig. 1]), the real-world, Gaussian, and simulated multivariate datasets correspond to the
sensitive data given as input to a DP-synthesizer method that produces a DP-synthetic
dataset as output. In the following subsections, we present the datasets, the implementation
details of the DP-synthetic data generation methods used, and the experiments conducted.
Original Datasets
First, we experimented with a setup, where the sensitive original data consist of
only two variables (i.e., a binary variable and a continuous variable). The binary
variable represents group membership (e.g., healthy or nonhealthy), while the continuous
variable is the one used to compare the groups with the MW U test.
To establish a controlled environment where the amount of signal (i.e., the effect
size) in the population is known, we drew two groups of data from two Gaussian distributions
with a known mean (μ) and standard deviation (σ). More precisely, for non-signal data, which corresponds to a setting where the null
hypothesis is true, the two groups were randomly drawn from the same Gaussian distribution
(μ = 50, σ = 2). For the signal data, which correspond to a setting where the null hypothesis
is false, two Gaussian distributions with effect size μ
1 − μ
2 = σ (i.e., μ
1 = 51, σ
1 = 1, μ
2 = 50, σ
2 = 1) were used to sample each group. Additionally, for those DP methods based on
histograms or marginals, the sampled values for each group were discretized into 100
bins (ranging from 1 to 100).
In order to verify our experiment's results on the Gaussian data, we also carried
out experiments using real-world medical data. In this case, we use the following
two datasets:
-
The Prostate Cancer Dataset
The data are from two registered clinical trials, IMPROD[52] and MULTI-IMPROD,[53] with trial numbers NCT01864135 and NCT02241122, respectively. These trials were
approved by the Institutional Review Board, and each enrolled patient gave written
informed consent. The dataset consists of 500 prostate cancer (PCa) patients (242
high-risk and 258 benign/low-risk PCa) with clinical variables, blood biomarkers,
and magnetic resonance imaging features. For our experiments, we selected two variables:
a binary label that indicates the condition of the patient and the prostate-specific
antigen (PSA) level. The PSA is a continuous variable that has been associated with
the presence of PCa.[54]
[55] Therefore, in this study, we considered the null hypothesis under test to be “The PSA levels of high-risk and benign/low-risk PCa patients originate from the same
distribution.” [Fig. 3A] presents the PSA distribution for both groups in this dataset. In those DP methods
based on histograms or marginals, the PSA values were discretized using a 40-bin histogram
(ranging from 1 to 40, where PSAs ≥ 40 are in the last bin).
-
Kaggle Cardiovascular Disease Dataset
This dataset is publicly available and consists of 70,000 subjects and 12 variables,
where the target variable is the cardio condition of the subjects, with 34,979 presenting
cardiovascular disease and 35,021 without the disease.[56] For our experiments, we use each subject body mass index (BMI), calculated from
their weight and height, which has been related to cardiovascular conditions.[57] Here, the null hypothesis under test is “The BMI level for individuals with the presence of cardiovascular disease and the
ones with absence cardiovascular disease originate from the same distribution.” [Fig. 3B] presents the BMI distribution of both groups (i.e., cardio disease vs. no cardio
disease). The BMI variable was discretized into 24 bins, where the first bin contains
BMI < 18 and the last bin BMI ≥ 40, in those DP methods that require it.
Fig. 3 (A) Prostate cancer (PCa) dataset: prostate-specific antigen (PSA) level distribution
for high-risk and benign/low PCa. The difference between the groups is statistically
significant (MW U stat = 22,713, p-value = 1.4e−07), (B) Kaggle Cardiovascular Disease dataset: body mass index (BMI) distribution for subjects
with the absence and presence of cardio disease. The difference between the groups
is statistically significant (MW U stat = 471,500,929.50, p-value ≅ 0.000).
Finally, we experimented with simulated multivariate datasets. The simulation was
based on the real-world PCa dataset. The included variables were the patient's age,
PSA level, prostate volume, the use of 5-alpha-reductase inhibitor (5-ARI) medication,
prostate imaging reporting and data systems (PIRADS) score, and a class label indicating
low-risk or high-risk PCa. The simulated datasets were generated by a GaussianCopulaSynthesizer
from the Synthetic Data Vault (SDV)[58] trained on the real-world dataset. In the SVD settings, the age variable was configured
to follow a normal distribution, while the remaining numerical variables were set
to follow a beta distribution. In experiments with a false null hypothesis, SVD was
conditioned to generate simulated datasets with an equal number of high-risk and low-risk
patients. For experiments with a true null hypothesis, the condition was to generate
only one class (low-risk) for the simulated dataset, and subsequently, half of the
data were randomly assigned to the high-risk class.
Implementations
In our experiments, for the generated DP-synthetic data, we used the hypothesis tests
provided by the Scipy v1.6.3 package,[59] such as the mannwhitneyu function for the MW U test. We used two-sided tests, with all the tests statistics
and p-values computed using the Scipy function's default values. As a point of reference,
we also computed the DP-MW U statistic and p-value on the corresponding original sensitive dataset. The DP-MW U test was implemented
using Python v3.7 and following the algorithms presented by Couch et al,[19] where 65% of ϵ and δ = 10−6 are used for estimating the size of the smallest group, and the U statistic is privatized
using the estimated size and the remaining ϵ.
In the case of the DP Perturbed Histogram, Python v3.7 was also used in the implementation.
The noise, added to the original histogram, was sampled from a discrete Laplacian
distribution[60] scaled by
, then the noisy counts were normalized by the original data size. After that, the
synthetic data were obtained by transforming the DP histogram counts to values using
the bin center point. For Private-PGM[61] and MWEM,[62] their corresponding open-source packages were used to generate DP-synthetic data.
The Private-PGM synthetic data were generated by following the demonstration in Python
code presented by McKenna et al[26] using Laplace distribution scaled by
where split (ϵ) is the privacy budget (ϵ) divided by the number of marginal queries selected. MWEM was run with default hyperparameters;
only ϵ was changed to show the effect of different privacy budgets. The resulting DP-synthetic
data were sampled using the histogram noisy weights returned by the MWEM algorithm.
The implementation of DP Smoothed Histogram was also coded in Python v.3.7 following
Algorithm 1 provided in the [Supplementary Material] (available in the online version). In all our experiments, the DP-synthetic data
generators were configured to preserve all the one-way and two-way marginals.
The GAN model used is based on the GS-WGAN by Chen et al.[25] The implementation is a modification of the freely available source code,[63] with changes made to suit tabular data generation instead of images. The generator
architecture was changed from a convolutional- to a fully connected three-layer network,
and the gradient perturbation procedure was modified to accommodate these changes
along with making the source code compatible with an up-to-date version of PyTorch
(v1.10.2).[64] Hyperparameter settings were chosen based on the recommendations of Gulrajani et
al[65] on the WGAN-GP, which of the GS-WGAN is a DP extension. This model uses privacy
amplification by subsampling,[25] a strategy to achieve stronger privacy guarantees by splitting training data into
mutually exclusive subsets according to a subsampling rate γ. Each subset is used to train one discriminator and the generator randomly queries
one discriminator for one update.
Experimental Setup
In the experiments, we investigated the utility of the statistical test at different
levels of privacy ϵ. For the DP-MW U test and all DP-synthetic data generation methods, except the DP
GAN, we used ϵ values of 0.01, 0.1, 1, 5, and 10. For the DP GAN experiments, the ϵ values were 1, 2, 3, 4, 5, and 10. The higher minimum of ϵ = 1 was set due to differences between the DP GAN and the other methods. Every experiment
was repeated 1,000 times, and the proportions of Type I and Type II errors were computed
and evaluated at a significance level α = 0.05.
Setup for Gaussian Data
In our experiments on Gaussian data using the DP-MW U test, DP Perturbed Histogram,
Private-PGM, and MWEM, each method was applied to original dataset sizes of 50, 100,
500, 1,000, and 20,000 with a group ratio of 50% and at the different values of ϵ. In these methods, the original dataset size was considered to be of public knowledge,
thus, the size of the generated DP-synthetic dataset was around or equal to the original
size.
Experiments with DP Smoothed Histogram were performed by randomly sampling original
Gaussian dataset of large size (i.e., dataset size of 20,000 with a group ratio of
50%). Then, the method was applied using the different values of ϵ, and for every ϵ synthetic data of size 50, 100, 500, and 1,000 were generated using the noisy probabilities
returned by the method.
In all experiments with the GAN discriminator networks, a subsampling rate γ of 1/500 was used, resulting in mutually exclusive subsets of size 40. The sample
size for the GAN training data was 20,000 in all settings and 1,000 different generators
were trained with models saved at the chosen values of ϵ (1, 2,3, 4, 5, and 10). Five synthetic datasets of sizes 50, 100, 500, and 1,000
were sampled from each of the generators and MW-U tests were conducted on each of
these synthetic datasets separately. The DP-hyperparameters were all set to C = 1 for the gradient clipping bound and 1.07 for the noise multiplier, following
Chen et al.[25]
A summary of the settings for the experiments with original Gaussian data is provided
in [Table 1].
Table 1
Setup for experiments using original Gaussian data
DP method
|
Original dataset size, group ratio 50%
|
DP-synthetic dataset size
|
Privacy budget
|
DP-MW U test
|
50, 100, 500, 1,000, 20,000
|
N/A
|
ϵ = 0.01, 0.1, 1, 5, 10
|
DP Perturbed Histogram
Private-PGM
MWEM
|
50, 100, 500, 1,000, 20,000
|
Similar to the original dataset
|
ϵ = 0.01, 0.1, 1, 5, 10
|
DP Smoothed Histogram
|
20,000
|
50, 100, 500, 1,000
|
ϵ = 0.01, 0.1, 1, 5, 10
|
DP GAN
|
20,000
|
50, 100, 500, 1,000
|
ϵ = 1, 2, 3, 4, 5, 10
|
Note: For the DP-MW U test, DP-synthetic dataset size is not applicable (“N/A”), because
this method is computed on the original sensitive data.
Setup for Real-World Data
The size of the PCa dataset constrained some of the experiments. Therefore, DP Smoothed
Histogram and DP GAN experiments with this dataset were excluded, as these methods
require a larger original dataset size (i.e., thousands of observations) to apply
them accurately. On the other hand, the cardiovascular dataset size allowed us to
carry out experiments with all the DP methods.
In the experiments with the PCa dataset, we applied each considered DP method at each
epsilon value 1,000 times. While in the cardiovascular dataset experiments, we used
the data to sample 1,000 original datasets for each dataset size: 50, 100, 500, 1,000,
and 20,000; then, for each sampled dataset, we applied the DP methods at each epsilon.
The proportion of Type II error was measured over the 1,000 repetitions for each experiment
setting. For DP Smoothed Histogram and DP GAN, due to their nature, the experiments
were performed differently; however, they had a similar setting to the ones with Gaussian
signal data.
Setup for Simulated Multivariate Data
In the experiments with a simulated multivariate dataset, we considered the Private-PGM
and MWEM synthesizers. Using the generated DP-synthetic data, we empirically assessed
the proportion of Type I and Type II errors for the MW U test for an ordinal variable
(PI-RADS score), Student's t-test for a normally distributed continuous variable (age), median test for another
continuous variable (PSA), and chi-squared test of independence for a binary variable
(use of 5-ARIs medication).
For these experiments, we generated 1,000 simulated multivariate datasets for dataset
sizes of 50, 100, 500, 1,000, and 20,000. Subsequently, for each simulated dataset,
we generated DP-synthetic data of the same size at each epsilon value. The proportions
of Type I and Type II errors were measured across the DP-synthetic datasets, with
the condition that the requirements for running the statistical test were met in at
least 50 of the generated DP-synthetic datasets (see [Supplementary Material A.2] (available in the online version) for further details on cases when tests are undefined,
such as when the DP-synthetic data consists of only single class).
Results
Gaussian Data
In [Fig. 4A], experiments on Gaussian non-signal data (i.e., both groups originate from the same
Gaussian distribution) show that when the DP-MW U test is applied to the 1,000 datasets,
the proportion of Type I stays close to α = 0.05 for all dataset sizes at all ϵ. Meanwhile, the MW U test on DP-synthetic data from DP Perturbed Histogram, Private-PGM,
and MWEM has a high proportion of Type I error for ϵ < 5, falsely indicating a significant difference between the two groups. From these
DP methods, DP Perturbed Histogram and Private-PGM benefit of having a large original
dataset size (i.e., 20,000), as ϵ can be reduced to 1 while still having a Type I error close to α = 0.05. MWEM is the method with the worst performance as the proportion of Type I
error for all sample sizes stays above 0.05 even for ϵ = 10.
Fig. 4 The proportion of Type I and Type II errors for the Mann–Whitney U test using four
differentially private (DP) methods: DP-MW U test, DP Perturbed Histogram, Private-PGM,
and MWEM at different privacy budget (ε). The dataset size indicates the size of the original data used in the experiments
by the DP methods. The proportions of Type I error and Type II error were measured
over 1,000 repetitions of the experiment using Gaussian (A) non-signal data and (B) signal data, respectively.
[Fig. 4B] presents the results for Gaussian signal data where a difference between the two
groups exists (i.e., normally distributed data of two groups with means 1 standard
deviation apart). From these results, we observed that the MW U test Type II error
for all the DP methods, with low ϵ, can be reduced by increasing the dataset size, corroborating the trade-off that
exists between privacy, utility, and dataset size.
Results for the MW U test on DP-synthetic data from DP Smoothed Histogram and DP GAN
are presented in [Fig. 5]. The DP Smoothed Histogram method controls the Type I error reliably. However, the
price for this is that in most of our experiment settings, it has high Type II error,
meaning that the real difference between the groups present in the original data is
lost in the DP-synthetic data generation process. DP GAN shows very high Type I error
that as an interesting contrast to the other methods grows as privacy level is reduced.
Fig. 5 The proportion of Type I and Type II errors of MW U test applied to synthetic data
generated from DP Smoothed Histogram and DP GAN. The size of the original dataset
is 20,000 with a group ratio of 50%. DP-synthetic data of sizes 50, 100, 500, and
1,000 were generated from both methods. The proportions of Type I error and Type II
error were measured over 1,000 DP-synthetic datasets using Gaussian (A) non-signal data and (B) signal data, respectively. DP, differentially private.
To summarize, these results show that except for DP Smoothed Histogram, all the DP-synthetic
data generation methods have highly inflated Type I error. This means that they are
prone to generating data from which false discoveries are likely to be made. For the
histogram-based methods, increased Type I error was associated with increased level
of privacy, the effect being especially clear for ϵ < 5. [Fig. 6] presents an example of false discovery on synthetic data generated with the DP Perturbed
histogram at ϵ = 0.1, and also demonstrates how DP Smoothed Histogram does not exhibit the same
behavior.
Fig. 6 Example of the two groups' distributions in a non-signal original dataset of size
500 (U stat = 31,460.5, p-value = 0.8953) and the corresponding distributions for synthetic data generated
using DP Perturbed Histogram (MW U stat = 38,191.5, p-value = 0.00001774) and DP Smoothed Histogram (MW U stat = 29,621.5, p-value = 0.3314) with ε = 0.1 as the privacy budget. With such high level of privacy enforced neither of
the DP-synthetic datasets preserves well the structure of the original data. The DP
Perturbed Histogram has the tendency to create artificial differences between the
two groups such that result in low p-values for MW U test, whereas with DP Smoothed Histogram method both the generated
case and control groups follow similar close to uniform distributions. DP, differentially
private.
Real-World Data
[Fig. 7A] shows the results of experiments conducted with the PCa dataset. The DP-MW U test
performs as expected for an original dataset size of 500 with a group ratio of approximately
50%. The null hypothesis is rejected for ϵ ≥ 1, while for ϵ < 1 it is often not rejected. Similar behavior is present in DP-synthetic data from
DP Perturbed Histogram and MWEM, yet the chance of rejecting the null hypothesis when
ϵ < 1 is higher than in the DP-MW U test. In DP-synthetic data from Private-PGM, the
null hypothesis is rejected for ϵ ≥ 5 more often than for ϵ < 5.
Fig. 7 The proportion of Type II error for the Mann–Whitney U test using four DP methods:
DP-MW U test, DP Perturbed Histogram, Private-PGM, and MWEM applied to (A) the PSA level data in the prostate cancer dataset (dataset size = 500) and (B) the body mass index (BMI) data in the Kaggle Cardiovascular Disease dataset. DP,
differentially private.
The experiment results for the DP-MW U test, DP Perturbed Histogram, Private-PGM,
and MWEM applied to the Cardiovascular Disease dataset are presented in [Fig. 7B]. In this dataset, we observe that MWEM and Private-PGM are the methods that benefit
the most from increasing the original sample size, as stronger privacy guarantees
can be provided without the MW U test losing power. These results agree with the ones
obtained when using Gaussian signal data.
Results for DP Smoothed Histogram and DP GAN applied to the cardiovascular dataset
are presented in [Fig. 8]. With DP Smoothed Histogram, Type II error is on an acceptable level when ϵ ≥ 5 and the sample size is 500 or 1,000, whereas for lower ϵ values the effect is not found. DP GAN results have lower Type II error, but given
how high Type I error the method shows in the non-signal experiments, the approach
is less reliable compared to the DP Smoothed Histogram method.
Fig. 8 The proportion of Type II error of MW U test applied to synthetic data generated
from DP Smoothed Histogram and DP GAN. The original dataset of size 20,000 with a
group ratio of 50% was drawn from the publicly available Cardiovascular Disease dataset.
DP-synthetic data of sizes 50, 100, 500, and 1,000 were generated using both DP methods
1,000 times. DP, differentially private.
Simulated Multivariate Data
In [Fig. 9], the proportion of Type I errors for various statistical tests (i.e., MW U test,
Student's t-test, median test, and chi-squared test) is presented. From these results, we observe
that false discoveries are also prone to occur similarly to the previous experiments
with only two variables. The validity of the tests is preserved only for largest tested
privacy budgets combined with large amounts of the original sensitive data. Same kind
of trend was observed for all statistical tests under consideration. For Private-PGM,
a substantial drop in Type I error was observed for ϵ = 0.01 and dataset size <20,000. On closer examination, we observed that with the
smallest privacy budgets, the size of the smaller of the two groups tends to be very
small or even zero. This can be seen from the numbers of times the test requirements
failed, as presented in [Supplementary Material A.2] (available in the online version), where the tests fail when the size of smaller
group is zero. The power of all evaluated tests strongly depends on the group size
imbalance in the sample, so that for a fixed sample size they have the highest power
for equal group sizes and the power shrinks to zero when the smaller group size goes
to zero. Therefore, the tendency of the low privacy budgets to produce imbalanced
samples counters the tendency to produce fake group differences to some extent.
Fig. 9 Proportion of Type I error conditioned to the number of DP-synthetic datasets where
the statistical test is feasible; (A) MW U test applied to an ordinal variable (PI-RADS score); (B) Student's t-test on a normally distributed variable (age); (C) median test on continuous variable (PSA); (D) Chi-squared test on a binary variable (5-ARI medication). DP, differentially private.
In the case of Type II error proportions ([Fig. 10]), the results depend on the magnitude of group differences in the original data,
how it is preserved by the GaussianCopulaSynthesizer, and the size of the simulated
dataset. As a baseline or point of reference, we first present the Type II error probabilities
computed over the 1,000 simulated multivariate datasets that represent the original
sensitive data before the synthetic data are generated based on them. Then, we illustrate
the corresponding Type II error probabilities for the DP-synthetic data with different
privacy budgets. For the synthetic data, especially for ϵ < 10, we observe that the Type II errors are often lower than those of on the original
data, indicating that true group differences are discovered more often from the synthetic
data than from the original. However, this is explained perfectly by the large Type
I error probabilities presented in [Fig. 9], indicating that the fake group differences present in the synthetic data are so
strong that they end up getting discovered rather than the true ones that are too
weak to be discovered from the original data. For ϵ = 0.01, the large Type II error of Private-PGM also mirrors the low Type I error,
caused by the loss of power due to the group size imbalance.
Fig. 10 Proportion of Type II error conditioned to the number of DP-synthetic datasets where
the statistical test is feasible; (A) MW U test applied to an ordinal variable (PI-RADS score); (B) Student's t-test on a normally distributed variable (age); (C) median test on continuous variable (PSA); (D) chi-squared test on a binary variable (5-ARI medication). DP, differentially private.
Discussion
This study investigated to what extent the validity and power of independent sample
tests are preserved in DP-synthetic data. Experimental results on Gaussian, real-world,
and multivariate simulated data demonstrate that the generated DP-synthetic data,
especially with strong privacy guarantees (ϵ ≤ 1), can lead to false discoveries. We empirically show that many state-of-the-art
DP methods for generating synthetic data have highly inflated Type I error when the
privacy level is high. These results indicate that false discoveries or inferences
are likely to be drawn from the DP-synthetic data produced by these DP methods. Our
findings are in line with other studies that have presented or stated that DP-synthetic
data can be invalid for statistical inference and indicated the need for methods that
are noise-aware in order to produce accurate statistical inferences.[17]
[66]
[67]
[68]
[69]
Additionally, it is necessary to be cautious when analyzing Type II error results,
as this is only meaningful for valid tests where the Type I error is properly controlled.
The Type II error tends to decrease with the increase of Type I error, as these errors
are inversely related. In our study, the only DP method based on synthetic data generation
that had a valid Type I error over all the privacy budgets tested was the DP Smooth
Histogram method. However, the method is applicable only when the original dataset
size is fairly large (e.g., n = 20,000 in our experiments) and tended to have high Type II error when the amount
of privacy enforced was high (e.g., ϵ ≤ 1). For DP Perturbed Histogram and Private-PGM methods, both Type I and Type II
errors remained low for ϵ ≥ 5, whereas MWEM and DP GAN did not provide valid Type I error levels even with
lowest privacy values tested.
The main advantage of releasing DP-synthetic data, as opposed to releasing only analysis
results from the original data, is that it can be ideally used to support a wide range
of analyses by different users. Due to postprocessing property of DP, any type or
number of analyses done on the synthetic data are also guaranteed to be DP with no
further privacy budget needed. However, if the only goal is to perform a limited number
of predefined analyses, it makes more sense to do these on the original data with
DP methods. This is illustrated in our experiments by the DP-MW U test baseline that
always outperforms analyses done on DP-synthetic data. As a middle ground between
these approaches, an active area of research is to develop such DP synthetization
methods where the data are optimized to support certain types of analyses well, such
as PrivPfC[70] for classifier training and various Bayesian noise-aware DP synthetic data generation
methods.[69]
There are limitations in our study that could be addressed in future research. One
limitation is that marginal- or histogram-based DP methods require continuous variables
to be discretized. This discretization must be performed in a private manner or based
on literature to avoid leaking private information. Besides, it is well known that
the number of bins used to discretize the data has a significant impact on the quality
of the resulting data.[16]
[43] Therefore, choosing the number of bins is problem- and data-dependent and can affect
the results. In our experiments with Gaussian data, the continuous values were discretized
using 100 bins. This number of bins was selected to show a possible extreme case where
having bins empty or with small counts deteriorates the quality of the generated DP-synthetic
data. On the other hand, for our experiments with real-world and multivariate simulated
data, the number of bins used was determined based on domain knowledge and literature.
Finally, testing different hyperparameter values for the DP method implementations
could yield different results for the methods.
Conclusion
Our results suggest caution when releasing DP-synthetic data, as false discoveries
or loss of information is likely to happen especially when a high level of privacy
is enforced. To an extent, these issues may be mitigated by having large enough original
datasets, selecting methods that are less prone to adding false signal to data, and
by carefully comparing the quality of the DP-synthetic data to the original one based
on various quality metrics (see, e.g., Hernadez et al[4]) before data release. Still, with current methods, DP-synthetic data may be a poor
substitute for real data when performing statistical hypothesis testing, as one cannot
be sure if the results obtained are based on trends that hold true in the real data,
or due to artefacts introduced when synthetizing the data.