Methods Inf Med
DOI: 10.1055/a-2540-8346
Letter to the Editor

Response to Commentary by Dehaene et al. on Synthetic Discovery is not only a Problem of Differentially Private Synthetic Data

Ileana Montoya Perez
1   Department of Computing, University of Turku, Turku, Finland
,
Parisa Movahedi
1   Department of Computing, University of Turku, Turku, Finland
,
Valtteri Nieminen
1   Department of Computing, University of Turku, Turku, Finland
,
Antti Airola
1   Department of Computing, University of Turku, Turku, Finland
,
Tapio Pahikkala
1   Department of Computing, University of Turku, Turku, Finland
› Author Affiliations
Funding This research has received funding from the European Union's Horizon Europe research and Innovation Programme (grant number 101095384). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Health and Digital Executive Agency (HADEA). Neither the European Union nor the granting authority can be held responsible for them.

Why Synthetic Discoveries are Not Only a Problem of Differentially Private Synthetic Data

We thank you for the opportunity to respond to the commentary letter by Dehaene et al on our recent article, “Does differentially private synthetic data lead to synthetic discoveries?”[1] published in Methods of Information in Medicine. We appreciate the commentators' interest in our work and their contribution to an important and ongoing discussion on the utility of synthetic data and its implications for statistical inference.

The letter from Dehaene et al raises a concern about two possible interpretations of the results in our article, namely that the risk of unacceptably high false-positive findings from synthetic data can be simply countered by increasing the amount of original data enough, or by stepping away from differentially private (DP) synthetization methods. Referring to simulation results in Decruyenaere et al,[2] they note that even for non-DP methods and large original sample sizes, this risk can remain high, especially when using deep learning-based generation methods. We find that Dehaene et al raise an important point and their observations are compatible also with our results. While reducing the amount of DP noise and increasing the original sample size are positively correlated with the utility of generated synthetic data, these alone are not enough if the generator is a misspecified parametric model or suffers from what Decruyenaere et al[2] refers to as the regularization bias.

As the authors note, citing Chen et al: “synthetic data are artificial data that (attempt to) mimic the original data in terms of statistical properties, without revealing individual records.”[3] Obviously, if privacy would not be of concern and reliable prior information on the true distribution of data absent, this would be achieved simply by using the original data. Indeed, some DP data release methods reconstruct the original data in the limit of epsilon approaching infinity. In our experiments, the DP perturbed and DP smoothed histograms have such properties. Accordingly, these methods demonstrate a clear trade-off between similarity to original data, privacy level, and the amount of original data, with the inferential utility of the synthetic data typically increasing both with respect to original sample size and inversely with respect to privacy level. On the other hand, the synthetic data generated by Multiplicative Weights Exponential Mechanism (MWEM) and Private-PGM (Private-Probabilistic Graphical Model) may diverge from the distribution of original data in the limit due to approximating higher-dimensional data with low-dimensional marginals. Hence, the trade-off may be less clear, if the statistical property of interest changes not only due to privacy level but also due to approximation. In some of our results, this is reflected by the utility increasing as a function of decreasing privacy level only up to a certain limit but not achieving the utility of the original data. A similar effect can take place if the synthetization methods make incorrect parametric assumptions. At the other extreme of this continuum of methods, there are synthesizers having regularization bias aimed for purposes other than reproducing the original data. For example, in our experiments, the DP GAN method had very different behavior compared with the other methods, and the risk of false discoveries even increased as a function of decreasing privacy level.

Accordingly, we agree with the main message of Dehane et al that the inferential utility level of the original data is not necessarily achieved simply by decreasing the privacy level or with larger amounts of original data, but is very method-dependent. Hence caution is certainly always warranted when performing statistical inference on synthetic data, with different methods having different trade-offs and some demonstrating systematic biases that are not easy to counter.



Publication History

Received: 12 February 2025

Accepted: 14 February 2025

Article published online:
15 April 2025

© 2025. Thieme. All rights reserved.

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany

 
  • References

  • 1 Montoya Perez I, Movahedi P, Nieminen V, Airola A, Pahikkala T. Does differentially private synthetic data lead to synthetic discoveries?. Methods Inf Med 2024; 63 (1-02): 35-51
  • 2 Decruyenaere A, Dehaene H, Rabaey P. et al. The Real Deal Behind the Artificial Appeal: Inferential Utility of Tabular Synthetic Data. Paper presented at: The 40th Conference on Uncertainty in Artificial Intelligence. Barcelona; 2024. Accessed July 18, 2024
  • 3 Chen RJ, Lu MY, Chen TY, Williamson DFK, Mahmood F. Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng 2021; 5 (06) 493-497