Subscribe to RSS
A Simple-to-Use R Package for Mimicking Study Data by SimulationsFunding The authors received no specific funding for this study.
Background Data protection policies might prohibit the transfer of existing study data to interested research groups. To overcome legal restrictions, simulated data can be transferred that mimic the structure but are different from the existing study data.
Objectives The aim of this work is to introduce the simple-to-use R package Mock Data Generation (modgo) that may be used for simulating data from existing study data for continuous, ordinal categorical, and dichotomous variables.
Methods The core is to combine rank inverse normal transformation with the calculation of a correlation matrix for all variables. Data can then be simulated from a multivariate normal and transferred back to the original scale of the variables. Unique features of modgo are that it allows to change the correlation between variables, to perform perturbation analysis, to handle multicenter data, and to change inclusion/exclusion criteria by selecting specific values of one or a set of variables. Simulation studies on real data demonstrate the validity and flexibility of modgo.
Results modgo mimicked the structure of the original study data. Results of modgo were similar with those from two other existing packages in standard simulation scenarios. modgo's flexibility was demonstrated on several expansions.
Conclusion The R package modgo is useful when existing study data may not be shared. Its perturbation expansion permits to simulate truly anonymized subjects. The expansion to multicenter studies can be used for validating prediction models. Additional expansions can support the unraveling of associations even in large study data and can be useful in power calculations.
Keywordsdata privacy - perturbation analysis - statistical disclosure control - synthetic data - validation study
Data Availability Statement
All relevant data are within the manuscript and its Supporting Information files.
Code Availability (Software Application or Custom Code)
All code including the R package is available as [Supplementary Material] (available in the online version).
G.K. was involved in programming and writing, editing, and review of original draft. F.O. was involved in methodology, programming, and writing, editing, and review of the original draft. A.Z. was involved in methodology, supervision, and writing, editing, and review of the original draft.
Authors are listed alphabetically.
Received: 07 July 2022
Accepted: 15 February 2023
Accepted Manuscript online:
07 March 2023
Article published online:
11 April 2023
© 2023. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)
Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany
- 1 Rubin DB. Discussion: statistical disclosure limitation. J Off Stat 1993; 9: 461-468
- 2 Raghunathan TE, Reiter JP, Rubin DB. Multiple imputation for statistical disclosure limitation. J Off Stat 2003; 19: 1-16
- 3 Falcaro M, Castañon A, Ndlela B. et al. The effects of the national HPV vaccination programme in England, UK, on cervical cancer and grade 3 cervical intraepithelial neoplasia incidence: a register-based observational study. Lancet 2021; 398 (10316): 2084-2092
- 4 Horvat P, Gray CM, Lambova A. et al. Comparing findings from a friends of cancer research exploratory analysis of real-world end points with the cancer analysis system in England. JCO Clin Cancer Inform 2021; 5: 1155-1168
- 5 Li D-C, Fang Y-H, Lai Y-Y, Hu SC. Utilization of virtual samples to facilitate cancer identification for DNA microarray data in the early stages of an investigation. Inf Sci 2009; 179: 2740-2753
- 6 Fowler EE, Berglund A, Schell MJ, Sellers TA, Eschrich S, Heine J. Empirically-derived synthetic populations to mitigate small sample sizes. J Biomed Inform 2020; 105: 103408
- 7 Pavlou M, Ambler G, Seaman S, De Iorio M, Omar RZ. Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events. Stat Med 2016; 35 (07) 1159-1177
- 8 König IR, Weimar C, Diener H-C, Ziegler A. Vorhersage des Funktionsstatus 100 Tage nach einem ischämischen Schlaganfall: Design einer prospektiven Studie zur externen Validierung eines prognostischen Modells. Z Arztl Fortbild Qualitatssich 2003; 97 (10) 717-722
- 9 Burgard JP, Kolb J-P, Merkle H, Münnich R. Synthetic data for open and reproducible methodological research in social sciences and official statistics. AStA Wirtsch Sozialstat Arch 2017; 11: 233-244
- 10 AbdelMalik P, Kamel Boulos MN. Multidimensional point transform for public health practice. Methods Inf Med 2012; 51 (01) 63-73
- 11 Dua D, Graff C. UCI machine learning repository. Irvine: University of California, School of Information and Computer Science; 2019 Accessed March 20, 2023 at: http://archive.ics.uci.edu/ml
- 12 Golub TR, Slonim DK, Tamayo P. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286 (5439): 531-537
- 13 Beasley TM, Erickson S, Allison DB. Rank-based inverse normal transformations are increasingly used, but are they merited?. Behav Genet 2009; 39 (05) 580-595
- 14 Olsson U. Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika 1979; 44: 443-460
- 15 Olsson U, Drasgow F, Dorans NJ. The polyserial correlation coefficient. Psychometrika 1982; 47: 337-347
- 16 Higham NJ. Computing the nearest correlation matrix—a problem from finance. IMA J Numer Anal 2002; 22: 329-343
- 17 Detrano R, Janosi A, Steinbrunn W. et al. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am J Cardiol 1989; 64 (05) 304-310
- 18 Hoff PD. Extending the rank likelihood for semiparametric copula estimation. Ann Appl Stat 2007; 1: 265-283
- 19 Fialkowski A, Tiwari H. SimCorrMix: simulation of correlated data with multiple variable types including continuous and count mixture distributions. R Journal 2019; 11: 250-286
- 20 OpenIntro. Data Sets. OpenIntro; 2023. Accessed March 20, 2023 at https://www.openintro.org/data
- 21 Demirtas H, Gao R. Mixed data generation packages and related computational tools in R. Commun Stat Simul Comput 2022; 51: 4520-4563
- 22 Ferrari PA, Barbiero A. Simulating ordinal data. Multivariate Behav Res 2012; 47 (04) 566-589
- 23 Ruscio J, Kaczetow W. Simulating multivariate nonnormal data using an iterative algorithm. Multivariate Behav Res 2008; 43 (03) 355-381
- 24 Demirtas H, Hedeker D. Multiple imputation under power polynomials. Commun Stat Simul Comput 2008; 37: 1682-1695
- 25 Amemiya T. Tobit models—a survey. J Econom 1984; 24: 3-61
- 26 Aitkin MA, Hume MW. Correlation in a singly truncated bivariate normal distribution II. Rank correlation. Biometrika 1966; 52: 639-643
- 27 Gajjar AV, Subrahmaniam K. On the sample correlation coefficient in the truncated bivariate normal population. Commun Stat Simul Comput 1978; 7: 455-477
- 28 Aitkin MA. Correlation in a singly truncated bivariate normal distribution. Psychometrika 1964; 29: 263-270
- 29 Demirtas H, Doganay B. Simultaneous generation of binary and normal data with specified marginal and association structures. J Biopharm Stat 2012; 22 (02) 223-236
- 30 Teo YY, Small KS, Clark TG, Kwiatkowski DP. Perturbation analysis: a simple method for filtering SNPs with erroneous genotyping in genome-wide association studies. Ann Hum Genet 2008; 72 (Pt 3): 368-374
- 31 Malley JD, Kruppa J, Dasgupta A, Malley KG, Ziegler A. Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inf Med 2012; 51 (01) 74-81
- 32 Hepp T, Schmid M, Gefeller O, Waldmann E, Mayr A. Approaches to regularized regression—a comparison between gradient boosting and the lasso. Methods Inf Med 2016; 55 (05) 422-430