A Simple-to-Use R Package for Mimicking Study Data by Simulations

Giorgos Koliopanos; Francisco Ojeda; Andreas Ziegler

doi:10.1055/a-2048-7692

Subscribe to RSS

Please copy the URL and add it into your RSS Feed Reader.

https://www.thieme-connect.de/rss/thieme/en/10.1055-s-00035037.xml

Download PDF

CC BY-NC-ND 4.0 · Methods Inf Med 2023; 62(03/04): 119-129
DOI: 10.1055/a-2048-7692

Original Article

A Simple-to-Use R Package for Mimicking Study Data by Simulations

Authors

Giorgos Koliopanos

¹Cardio-CARE, Medizincampus Davos, Davos, Switzerland
Francisco Ojeda

²Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Germany

³Centre for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Germany
Andreas Ziegler

¹Cardio-CARE, Medizincampus Davos, Davos, Switzerland

²Department of Cardiology, University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Germany

³Centre for Population Health Innovation (POINT), University Heart and Vascular Center Hamburg, University Medical Center Hamburg-Eppendorf, Germany

⁴School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, South Africa

Funding The authors received no specific funding for this study.

Further Information

Permissions and Reprints

Abstract

Background Data protection policies might prohibit the transfer of existing study data to interested research groups. To overcome legal restrictions, simulated data can be transferred that mimic the structure but are different from the existing study data.

Objectives The aim of this work is to introduce the simple-to-use R package Mock Data Generation (modgo) that may be used for simulating data from existing study data for continuous, ordinal categorical, and dichotomous variables.

Methods The core is to combine rank inverse normal transformation with the calculation of a correlation matrix for all variables. Data can then be simulated from a multivariate normal and transferred back to the original scale of the variables. Unique features of modgo are that it allows to change the correlation between variables, to perform perturbation analysis, to handle multicenter data, and to change inclusion/exclusion criteria by selecting specific values of one or a set of variables. Simulation studies on real data demonstrate the validity and flexibility of modgo.

Results modgo mimicked the structure of the original study data. Results of modgo were similar with those from two other existing packages in standard simulation scenarios. modgo's flexibility was demonstrated on several expansions.

Conclusion The R package modgo is useful when existing study data may not be shared. Its perturbation expansion permits to simulate truly anonymized subjects. The expansion to multicenter studies can be used for validating prediction models. Additional expansions can support the unraveling of associations even in large study data and can be useful in power calculations.

Keywords

data privacy - perturbation analysis - statistical disclosure control - synthetic data - validation study

Data Availability Statement

All relevant data are within the manuscript and its Supporting Information files.

Code Availability (Software Application or Custom Code)

All code including the R package is available as [Supplementary Material] (available in the online version).

Authors' Contribution

G.K. was involved in programming and writing, editing, and review of original draft. F.O. was involved in methodology, programming, and writing, editing, and review of the original draft. A.Z. was involved in methodology, supervision, and writing, editing, and review of the original draft.

Authors are listed alphabetically.

Supplementary Material

Publication History

Received: 07 July 2022

Accepted: 15 February 2023

Accepted Manuscript online:
07 March 2023

Article published online:
11 April 2023

© 2023. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (https://creativecommons.org/licenses/by-nc-nd/4.0/)

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

References
1 Rubin DB. Discussion: statistical disclosure limitation. J Off Stat 1993; 9: 461-468

Search in Google Scholar
Download RIS citation
2 Raghunathan TE, Reiter JP, Rubin DB. Multiple imputation for statistical disclosure limitation. J Off Stat 2003; 19: 1-16

Search in Google Scholar
Download RIS citation
3 Falcaro M, Castañon A, Ndlela B. et al. The effects of the national HPV vaccination programme in England, UK, on cervical cancer and grade 3 cervical intraepithelial neoplasia incidence: a register-based observational study. Lancet 2021; 398 (10316): 2084-2092

Crossref PubMed Search in Google Scholar
Download RIS citation
4 Horvat P, Gray CM, Lambova A. et al. Comparing findings from a friends of cancer research exploratory analysis of real-world end points with the cancer analysis system in England. JCO Clin Cancer Inform 2021; 5: 1155-1168

Crossref PubMed Search in Google Scholar
Download RIS citation
5 Li D-C, Fang Y-H, Lai Y-Y, Hu SC. Utilization of virtual samples to facilitate cancer identification for DNA microarray data in the early stages of an investigation. Inf Sci 2009; 179: 2740-2753

Crossref Search in Google Scholar
Download RIS citation
6 Fowler EE, Berglund A, Schell MJ, Sellers TA, Eschrich S, Heine J. Empirically-derived synthetic populations to mitigate small sample sizes. J Biomed Inform 2020; 105: 103408

Crossref PubMed Search in Google Scholar
Download RIS citation
7 Pavlou M, Ambler G, Seaman S, De Iorio M, Omar RZ. Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events. Stat Med 2016; 35 (07) 1159-1177

Crossref PubMed Search in Google Scholar
Download RIS citation
8 König IR, Weimar C, Diener H-C, Ziegler A. Vorhersage des Funktionsstatus 100 Tage nach einem ischämischen Schlaganfall: Design einer prospektiven Studie zur externen Validierung eines prognostischen Modells. Z Arztl Fortbild Qualitatssich 2003; 97 (10) 717-722

PubMed Search in Google Scholar
Download RIS citation
9 Burgard JP, Kolb J-P, Merkle H, Münnich R. Synthetic data for open and reproducible methodological research in social sciences and official statistics. AStA Wirtsch Sozialstat Arch 2017; 11: 233-244

Crossref Search in Google Scholar
Download RIS citation
10 AbdelMalik P, Kamel Boulos MN. Multidimensional point transform for public health practice. Methods Inf Med 2012; 51 (01) 63-73

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
11 Dua D, Graff C. UCI machine learning repository. Irvine: University of California, School of Information and Computer Science; 2019 Accessed March 20, 2023 at: http://archive.ics.uci.edu/ml

Download RIS citation
12 Golub TR, Slonim DK, Tamayo P. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999; 286 (5439): 531-537

Crossref PubMed Search in Google Scholar
Download RIS citation
13 Beasley TM, Erickson S, Allison DB. Rank-based inverse normal transformations are increasingly used, but are they merited?. Behav Genet 2009; 39 (05) 580-595

Crossref PubMed Search in Google Scholar
Download RIS citation
14 Olsson U. Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika 1979; 44: 443-460

Crossref Search in Google Scholar
Download RIS citation
15 Olsson U, Drasgow F, Dorans NJ. The polyserial correlation coefficient. Psychometrika 1982; 47: 337-347

Crossref Search in Google Scholar
Download RIS citation
16 Higham NJ. Computing the nearest correlation matrix—a problem from finance. IMA J Numer Anal 2002; 22: 329-343

Crossref Search in Google Scholar
Download RIS citation
17 Detrano R, Janosi A, Steinbrunn W. et al. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am J Cardiol 1989; 64 (05) 304-310

Crossref PubMed Search in Google Scholar
Download RIS citation
18 Hoff PD. Extending the rank likelihood for semiparametric copula estimation. Ann Appl Stat 2007; 1: 265-283

Crossref Search in Google Scholar
Download RIS citation
19 Fialkowski A, Tiwari H. SimCorrMix: simulation of correlated data with multiple variable types including continuous and count mixture distributions. R Journal 2019; 11: 250-286

Crossref Search in Google Scholar
Download RIS citation
20 OpenIntro. Data Sets. OpenIntro; 2023. Accessed March 20, 2023 at https://www.openintro.org/data

Download RIS citation
21 Demirtas H, Gao R. Mixed data generation packages and related computational tools in R. Commun Stat Simul Comput 2022; 51: 4520-4563

Crossref Search in Google Scholar
Download RIS citation
22 Ferrari PA, Barbiero A. Simulating ordinal data. Multivariate Behav Res 2012; 47 (04) 566-589

Crossref PubMed Search in Google Scholar
Download RIS citation
23 Ruscio J, Kaczetow W. Simulating multivariate nonnormal data using an iterative algorithm. Multivariate Behav Res 2008; 43 (03) 355-381

Crossref PubMed Search in Google Scholar
Download RIS citation
24 Demirtas H, Hedeker D. Multiple imputation under power polynomials. Commun Stat Simul Comput 2008; 37: 1682-1695

Crossref Search in Google Scholar
Download RIS citation
25 Amemiya T. Tobit models—a survey. J Econom 1984; 24: 3-61

Crossref Search in Google Scholar
Download RIS citation
26 Aitkin MA, Hume MW. Correlation in a singly truncated bivariate normal distribution II. Rank correlation. Biometrika 1966; 52: 639-643

Crossref Search in Google Scholar
Download RIS citation
27 Gajjar AV, Subrahmaniam K. On the sample correlation coefficient in the truncated bivariate normal population. Commun Stat Simul Comput 1978; 7: 455-477

Crossref Search in Google Scholar
Download RIS citation
28 Aitkin MA. Correlation in a singly truncated bivariate normal distribution. Psychometrika 1964; 29: 263-270

Crossref Search in Google Scholar
Download RIS citation
29 Demirtas H, Doganay B. Simultaneous generation of binary and normal data with specified marginal and association structures. J Biopharm Stat 2012; 22 (02) 223-236

Crossref PubMed Search in Google Scholar
Download RIS citation
30 Teo YY, Small KS, Clark TG, Kwiatkowski DP. Perturbation analysis: a simple method for filtering SNPs with erroneous genotyping in genome-wide association studies. Ann Hum Genet 2008; 72 (Pt 3): 368-374

Crossref PubMed Search in Google Scholar
Download RIS citation
31 Malley JD, Kruppa J, Dasgupta A, Malley KG, Ziegler A. Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inf Med 2012; 51 (01) 74-81

Thieme Connect PubMed Search in Google Scholar
Download RIS citation
32 Hepp T, Schmid M, Gefeller O, Waldmann E, Mayr A. Approaches to regularized regression—a comparison between gradient boosting and the lasso. Methods Inf Med 2016; 55 (05) 422-430

Thieme Connect PubMed Search in Google Scholar
Download RIS citation

Supplementary Material

Related Journals

Subscribe to RSS

Share / Bookmark

A Simple-to-Use R Package for Mimicking Study Data by Simulations

Authors

Abstract

Keywords

Data Availability Statement

Code Availability (Software Application or Custom Code)

Authors' Contribution

Supplementary Material

Publication History

References