CC BY-NC-ND 4.0 · Methods Inf Med 2022; 61(S 01): e1-e11
DOI: 10.1055/s-0041-1740564
Original Article

A Privacy-Preserving Distributed Analytics Platform for Health Care Data

Sascha Welten
1   Chair of Computer Science 5, RWTH Aachen University, Aachen, Germany
Yongli Mou
1   Chair of Computer Science 5, RWTH Aachen University, Aachen, Germany
Laurenz Neumann
1   Chair of Computer Science 5, RWTH Aachen University, Aachen, Germany
Mehrshad Jaberansary
1   Chair of Computer Science 5, RWTH Aachen University, Aachen, Germany
Yeliz Yediel Ucer
2   Department of Data Science and Artificial Intelligence, Fraunhofer FIT, Sankt Augustin, Germany
Toralf Kirsten
3   Department of Medical Data Science, University Medical Center Leipzig, Leipzig, Germany
Stefan Decker
1   Chair of Computer Science 5, RWTH Aachen University, Aachen, Germany
2   Department of Data Science and Artificial Intelligence, Fraunhofer FIT, Sankt Augustin, Germany
Oya Beyan
2   Department of Data Science and Artificial Intelligence, Fraunhofer FIT, Sankt Augustin, Germany
4   Institute for Medical Informatics, Faculty of Medicine, University Hospital Cologne, University of Cologne, Cologne, Germany
› Author Affiliations
Funding This work was supported by the German Ministry for Research and Education (BMBF) as part of the SMITH consortium (SW, LN, MJ, YUY, TK, SD, and OB, grant no. 01ZZ1803K). This work was conducted jointly by RWTH Aachen University and Fraunhofer FIT as part of the PHT and Go FAIR implementation network, which aims to develop a proof-of-concept information system to address current data reusability challenges occurring in the context of so-called data integration centres that are being established as part of ongoing German Medical Informatics BMBF projects.


Background In recent years, data-driven medicine has gained increasing importance in terms of diagnosis, treatment, and research due to the exponential growth of health care data. However, data protection regulations prohibit data centralisation for analysis purposes because of potential privacy risks like the accidental disclosure of data to third parties. Therefore, alternative data usage policies, which comply with present privacy guidelines, are of particular interest.

Objective We aim to enable analyses on sensitive patient data by simultaneously complying with local data protection regulations using an approach called the Personal Health Train (PHT), which is a paradigm that utilises distributed analytics (DA) methods. The main principle of the PHT is that the analytical task is brought to the data provider and the data instances remain in their original location.

Methods In this work, we present our implementation of the PHT paradigm, which preserves the sovereignty and autonomy of the data providers and operates with a limited number of communication channels. We further conduct a DA use case on data stored in three different and distributed data providers.

Results We show that our infrastructure enables the training of data models based on distributed data sources.

Conclusion Our work presents the capabilities of DA infrastructures in the health care sector, which lower the regulatory obstacles of sharing patient data. We further demonstrate its ability to fuel medical science by making distributed data sets available for scientists or health care practitioners.

Supplementary Material

Publication History

Received: 30 March 2021

Accepted: 22 September 2021

Article published online:
17 January 2022

© 2022. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution-NonDerivative-NonCommercial License, permitting copying and reproduction so long as the original work is given appropriate credit. Contents may not be used for commercial purposes, or adapted, remixed, transformed or built upon. (

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

  • References

  • 1 Chang K, Balachandar N, Lam C. et al. Distributed deep learning networks among institutions for medical imaging. J Am Med Inform Assoc 2018; 25 (08) 945-954
  • 2 Das A, Upadhyaya I, Meng X. et al. Collaborative filtering as a case-study for model parallelism on bulk synchronous systems. In: ACM Conference on Information and Knowledge Management - CIKM '17. New York, New York, USA: ACM Press; 2017: 969-977
  • 3 McMahan B, Moore E, Ramage D. et al. Communication-Efficient Learning of Deep Networks from Decentralized Data. In: Artificial Intelligence and Statistics - AISTATS 2016. PMLR; 2017: 1273-1282
  • 4 Sheller MJ, Reina GA, Edwards B, Martin J, Bakas S. Multi-institutional deep learning modeling without sharing patient data: a feasibility study on brain tumor segmentation. Brainlesion 2019; 11383: 92-104
  • 5 Su H, Chen H. Experiments on parallel training of deep neural network using model averaging. 2015 . ArXiv: 1507.01239
  • 6 Su Y, Lyu M, King I. Communication-Efficient Distributed Deep Metric Learning with Hybrid Synchronization. In: 27th ACM International Conference on Information and Knowledge Management - CIKM '18. New York, USA: ACM Press; 2018: 1463-1472
  • 7 Sheller MJ, Edwards B, Reina GA. et al. Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Sci Rep 2020; 10 (01) 12598
  • 8 Beyan O, Choudhury A, van Soest J. et al. Distributed analytics on sensitive medical data: the Personal Health Train. Data Intelligence 2020; 2 (1–2): 96-107
  • 9 Sun C, Ippel L, van Soest J. et al. A privacy-preserving infrastructure for analyzing personal health data in a vertically partitioned scenario. Stud Health Technol Inform 2019; 264: 373-377
  • 10 Shi Z, Zhovannik I, Traverso A. et al. Distributed radiomics as a signature validation study using the Personal Health Train infrastructure. Sci Data 2019; 6 (01) 218
  • 11 Deist TM, Dankers FJWM, Ojha P. et al. Distributed learning on 20 000+ lung cancer patients - The Personal Health Train. Radiother Oncol 2020; 144: 189-200
  • 12 Mou Y, Welten S, Jaberansary M. et al. Distributed skin lesion analysis across decentralised data sources. Stud Health Technol Inform 2021; 281: 352-356
  • 13 Wilson RC, Butters OW, Avraam D. et al. DataSHIELD – new directions and dimensions. Data Sci J 2017; 16: 21
  • 14 Bonofiglio F, Schumacher M, Binder H. Recovery of original individual person data (IPD) inferences from empirical IPD summaries only: applications to distributed computing under disclosure constraints. Stat Med 2020; 39 (08) 1183-1198
  • 15 Pinart M, Jeran S, Boeing H. et al. Dietary macronutrient composition in relation to circulating HDL and non-HDL cholesterol: a federated individual-level analysis of cross-sectional data from adolescents and adults in 8 European studies. J Nutr 2021; 151 (08) 2317-2329
  • 16 Zhao C, Zhao S, Zhao M. et al. Secure multi-party computation: theory, practice and applications. Inf Sci 2019; 476: 357-372
  • 17 Doganay MC, Pedersen TB, Förg F. et al. Distributed privacy preserving k-means clustering with additive secret sharing. In: Proceedings of the 2008 International Workshop on Privacy and Anonymity in Information Society, PAIS'08, New York, USA: ACM; 2008: 3-11
  • 18 Stammler S, Kussel T, Schoppmann P. et al. Mainzelliste SecureEpiLinker (MainSEL): privacy-preserving record linkage using secure multi-party computation. Bioinformatics 2020; btaa764
  • 19 Wüller S, Mayer D, Förg F. et al. Designing privacy-preserving interval operations based on homomorphic encryption and secret sharing techniques. J Comput Secur 2017; 25: 59-81
  • 20 Welten S, Neumann L, Ucer YedielY. et al. DAMS: A Distributed Analytics Metadata Schema. Data Intelligence; 2021
  • 21 Kermany D, Zhang K, Goldbaum M. Labeled optical coherence tomography (OCT) and chest X-ray images for classification. Mendeley data 2018;2(02):
  • 22 Kermany DS, Goldbaum M, Cai W. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 2018; 172 (05) 1122-1131.e9
  • 23 Fang H, Qian Q. Privacy preserving machine learning with homomorphic encryption and federated learning. Future Internet 2021; 13 (04) 94
  • 24 Li W, Milletarì F, Xu D. et al. Privacy-Preserving Federated Brain Tumour Segmentation. In: Suk HI, Liu M, Yan P, Lian C. eds. Machine Learning in Medical Imaging. MLMI 2019. Lecture Notes in Computer Science, Vol 11861. Cham: Springer; 2019
  • 25 Melis L, Song C, De Cristofaro E. et al. Exploiting unintended feature leakage in collaborative learning. In: Proceedings of 40th IEEE Symposium on Security & Privacy, San Francisco, USA; 2019: 497-512
  • 26 Hitaj B, Ateniese G, Perez-Cruz F. Deep models under the GAN: Information leakage from collaborative deep learning. In: Proceedings of the 24th Conference on Computer and Communications Security, Dallas, USA; 2017: 603-618
  • 27 Vatsalan D, Christen P, Rahm E. Incremental clustering techniques for multi-party privacy-preserving record linkage. Data Knowl Eng 2020; 128: 101809