Privacy-Preserving Synthetic Health Data for Research and Education

Workflow used to generate synthetic data securely

The inability to share private health data can severely stifle research and innovation in health informatics. Studies based on unpublished electronic medical record (EMR) data cannot be reproduced, thus future researchers are not able to use them to develop and compare new research. This contributes to the reproduciblity crisis in biomedical research. Making open data available for research can spur innovation and research. The public Medical Information Mart for Intensive Care datasets, MIMIC-II and MIMIC-III, are widely used with over 2000 citations reported in Google Scholar in March 2020. But since MIMIC-II and MIMIC-III focus on Intensive Care Unit patients in Boston hospitals, the resulting research may be biased and have limited generalization. The cost and time required, along with re-identification risk concerns make de-identification only a partial solution to this problem.

Recent synthetic data generation methods provide an attractive alternative for making data available for research and education purposes without violating privacy. Deep learning approaches for synthetic data specifically show significant promise. In the future, synthetic data generation methods combined with automatic machine learning methods could enable synthetic versions of data to be released when research papers are published. Results could be reproduced and novel methods and analysis could be developed without compromising patient privacy. To accomplish this, synthetic data assets must have

  1. Privacy: how well does the synthetic generation data method preserve anonymity;
  2. Resemblance: whether the distribution of synthetic data is indistinguishable from the distribution of real data;
  3. Utility: can research studies be reproduced successfully with synthetic data;
  4. Efficiency: how practical is the training and generation pipeline

In recent publications we report our experiences generating synthetic data using a novel pipeline for generating synthetic data securely, now available as a Python package on GitHub. We demonstrate the effectiveness of HealthGAN in producing high quality synthetic data for three research studies on MIMIC data and two research studies on comorbidites of Autism Spectrum Disorder.