Privacy-Preserving Synthetic Health Data for Research and Education

A novel pipeline for generating synthetic data securely,

The inability to share private health data can severely stifle research and innovation in health informatics. Studies based on unpublished electronic medical record (EMR) data cannot be reproduced, thus future researchers are not able to use them to develop and compare new research. This contributes to the reproduciblity crisis in biomedical research. Making open data available for research can spur innovation and research. The public Medical Information Mart for Intensive Care datasets, MIMIC-II and MIMIC-III, are widely used with over 2000 citations reported in Google Scholar in March 2020. But since MIMIC-II and MIMIC-III focus on Intensive Care Unit patients in Boston hospitals, the resulting research may be biased and have limited generalization. The cost and time required, along with re-identification risk concerns make de-identification only a partial solution to this problem.