New Developments in Synthetic Data Generation
Authors/Creators
- 1. Harvard Medical School
- 2. Vassar College
- 3. Urban Institute
Description
These talks were presented for the Privacy Day Webinar 2022 sponsored by the American Statistical Association's Committee on Privacy and Confidentiality.
Talk 1: "The potential of privacy-preserving generative deep neural networks to support clinical data sharing"
Brett Beaulieu-Jones, Harvard Medical School
Abstract: Data sharing accelerates scientific progress but sharing individual-level data while preserving patient privacy presents a barrier. Deep generative adversarial networks have the potential to produce synthetic data while maintaining privacy. In some cases, the synthetic data has been shown to maintain statistical properties of source data and to be indistinguishable to human experts. This raises two important questions: 1.) How can we do this? And 2.) What is the privacy-preserving synthetic data good for?
Brett Beaulieu-Jones is an Instructor of Biomedical Informatics at Harvard Medical School. He obtained his PhD from the Perelman School of Medicine at the University of Pennsylvania under the supervision of Dr. Jason Moore and Dr. Casey Greene. Beaulieu-Jones’ doctoral researchfocused on using machine learning-based methods to more precisely define phenotypes from large-scale biomedical data repositories, e.g., those contained in clinical records. He joined Dr.Isaac Kohane’s lab to do his postdoc, where he has been focused on using machine learning to better understand the heterogeneity of neurological diseases and conditions, specificallyParkinson’s disease and Epilepsy. He is a former general chair and on the advisory board for the Machine Learning for Health Workshop at NeurIPS and is a founding board member for the Association for Health Learning and Inference (parent organization of ML4H and CHIL).
Talk 2: "Incorporating disclosure risk in designing data synthesis models"
Jingchen (Monika) Hu, Vassar College
The generation and release of synthetic data can facilitate microdata dissemination by statistical agencies. Often times, agencies would need to strike a desirable balance of the utility-risk trade-off of the synthetic data. We propose a novel approach that can incorporate the disclosure risk of each record in designing any Bayesian synthesis model. In this way, records with a higher risk can receive a higher level of protection in the resulting synthetic data. We illustrate our methods with an application to the Consumer Expenditure Surveys of the U.S. Bureau of Labor Statistics.
Jingchen (Monika) Hu is an assistant professor of statistics at Vassar College. Her research focuses on statistical data privacy methods, mainly synthetic data and differential privacy. She teaches a senior seminar on statistical data privacy at Vassar and engages undergraduate students in learning cutting-edge methods and conducting applied research in this area.
Talk 3: "Fully synthetic microdata for public policy analysis"
Aaron R. Williams, Urban Institute
Government agencies possess data that could be invaluable for evaluating public policy but often do not publicly release the data due to disclosure concerns. For instance, the IRS has rich administrative data about Americans with incomes below the income tax filing threshold and tax filers and is restricted to a select few with IRS clearance. This talk will overview the generation of the fully synthetic 2012 IRS Statistics of Income Division supplemental public use file, ongoing work in generating the fully synthetic 2012 IRS SOI PUF, and a formally-private validation server for analysis with tax data. We use sequential Classification and Regression Trees (CART) and kernel density smoothing to create useful microlevel data with disclosure protection. We test and evaluate the tradeoffs between data utility and disclosure risks of different parameterizations using a variety of validation metrics. The resulting synthetic data sets have high utility, particularly for summary statistics and microsimulation, and low disclosure risk.
Aaron R. Williams is a senior data scientist in the Income and Benefits Policy Center at the Urban Institute, where he works on retirement policy, microsimulation models, data privacy, and dataimputation methods. He has worked on Urban’s Dynamic Simulation of Income (DYNASIM)microsimulation model, the Social Security Administration’s Modeling Income in the Near Term (MINT) microsimulation model, and the Tax Policy Center’s synthesis of individual tax records.Williams is an adjunct professor in the McCourt School of Public Policy at Georgetown University.