New Developments in Synthetic Data Generation

Brett Beaulieu-Jones; Jingchen (Monika) Hu; Aaron R. Williams

doi:10.5281/zenodo.5933201

Published January 28, 2022 | Version v1

Presentation Open

New Developments in Synthetic Data Generation

1. Harvard Medical School
2. Vassar College
3. Urban Institute

These talks were presented for the Privacy Day Webinar 2022 sponsored by the American Statistical Association's Committee on Privacy and Confidentiality.

Link to recording.

Talk 1: "The potential of privacy-preserving generative deep neural networks to support clinical data sharing"

Brett Beaulieu-Jones, Harvard Medical School

Abstract: Data sharing accelerates scientific progress but sharing individual-level data while preserving patient privacy presents a barrier. Deep generative adversarial networks have the potential to produce synthetic data while maintaining privacy. In some cases, the synthetic data has been shown to maintain statistical properties of source data and to be indistinguishable to human experts. This raises two important questions: 1.) How can we do this? And 2.) What is the privacy-preserving synthetic data good for?

Brett Beaulieu-Jones is an Instructor of Biomedical Informatics at Harvard Medical School. He obtained his PhD from the Perelman School of Medicine at the University of Pennsylvania under the supervision of Dr. Jason Moore and Dr. Casey Greene. Beaulieu-Jones’ doctoral researchfocused on using machine learning-based methods to more precisely define phenotypes from large-scale biomedical data repositories, e.g., those contained in clinical records. He joined Dr.Isaac Kohane’s lab to do his postdoc, where he has been focused on using machine learning to better understand the heterogeneity of neurological diseases and conditions, specificallyParkinson’s disease and Epilepsy. He is a former general chair and on the advisory board for the Machine Learning for Health Workshop at NeurIPS and is a founding board member for the Association for Health Learning and Inference (parent organization of ML4H and CHIL).

Talk 2: "Incorporating disclosure risk in designing data synthesis models"

Jingchen (Monika) Hu, Vassar College

The generation and release of synthetic data can facilitate microdata dissemination by statistical agencies. Often times, agencies would need to strike a desirable balance of the utility-risk trade-off of the synthetic data. We propose a novel approach that can incorporate the disclosure risk of each record in designing any Bayesian synthesis model. In this way, records with a higher risk can receive a higher level of protection in the resulting synthetic data. We illustrate our methods with an application to the Consumer Expenditure Surveys of the U.S. Bureau of Labor Statistics.

Jingchen (Monika) Hu is an assistant professor of statistics at Vassar College. Her research focuses on statistical data privacy methods, mainly synthetic data and differential privacy. She teaches a senior seminar on statistical data privacy at Vassar and engages undergraduate students in learning cutting-edge methods and conducting applied research in this area.

Talk 3: "Fully synthetic microdata for public policy analysis"

Aaron R. Williams, Urban Institute

Government agencies possess data that could be invaluable for evaluating public policy but often do not publicly release the data due to disclosure concerns. For instance, the IRS has rich administrative data about Americans with incomes below the income tax filing threshold and tax filers and is restricted to a select few with IRS clearance. This talk will overview the generation of the fully synthetic 2012 IRS Statistics of Income Division supplemental public use file, ongoing work in generating the fully synthetic 2012 IRS SOI PUF, and a formally-private validation server for analysis with tax data. We use sequential Classification and Regression Trees (CART) and kernel density smoothing to create useful microlevel data with disclosure protection. We test and evaluate the tradeoffs between data utility and disclosure risks of different parameterizations using a variety of validation metrics. The resulting synthetic data sets have high utility, particularly for summary statistics and microsimulation, and low disclosure risk.

Aaron R. Williams is a senior data scientist in the Income and Benefits Policy Center at the Urban Institute, where he works on retirement policy, microsimulation models, data privacy, and dataimputation methods. He has worked on Urban’s Dynamic Simulation of Income (DYNASIM)microsimulation model, the Social Security Administration’s Modeling Income in the Near Term (MINT) microsimulation model, and the Tax Policy Center’s synthesis of individual tax records.Williams is an adjunct professor in the McCourt School of Public Policy at Georgetown University.

Files

Privacy_Day_2022_Aaron_Williams.pdf

Files (10.5 MB)

Name	Size	Download all
Privacy_Day_2022_Aaron_Williams.pdf md5:382b45df6c2bff3eefec2f78476096c1	2.1 MB	Preview Download
Privacy_Day_2022_Brett_Beaulieu_Jones.pdf md5:125907cfc15f54ec2658925ee53fb064	7.9 MB	Preview Download
Privacy_Day_2022_Monika_Hu.pdf md5:3c2cfac9f3fe6c51ffaf2504dff7f211	460.3 kB	Preview Download

	All versions	This version
Views	379	378
Downloads	365	364
Data volume	1.1 GB	1.1 GB

New Developments in Synthetic Data Generation

Authors/Creators

Description

Files

Privacy_Day_2022_Aaron_Williams.pdf

Files (10.5 MB)