How to Deal with Privacy, Bias & Drift in Synthetic Primary Care Data
Description
Primary healthcare care data offers huge value in modelling disease and illness. However, this data holds extremely private information about individuals and privacy concerns continue to limit the wide-spread use of such data, both by public research institutions and by the private health-tech sector.
One possible solution is the use of synthetic data which mimics the underlying correlational structure and distributions of real data but avoids many of the privacy concerns. Brunel University London has been working in a long-term collaboration with the Medicine and Healthcare products Regulatory Agency in the UK to construct a high-fidelity synthetic data service using probabilistic models with complex underlying latent variable structures.
This work has led to multiple releases of synthetic data on a number of diseases including covid and cardiovascular disease, which are available for research. Two major issues that have arisen from our synthetic data work are issues with bias, even when working with comprehensive national data, and with concept drift where subsequent batches of data move away from current models and what impact this may have on regulation.
In this talk Allan Tucker discusses some of the key results of the collaboration: on his experiences of synthetic data generation, on the detection of bias and how to better represent the true underlying UK population, and how to handle concept drift when building models of healthcare data that evolves over time
Files
Turing_v1.pdf
Files
(4.2 MB)
Name | Size | Download all |
---|---|---|
md5:03a81dd99f07ec16ccb694f4605e0408
|
4.2 MB | Preview Download |