Published October 23, 2024 | Version v1
Dataset Open

Synthetic datasets of the UK Biobank cohort

  • 1. ROR icon London School of Hygiene & Tropical Medicine
  • 2. ROR icon Nagasaki University

Description

This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.

The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.

The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:

  • Vanoli J, et al. Long-term associations between time-varying exposure to ambient PM2.5 and mortality: an analysis of the UK Biobank. Epidemiology. 2025;36(1):1-10. DOI: 10.1097/EDE.0000000000001796 [freely available here, with code provided in this GitHub repo]
  • Vanoli J, et al. Confounding issues in air pollution epidemiology: an empirical assessment with the UK Biobank cohort. International Journal of Epidemiology. 2025;54(5):dyaf163. DOI: 10.1093/ije/dyaf163 [freely available here, with code provided in this GitHub repo]

Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.

The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).

Content

The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:

  • synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.
  • synthbdbasevar: baseline variables, mostly collected at recruitment.
  • synthpmdata: annual average exposure to PM2.5 for each participant reconstructed using their residential history.
  • synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.

In addition, this repository provides these additional files:

  • codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.
  • asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).
  • Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].

Generation of the synthetic data

The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode). 

The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.

This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses. 

Files

synthbdcohortinfo.csv

Files (335.1 MB)

Name Size Download all
md5:37861b35515e6e9d56abc57425747ab5
1.6 kB Preview Download
md5:530bde0caede4a1bf4c20fb79d65dfc5
125.0 kB Preview Download
md5:5a71c675203ad21c5abf13b946c79a14
146.0 kB Preview Download
md5:691750ac43db977207c7c7ae9dd465c5
104.3 MB Preview Download
md5:65fc80c4386193f611868add0f87cc7b
15.0 MB Download
md5:a5cf12292a69db97c4ef8877cf798065
22.0 MB Preview Download
md5:5949177c66dd30aa7bc1ed99f78c0fda
4.1 MB Download
md5:d7acbeb30252e98c934c858216a5f4f1
940.6 kB Preview Download
md5:5dd5d6436e6254b2a76886d1733c2415
263.9 kB Download
md5:152f0b504b17b1d82f319c2282fde54b
173.9 MB Preview Download
md5:56fe974a6737a908f03e65ac1a5f8383
14.3 MB Download

Additional details

Related works

Is supplement to
Journal article: 10.1097/EDE.0000000000001796 (DOI)

Funding

UK Research and Innovation
Investigating health risks of environmental stressors in the UK Biobank cohort MR/Y003330/1

Software

Repository URL
https://github.com/gasparrini/UKB-pm25deathlong
Programming language
R