Synthetic datasets of the UK Biobank cohort
Creators
Description
This repository stores synthetic datasets derived from the database of the UK Biobank (UKB) cohort.
The datasets were generated for illustrative purposes, in particular for reproducing specific analyses on the health risks associated with long-term exposure to air pollution using the UKB cohort. The code used to create the synthetic datasets is available and documented in a related GitHub repo, with details provided in the section below. These datasets can be freely used for code testing and for illustrating other examples of analyses on the UKB cohort.
The synthetic data have been used so far in two analyses described in related peer-reviewed publications, which also provide information about the original data sources:
- Vanoli J, et al. Long-term associations between time-varying exposure to ambient PM2.5 and mortality: an analysis of the UK Biobank. Epidemiology. 2025;36(1):1-10. DOI: 10.1097/EDE.0000000000001796 [freely available here, with code provided in this GitHub repo]
- Vanoli J, et al. Confounding issues in air pollution epidemiology: an empirical assessment with the UK Biobank cohort. International Journal of Epidemiology. 2025;54(5):dyaf163. DOI: 10.1093/ije/dyaf163 [freely available here, with code provided in this GitHub repo]
Note: while the synthetic versions of the datasets resemble the real ones in several aspects, the users should be aware that these data are fake and must not be used for testing and making inferences on specific research hypotheses. Even more importantly, these data cannot be considered a reliable description of the original UKB data, and they must not be presented as such.
The work was supported by the Medical Research Council-UK (Grant ID: MR/Y003330/1).
Content
The series of synthetic datasets (stored in two versions with csv and RDS formats) are the following:
- synthbdcohortinfo: basic cohort information regarding the follow-up period and birth/death dates for 502,360 participants.
- synthbdbasevar: baseline variables, mostly collected at recruitment.
- synthpmdata: annual average exposure to PM2.5 for each participant reconstructed using their residential history.
- synthoutdeath: death records that occurred during the follow-up with date and ICD-10 code.
In addition, this repository provides these additional files:
- codebook: a pdf file with a codebook for the variables of the various datasets, including references to the fields of the original UKB database.
- asscentre: a csv file with information on the assessment centres used for recruitment of the UKB participants, including code, names, and location (as northing/easting coordinates of the British National Grid).
- Countries_December_2022_GB_BUC: a zip file including the shapefile defining the boundaries of the countries in Great Britain (England, Wales, and Scotland), used for mapping purposes [source].
Generation of the synthetic data
The datasets resemble the real data used in the analysis, and they were generated using the R package synthpop (www.synthpop.org.uk). The generation process involves two steps, namely the synthesis of the main data (cohort info, baseline variables, annual PM2.5 exposure) and then the sampling of death events. The R scripts for performing the data synthesis are provided in the GitHub repo (subfolder Rcode/synthcode).
The first part merges all the data, including the annual PM2.5 levels, into a single wide-format dataset (with a row for each subject), generates a synthetic version, adds fake IDs, and then extracts (and reshapes) the single datasets. In the second part, a Cox proportional hazard model is fitted on the original data to estimate risks associated with various predictors (including the main exposure represented by PM2.5), and then these relationships are used to simulate death events in each year. Details on the modelling aspects are provided in the article.
This process guarantees that the synthetic data do not hold specific information about the original records, thus preserving confidentiality. At the same time, the multivariate distribution and correlation across variables, as well as the mortality risks, resemble those of the original data, so the results of descriptive and inferential analyses are similar to those in the original assessments. However, as noted above, the data are used only for illustrative purposes, and they must not be used to test other research hypotheses.
Files
synthbdcohortinfo.csv
Files
(335.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:37861b35515e6e9d56abc57425747ab5
|
1.6 kB | Preview Download |
|
md5:530bde0caede4a1bf4c20fb79d65dfc5
|
125.0 kB | Preview Download |
|
md5:5a71c675203ad21c5abf13b946c79a14
|
146.0 kB | Preview Download |
|
md5:691750ac43db977207c7c7ae9dd465c5
|
104.3 MB | Preview Download |
|
md5:65fc80c4386193f611868add0f87cc7b
|
15.0 MB | Download |
|
md5:a5cf12292a69db97c4ef8877cf798065
|
22.0 MB | Preview Download |
|
md5:5949177c66dd30aa7bc1ed99f78c0fda
|
4.1 MB | Download |
|
md5:d7acbeb30252e98c934c858216a5f4f1
|
940.6 kB | Preview Download |
|
md5:5dd5d6436e6254b2a76886d1733c2415
|
263.9 kB | Download |
|
md5:152f0b504b17b1d82f319c2282fde54b
|
173.9 MB | Preview Download |
|
md5:56fe974a6737a908f03e65ac1a5f8383
|
14.3 MB | Download |
Additional details
Related works
- Is supplement to
- Journal article: 10.1097/EDE.0000000000001796 (DOI)
Funding
Software
- Repository URL
- https://github.com/gasparrini/UKB-pm25deathlong
- Programming language
- R