Published March 3, 2020 | Version v1
Dataset Open

Optimization of the Mainzelliste Software for Fast Privacy-preserving Record Linkage

  • 1. Database Group, University of Leipzig
  • 2. German Cancer Research Center Heidelberg, Germany

Description

Synthetically generated person related datasets used in the evaluation of linkage quality and runtime of the Mainzelliste. To generate person records we used the established GeCo data generator modified with small extensions such as including look-up files for German names in addition to English names. A generated dataset consists of two subsets, org and dup, to be compared with each other. The duplicate records can contain data errors (e.g., different
but similarly sounding letters, OCR errors or typos) to simulate reduced data quality, making matching more
challenging.

Files

mainzelliste_datasets.zip

Files (12.2 MB)

Name Size Download all
md5:327c7bf2e4a54f4b59da25855adf5c3b
12.2 MB Preview Download