Software Open Access

Processing and similarity scoring WHO ICTRP data

van Valkenhoef, Gert

Source code for "Previously Unidentified Duplicate Registrations of Clinical Trials: an Exploratory Analysis of Registry Data Worldwide" (under review).

This code was used to process the WHO International Clinical Trials Registry Platform (ICTRP) dataset retrieved in April 2015 (see related). The code imports the XML data into a SQL database and performs a number of standardizations. There is also code to group records by referenced primary registry IDs and to perform text-based similarity scoring on registration fields.

The README file included with the code provides detailed instructions on dependencies and running the code.

Files (47.3 MB)
Name Size
47.3 MB Download
All versions This version
Views 107107
Downloads 99
Data volume 425.4 MB425.4 MB
Unique views 101101
Unique downloads 88


Cite as