Published July 30, 2024 | Version v2
Dataset Open

Data cleaning using unstructured data

  • 1. Ghent University

Description

In this project, we work on repairing three datasets:

  • Trials design: This dataset was obtained from the European Union Drug Regulating Authorities Clinical Trials Database (EudraCT) register and the ground truth was created from external registries. In the dataset, multiple countries, identified by the attribute country_protocol_code, conduct the same clinical trials which is identified by eudract_number. Each clinical trial has a title that can help  find informative details about the design of the trial.
  • Trials population: This dataset delineates the demographic origins of participants in clinical trials primarily conducted across European countries. This dataset include structured attributes indicating whether the trial pertains to a specific gender, age group or healthy volunteers. Each of these categories is labeled as (`1') or (`0') respectively denoting whether it is included in the trials or not. It is important to note that the population category should remain consistent across all countries conducting the same clinical trial identified by an eudract_number. The ground truth samples in the dataset were established by aligning information about the trial populations provided by external registries, specifically the CT.gov database and the German Trials database.  Additionally, the dataset comprises other unstructured attributes that categorize the inclusion criteria for trial participants such as inclusion.
  • Allergens: This dataset contains information about products and their allergens. The data was collected from the German version of the `Alnatura' (Access date: 24 November, 2020), a free database of food products from around the world `Open Food Facts', and the websites: `Migipedia', 'Piccantino', and  `Das Ist Drin'. There may be overlapping products across these websites.  Each product in the dataset is identified by a unique code.  Samples with the same code represent the same product but are extracted from a differentb source. The allergens are indicated by (‘2’) if present, or (‘1’) if there are traces of it, and (‘0’) if it is absent in a product. The dataset also includes information on ingredients in the products. Overall, the dataset comprises categorical structured data describing the presence, trace, or absence of specific allergens, and unstructured text describing ingredients. 

N.B: Each '.zip' file contains a set of 5 '.csv' files which are part of the afro-mentioned datasets:

  • "{dataset_name}_train.csv": samples used for the ML-model training. (e.g "allergens_train.csv")
  • "{dataset_name}_test.csv": samples used to test the the ML-model performance. (e.g "allergens_test.csv")
  • "{dataset_name}_golden_standard.csv": samples represent the ground truth of the test samples. (e.g "allergens_golden_standard.csv")
  • "{dataset_name}_parker_train.csv": samples repaired using Parker Engine used for the ML-model training. (e.g "allergens_parker_train.csv")
  • "{dataset_name}_parker_train.csv": samples repaired using Parker Engine used to test the the ML-model performance. (e.g "allergens_parker_test.csv")

Files

allergens.zip

Files (12.7 MB)

Name Size Download all
md5:3e9286c7fd0dc8a65e5d3a923760e326
83.4 kB Preview Download
md5:1707876a2247a695af8318af295115e5
3.7 MB Preview Download
md5:1da17f7aee823a6702f8aa2aae51f5af
8.9 MB Preview Download