Published December 3, 2024 | Version v1
Presentation Open

Metacurate-ML: Enhanced Data Curation - Automation of Disclosure Control Assessment

Description

Conceptual annotations and provenance can provide contextual information to inform a range of data processing activities. In this workstream we will be utilising the metadata generated in the earlier workstreams – the questions and response domains from the metadata extraction phase and the concorded variables from the conceptual comparison phase – to identify key variables, those that although are not sensitive in of themselves, have the potential to be disclosive if used in combination. This identification will be achieved using state-of-the-art text classification methods, which we will also use to identify such metadata as identifiers and weight variables. Rule-based classifiers will further interrogate the variable metadata to determine its classification hierarchy and level, e.g., a socio-economic variable may be coded using the ONS NS-SeC classification hierarchy at the 8-class analytic level.   This enhanced metadata can then be combined with the data itself to provide an enhanced curation platform – one which allows our data curators to evaluate and mitigate the disclosure risk of a dataset with relative ease. The resulting platform will be powered by metadata and microdata stored using the DDI-CDI schema, utilising such aspects as its variable cascade.

Files

Files (5.3 MB)

Name Size Download all
md5:763025778af75fd25702e410e287a630
5.3 MB Download