Initial findings from the automation of extraction of metadata from questionnaires and its classification
Description
Social science archives have a long history of producing well documented datasets which include the provenance (questionnaires), data description and methodological annotation. Alongside that recent efforts to create thesauri such as ELSST which can be used systematically across the social sciences provide the possibility for enriching these valuable assets created over the last 50 years. However, this information is currently available mostly as PDFs alongside deposited datasets.
The presentation will show preliminary findings from a project between CLOSER and the University of Surrey which has used the metadata held in CLOSER Discovery (https://discovery.closer.ac.uk) to explore the automation of extraction of provenance data from PDFs of questionnaires, and the classification of the questions and associated data to a subset of ELSST.
The project has used four supervised model architectures (Multinomial naive Bayes, LSTM, ULMFit, and BERT) and their enhancements, to explore the strengths of the models, for metadata extraction and its utility for classification, in a number of different social science and health domains. This has provided valuable insights both for the most suitable methods and the composition of training data which would be needed to reliably extract metadata from questionnaires and classify the questions and associated data to a suitable ontology.
Files
ESRA 2023 - Johnson - De.pdf
Files
(2.5 MB)
Name | Size | Download all |
---|---|---|
md5:838c8281341229511aace562ad4f4d38
|
2.5 MB | Preview Download |
Additional details
Funding
Dates
- Copyrighted
-
2023-07-29