Supplementary Material for the paper: Automatic Document Screening of Medical Literature Using Word and Text Embeddings in an Active Learning Setting

Andres Carvallo; Denis Parra; Hans Lobel; Alvaro Soto

doi:10.5281/zenodo.3834845

Published March 3, 2020 | Version v4

Dataset Open

Supplementary Material for the paper: Automatic Document Screening of Medical Literature Using Word and Text Embeddings in an Active Learning Setting

1. Pontificia Universidad Católica de Chile

This is the dataset used in the paper: Automatic Document Screening of Medical Literature Using Word and Text Embeddings in an Active Learning Setting.

It is composed of:

- Pre-trained models using active learning for document screening on HealthCLEF and Epistemonikos datasets.

- Epistemonikos and HealthCLEF datasets containing medical questions and relevant/non relevant articles.

- Embeddings and Document Representations used for experiments on both datasets.

Scripts to run experiments can be found at: https://github.com/afcarvallo/active_learning_document_screening

Paper abstract:

Document screening is a fundamental task within Evidence-based Medicine (EBM), a practice that provides scientific evidence to support medical decisions. Several approaches have tried to reduce physicians' workload of screening and labeling vast amounts of documents to answer clinical questions. Previous works tried to semi-automate document screening, reporting promising results, but their evaluation was conducted on small datasets, which hinders generalization. Moreover, recent works in natural language processing have introduced neural language models, but none have compared their performance in EBM. In this paper, we evaluate the impact of several document representations such as TF-IDF along with neural language models (BioBERT, BERT, Word2vec, and GloVe) on an active learning-based setting for document screening in EBM. Our goal is to reduce the number of documents that physicians need to label to answer clinical questions. We evaluate these methods using both a small challenging dataset (HealthCLEF 2017) as well as a larger one but easier to rank (Epistemonikos). Our results indicate that word as well as textual neural embeddings always outperform the traditional TF-IDF representation. When comparing among neural and textual embeddings, in the HealthCLEF dataset the models BERT and BioBERT yielded the best results. On the larger dataset, Epistemonikos, Word2Vec and BERT were the most competitive, showing that BERT was the most consistent model across different corpuses. In term of active learning, an uncertainty sampling strategy combined with logistic regression achieved the best performance overall, above other methods under evaluation, and in fewer iterations.

Files

clef_bert_embeddings.json.zip

Files (18.2 GB)

Name	Size	Download all
clef_bert_embeddings.json.zip md5:f023bffcd47e9e404fcd4a522555afc6	1.4 GB	Preview Download
clef_biobert_embeddings.json.zip md5:5e664db9d34ae1287d319387477cf043	1.4 GB	Preview Download
clef_documents_data.zip md5:ef0e2a9cea7b7ee759a8f4091d255612	63.1 MB	Preview Download
clef_glove_dict.json.zip md5:173c4294cb26352a780d4f807746e53d	535.9 MB	Preview Download
clef_tfidf.json.zip md5:7decceb08ae2088434d18607b00dc76f	150.6 MB	Preview Download
clef_w2vec.json.zip md5:44e8e6fbeb50420d2e066f81720ffd4d	534.0 MB	Preview Download
datasets.zip md5:996458e9d711913cee1c4d915238a40c	4.3 MB	Preview Download
episte_glove_dict.json.zip md5:3f2610ca345a886b2ed683015dc8e153	1.2 GB	Preview Download
episte_tfidf.json.zip md5:a9cc3f8d001d0e999c43a535de99ab9d	422.0 MB	Preview Download
episte_w2vec.json.zip md5:13e607327fb9d7d35e5f8adf84e3d6eb	1.2 GB	Preview Download
epistemonikos_bert_embeddings.json.zip md5:9f617653600ed6ca4b18f8b564d379f3	3.1 GB	Preview Download
epistemonikos_bio_bert_embeddings.json.zip md5:5e656885c8e54164199d7c41c7957081	3.1 GB	Preview Download
models_clef.zip md5:b820b2732f3aedbfb35cc020d1d68a4e	541.6 MB	Preview Download
models_episte.zip md5:f243f401b72b892a76fdf1aaf82efe41	4.5 GB	Preview Download
results_clef.zip md5:eee73d9356ffb1fbdf2360af79650a47	23.4 MB	Preview Download
results_episte.zip md5:ad0e5bef63cc32034fcf8b94f0b69dff	72.3 MB	Preview Download

Additional details

Cites: Journal article: 1588-2861 (ISSN)

	All versions	This version
Views	878	315
Downloads	1,671	617
Data volume	2.2 TB	731.3 GB

Supplementary Material for the paper: Automatic Document Screening of Medical Literature Using Word and Text Embeddings in an Active Learning Setting

Authors/Creators

Description

Files

clef_bert_embeddings.json.zip

Files (18.2 GB)

Additional details

Related works