Published July 31, 2023
| Version v2
Preprint
Open
Automatizing biocurators' intuition: filtering scientific papers by analyzing titles and short summaries
Description
We present a text classification task arising in the biocuration of cellular chemical reactions when searching for curatable literature. We explore the suitability of various NLP and ML methods for this task. In summary, while fine-tuned domain-specific language models show the best results, random forests are nearly as good, with a much lighter computational footprint.
Files
EXP.pdf
Files
(106.1 kB)
Name | Size | Download all |
---|---|---|
md5:0efdb111092a73a959d607043e603561
|
92.2 kB | Preview Download |
md5:baad0bb1347bbc6bb03a25c7c70e5631
|
13.9 kB | Download |
Additional details
References
- DOI:10.1371/journal.pbio.2002846
- https://scholar.archive.org/work/vrj26wboenecve56hhtexmax3a/access/wayback/https://proceedings.neurips.cc/paper/2001/file/296472c9542ad4d4788d543508116cbc-Paper.pdf
- DOI:10.18653/v1/2020.findings-emnlp.428
- arXiv:1803.11175
- arXiv:1810.09302
- arXiv:1810.04805
- DOI:10.1093/nar/gkab1028
- https://tfhub.dev/google/experts/bert/pubmed/2
- arXiv:2212.02934
- https://proceedings.neurips.cc/paper_files/paper/2010/hash/71f6278d140af599e06ad9bf1ba03cb0-Abstract.html
- DOI:10.1145/2372251.2372257
- DOI:10.18653/v1/2021.bionlp-1.16
- DOI:10.3115/v1/p14-5010
- https://www.tensorflow.org/
- https://aclanthology.org/2022.sdp-1.31
- DOI:10.5281/zenodo.8123552
- https://is.muni.cz/publication/884893/cs/Software-Framework-for-Topic-Modelling-with-Large-Corpora/Rehurek-Sojka
- arXiv:1908.10084
- arXiv:2209.11055
- DOI:10.1016/j.infsof.2022.106908