Neoplasm topography and morphology corpus
Authors/Creators
- 1. Center for Mathematical Modeling - CNRS UMI2807, Faculty of Physical and Mathematical Sciences, University of Chile, Chile
- 2. Center for Medical Informatics and Telemedicine, Faculty of Medicine, University of Chile, Chile.
- 3. Unidad de Informática Médica y Data Science, Departamento de Investigación del Cáncer, Intituto Oncológico Fundación Arturo López Pérez, Santiago, Chile
- 4. Department of Computer Sciences, Faculty of Physical and Mathematical Sciences, University of Chile, Chile
Description
Pathology reports provide valuable information for cancer registries to understand, plan and implement strategies to mitigate the impact of cancer. However, coding key information from unstructured reports is done by experts in a time-consuming manual process. Here we report an automatic deep learning-based system that recognizes tumor morphology and topography mentions from free-text and suggests codes from the International Classification of Diseases for Oncology (ICD-O) in Spanish. This task was performed using the morphology guidelines and the Cantemist resource, an open corpus annotated with tumor morphology mentions created by the Barcelona Supercomputing Center, and the topography guidelines developed by us and inspired by the former. In this way we generated an annotated internal corpus of tumor morphology and topography mentions. Here, we applied transfer learning from state-of-the-art pre-trained language models to create a Named Entity Recognition (NER) model. The mentions found with this architecture are subsequently coded using a search engine tailored to the ICD-O codes. Our NER models achieved an F1-Score of 0.86 and 0.90 for tumor morphology and topography, respectively. The overall performance of our proposed automatic coding system achieved an accuracy at five suggestions of 0.72 and 0.65 for tumor morphology and topography, respectively. Our results demonstrate the feasibility of implementing NLP tools in the routine of a cancer center to extract and code valuable information from pathology reports.
The tumor morphology corpus created in Spain, Cantemist corpus [https://doi.org/10.5281/zenodo.3773228], was developed at the Barcelona Supercomputing Center (funded by the "Plan de Tecnologías del Language"): "Miranda-Escalada, A., Farré, E., & Krallinger, M. (2020). Named entity recognition, concept normalization and clinical coding: Overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.
We are releasing the dataset in 2 formats:
- corpus_raw.zip: Contains the raw text files for each document along with its annotation file in Standoff format
- corpus.zip: Contains the corpus already tokenized and annotated using the IOB2 format. The corpus is separated into train, test and development subsets.
Files
corpus.zip
Additional details
References
- Antonio Miranda-Escalada, Farré, Eulàlia, & Martin Krallinger. (2020). Cantemist corpus: gold standard of oncology clinical cases annotated with CIE-O 3 terminology (1.6) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3978041