Published October 8, 2021 | Version v1
Dataset Open

Neoplasm topography and morphology corpus

  • 1. Center for Mathematical Modeling - CNRS UMI2807, Faculty of Physical and Mathematical Sciences, University of Chile, Chile
  • 2. Center for Medical Informatics and Telemedicine, Faculty of Medicine, University of Chile, Chile.
  • 3. Unidad de Informática Médica y Data Science, Departamento de Investigación del Cáncer, Intituto Oncológico Fundación Arturo López Pérez, Santiago, Chile
  • 4. Department of Computer Sciences, Faculty of Physical and Mathematical Sciences, University of Chile, Chile

Description

Pathology reports provide valuable information for cancer registries to understand, plan and implement strategies to mitigate the impact of cancer. However, coding key information from unstructured reports is done by experts in a time-consuming manual process. Here we report an automatic deep learning-based system that recognizes tumor morphology and topography mentions from free-text and suggests codes from the International Classification of Diseases for Oncology (ICD-O) in Spanish. This task was performed using the morphology guidelines and the Cantemist resource, an open corpus annotated with tumor morphology mentions created by the Barcelona Supercomputing Center, and the topography guidelines developed by us and inspired by the former. In this way we generated an annotated internal corpus of tumor morphology and topography mentions. Here, we applied transfer learning from state-of-the-art pre-trained language models to create a Named Entity Recognition (NER) model. The mentions found with this architecture are subsequently coded using a search engine tailored to the ICD-O codes. Our NER models achieved an F1-Score of 0.86 and 0.90 for tumor morphology and topography, respectively. The overall performance of our proposed automatic coding system achieved an accuracy at five suggestions of 0.72 and 0.65 for tumor morphology and topography, respectively. Our results demonstrate the feasibility of implementing NLP tools in the routine of a cancer center to extract and code valuable information from pathology reports.

The tumor morphology corpus created in Spain, Cantemist corpus [https://doi.org/10.5281/zenodo.3773228], was developed at the Barcelona Supercomputing Center (funded by the "Plan de Tecnologías del Language"): "Miranda-Escalada, A., Farré, E., & Krallinger, M. (2020). Named entity recognition, concept normalization and clinical coding: Overview of the cantemist track for cancer text mining in spanish, corpus, guidelines, methods and results. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), CEUR Workshop Proceedings
 

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.

We are releasing the dataset in 2 formats:

  • corpus_raw.zip: Contains the raw text files for each document along with its annotation file in Standoff format
  • corpus.zip: Contains the corpus already tokenized and annotated using the IOB2 format. The corpus is separated into train, test and development subsets.

Files

corpus.zip

Files (9.8 MB)

Name Size Download all
md5:1cd9b96eeac6825f02026ac5f3c91e43
5.0 MB Preview Download
md5:ea6c5972da347de7240a4fcf0c5f78ef
4.7 MB Preview Download
md5:136c671dba2d2f644b882e31c3e289e8
20.9 kB Preview Download

Additional details

References

  • Antonio Miranda-Escalada, Farré, Eulàlia, & Martin Krallinger. (2020). Cantemist corpus: gold standard of oncology clinical cases annotated with CIE-O 3 terminology (1.6) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3978041