Published February 2026 | Version v1
Dataset Open

CPL-Code: a corpus annotated with bioinformatics tools names in executable code of Nextflow workflows

  • 1. Université Paris-Saclay, CNRS, LISN, 91400, Orsay, France
  • 2. Université Paris-Saclay, CEA, Institut LIST, 91191, Gif-sur-Yvette, France

Description

CPL-Code is a corpus describing bioinformatics tools names in executable code of Nextflow workflows. These annotations are available in the BRAT Rapid Annotation Tool (BRAT) standoff format (https://brat.nlplab.org/standoff.html).

This corpus is composed of 797 processes related to Nextflow workflows randomly selected from Github with a total of 78,562 tokens and 1,914 annotated tokens corresponding to 1,911 tool occurences (421 unique tools).
 

Repository organisation

The articles are separated into six different directories:

  • Five folders (iteration_{i}) are provided, each corresponding to a different split of the training data. This allows experiments to be run on different splits.
  • The last folder contains five articles used for testing.

Contact


Clémence Sebe, clemence.sebe@universite-paris-saclay.fr

Funding


This work received support from the National Research Agency under the France 2030 program, with reference to ANR-22-PESN-0007.

Files

CPL-Code.zip

Files (4.8 MB)

Name Size Download all
md5:50476181e927913bf6c4c23c0e713d52
4.8 MB Preview Download