Published February 2026
| Version v1
Dataset
Open
CPL-Code: a corpus annotated with bioinformatics tools names in executable code of Nextflow workflows
Authors/Creators
- 1. Université Paris-Saclay, CNRS, LISN, 91400, Orsay, France
- 2. Université Paris-Saclay, CEA, Institut LIST, 91191, Gif-sur-Yvette, France
Description
CPL-Code is a corpus describing bioinformatics tools names in executable code of Nextflow workflows. These annotations are available in the BRAT Rapid Annotation Tool (BRAT) standoff format (https://brat.nlplab.org/standoff.html).
This corpus is composed of 797 processes related to Nextflow workflows randomly selected from Github with a total of 78,562 tokens and 1,914 annotated tokens corresponding to 1,911 tool occurences (421 unique tools).
Repository organisation
The articles are separated into six different directories:
- Five folders (iteration_{i}) are provided, each corresponding to a different split of the training data. This allows experiments to be run on different splits.
- The last folder contains five articles used for testing.
Contact
Clémence Sebe, clemence.sebe@universite-paris-saclay.fr
Funding
This work received support from the National Research Agency under the France 2030 program, with reference to ANR-22-PESN-0007.
Files
CPL-Code.zip
Files
(4.8 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:50476181e927913bf6c4c23c0e713d52
|
4.8 MB | Preview Download |