Geniac: Automatic Configuration GENerator and Installer for nextflow pipelines
Creators
- 1. Mines Paris Tech, Fontainebleau, F-77305, France
- 2. UMR144, CNRS, Paris, F-75005, France
Description
With the advent of high-throughput biotechnological platforms and their ever-growing capacity, life science has turned into a digitized, computational and data-intensive discipline. As a consequence, standard analysis with a bioinformatics pipeline in the context of routine production has become a challenge such that the data can be processed in real-time and delivered to the end-users as fast as possible. The usage of workflow management systems along with packaging systems and containerization technologies offer an opportunity to tackle this challenge. While very powerful, they can be used and combined in many multiple ways which may differ from one developer to another. Therefore, promoting the homogeneity of the workflow implementation requires guidelines and protocols which detail how the source code of the bioinformatics pipeline should be written and organized to ensure its usability, maintainability, interoperability, sustainability, portability, reproducibility, scalability and efficiency. Capitalizing on Nextflow, Conda, Docker, Singularity and the nf-core initiative, we propose a set of best practices along the development life cycle of the bioinformatics pipeline and deployment for production operations which target different expert communities including i) the bioinformaticians and statisticians ii) the software engineers and iii) the data managers and core facility engineers. We implemented Geniac (Automatic Configuration GENerator and Installer for nextflow pipelines) which consists of a toolbox with three components: i) a technical documentation available at https://geniac.readthedocs.io to detail coding guidelines for the bioinformatics pipeline with Nextflow, ii) a command line interface with a linter to check that the code respects the guidelines, and iii) an add-on to generate configuration files, build the containers and deploy the pipeline. The Geniac toolbox aims at the harmonization of development practices across developers and automation of the generation of configuration files and containers by parsing the source code of the Nextflow pipeline.
Files
openreseurope-1-15693.pdf
Files
(2.4 MB)
Name | Size | Download all |
---|---|---|
md5:4b18aca499d57eed93c5760c92814662
|
2.4 MB | Preview Download |
Additional details
References
- da Veiga Leprevost F, Barbosa VC, Francisco EL (2014). On best practices in the development of bioinformatics software. Front Genet. doi:10.3389/fgene.2014.00199
- Di Tommaso P, Chatzou M, Floden EW (2017). Nextflow enables reproducible computational workflows. Nat Biotechnol. doi:10.1038/nbt.3820
- Ewels P, Magnusson M, Lundin S (2016). Multiqc: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. doi:10.1093/bioinformatics/btw354
- Ewels PA, Peltzer A, Fillinger S (2020). The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. doi:10.1038/s41587-020-0439-x
- Georgeson P, Syme A, Sloggett C (2019). Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software. Gigascience. doi:10.1093/gigascience/giz109
- Goble C, Cohen-Boulakia S, Soiland-Reyes S (2020). FAIR Computational Workflows. Data Intell. doi:10.1162/dint_a_00033
- Goh WWB, Wong L (2020). The birth of bio-data science: Trends, expectations, and applications. Genomics Proteomics Bioinformatics. doi:10.1016/j.gpb.2020.01.002
- Gruening B, Sallou O, Moreno P (2018). Recommendations for the packaging and containerizing of bioinformatics software [version 2; peer review: 2 approved, 1 approved with reservations]. F1000Res. doi:10.12688/f1000research.15140.2
- Grüning B, Dale R, Sjödin A (2018). Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. doi:10.1038/s41592-018-0046-7
- Hupé P, Allain F, Roméjon J (2022a). bioinfo-pf-curie/geniac: version-2.0.0.
- Hupé P, Allain F, Servant N (2022b). bioinfo-pf-curie/geniac-demo: version-2.0.0.
- Jackson M, Kavoussanakis K, Wallace EWJ (2021). Using prototyping to choose a bioinformatics workflow manage-ment system. PLoS Comput Biol. doi:10.1371/journal.pcbi.1008622
- Jarlier F, Joly N, Fedy N (2020). QUARTIC: QUick pArallel algoRithms for high-Throughput sequencIng data proCessing [version 3; peer review: 2 approved]. F1000Res. doi:10.12688/f1000research.22954.3
- Kamoun C, Roméjon J, de Soyres H (2020). development workflow protocols for bioinformatics pipelines with git and gitlab. F1000Res. doi:10.12688/f1000research.24714.3
- Kurtzer GM, Sochat V, Bauer MW (2017). Singularity: Scientific containers for mobility of compute. PLoS One. doi:10.1371/journal.pone.0177459
- La Rosa P, Hupé P, Roméjon J (2022). bioinfo-pf-curie/geniac-demo-dsl2: version-2.0.0.
- Lawlor B, Walsh P (2015). Engineering bioinformatics: building reliability, performance and productivity into bioinformatics software. Bioengineered. doi:10.1080/21655979.2015.1050162
- Leipzig J (2017). A review of bioinformatic pipeline frameworks. Brief Bioinform. doi:10.1093/bib/bbw020
- Merkel D (2014). Docker: Lightweight linux containers for consistent development and deployment. Linux J.
- Reiter T, Brooks PT, Irber L (2021). Streamlining data-intensive biology with workflow systems. Gigascience. doi:10.1093/gigascience/giaa140
- Servant N, Hupé P (2022). bioinfo-pf-curie/geniac-template: version-2.0.0.
- Strozzi F, Janssen R, Wurmus R (2019). Scalable Workflows and Reproducible Data Analysis for Genomics. Methods Mol Biol. doi:10.1007/978-1-4939-9074-0_24
- Tanjo T, Kawai Y, Tokunaga K (2021). Practical guide for managing large-scale human genome data in research. J Hum Genet. doi:10.1038/s10038-020-00862-1
- Wilkinson MD, Dumontier M, Aalbersberg IJJ (2016). The fair guiding principles for scientific data management and stewardship. Sci Data. doi:10.1038/sdata.2016.18