CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object)

Farah Zaib Khan; Stian Soiland-Reyes

doi:10.17632/xnwncxpw42.1

Published December 4, 2018 | Version 1

Dataset Open

CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object)

1. Department of Computing and Information Systems, The University of Melbourne, Australia
2. School of Computer Science, The University of Manchester, UK

This workflow adapts the approach and parameter settings of Trans-Omics for precision Medicine (TOPMed). The RNA-seq pipeline originated from the Broad Institute. There are in total five steps in the workflow starting from:

Read alignment using STAR which produces aligned BAM files including the Genome BAM and Transcriptome BAM.
The Genome BAM file is processed using Picard MarkDuplicates. producing an updated BAM file containing information on duplicate reads (such reads can indicate biased interpretation).
SAMtools index is then employed to generate an index for the BAM file, in preparation for the next step.
The indexed BAM file is processed further with RNA-SeQC which takes the BAM file, human genome reference sequence and Gene Transfer Format (GTF) file as inputs to generate transcriptome-level expression quantifications and standard quality control metrics.
In parallel with transcript quantification, isoform expression levels are quantified by RSEM. This step depends only on the output of the STAR tool, and additional RSEM reference sequences.

For testing and analysis, the workflow author provided example data created by down-sampling the read files of a TOPMed public access data. Chromosome 12 was extracted from the Homo Sapien Assembly 38 reference sequence and provided by the workflow authors. The required GTF and RSEM reference data files are also provided. The workflow is well-documented with a detailed set of instructions of the steps performed to down-sample the data are also provided for transparency. The availability of example input data, use of containerization for underlying software and detailed documentation are important factors in choosing this specific CWL workflow for CWLProv evaluation.

This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.5.0 or use https://pypi.org/project/cwl

Steps to reproduce

To build the research object again, use Python 3 on macOS. Built with:

Processor 2.8GHz Intel Core i7
Memory: 16GB
OS: macOS High Sierra, Version 10.13.3
Storage: 250GB

Install cwltool

pip3 install cwltool==1.0.20180912090223

Install git lfs
The data download with the git repository requires the installation of Git lfs:
https://www.atlassian.com/git/tutorials/git-lfs#installing-git-lfs

Get the data and make the analysis environment ready:

git clone https://github.com/FarahZKhan/cwl_workflows.git
cd cwl_workflows/
git checkout CWLProvTesting
./topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/download_examples.sh

Run the following commands to create the CWLProv Research Object:

cwltool --provenance rnaseqwf_0.6.0_linux --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-workflows/TOPMed_RNAseq_pipeline/rnaseq_pipeline_fastq.cwl topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/Dockstore.json

zip -r rnaseqwf_0.5.0_mac.zip rnaseqwf_0.5.0_mac
sha256sum rnaseqwf_0.5.0_mac.zip > rnaseqwf_0.5.0_mac_mac.zip.sha256

The https://github.com/FarahZKhan/cwl_workflows repository is a frozen snapshot from https://github.com/heliumdatacommons/TOPMed_RNAseq_CWL commit 027e8af41b906173aafdb791351fb29efc044120

Notes

Mirror of Mendeley Data upload https://data.mendeley.com/datasets/xnwncxpw42/1

Files

rnaseqwf_0.5.0_mac.zip

Files (1.9 GB)

Name	Size	Download all
rnaseqwf_0.5.0_mac.zip md5:6b766e4629ba25cc18517f44a9fb3a6e	1.9 GB	Preview Download
rnaseqwf_0.5.0_mac.zip.sha256 md5:a45b85feb4e483b981f6d54c410db662	89 Bytes	Download

Additional details

European Commission
BioExcel - Centre of Excellence for Biomolecular Research 675728

	All versions	This version
Views	576	575
Downloads	135	135
Data volume	168.1 GB	168.1 GB

CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object)

Notes

Files

rnaseqwf_0.5.0_mac.zip

Files (1.9 GB)

Additional details

Related works

Funding

CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object)

Creators

Description

Notes

Files

rnaseqwf_0.5.0_mac.zip

Files (1.9 GB)

Additional details

Related works

Funding