Published December 4, 2018 | Version 1
Dataset Open

CWL run of RNA-seq Analysis Workflow (CWLProv 0.5.0 Research Object)

  • 1. Department of Computing and Information Systems, The University of Melbourne, Australia
  • 2. School of Computer Science, The University of Manchester, UK

Description

This workflow adapts the approach and parameter settings of Trans-Omics for precision Medicine (TOPMed). The RNA-seq pipeline originated from the Broad Institute. There are in total five steps in the workflow starting from:

  1. Read alignment using STAR which produces aligned BAM files including the Genome BAM and Transcriptome BAM.
  2. The Genome BAM file is processed using Picard MarkDuplicates. producing an updated BAM file containing information on duplicate reads (such reads can indicate biased interpretation).
  3. SAMtools index is then employed to generate an index for the BAM file, in preparation for the next step.
  4. The indexed BAM file is processed further with RNA-SeQC which takes the BAM file, human genome reference sequence and Gene Transfer Format (GTF) file as inputs to generate transcriptome-level expression quantifications and standard quality control metrics.
  5. In parallel with transcript quantification, isoform expression levels are quantified by RSEM. This step depends only on the output of the STAR tool, and additional RSEM reference sequences.

For testing and analysis, the workflow author provided example data created by down-sampling the read files of a TOPMed public access data. Chromosome 12 was extracted from the Homo Sapien Assembly 38 reference sequence and provided by the workflow authors. The required GTF and RSEM reference data files are also provided. The workflow is well-documented with a detailed set of instructions of the steps performed to down-sample the data are also provided for transparency. The availability of example input data, use of containerization for underlying software and detailed documentation are important factors in choosing this specific CWL workflow for CWLProv evaluation.

This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.5.0 or use https://pypi.org/project/cwl

Steps to reproduce

To build the research object again, use Python 3 on macOS. Built with:

  • Processor 2.8GHz Intel Core i7
  • Memory: 16GB
  • OS: macOS High Sierra, Version 10.13.3
  • Storage: 250GB
  1. Install cwltool

    pip3 install cwltool==1.0.20180912090223
  2. Install git lfs
    The data download with the git repository requires the installation of Git lfs:
    https://www.atlassian.com/git/tutorials/git-lfs#installing-git-lfs

  3. Get the data and make the analysis environment ready:

    git clone https://github.com/FarahZKhan/cwl_workflows.git
    cd cwl_workflows/
    git checkout CWLProvTesting
    ./topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/download_examples.sh
  4. Run the following commands to create the CWLProv Research Object:

    cwltool --provenance rnaseqwf_0.6.0_linux --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-workflows/TOPMed_RNAseq_pipeline/rnaseq_pipeline_fastq.cwl topmed-workflows/TOPMed_RNAseq_pipeline/input-examples/Dockstore.json
    
    zip -r rnaseqwf_0.5.0_mac.zip rnaseqwf_0.5.0_mac
    sha256sum rnaseqwf_0.5.0_mac.zip > rnaseqwf_0.5.0_mac_mac.zip.sha256

The https://github.com/FarahZKhan/cwl_workflows repository is a frozen snapshot from https://github.com/heliumdatacommons/TOPMed_RNAseq_CWL commit 027e8af41b906173aafdb791351fb29efc044120

Notes

Mirror of Mendeley Data upload https://data.mendeley.com/datasets/xnwncxpw42/1

Files

rnaseqwf_0.5.0_mac.zip

Files (1.9 GB)

Name Size Download all
md5:6b766e4629ba25cc18517f44a9fb3a6e
1.9 GB Preview Download
md5:a45b85feb4e483b981f6d54c410db662
89 Bytes Download

Additional details

Funding

BioExcel – Centre of Excellence for Biomolecular Research 675728
European Commission