Published December 4, 2018 | Version 1
Dataset Open

CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object)

  • 1. Department of Computing and Information Systems, The University of Melbourne, Australia
  • 2. School of Computer Science, The University of Manchester, UK

Description

This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see CWLProv 0.6.0 or use the cwlprov Python tool to explore.

 

The CWL alignment workflow included in this case study is designed by Data Biosphere. It adapts the alignment pipeline originally developed at Abecasis Lab, The University of Michigan. This workflow is part of NIH Data Commons initiative and comprises of four stages.

First step, Pre-align, accepts a Compressed Alignment Map (CRAM) file (a compressed format for BAM files developed by European Bioinformatics Institute (EBI)) and human genome reference sequence as input and using underlying software utilities of SAMtools such as view, sort and fixmate returns a list of fastq files which can be used as input for the next step.

The next step Align also accepts the human reference genome as input along with the output files from Pre-align and uses BWA-mem to generate aligned reads as BAM files. SAMBLASTER is used to mark duplicate reads and SAMtools view to convert read files from SAM to BAM format.

The BAM files generated after lign are sorted with SAMtool sort'.

Finally, these sorted alignment files are merged to produce single sorted BAM file using SAMtools merge in Post-align step.

 

Steps to reproduce

This analysis was run using a 16-core Linux cloud instance with 64GB RAM and pre-installed docker.

  1. Install gsutils
     

    export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"
    
    echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | \
      sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
    
    curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | \
      sudo apt-key add -
    
    sudo apt-get update && sudo apt-get install google-cloud-sdk

     

  2. Get the data and make the analysis environment ready:
     

    git clone https://github.com/FarahZKhan/topmed-workflows.git
    cd topmed-workflows
    git checkout cwlprov_testing
    cd aligner/sbg-alignment-cwl
    
    # this is a custom script download google bucket files from json files and create a local json
    # it needs gsutil to be installed though
    git clone https://github.com/DailyDreaming/fetch_gs_frm_json.git
    
    # Wait... this should download ~18Gb.
    python2.7 fetch_gs_frm_json/dl_gsfiles_frm_json.py topmed-alignment.sample.json
    

     

  3. Run the following commands to create the CWLProv Research Object:

    time cwltool --no-match-user --provenance alignmnentwf0.6.0 --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-alignment.cwl topmed-alignment.sample.json.new
    
    zip -r alignment_0.6.0_linux.zip alignment_0.6.0_linux
    
    sha256sum alignment_0.6.0_linux.zip > alignment_0.6.0_linux.zip.sha25

     

Notes

Mirror of Mendeley Data upload https://data.mendeley.com/datasets/6wtpgr3kbj/1

Files

alignment_0.6.0_linux.zip

Files (6.9 GB)

Name Size Download all
md5:31f25b032f271f76f7918bbcaf809286
6.9 GB Preview Download
md5:9b868337a8658e03e70180ec1d25b94f
92 Bytes Download

Additional details

Funding

BioExcel – Centre of Excellence for Biomolecular Research 675728
European Commission