CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object)
Creators
- 1. Department of Computing and Information Systems, The University of Melbourne, Australia
- 2. School of Computer Science, The University of Manchester, UK
Description
This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see CWLProv 0.6.0 or use the cwlprov Python tool to explore.
The CWL alignment workflow included in this case study is designed by Data Biosphere. It adapts the alignment pipeline originally developed at Abecasis Lab, The University of Michigan. This workflow is part of NIH Data Commons initiative and comprises of four stages.
First step, Pre-align, accepts a Compressed Alignment Map (CRAM) file (a compressed format for BAM files developed by European Bioinformatics Institute (EBI)) and human genome reference sequence as input and using underlying software utilities of SAMtools such as view, sort and fixmate returns a list of fastq files which can be used as input for the next step.
The next step Align also accepts the human reference genome as input along with the output files from Pre-align and uses BWA-mem to generate aligned reads as BAM files. SAMBLASTER is used to mark duplicate reads and SAMtools view to convert read files from SAM to BAM format.
The BAM files generated after lign are sorted with SAMtool sort'.
Finally, these sorted alignment files are merged to produce single sorted BAM file using SAMtools merge in Post-align step.
Steps to reproduce
This analysis was run using a 16-core Linux cloud instance with 64GB RAM and pre-installed docker.
-
Install gsutils
export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)" echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | \ sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | \ sudo apt-key add - sudo apt-get update && sudo apt-get install google-cloud-sdk
-
Get the data and make the analysis environment ready:
git clone https://github.com/FarahZKhan/topmed-workflows.git cd topmed-workflows git checkout cwlprov_testing cd aligner/sbg-alignment-cwl # this is a custom script download google bucket files from json files and create a local json # it needs gsutil to be installed though git clone https://github.com/DailyDreaming/fetch_gs_frm_json.git # Wait... this should download ~18Gb. python2.7 fetch_gs_frm_json/dl_gsfiles_frm_json.py topmed-alignment.sample.json
-
Run the following commands to create the CWLProv Research Object:
time cwltool --no-match-user --provenance alignmnentwf0.6.0 --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-alignment.cwl topmed-alignment.sample.json.new zip -r alignment_0.6.0_linux.zip alignment_0.6.0_linux sha256sum alignment_0.6.0_linux.zip > alignment_0.6.0_linux.zip.sha25
Notes
Files
alignment_0.6.0_linux.zip
Files
(6.9 GB)
Name | Size | Download all |
---|---|---|
md5:31f25b032f271f76f7918bbcaf809286
|
6.9 GB | Preview Download |
md5:9b868337a8658e03e70180ec1d25b94f
|
92 Bytes | Download |