CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object)

Farah Zaib Khan; Stian Soiland-Reyes

doi:10.17632/6wtpgr3kbj.1

Published December 4, 2018 | Version 1

Dataset Open

CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object)

1. Department of Computing and Information Systems, The University of Melbourne, Australia
2. School of Computer Science, The University of Manchester, UK

This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see CWLProv 0.6.0 or use the cwlprov Python tool to explore.

The CWL alignment workflow included in this case study is designed by Data Biosphere. It adapts the alignment pipeline originally developed at Abecasis Lab, The University of Michigan. This workflow is part of NIH Data Commons initiative and comprises of four stages.

First step, Pre-align, accepts a Compressed Alignment Map (CRAM) file (a compressed format for BAM files developed by European Bioinformatics Institute (EBI)) and human genome reference sequence as input and using underlying software utilities of SAMtools such as view, sort and fixmate returns a list of fastq files which can be used as input for the next step.

The next step Align also accepts the human reference genome as input along with the output files from Pre-align and uses BWA-mem to generate aligned reads as BAM files. SAMBLASTER is used to mark duplicate reads and SAMtools view to convert read files from SAM to BAM format.

The BAM files generated after lign are sorted with SAMtool sort'.

Finally, these sorted alignment files are merged to produce single sorted BAM file using SAMtools merge in Post-align step.

Steps to reproduce

This analysis was run using a 16-core Linux cloud instance with 64GB RAM and pre-installed docker.

Install gsutils

export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"

echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | \
  sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | \
  sudo apt-key add -

sudo apt-get update && sudo apt-get install google-cloud-sdk

Get the data and make the analysis environment ready:

git clone https://github.com/FarahZKhan/topmed-workflows.git
cd topmed-workflows
git checkout cwlprov_testing
cd aligner/sbg-alignment-cwl

# this is a custom script download google bucket files from json files and create a local json
# it needs gsutil to be installed though
git clone https://github.com/DailyDreaming/fetch_gs_frm_json.git

# Wait... this should download ~18Gb.
python2.7 fetch_gs_frm_json/dl_gsfiles_frm_json.py topmed-alignment.sample.json

Run the following commands to create the CWLProv Research Object:

time cwltool --no-match-user --provenance alignmnentwf0.6.0 --tmp-outdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp --tmpdir-prefix=/CWLProv_workflow_testing/intermediate_temp/temp topmed-alignment.cwl topmed-alignment.sample.json.new

zip -r alignment_0.6.0_linux.zip alignment_0.6.0_linux

sha256sum alignment_0.6.0_linux.zip > alignment_0.6.0_linux.zip.sha25

Notes

Mirror of Mendeley Data upload https://data.mendeley.com/datasets/6wtpgr3kbj/1

Files

alignment_0.6.0_linux.zip

Files (6.9 GB)

Name	Size	Download all
alignment_0.6.0_linux.zip md5:31f25b032f271f76f7918bbcaf809286	6.9 GB	Preview Download
alignment_0.6.0_linux.zip.sha256 md5:9b868337a8658e03e70180ec1d25b94f	92 Bytes	Download

Additional details

European Commission
BioExcel – Centre of Excellence for Biomolecular Research 675728

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	629	628
Downloads	54	54
Data volume	236.0 GB	236.0 GB

CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object)

Notes

Files

alignment_0.6.0_linux.zip

Files (6.9 GB)

Additional details

Related works

Funding

CWL run of Alignment Workflow (CWLProv 0.6.0 Research Object)

Creators

Description

Notes

Files

alignment_0.6.0_linux.zip

Files (6.9 GB)

Additional details

Related works

Funding