github.com/mgbpm/biofx-workflows/BamsurgeonWorkflow

mgbpm

doi:10.5281/zenodo.16414245

Published April 24, 2024 | Version Bamsurgeon_2.1.1

Software Open

github.com/mgbpm/biofx-workflows/BamsurgeonWorkflow

mgbpm

Running a bamsurgeon workflow

Background

Usage bamsurgeon is a tool that will introduce genomic variants (such as SNV, SV, and indel mutations) into BAM, CRAM, and SAM files. The resulting mutated BAM files can be used to test variant callers.

About the Workflow Bamsurgeon was originally developed by Adam Ewing. This workflow adapts the original Dockerfile and python scripts to work with a WDL on Terra. All scripts that are used in the tasks of this workflow can be found on the MGBPM-IT Azure DevOps Repo for bamsurgeon.

bamsurgeon has three main scripts, each corresponding to three categories of mutations that can be introduced to BAM files: addsnv.py, addindel.py, and addsv.py. Each of the different mutations has various optional inputs to run the workflow, most of which have defaults listed in the tables under the "Input Variables" section below.

Input Parameters for all Mutation Types

Note that BWA makes assumptions that all genome reference files are conventionally named. The reference files found under the hg38 reference data in Terra suit this assumption. In addition, these reference files are not explicitly used in the WDL (besides the FASTA), but are necessary within the container for BWA to run.

Input Parameters for SNV-Specific Mutations

The information to be included in the snv_bed_input array includes:

"chr": chromosome containing mutation
"start": start position of mutation
"end": end position of mutation
"vaf": variant allele frequency
"base": nucleotide to mutate to (optional)

Input Parameters for Indel-Specific Mutations

The information to be included in the indel_bed_input array includes:

"chr": chromosome containing mutation
"start": start position of mutation
"end": end position of mutation
"vaf": variant allele frequency
"mut_type": either "INS" for insertions or "DEL" for deletions
"insert_seq": insertion sequence for "INS" mutations (optional)

Note: When running the workflow to generate indels, it is most reliable to run simple insertions using the "indel" inputs and to run deletions using the "sv" inputs.

Input Parameters for SV-Specific Mutations

SV mutation runs can consist of inversions, deletions, duplications, translocations, and insertions. A mutation will not be made if the largest contig obtained from local assembly of the specified regions is less than three times the maximum library size (max_lib_size). Though small insertions and deletions can be done in sv mutation runs, it is recommended to do an indel run instead.

| Type | Name | Req'd | Description | Default Value | | :--- | :--- | :---: | :--- | :--- | | Int | max_lib_size | No | Maximum fragment length of sequence library | 600 | | Int | kmer_size | No | Kmer size for assembly | 31 | | Float | sv_frac | No | Allele fraction of variant | 1.0 | | Int | min_contig_gen | No | Minimum length for contig generation, also used to pad assembly | 4000 | | File | cnv_file | No | Tabix-indexed list of genome-wide absolute copy number values | | | File | donor_bam | No | Bam file for donor reads if using "BIGDUP" mutations | | | File | donor_bai | No | Bam index file for donor reads if using "BIGDUP" mutations | | | Int | mean_insert_size | No | Mean insert size | 300 | | Int | insert_size_stdev | No| Insert size standard deviation | 70 | | Float | sim_err | No | Error rate for wgsim-generated reads | | | File | insert_library | No | FASTA file containing library of possible insertions; use "INS RND" instead of "INS filename" to pick one | | | Boolean | tag_reads | No | Add BS tag to altered reads | false | | Boolean | keep_secondary_reads | No | Keep secondary reads in final BAM | false | | String | seed | No | See for random number generation | |

The information in the sv_bed_input array should include:

"chr": chromosome containing mutation
"start": start position of mutation
"end": end position of mutation
"mut_type":
- "INS" for insertions
- "DEL" for deletions
- "BIGDEL" for deletions greater than 5 kbp in length
- "DUP" for duplications
- "BIGDUP" for duplications greater than 5 kbp
- "INV" for inversions
- "BIGINV" for inversions greater than 5 kbp in length
- "TRN" for translocation

Additional information in sv_input array for translocations:

"vaf": variant allele frequency
"trans_chr": chromosome for translocation
"trans_start": start of translocation
"trans_end": end of translocation
"trans_on_chr": strand (+ or -) to translocate to
"trans_from_chr": strand (+ or -) to translocate from

Additional information in sv_input array for insertions:

"insert_seq": insertion sequence
"vaf": variant allele frequency
"target_site_len": length of TSD; for simulating target site duplications (TSDs)
"cut_site_motif": a sequence of bases with syntax NNN^NNN, where the bases after the caret is the motif; for simulating endocnuclease motifs

For insertions, the sequence to insert can be specified using: a string of the sequence itself, a FASTA file containing the sequence to insert, or an RND library of potential insertions (using the insert_library input). You can simulate target site duplications by including an integer value that specifies the length. Endonuclease motifs can also be simulated by defining an insertion entry with the syntax NNN^NNN, where NNN denotes a sequence of any length, and the cut site motif is denoted by the caret sequence.

Additional information in sv_input array for deletions and inversions:

"vaf": variant allele frequency

Note: When running the workflow to generate indels, it is most reliable to run simple insertions using the "indel" inputs and to run deletions using the "sv" inputs.

Additional information in sv_input for duplications:

"vaf": variant allele frequency
"num_dupes": number of times the sequence should be duplicated

Output Parameters

Files