There is a newer version of the record available.

Published September 7, 2022 | Version 0.1
Journal article Open

TandemAligner: a new parameter-free framework for fast sequence alignment

Description

The recent advances in “complete genomics” revealed the previously inaccessible genomic regions (such as centromeres) and enabled analysis of their associations with diseases. However, analysis of variations in centromeres, immunoglobulin loci, and other extra-long tandem repeats (ETRs) faces an algorithmic bottleneck since there are currently no tools for accurate sequence comparison of ETRs. Counterintuitively, the classical alignment approaches, such as the Smith-Waterman algorithm, that work well for most sequences, fail to construct biologically adequate alignments of ETRs. This limitation was overlooked in the previous studies since the sequences of the centromeres and other ETRs across multiple genomes only became available in the last year. We present TandemAligner — the first parameter-free algorithm for sequence alignment that introduces a sequence-dependent alignment scoring that automatically changes for any pair of compared sequences. We apply TandemAligner to various centromeres, arrive at the first accurate estimate of the mutation rates in human centromeres, and quantify the extremely high rate of large insertions/deletions in centromeres. The codebase of TandemAligner is available at https://github.com/seryrzu/tandem_aligner.

 

The uploaded gzip archive includes sequences of human centromeres that are considered in the study and their alignment produced by TandemAligner.

Files

Files (21.9 MB)

Name Size Download all
md5:441df469e297a115a24be03f6377da60
21.9 MB Download