Algorithms for determining transposable genes in a genome

Wang, Yue

doi:10.5061/dryad.9zw3r22j3

Published November 28, 2022 | Version v1

Dataset Open

Algorithms for determining transposable genes in a genome

Wang, Yue¹

1. University of California Los Angeles

Transposons are nucleotide sequences in DNA that can change their positions. Many transposons are shorter than a general gene. When we restrict to nucleotide sequences that form complete genes, we can still find genes that change their relative locations in a genome. Thus for different individuals of the same species, the orders of genes might be different. A practical problem is to determine such transposable genes in given gene sequences. Through an intuitive rule, we transform the biological problem of determining transposable genes into a rigorous mathematical problem of determining the longest common subsequence. Depending on whether the gene sequence is linear (each sequence has a fixed head and tail) or circular (we can choose any gene as the head, and the previous one is the tail), and whether genes have multiple copies, we classify the problem of determining transposable genes into four scenarios: (1) linear sequences without duplicated genes; (2) circular sequences without duplicated genes; (3) linear sequences with duplicated genes; (4) circular sequences with duplicated genes. With the help of graph theory, we design fast algorithms for different scenarios. Specifically, we study the situation where the longest common subsequence is not unique.

This dataset contains code files for the corresponding algorithms. Besides, it has gene sequence data for certain Escherichia coli strains (from NCBI), which are used to test those algorithms.

Notes

This repository contains Python code for all algorithms in my paper https://arxiv.org/abs/1506.02424.

NewScenario1.py implements Algorithms 1,2 for Scenario 1.

NewScenario2.py implements Algorithms 3,4 for Scenario 2.

Scenario3.py implements Algorithm 5 for Scenario 3.

Scenario4.py implements Algorithm 6 for Scenario 4.

NewScenario1 test.py runs Algorithms 1 and 2 on real data.

NewScenario2 test.py runs Algorithms 3 and 4 on real data.

S3test.py tests the performance of Algorithm 5 for Scenario 3 on various random graphs.

S4test.py tests the performance of Algorithm 6 for Scenario 4 on various random graphs.

CPxxxx.txt are processed gene sequences, used in tests of Scenarios 1 and 2

Escherichia coli xxxx.txt are original annotation files, used to generate CPxxxx.txt.

Process ST540.py processes three Escherichia coli xxxx.txt files to CPxxxx.txt

Process ST2747.py processes three Escherichia coli xxxx.txt files to CPxxxx.txt

Scenario1.py (outdated!) implements Algorithms 1 and 2 for Scenario 1.

Scenario2.py (outdated!) implements Algorithms 3 and 4 for Scenario 2.

Files

CP007265.1.txt

Files (29.2 MB)

Name	Size	Download all
CP007265.1.txt md5:edcc3396e78b036fbe96906853c4437b	3.4 kB	Preview Download
CP007390.1.txt md5:548fe36a3faee63a46c90a76b2f68e69	3.4 kB	Preview Download
CP007391.1.txt md5:a69b2bcc709845d3757b536d387e79c8	3.4 kB	Preview Download
CP007392.1.txt md5:322b99020851f536fe4de6d9cd783492	3.4 kB	Preview Download
CP007393.1.txt md5:322b99020851f536fe4de6d9cd783492	3.4 kB	Preview Download
CP007394.1.txt md5:322b99020851f536fe4de6d9cd783492	3.4 kB	Preview Download
Escherichia_coli_strain_ST2747_GenBank_CP007392.1.txt md5:89441f1757c0cd3ebd8a9ce423bd24c2	5.0 MB	Preview Download
Escherichia_coli_strain_ST2747_GenBank_CP007393.1.txt md5:27d8415dfa97cd43924e6a3ef250e27f	4.9 MB	Preview Download
Escherichia_coli_strain_ST2747_GenBank_CP007394.1.txt md5:034172627623b4ad90108b7fb003f8ab	5.0 MB	Preview Download
Escherichia_coli_strain_ST540_GenBank_CP007265.1.txt md5:638a4eed6c0cf952a4014cdc10a155e0	4.7 MB	Preview Download
Escherichia_coli_strain_ST540_GenBank_CP007390.1.txt md5:4de113ec81f5ba09776140ff86d16547	4.8 MB	Preview Download
Escherichia_coli_strain_ST540_GenBank_CP007391.1.txt md5:b4416c1f9f9389c218e1515a6bb52a31	4.8 MB	Preview Download
README.md md5:20e23607545e5a3415c0f79fb3fea0ce	1.1 kB	Preview Download

Additional details

Is derived from: 10.5281/zenodo.7336046 (DOI)

Citations

Oops! Something went wrong while fetching results.

	All versions	This version
Views	115	113
Downloads	30	30
Data volume	87.5 MB	87.5 MB

Algorithms for determining transposable genes in a genome

Creators

Description

Notes

Files

CP007265.1.txt

Files (29.2 MB)

Additional details

Related works