Dataset Open Access

A phased genome assembly for allele-specific analysis in Trypanosoma brucei

Cosentino, Raúl Oscar; Brink, Benedikt; Siegel, T. Nicolai

This repository contains the data analysis workflows, the supplementary tables and genome and annotation version from the manuscript entitled "A phased genome assembly for allele-specific analysis in Trypanosoma brucei" https://doi.org/10.1101/2021.04.13.439624

Due to space limitations in Zenodo, for some workflows, full datasets could not be uploaded. For those workflows we provide the directory tree of the complete data analysis folder.

Abstract

Many eukaryotic organisms are diploid or even polyploid, i.e. they harbour two or more independent copies of each chromosome. Yet, to date most reference genome assemblies represent a mosaic consensus sequence in which the homologous chromosomes have been collapsed into one sequence. This procedure generates sequence artefacts and impedes analyses of allele-specific mechanisms. Here, we report the allele-specific genome assembly of the diploid unicellular protozoan parasite Trypanosoma brucei.

As a first step, we called variants on the allele-collapsed assembly of the T. brucei Lister 427 isolate using short-read error-corrected PacBio reads. We identified ~96 thousand heterozygote variants across the genome (average of 4.2 variants / kb), and observed that the variant density along the chromosomes was highly uneven. Several long (>100 kb) regions of loss-of-heterozigosity (LOH) were identified, suggesting recent recombination events between the alleles. By analysing available genomic sequencing data of multiple Lister 427 derived clones, we found that most LOH regions were conserved, except for some that were specific to clones adapted to the insect lifecycle stage. Surprisingly, we also found that some Lister 427 clones were aneuploid. We found evidence of trisomy in chromosome five (Chr5), Chr2, Chr6 and Chr7. Moreover, by analysing RNA-seq data, we showed that the transcript level is proportional to the ploidy, evidencing the lack of a general expression control at the transcript level in T. brucei.

As a second step, to generate an allele-specific genome assembly, we used two powerful datatypes for haplotype reconstruction: raw long reads (PacBio) and chromosome conformation (Hi-C) data. With this approach, we were able to assign 99.5% of all the heterozygote variants to a specific homologous chromosome, building a 66 Mb long T. brucei Lister 427 allele-specific genome assembly. Hereby, we identified genes with allele-specific premature termination codons and showed that differences in allele-specific expression at the level of transcription and translation can be accurately monitored with the fully phased genome assembly.

The obtained reference-grade allele-specific genome assembly of T. brucei will enable the analysis of allele-specific phenomena, as well as the better understanding of recombination and evolutionary processes. Furthermore, it will serve as a standard to ‘benchmark’ much needed automatic genome assembly pipelines for highly heterozygous wild species isolates.

The work was funded by an ERC Starting Grant (3D_Tryps 715466). R.O.C was supported by a Georg Forster Fellowship (Humboldt Foundation).
Files (1.0 GB)
Name Size
01_Genome_correction_pipeline.tar.gz
md5:e77696985f1506f6ede8e04724ae36a1
71.9 MB Download
01_Genome_correction_pipeline_complete_tree.txt
md5:5468c0ee0ea61b468f351d24e4907c99
220.0 kB Download
02_Genome_phasing_pipeline.tar.gz
md5:57aafcf8e64fc8c3fcb520b68805f673
38.1 MB Download
02_Genome_phasing_pipeline_complete_tree.txt
md5:25422d45a4bc8ba87d87bc68f3f15626
804.6 kB Download
03_Annotate_genome_pipeline.tar.gz
md5:d3d8755aeef83f0eccd5b7fb30cb3d44
68.6 MB Download
04_Genome_annotation_conversion_v9.9_to_v10.tar.gz
md5:3700033efc1f10caffba38f9653f88e7
163.6 MB Download
05_Intron_annotation_correction.tar.gz
md5:a7baa33efffe4ba1ab406a8bf5ea5ec1
47.0 MB Download
06_Fully_phased_scaffold.tar.gz
md5:c7718d04a838b01c075afad3e8fbf923
59.5 MB Download
07_Pre-processing_Annotation_comparison_among_genome_versions.tar.gz
md5:f056df71010d256c304d766f87569d64
200.4 MB Download
08_Compare_annotated_proteome.tar.gz
md5:731a9e4e7c24a328bc60ab0f4f900a3c
5.2 MB Download
09_Pre-processing_variant_density_and_ploidy_assessment.tar.gz
md5:b2385a982665d2caf43eef939c6750a2
47.3 MB Download
09_Pre-processing_variant_density_and_ploidy_assessment_complete_tree.txt
md5:1b9c62431eb7c42874dcd97eac956b9e
1.2 MB Download
10_Variant_density_and_ploidy_analysis.tar.gz
md5:244ae000b809e666177b935edadc1e80
36.0 MB Download
11_Transcription_and_translation_in_aneuploid_clones.tar.gz
md5:ab3027dbbd93e9b99e5271562e440493
19.3 MB Download
11_Transcription_and_translation_in_aneuploid_clones_complete_tree.txt
md5:6f782cc5a1656c195c2700bb9f3dfa0e
22.4 kB Download
12_Pre-processing_expression_in_allele_specific_truncated_genes.tar.gz
md5:c0360d4511f8bd85e629fb848cb3c7f1
38.3 MB Download
12_Pre-processing_expression_in_allele_specific_truncated_genes_complete_tree.txt
md5:83f744d2ed65a96b32f39be2ac383e39
4.0 kB Download
13_Expression_on_genes_with_allele_specific_premature_termination_codons.tar.gz
md5:053d5b5fdba9bb4ff16c7e6e057f4d4c
5.7 MB Download
14_HiC_on_genome_with_phased_core.tar.gz
md5:616279c200e6c6d9752ad239a5c3b26d
92.5 MB Download
14_HiC_on_genome_with_phased_core_complete_tree.txt
md5:30abcd59d702f508961e49258c0643bc
18.7 kB Download
Genome_versions_and_annotation_files.zip
md5:aedbe2cb1c689d0b33b1d7018aa55103
131.9 MB Download
Supplementary_Tables.zip
md5:abe5b800f12f126f6daba140785da616
759.6 kB Download
22
13
views
downloads
All versions This version
Views 2222
Downloads 1313
Data volume 605.5 MB605.5 MB
Unique views 1919
Unique downloads 77

Share

Cite as