Automated assembly of centromeres from ultra-long error-prone reads (repository for paper in Nature Biotech, 2020)
Creators
- 1. Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, CA, USA
- 2. Department of Computer Science and Engineering, University of California, San Diego, CA, USA
Description
UPD May 24, 2021. The latest version of cenX and cen6 assemblies in CHM13 cell line are available as a part of a complete assembly of a human genome generated by the Telomere-to-Telomere Consortium: (github).
The following was last edited on Aug 21, 2020.
Same information can be found in the README inside the attached archive (centroFlye_data_cenXv0_8_3_cen6v0_1_3_20191224-updatedAndUplodaded_20200616.tgz)
Introduction
This data collection is aimed to replicate the results in the centroFlye paper (Bzikadze A.V., Pevzner P.A., Nature Biotechnology, 2020).
It contains assemblies of cenX and cen6 (CHM13 cell line) that are analyzed in the paper.
The centroFlye version that replicates these assemblies (with instructions) can be found in [github](https://github.com/seryrzu/centroFlye/tree/cF_NatBiotech_paper_Xv0.8.3-6v0.1.3).
Note that the branch cF_NatBiotech_paper_Xv0.8.3-6v0.1.3 is dedicated specifically to replicating assemblies in the paper, while the version of centroFlye in the master branch can produce more recent assemblies.
This data collection can be used for jupyter notebooks in the [github](https://github.com/seryrzu/centroFlye_paper_scripts).
That repository contains instructions on how to use it.
Assemblies
cenX. Assembly of cenX (version 0.8.3) that is presented in the paper can be found at `centroFlye_results/polishing1/final_sequence_4.fasta`.
If you wish to reproduce the results of the paper, please use this version.
The same assembly polished by tandemQUAST (15368941c15681ba9353a97f3542bb3ada149287) can be found at centroFlye_results/final_assembly.fasta
Please note that this version is not discussed in the paper.
cen6. Assembly of cen6 (version 0.1.3) that is presented in the paper can be found at centroFlyeMono_results_cen6/centroFlyeMono_cen6/polishing/scaffold_0/scaffold_0.fasta
Description of data collection
- abnormal_12_mers --- dot-plots of 12-alpha-mers with non-canonical structure (not like DXZ1*). See `Supplementary Note 3. Abnormal units in centroFlye cenX assembly`
- abnormal_units --- dot-plots of units with not 12-alpha-mer structure. See `Supplementary Note 3. Abnormal units in centroFlye cenX assembly`
- centroFlye_results --- results of centroFlye for centromere X of CHM13 cell line. See main part of the paper and README.md at the centroFlye main github (branch cF_NatBiotech_paper_Xv0.8.3-6v0.1.3)
- centromeric_reads --- directory containing centromeric (cenX) reads. See `Methods: Recruiting centromeric reads` (Figure 1.1)
- DXZ1_star --- contains DXZ1* (and its homopolymer-compressed version). See `Supplementary Note 5: Deriving accurate consensus HORs`
- NCRF --- NCRF report on centromeric reads on DXZ1. See `Methods: Partitioning centromeric reads into units` (Figure 1.2)
- polishing1 --- polished version of assembly (the assembly itself is `final_sequence_4.fasta`). See `Methods: Polishing the reconstructed centromere sequence` (Figure 1.7)
- recruited_unique_kmers --- recruited k-mers used as stepping stones of assembly. See `Methods: Identifying rare centromeric k-mers` (Figure 2.4) and `Methods: Constructing the distance graph` (Figure 2.5)
- tr_resolution --- placement of reads by units in the cenX. See `Methods: Reconstructing the centromere` (Figure 1.6)`
- centroFlyeMono_results_cen6 --- results of centroFlyeMono for centromere 6 of CHM13 cell line. See `Supplementary Note 6. Assembly of centromere on chromosome 6` in the paper and `README.md` at the centroFlye main github (branch cF_NatBiotech_paper_Xv0.8.3-6v0.1.3)
- centroFlyeMono_cen6 --- results of the main algorithm. Contains dot plots of iterative de Bruijn graphs and polished assembly sequence
- centromeric_reads --- directory containing centromeric reads (cen6). Same method as for cenX
- string_decomposer_report --- report of String Decomposer on cen6 Reads (Dvorkina et al., Bioinformatics, 2020) (commit 83640e3388be7766837e29384884421955ca3126)
- string_decomposition_assembly --- report of String Decomposer on generated cen6 assembly (same commit)
- tandemQUAST_report --- report of tandemQUAST on generated cen6 assembly and cen6 reads (Mikheenko et al., Bioinformatics, 2020) (commit bcfbfd374279c4f3401caaf6cefcafdb4bc9003d)
- D6Z1 --- version of D6Z1 used in the study (genbank AB005791.1). Monomers that are extracted with Alpha-CENTAURI 0.2 (Sevim et al., Bioinf, 2016) and specially processed.
- DXZ1 --- version of DXZ1 used in the study (genbank X02418.1, reversed-complement)
- DXZ1_gorilla --- version of DXZ1 from gorilla used in `Variations in HORs provide insights into centromere evolution`, `Supplementary Figure 5. HOR recombination
- LINE --- LINE element that was found in cenX (genbank GU477636.1)
- nanosim_training --- results of [Nanosim](https://github.com/bcgsc/NanoSim) (commit 77a4393e18c67adddf60427a617499ac352bc9d8) training on T2T CHM13 cell line (first 1m reads). See `Supplementary Note 1. Benchmarking centroFlye on simulated datasets`.
- rel2 --- necessary files from release 2 of T2T CHM13 (available at [github](https://github.com/nanopore-wgs-consortium/CHM13)). See `Supplementary Figure 2. Non-uniform coverage of chrX in HG38 with ultra-long reads (longer than 50 kb)
- rel3 --- necessary files from release 3 of T2T CHM13 (available at [github](https://github.com/nanopore-wgs-consortium/CHM13))
- simulations --- results of simulations and centroFlye performance on them. See `Supplementary Note 1. Benchmarking centroFlye on simulated datasets`
- subsampling --- results of downsampling and read trimming. See `Supplementary Note 6. Assembly of centromere on chromosome 6`: `Effects of downsampling and read trimming on cen6 assembly.`
- T2T --- directories with T2T assemblies and NCRF results on them (using DXZ1*). See `Results: cenX assembly`
Files
Files
(6.8 GB)
Name | Size | Download all |
---|---|---|
md5:592048c5908713e5731a66e99580e023
|
6.8 GB | Download |