Published June 30, 2025 | Version 1
Dataset Open

Genome assemblies and gene annotations for the cloudberry (Rubus chamaemorus)

Description

We provide haplotype-resolved chromosome-level genome assemblies and gene annotations for the cloudberry (Rubus chamaemorus). For each haplotype, we include the assembled genome, predicted protein-coding genes (GFF3), and corresponding protein and mRNA transcript FASTA files. These files are included for convenience and completeness, as some functional annotations may be lost or simplified during formatting for ENA submission.

Data generation and assembly: (https://github.com/ebp-nor/GenomeAssembly)
The drRubCham1.1 assembly is based on 59x PacBio HiFi and 62x Arima Hi-C data. The HiFi reads were filtered for adapters using HiFiAdapterFilt (Sim et al. 2022) and assembled with hifiasm (Cheng et al. 2021) together with Hi-C data to generate a pair of haplotype-resolved assemblies. K-mers from the two assemblies were used to filter Hi-C reads to create two sets of reads for scaffolding the assemblies, ensuring that reads for hap1 scaffolding did not contain k-mers unique to hap2, and vice versa. YaHS (Zhou, McCarthy, and Durbin 2023) was used for scaffolding. FCS-adaptor was used to remove putative adapters, and putative contaminants were removed with FCS-GX (Astashyn et al. 2024). The pseudo-chromosomes, as confirmed by Hi-C data, were grouped and oriented based on their homology to the diploid relative Rubus idaeus (ASCWY01 - Price et al. 2023), and then sorted by size in hap1 within each syntenic group of four. Ordering and naming were mirrored in hap2.

Gene annotation: (https://github.com/ebp-nor/GenomeAnnotation)
The assemblies were masked for repeats using RED (Girgis, 2015) via redmask. Miniprot (Li, 2023) was used to align the proteins to the curated assemblies. Longest isoform protein sequences from Arabidopsis thaliana (TAIR10.1 - GCF_000001735.4) were extracted using AGAT (agat_sp_keep_longest_isoform.pl and agat_sp_extract_sequences.pl) and aligned to the masked assemblies. Additional protein evidence from UniProtKB/Swiss-Prot (release 2023_02) (Uniprot Consortium et al., 2023) and Viridiplantae OrthoDB v11 (Kuznetsov et al., 2022) was aligned separately. Predicted gene models were generated using GALBA (Brůna et al., 2023; Buchfink et al., 2015; Hoff and Stanke, 2018; Li, 2023; Stanke et al., 2006) running with the TAIR10.1 proteins using the miniprot mode on the masked assemblies. All evidence (protein alignments and GALBA predictions) was combined using EvidenceModeler via Funannotate (Haas et al., 2008). The resulting predicted proteins were compared to the repeat protein database distributed with Funannotate using DIAMOND (Buchfink et al., 2015) blastp, and corresponding gene models were filtered using AGAT. The filtered proteins were then compared to UniProtKB/Swiss-Prot (release 2024_04) using DIAMOND to assign gene names, and InterProScan (Jones et al., 2014) was used to identify functional domains. AGAT’s agat_sp_manage_functional_annotation.pl was used to integrate gene names and functional annotations into the final GFF3 files.

 

List of files provided here and their description:

drRubCham1.1.hap1.fa.gz - genome assembly of cloudberry (hap1)  
drRubCham1.1.hap1.gff.gz - genome annotation of cloudberry (hap1)  
drRubCham1.1.hap1.proteins.fa.gz - predicted proteins of cloudberry (hap1)  
drRubCham1.1.hap1.mrna.fa.gz - predicted mRNA transcripts of cloudberry (hap1)  

drRubCham1.1.hap2.fa.gz - genome assembly of cloudberry (hap2)  
drRubCham1.1.hap2.gff.gz - genome annotation of cloudberry (hap2)  
drRubCham1.1.hap2.proteins.fa.gz - predicted proteins of cloudberry (hap2)  
drRubCham1.1.hap2.mrna.fa.gz - predicted mRNA transcripts of cloudberry (hap2)  

 

Files

Files (822.2 MB)

Name Size Download all
md5:6fd1e16b629f1d1f2e6b87b8d9db1c6c
350.6 MB Download
md5:bd1ef50c3cf4ca9e87680a1ac1253583
17.1 MB Download
md5:7af6d409fd85f9f9f7b42c4ea6b27b90
31.3 MB Download
md5:b7fe9e5cb629d36997c2762efcfa15fd
18.2 MB Download
md5:b035adbc78347720383c0fef7e2cf9cc
339.9 MB Download
md5:6ced7d6008cba210b3f3d06f327b6609
16.7 MB Download
md5:f2932828ae2cc85c04d78c35ea588387
30.5 MB Download
md5:31a290574c4e6d8b3a2babb1595877f2
18.0 MB Download

Additional details

Funding

The Research Council of Norway
326819