ProkBERT datasets
Creators
Description
Datasets for ProkBERT
This repository contains the training, validation, and testing datasets used in our research for ProkBERT, optimized for microbiome studies.
There are 4 different datasets:
- ESKAPE genomic features
- Bacterial promoter database
- Phage training, test and evaluation datasets
- ESKAPE masked sequences dataset
Description
The datasets support the development and evaluation of ProkBERT models. They include raw sequence data in compressed TSV format and tokenized datasets in compressed HDF format, using various k-mer sizes and shift values.
ESKAPE genomic features
filename: eskape_genomic_features.tsv.bz2
This dataset includes genomic segments from ESKAPE pathogens, characterized by various genomic features such as coding sequences (CDS), intergenic regions, ncRNA, and pseudogenes. It was analyzed to understand the representations captured by models like ProkBERT-mini, ProkBERT-mini-c, and ProkBERT-mini-long.
Data Fields
contig_id: Identifier of the contig.segment_id: Unique identifier for each genomic segment.strand: DNA strand of the segment (+or-).seq_start: Starting position of the segment in the contig.seq_end: Ending position of the segment in the contig.segment_start: Starting position of the segment in the sequence.segment_end: Ending position of the segment in the sequence.label: Genomic feature category (e.g., CDS, intergenic).segment_length: Length of the genomic segment.segment: Genomic sequence of the segment.
For a more detailed description, please visit: https://huggingface.co/datasets/neuralbioinfo/ESKAPE-genomic-features
PROMOTER dataset
filename: bacterial_promoter_db.tsv.bz2
Data collection and processing
- Data source: The positive samples, known promoters, are primarily drawn from the Prokaryotic Promoter Database (PPD), containing experimentally validated promoter sequences from 75 organisms. Non-promoter sequences are obtained from the NCBI RefSeq database, sampled specifically from CDS regions.
- Preprocessing: The dataset includes non-promoter sequences constructed via higher and zero-order Markov chains, which mirror compositional characteristics of known promoters. An independent test set based on E.coli sigma70 promoters is also included.
Dataset structure
- Dataset splits: The dataset is systematically divided into training, validation, and test subsets.
- Data fields:
segment_id: Unique identifier for each segment.ppd_original_SpeciesName: Original species name from the PPD.Strand: The strand of the DNA sequence.segment: The DNA sequence of the promoter region.label: The label indicating whether the sequence is a promoter or non-promoter.L: Length of the DNA sequence.prom_class: The class of the promoter.y: Binary label indicating the presence of a promoter.
Dataset splits
- Training set: Primary dataset used for model training.
- Test set (Sigma70): Independent test set focusing on E.coli sigma70 promoters.
- Multispecies set: Additional test set including various species, ensuring generalization across different organisms.
ESKAPE masked sequences dataset
filename: eskape_masking_dataset.tsv.bz2
Dataset description
This dataset was used to evaluate different models on the masking exercise, measuring how well the different models can recover the original character.
Dataset overview
The dataset is compiled from the RefSeq database and other sources, focusing on ESKAPE pathogens. The genomic features were sampled randomly, followed by contiguous segmentation. This dataset contains various segments with lengths: [128, 256, 512, 1024]. The segments were randomly selected, and one of the characters was replaced by '*' (masked_segment column) to create a masking task. The reference_segment contains the original, non-replaced nucleotides. We performed 10,000 maskings per set, with a maximum of 2,000 genomic features. Only the genomic features: 'CDS', 'intergenic', 'pseudogene', and 'ncRNA' were considered.
Dataset Structure
- Data Fields:
reference_segment_id: A mapping of segment identifiers to their respective reference IDs in the database.masked_segment: The DNA sequence of the segment with certain positions masked (marked with '*') for prediction or testing purposes.position_to_mask: The specific position(s) in the sequence that have been masked, indicated by index numbers.masked_segment_id: Unique identifiers assigned to the masked segments. (unique only with respect to length)contig_id: Identifier of the contig to which the segment belongs.segment_id: Unique identifier for each genomic segment (same as reference segment id).strand: The DNA strand of the segment, indicated as '+' (positive) or '-' (negative).seq_start: Starting position of the segment within the contig.seq_end: Ending position of the segment within the contig.segment_start: Starting position of the genomic segment in the sequence.segment_end: Ending position of the genomic segment in the sequence.label: Category label for the genomic segment (e.g., 'CDS', 'intergenic').segment_length: The length of the genomic segment.original_segment: The original genomic sequence without any masking.
PHAGE dataset description
We assembled a phage sequence database from RefSeq and other sources, refining it to reduce redundancy and ensure balance between phage and bacterial sequences. The final dataset targets important bacterial genera, aiding in understanding phage-host interactions and their implications for health.
Data file naming conventions
For tokenized datasets:
- Pattern:
RND__balanced_(test|val|train)_Ls(\d+)_k(\d+)s(\d+)\.h5\.bz2 - Matches files indicating type (test or validation), segment length, k-mer size, and shift value.
For sampled raw data:
- Pattern:
RND__balanced_(test|val|train)_Ls(\d+)\.tsv\.bz2 - Matches files indicating type (test or validation) and segment length.
Data fields
segment_id: Unique identifier for each genomic segment.contig_id: Identifier for the contig from which the segment is derived.segment_start: Start position of the segment in the contig.segment_end: End position of the segment in the contig.L: Length of the genomic segment (512, 1024, or 2048).segment: The genomic sequence of the segment.label: Classification label (e.g., 'phage').y: Binary label (1 for phage, 0 for non-phage).
Usage
These datasets are for academic use. Reference our paper when using them.
Contact information
For any questions, feedback, or contributions regarding the datasets or ProkBERT, please feel free to reach out:
- Name: Balázs Ligeti
- Email: obalasz@gmail.com
We welcome your input and collaboration to improve our resources and research.
Citation
@Article{ProkBERT2024,
author = {Ligeti, Balázs and Szepesi-Nagy, István and Bodnár, Babett and Ligeti-Nagy, Noémi and Juhász, János},
journal = {Frontiers in Microbiology},
title = {{ProkBERT} family: genomic language models for microbiome applications},
year = {2024},
volume = {14},
URL = {https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233},
DOI = {10.3389/fmicb.2023.1331233}
}
Files
Files
(18.3 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:1c022316afc28a40720ebb54d122fb7f
|
6.4 MB | Download |
|
md5:0b7545dabc2a1774394d6190ba95ed8c
|
4.2 MB | Download |
|
md5:e4f41ff12a96fc8cb751bc43013098e9
|
8.4 MB | Download |
|
md5:ca3752449fd65b0319258b7b730a81d4
|
7.4 kB | Download |
|
md5:1825abe0d998b5cb8a468ff737e9e9b0
|
167.7 MB | Download |
|
md5:f56060ec438d8aac0398adf9d85b4253
|
155.1 MB | Download |
|
md5:09dc2cd1a414bb479327acd28f198a09
|
223.8 MB | Download |
|
md5:a545094ac52af86e80b4e385e70a14ad
|
437.8 MB | Download |
|
md5:a75ddeef17b72e1e0b65cacb6126c3c7
|
136.7 MB | Download |
|
md5:1867464eecb855427ce9c1e0e1e04d38
|
359.6 MB | Download |
|
md5:dadfcb1361bfaa174229faf15840e815
|
1.7 GB | Download |
|
md5:00c50c5a8768b82400b61d33125c7f44
|
1.5 GB | Download |
|
md5:d4b83a55e42027b0cf83e9bccdad6221
|
2.2 GB | Download |
|
md5:9c023934da81a61d1afa521b9bf10f78
|
4.3 GB | Download |
|
md5:283222583e52b2b05e213805218a4fed
|
1.5 GB | Download |
|
md5:6e608586d4c3fed4cb4f099822baa6f3
|
4.0 GB | Download |
|
md5:bb2ffa833d59bc8d0416f22efafcfe90
|
168.4 MB | Download |
|
md5:774a3c22e586541d283ff4d245987857
|
155.8 MB | Download |
|
md5:964416515806187261621460e0e5d9dd
|
224.7 MB | Download |
|
md5:51d3cf1ff9d964c2cef92af2f8a5e1c6
|
439.7 MB | Download |
|
md5:f1ea21d8c95a6923f544e680b883c2cd
|
136.2 MB | Download |
|
md5:dd0a53483cbeb3314c8862778222219c
|
358.5 MB | Download |
Additional details
Related works
- Is published in
- Dataset: 10.3389/fmicb.2023.1331233 (DOI)
Software
- Repository URL
- https://github.com/nbrg-ppcu/prokbert