Published October 31, 2024 | Version v1
Dataset Open

ProkBERT datasets

Contributors

Researcher:

  • 1. ROR icon Pázmány Péter Catholic University

Description

Datasets for ProkBERT

This repository contains the training, validation, and testing datasets used in our research for ProkBERT, optimized for microbiome studies.

There are 4 different datasets:

  1. ESKAPE genomic features
  2. Bacterial promoter database
  3. Phage training, test and evaluation datasets
  4. ESKAPE masked sequences dataset

 

Description

The datasets support the development and evaluation of ProkBERT models. They include raw sequence data in compressed TSV format and tokenized datasets in compressed HDF format, using various k-mer sizes and shift values.

 

ESKAPE genomic features

filename: eskape_genomic_features.tsv.bz2

This dataset includes genomic segments from ESKAPE pathogens, characterized by various genomic features such as coding sequences (CDS), intergenic regions, ncRNA, and pseudogenes. It was analyzed to understand the representations captured by models like ProkBERT-mini, ProkBERT-mini-c, and ProkBERT-mini-long.

 

Data Fields

  • contig_id: Identifier of the contig.
  • segment_id: Unique identifier for each genomic segment.
  • strand: DNA strand of the segment (+ or -).
  • seq_start: Starting position of the segment in the contig.
  • seq_end: Ending position of the segment in the contig.
  • segment_start: Starting position of the segment in the sequence.
  • segment_end: Ending position of the segment in the sequence.
  • label: Genomic feature category (e.g., CDS, intergenic).
  • segment_length: Length of the genomic segment.
  • segment: Genomic sequence of the segment.

For a more detailed description, please visit: https://huggingface.co/datasets/neuralbioinfo/ESKAPE-genomic-features

 

PROMOTER dataset

filename: bacterial_promoter_db.tsv.bz2

 

Data collection and processing

  • Data source: The positive samples, known promoters, are primarily drawn from the Prokaryotic Promoter Database (PPD), containing experimentally validated promoter sequences from 75 organisms. Non-promoter sequences are obtained from the NCBI RefSeq database, sampled specifically from CDS regions.
  • Preprocessing: The dataset includes non-promoter sequences constructed via higher and zero-order Markov chains, which mirror compositional characteristics of known promoters. An independent test set based on E.coli sigma70 promoters is also included.

 

Dataset structure

  • Dataset splits: The dataset is systematically divided into training, validation, and test subsets.
  • Data fields:
    • segment_id: Unique identifier for each segment.
    • ppd_original_SpeciesName: Original species name from the PPD.
    • Strand: The strand of the DNA sequence.
    • segment: The DNA sequence of the promoter region.
    • label: The label indicating whether the sequence is a promoter or non-promoter.
    • L: Length of the DNA sequence.
    • prom_class: The class of the promoter.
    • y: Binary label indicating the presence of a promoter.

 

Dataset splits

  • Training set: Primary dataset used for model training.
  • Test set (Sigma70): Independent test set focusing on E.coli sigma70 promoters.
  • Multispecies set: Additional test set including various species, ensuring generalization across different organisms.

 

ESKAPE masked sequences dataset

filename: eskape_masking_dataset.tsv.bz2

 

Dataset description

This dataset was used to evaluate different models on the masking exercise, measuring how well the different models can recover the original character.

 

Dataset overview

The dataset is compiled from the RefSeq database and other sources, focusing on ESKAPE pathogens. The genomic features were sampled randomly, followed by contiguous segmentation. This dataset contains various segments with lengths: [128, 256, 512, 1024]. The segments were randomly selected, and one of the characters was replaced by '*' (masked_segment column) to create a masking task. The reference_segment contains the original, non-replaced nucleotides. We performed 10,000 maskings per set, with a maximum of 2,000 genomic features. Only the genomic features: 'CDS', 'intergenic', 'pseudogene', and 'ncRNA' were considered.

 

Dataset Structure

  • Data Fields:
  • reference_segment_id: A mapping of segment identifiers to their respective reference IDs in the database.
  • masked_segment: The DNA sequence of the segment with certain positions masked (marked with '*') for prediction or testing purposes.
  • position_to_mask: The specific position(s) in the sequence that have been masked, indicated by index numbers.
  • masked_segment_id: Unique identifiers assigned to the masked segments. (unique only with respect to length)
  • contig_id: Identifier of the contig to which the segment belongs.
  • segment_id: Unique identifier for each genomic segment (same as reference segment id).
  • strand: The DNA strand of the segment, indicated as '+' (positive) or '-' (negative).
  • seq_start: Starting position of the segment within the contig.
  • seq_end: Ending position of the segment within the contig.
  • segment_start: Starting position of the genomic segment in the sequence.
  • segment_end: Ending position of the genomic segment in the sequence.
  • label: Category label for the genomic segment (e.g., 'CDS', 'intergenic').
  • segment_length: The length of the genomic segment.
  • original_segment: The original genomic sequence without any masking.

 

PHAGE dataset description

We assembled a phage sequence database from RefSeq and other sources, refining it to reduce redundancy and ensure balance between phage and bacterial sequences. The final dataset targets important bacterial genera, aiding in understanding phage-host interactions and their implications for health.

 

Data file naming conventions

For tokenized datasets:

  • Pattern: RND__balanced_(test|val|train)_Ls(\d+)_k(\d+)s(\d+)\.h5\.bz2
  • Matches files indicating type (test or validation), segment length, k-mer size, and shift value.

For sampled raw data:

  • Pattern: RND__balanced_(test|val|train)_Ls(\d+)\.tsv\.bz2
  • Matches files indicating type (test or validation) and segment length.

 

Data fields

  • segment_id: Unique identifier for each genomic segment.
  • contig_id: Identifier for the contig from which the segment is derived.
  • segment_start: Start position of the segment in the contig.
  • segment_end: End position of the segment in the contig.
  • L: Length of the genomic segment (512, 1024, or 2048).
  • segment: The genomic sequence of the segment.
  • label: Classification label (e.g., 'phage').
  • y: Binary label (1 for phage, 0 for non-phage).

 

Usage

These datasets are for academic use. Reference our paper when using them.

 

Contact information

For any questions, feedback, or contributions regarding the datasets or ProkBERT, please feel free to reach out:

We welcome your input and collaboration to improve our resources and research.

 

Citation

@Article{ProkBERT2024,
  author  = {Ligeti, Balázs  and Szepesi-Nagy, István  and Bodnár, Babett  and Ligeti-Nagy, Noémi  and Juhász, János},
  journal = {Frontiers in Microbiology},
  title   = {{ProkBERT} family: genomic language models for microbiome applications},
  year    = {2024},
  volume  = {14},
  URL     = {https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233},
  DOI     = {10.3389/fmicb.2023.1331233}
}

Files

Files (18.3 GB)

Name Size Download all
md5:1c022316afc28a40720ebb54d122fb7f
6.4 MB Download
md5:0b7545dabc2a1774394d6190ba95ed8c
4.2 MB Download
md5:e4f41ff12a96fc8cb751bc43013098e9
8.4 MB Download
md5:ca3752449fd65b0319258b7b730a81d4
7.4 kB Download
md5:1825abe0d998b5cb8a468ff737e9e9b0
167.7 MB Download
md5:f56060ec438d8aac0398adf9d85b4253
155.1 MB Download
md5:09dc2cd1a414bb479327acd28f198a09
223.8 MB Download
md5:a545094ac52af86e80b4e385e70a14ad
437.8 MB Download
md5:a75ddeef17b72e1e0b65cacb6126c3c7
136.7 MB Download
md5:1867464eecb855427ce9c1e0e1e04d38
359.6 MB Download
md5:dadfcb1361bfaa174229faf15840e815
1.7 GB Download
md5:00c50c5a8768b82400b61d33125c7f44
1.5 GB Download
md5:d4b83a55e42027b0cf83e9bccdad6221
2.2 GB Download
md5:9c023934da81a61d1afa521b9bf10f78
4.3 GB Download
md5:283222583e52b2b05e213805218a4fed
1.5 GB Download
md5:6e608586d4c3fed4cb4f099822baa6f3
4.0 GB Download
md5:bb2ffa833d59bc8d0416f22efafcfe90
168.4 MB Download
md5:774a3c22e586541d283ff4d245987857
155.8 MB Download
md5:964416515806187261621460e0e5d9dd
224.7 MB Download
md5:51d3cf1ff9d964c2cef92af2f8a5e1c6
439.7 MB Download
md5:f1ea21d8c95a6923f544e680b883c2cd
136.2 MB Download
md5:dd0a53483cbeb3314c8862778222219c
358.5 MB Download

Additional details

Related works

Is published in
Dataset: 10.3389/fmicb.2023.1331233 (DOI)