ProkBERT datasets

Ligeti, Balázs

doi:10.5281/zenodo.10057832

Published October 31, 2024 | Version v1

Dataset Open

ProkBERT datasets

Ligeti, Balázs

Contributors

Researcher:

Ligeti, Balázs¹

1. Pázmány Péter Catholic University

Datasets for ProkBERT

This repository contains the training, validation, and testing datasets used in our research for ProkBERT, optimized for microbiome studies.

There are 4 different datasets:

ESKAPE genomic features
Bacterial promoter database
Phage training, test and evaluation datasets
ESKAPE masked sequences dataset

Description

The datasets support the development and evaluation of ProkBERT models. They include raw sequence data in compressed TSV format and tokenized datasets in compressed HDF format, using various k-mer sizes and shift values.

ESKAPE genomic features

filename: eskape_genomic_features.tsv.bz2

This dataset includes genomic segments from ESKAPE pathogens, characterized by various genomic features such as coding sequences (CDS), intergenic regions, ncRNA, and pseudogenes. It was analyzed to understand the representations captured by models like ProkBERT-mini, ProkBERT-mini-c, and ProkBERT-mini-long.

Data Fields

contig_id: Identifier of the contig.
segment_id: Unique identifier for each genomic segment.
strand: DNA strand of the segment (+ or -).
seq_start: Starting position of the segment in the contig.
seq_end: Ending position of the segment in the contig.
segment_start: Starting position of the segment in the sequence.
segment_end: Ending position of the segment in the sequence.
label: Genomic feature category (e.g., CDS, intergenic).
segment_length: Length of the genomic segment.
segment: Genomic sequence of the segment.

For a more detailed description, please visit: https://huggingface.co/datasets/neuralbioinfo/ESKAPE-genomic-features

PROMOTER dataset

filename: bacterial_promoter_db.tsv.bz2

Data collection and processing

Data source: The positive samples, known promoters, are primarily drawn from the Prokaryotic Promoter Database (PPD), containing experimentally validated promoter sequences from 75 organisms. Non-promoter sequences are obtained from the NCBI RefSeq database, sampled specifically from CDS regions.
Preprocessing: The dataset includes non-promoter sequences constructed via higher and zero-order Markov chains, which mirror compositional characteristics of known promoters. An independent test set based on E.coli sigma70 promoters is also included.

Dataset structure

Dataset splits: The dataset is systematically divided into training, validation, and test subsets.
Data fields:
- segment_id: Unique identifier for each segment.
- ppd_original_SpeciesName: Original species name from the PPD.
- Strand: The strand of the DNA sequence.
- segment: The DNA sequence of the promoter region.
- label: The label indicating whether the sequence is a promoter or non-promoter.
- L: Length of the DNA sequence.
- prom_class: The class of the promoter.
- y: Binary label indicating the presence of a promoter.

Dataset splits

Training set: Primary dataset used for model training.
Test set (Sigma70): Independent test set focusing on E.coli sigma70 promoters.
Multispecies set: Additional test set including various species, ensuring generalization across different organisms.

ESKAPE masked sequences dataset

filename: eskape_masking_dataset.tsv.bz2

Dataset description

This dataset was used to evaluate different models on the masking exercise, measuring how well the different models can recover the original character.

Dataset overview

The dataset is compiled from the RefSeq database and other sources, focusing on ESKAPE pathogens. The genomic features were sampled randomly, followed by contiguous segmentation. This dataset contains various segments with lengths: [128, 256, 512, 1024]. The segments were randomly selected, and one of the characters was replaced by '*' (masked_segment column) to create a masking task. The reference_segment contains the original, non-replaced nucleotides. We performed 10,000 maskings per set, with a maximum of 2,000 genomic features. Only the genomic features: 'CDS', 'intergenic', 'pseudogene', and 'ncRNA' were considered.

Dataset Structure

Data Fields:
reference_segment_id: A mapping of segment identifiers to their respective reference IDs in the database.
masked_segment: The DNA sequence of the segment with certain positions masked (marked with '*') for prediction or testing purposes.
position_to_mask: The specific position(s) in the sequence that have been masked, indicated by index numbers.
masked_segment_id: Unique identifiers assigned to the masked segments. (unique only with respect to length)
contig_id: Identifier of the contig to which the segment belongs.
segment_id: Unique identifier for each genomic segment (same as reference segment id).
strand: The DNA strand of the segment, indicated as '+' (positive) or '-' (negative).
seq_start: Starting position of the segment within the contig.
seq_end: Ending position of the segment within the contig.
segment_start: Starting position of the genomic segment in the sequence.
segment_end: Ending position of the genomic segment in the sequence.
label: Category label for the genomic segment (e.g., 'CDS', 'intergenic').
segment_length: The length of the genomic segment.
original_segment: The original genomic sequence without any masking.

PHAGE dataset description

We assembled a phage sequence database from RefSeq and other sources, refining it to reduce redundancy and ensure balance between phage and bacterial sequences. The final dataset targets important bacterial genera, aiding in understanding phage-host interactions and their implications for health.

Data file naming conventions

For tokenized datasets:

Pattern: RND__balanced_(test|val|train)_Ls(\d+)_k(\d+)s(\d+)\.h5\.bz2
Matches files indicating type (test or validation), segment length, k-mer size, and shift value.

For sampled raw data:

Pattern: RND__balanced_(test|val|train)_Ls(\d+)\.tsv\.bz2
Matches files indicating type (test or validation) and segment length.

Data fields

segment_id: Unique identifier for each genomic segment.
contig_id: Identifier for the contig from which the segment is derived.
segment_start: Start position of the segment in the contig.
segment_end: End position of the segment in the contig.
L: Length of the genomic segment (512, 1024, or 2048).
segment: The genomic sequence of the segment.
label: Classification label (e.g., 'phage').
y: Binary label (1 for phage, 0 for non-phage).

Usage

These datasets are for academic use. Reference our paper when using them.

Contact information

For any questions, feedback, or contributions regarding the datasets or ProkBERT, please feel free to reach out:

Name: Balázs Ligeti
Email: obalasz@gmail.com

We welcome your input and collaboration to improve our resources and research.

Citation

@Article{ProkBERT2024,
  author  = {Ligeti, Balázs  and Szepesi-Nagy, István  and Bodnár, Babett  and Ligeti-Nagy, Noémi  and Juhász, János},
  journal = {Frontiers in Microbiology},
  title   = {{ProkBERT} family: genomic language models for microbiome applications},
  year    = {2024},
  volume  = {14},
  URL     = {https://www.frontiersin.org/articles/10.3389/fmicb.2023.1331233},
  DOI     = {10.3389/fmicb.2023.1331233}
}

Files

Files (18.3 GB)

Name	Size	Download all
bacterial_promoter_db.tsv.bz2 md5:1c022316afc28a40720ebb54d122fb7f	6.4 MB	Download
eskape_genomic_features.tsv.bz2 md5:0b7545dabc2a1774394d6190ba95ed8c	4.2 MB	Download
eskape_masking_dataset.tsv.bz2 md5:e4f41ff12a96fc8cb751bc43013098e9	8.4 MB	Download
Readme md5:ca3752449fd65b0319258b7b730a81d4	7.4 kB	Download
RND__balanced_test_Ls1024.tsv.bz2 md5:1825abe0d998b5cb8a468ff737e9e9b0	167.7 MB	Download
RND__balanced_test_Ls1024_k1s1.h5.bz2 md5:f56060ec438d8aac0398adf9d85b4253	155.1 MB	Download
RND__balanced_test_Ls1024_k6s1.h5.bz2 md5:09dc2cd1a414bb479327acd28f198a09	223.8 MB	Download
RND__balanced_test_Ls1024_k6s2.h5.bz2 md5:a545094ac52af86e80b4e385e70a14ad	437.8 MB	Download
RND__balanced_test_Ls2048.tsv.bz2 md5:a75ddeef17b72e1e0b65cacb6126c3c7	136.7 MB	Download
RND__balanced_test_Ls2048_k6s2.h5.bz2 md5:1867464eecb855427ce9c1e0e1e04d38	359.6 MB	Download
RND__balanced_train_Ls1024.tsv.bz2 md5:dadfcb1361bfaa174229faf15840e815	1.7 GB	Download
RND__balanced_train_Ls1024_k1s1.h5.bz2 md5:00c50c5a8768b82400b61d33125c7f44	1.5 GB	Download
RND__balanced_train_Ls1024_k6s1.h5.bz2 md5:d4b83a55e42027b0cf83e9bccdad6221	2.2 GB	Download
RND__balanced_train_Ls1024_k6s2.h5.bz2 md5:9c023934da81a61d1afa521b9bf10f78	4.3 GB	Download
RND__balanced_train_Ls2048.tsv.bz2 md5:283222583e52b2b05e213805218a4fed	1.5 GB	Download
RND__balanced_train_Ls2048_k6s2.h5.bz2 md5:6e608586d4c3fed4cb4f099822baa6f3	4.0 GB	Download
RND__balanced_val_Ls1024.tsv.bz2 md5:bb2ffa833d59bc8d0416f22efafcfe90	168.4 MB	Download
RND__balanced_val_Ls1024_k1s1.h5.bz2 md5:774a3c22e586541d283ff4d245987857	155.8 MB	Download
RND__balanced_val_Ls1024_k6s1.h5.bz2 md5:964416515806187261621460e0e5d9dd	224.7 MB	Download
RND__balanced_val_Ls1024_k6s2.h5.bz2 md5:51d3cf1ff9d964c2cef92af2f8a5e1c6	439.7 MB	Download
RND__balanced_val_Ls2048.tsv.bz2 md5:f1ea21d8c95a6923f544e680b883c2cd	136.2 MB	Download
RND__balanced_val_Ls2048_k6s2.h5.bz2 md5:dd0a53483cbeb3314c8862778222219c	358.5 MB	Download

Additional details

Is published in: Dataset: 10.3389/fmicb.2023.1331233 (DOI)

Repository URL: https://github.com/nbrg-ppcu/prokbert

	All versions	This version
Views	225	225
Downloads	632	632
Data volume	517.1 GB	517.1 GB

ProkBERT datasets

Contributors

Researcher:

Datasets for ProkBERT

Description

ESKAPE genomic features

Data Fields

PROMOTER dataset

Data collection and processing

Dataset structure

Dataset splits

ESKAPE masked sequences dataset

Dataset description

Dataset overview

Dataset Structure

PHAGE dataset description

Data file naming conventions

Data fields

Usage

Contact information

Citation

Files

Files (18.3 GB)

Additional details

Related works

Software

ProkBERT datasets

Creators

Contributors

Researcher:

Description

Datasets for ProkBERT

Description

ESKAPE genomic features

Data Fields

PROMOTER dataset

Data collection and processing

Dataset structure

Dataset splits

ESKAPE masked sequences dataset

Dataset description

Dataset overview

Dataset Structure

PHAGE dataset description

Data file naming conventions

Data fields

Usage

Contact information

Citation

Files

Files (18.3 GB)

Additional details

Related works

Software