miRBench datasets

Sammut, Stephanie; Gresova, Katarina; Tzimotoudis, Dimosthenis; Klimentova, Eva; Cechak, David; Alexiou, Panagiotis

doi:10.5281/zenodo.20540907

Published June 4, 2026 | Version v7

Dataset Open

miRBench datasets

1. Centre for Molecular Medicine and Biobanking, University of Malta, Msida, Malta
2. Department of Applied BIomedical Science, Faculty of Health Sciences, University of Malta, Msida, Malta
3. National Centre for Biomolecular Research, Faculty of Science, Masaryk University, 61137 Brno, Czech Republic
4. Central European Institute of Technology – Masaryk University

Changelog: 2026-06-04 (v7)

TO BE UPDATED SOON

Changelog (v6)

PhyloP and PhastCons conservation scores for the target gene sequence have been added to the test/train/leftout datasets as two additional columns - 'gene_phyloP' and' gene_phastCons'.

Both of new columns contain list of conservation scores rounded to 3 decimal places, one score for each nucelotide in the gene sequence.

PhyloP and PhastCons scores were obtained from:

Downloaded on 15 September 2024.

Dataset Summary (v5)

The following listed datasets were recreated via a series of post-processing pipelines (available here) to eliminate a bias between the positive and negative classes (miRNA family imbalance) discovered in previous versions of the datasets. All have a 1:1 positive to negative class ratio.

AGO2_eCLIP_Manakov2022_leftout.tsv.gz
AGO2_eCLIP_Manakov2022_test.tsv.gz
AGO2_eCLIP_Manakov2022_train.tsv.gz
AGO2_eCLIP_Klimentova2022_test.tsv.gz
AGO2_CLASH_Hejret2023_test.tsv.gz
AGO2_CLASH_Hejret2023_train.tsv.gz

The following listed dataset is the concatenated HybriDetector output of all the selected samples from the available Manakov sample files. It therefore contains only a raw version of the positive class of the Manakov dataset. It is the input to the series of post-process pipelines for the Manakov dataset.

AGO2_eCLIP_Manakov2022_full_dataset.tsv.gz

The other inputs to the post-process pipelines for the Hejret and Klimentova datasets are found at the following links.

The structure of each dataset is consistent, with the following column order:

gene: A string of length 50 indicating the binding site sequence in the 5’ to 3’ direction.
noncodingRNA: A string of variable length (16–28) indicating the mature miRNA sequence in the 5’ to 3’ direction.
noncodingRNA_name: A string indicating the name of the miRNA.
noncodingRNA_fam: A string indicating the name of the miRNA family the miRNA belongs to.
feature: A string indicating the feature annotation on the genome where the binding site occurs.
label: A boolean value indicating whether the example belongs to the positive or negative class.
chr: A string indicating the chromosome number on the genome where the binding site occurs.
start: An integer indicating the 1-based start position of the binding site on the genome.
end: An integer indicating the 1-based end position of the binding site on the genome.
strand: A string indicating whether the binding site occurs on the ’+’ or ’-’ strand on the genome.
gene_cluster_ID: An integer indicating the cluster ID of the binding site sequence used to generate the negative class.

Note that the binding sites reported in all datasets are consistent with GRCh38.

Files

Files (739.7 MB)

Name	Size	Download all
AGO2_CLASH_Hejret2023_test.tsv.gz md5:224fe08e101b6997ebae54090664c9f7	221.7 kB	Download
AGO2_CLASH_Hejret2023_train.tsv.gz md5:55f1fbc97234cb3747ca7480b2ef70c6	1.8 MB	Download
AGO2_eCLIP_Klimentova2022_test.tsv.gz md5:49b3ffb7752401a52b70f18a15fec4b6	200.8 kB	Download
AGO2_eCLIP_Manakov2022_full_dataset.tsv.gz md5:f0271f540d6035e879e6afa86cdf1ac4	107.3 MB	Download
AGO2_eCLIP_Manakov2022_leftout.tsv.gz md5:0bd263447cd83f5dbdf669a1d421bd56	4.4 MB	Download
AGO2_eCLIP_Manakov2022_test.tsv.gz md5:bb6140b14f025899b38469d3c78d93f7	71.6 MB	Download
AGO2_eCLIP_Manakov2022_train.tsv.gz md5:42a7e091f66cf12539d6ae47bc2e0c85	554.2 MB	Download

Additional details

Is derived from: Preprint: 10.1101/2022.02.13.480296 (DOI)
Is new version of: Publication: 10.3390/genes13122323 (DOI); Publication: 10.1038/s41598-023-49757-z (DOI)

	All versions	This version
Views	1,109	28
Downloads	2,248	50
Data volume	1.4 TB	6.7 GB

miRBench datasets

Authors/Creators

Description

Changelog: 2026-06-04 (v7)

Changelog (v6)

Dataset Summary (v5)

Files

Files (739.7 MB)

Additional details

Related works