There is a newer version of the record available.

Published December 16, 2024 | Version v5

miRBench datasets

  • 1. Centre for Molecular Medicine and Biobanking, University of Malta, Msida, Malta
  • 2. Department of Applied BIomedical Science, Faculty of Health Sciences, University of Malta, Msida, Malta
  • 3. National Centre for Biomolecular Research, Faculty of Science, Masaryk University, 61137 Brno, Czech Republic
  • 4. ROR icon Central European Institute of Technology – Masaryk University

Description

Changelog

The following listed datasets were recreated via a series of post-processing pipelines (available here) to eliminate a bias between the positive and negative classes (miRNA frequency class bias) discovered in previous versions of the datasets. All have a 1:1 positive to negative class ratio. 

  • AGO2_eCLIP_Manakov2022_leftout.tsv.gz
  • AGO2_eCLIP_Manakov2022_test.tsv.gz
  • AGO2_eCLIP_Manakov2022_train.tsv.gz
  • AGO2_eCLIP_Klimentova2022.tsv.gz
  • AGO2_CLASH_Hejret2023_test.tsv.gz
  • AGO2_CLASH_Hejret2023_train.tsv.gz

The structure of each dataset is consistent, with the following column order:

  1. gene: A string of length 50 indicating the binding site sequence in the 5’ to 3’ direction.
  2. noncodingRNA: A string of variable length (16–28) indicating the mature miRNA sequence in the 5’ to 3’ direction.
  3. noncodingRNA_name: A string indicating the name of the miRNA.
  4. noncodingRNA_fam: A string indicating the name of the miRNA family the miRNA belongs to.
  5. feature: A string indicating the feature annotation on the genome where the binding site occurs.
  6. label: A boolean value indicating whether the example belongs to the positive or negative class.
  7. chr: A string indicating the chromosome number on the genome where the binding site occurs.
  8. start: An integer indicating the 1-based start position of the binding site on the genome.
  9. end: An integer indicating the 1-based end position of the binding site on the genome.
  10. strand: A string indicating whether the binding site occurs on the ’+’ or ’-’ strand on the genome.
  11. gene_cluster_ID: An integer indicating the cluster ID of the binding site sequence used to generate the negative class.

The following listed dataset is the concatenated HybriDetector output of all the selected samples from the available Manakov sample files. It therefore contains only a raw version of the positive class of the Manakov dataset. It is the input to the series of post-process pipelines for the Manakov dataset. 

  • AGO2_eCLIP_Manakov2022_full_dataset.tsv.gz

The other inputs to the post-process pipelines for the Hejret and Klimentova datasets are found at the following links. 

Note that the binding sites reported in all datasets are consistent with GRCh38. 

Files

Files (197.0 MB)

Name Size Download all
md5:f0e1705fc633ba46510cdadf91756df5
28.5 kB Download
md5:1a107faad936c427f56447ac370afb18
271.7 kB Download
md5:6cb09adad0c8f684db3d37a2a9eb26a7
26.7 kB Download
md5:f0271f540d6035e879e6afa86cdf1ac4
107.3 MB Download
md5:f0ca72ed6b17b1ea73a3ebc122a9877e
667.1 kB Download
md5:a16ba7b287e68771f1fb24de2c459612
10.0 MB Download
md5:0ba7f4af69b706ed3a067eabb9ce7dd1
78.7 MB Download

Additional details

Related works

Is derived from
Preprint: 10.1101/2022.02.13.480296 (DOI)
Is new version of
Publication: 10.3390/genes13122323 (DOI)
Publication: 10.1038/s41598-023-49757-z (DOI)