Published April 6, 2022 | Version v2
Dataset Open

RegEl Database: text-mined regulatory elements from the literature and their associations to genes and disease

  • 1. Humboldt-Universitält zu Berlin
  • 2. Charité-Universitätsmedizin Berlin
  • 3. Berlin Institute of Health

Description

@article{garda2022regel,
  title={RegEl corpus: identifying DNA regulatory elements in the scientific literature},
  author={Garda, Samuele and Lenihan-Geels, Freyda and Proft, Sebastian and Hochmuth, Stefanie and Sch{\"u}lke, Markus and Seelow, Dominik and Leser, Ulf},
  journal={Database},
  volume={2022},
  year={2022},
  publisher={Oxford Academic}
}

# RegEl PubMed Database

This database contains the annotations generated by running [HunFlair](https://github.com/flairNLP/flair/blob/master/resources/docs/HUNFLAIR.md) models trained on the [RegEl corpus](https://zenodo.org/record/5776679) over >20M PubMed abstracts.

By pairing these annotations with the one provided by PubTator this generates a large text mining database of regulatory elements associated with genes (normalized to NCBI Gene ids) and disease (normalized to either MeSH or OMIM).

The tables composing the database are:

* abstracts.db:
  - pmid = PubMed ID of the given abstracts
  - sid = sentence ID of the given abstracts (from 0 to # of sentences)
  - text = text of the given sentence

* gene.db and disease.db:
  - pmid = PubMed ID of the given abstracts
  - sid = sentence ID of the given abstracts (from 0 to # of sentences)
  - etype = entity type (enhancer, promoter, TFBS)
  - ann_text = mention of the regulatory element as found in the abstract
  - start = position (# character) in which the mention begins
  - end = position (# characters) in which the mention ends
  - score = model's confidence
  - cui = gene or disease identifier
  - cui_symbol = official symbol of cui (if available)

Files

regel_db.zip

Files (304.0 MB)

Name Size Download all
md5:04d86b2c3d11fd9798fd4ef9b553af50
304.0 MB Preview Download