Dataset Open Access

ProGene - A Large-scale, High-Quality Protein-Gene Annotated Benchmark Corpus

Faessler, Erik; Modersohn, Luise; Lohr, Christina; Hahn, Udo

The Pro(tein)/Gene corpus was developed at the JULIE Lab Jena under supervision of Prof. Udo Hahn.

The goals of the annotation project were

  • to construct a consistent and (as far as possible) subdomain-independent/-comprehensive protein-annotated corpus
  • to differentiate between protein families and groups, protein complexes, protein molecules, protein variants (e.g. alleles) and elliptic enumerations of proteins.

The corpus has the following annotation levels / entity types:

  • protein
  • protein_familiy_or_group
  • protein_complex
  • protein_variant
  • protein_enum

For definitions of the annotation levels, please refer to the Proteins-guidelines-final.doc file that is found in the download package.

To achieve a large coverage of biological subdomains, document from multiple other protein / gene corpora were reannotated. For further coverage, new document sets were created. All documents are abstracts from PubMed/MEDLINE. The corpus is made up of the union of all the documents in the different subcorpora.
All document are delivered as MMAX2 (http://mmax2.net/) annotation projects.

Files (24.9 GB)
Name Size
progene.zip
md5:fa985ca0ef2c8da932db6f235422c9d9
24.9 GB Download
  • Faessler et al. (2020). PROGENE—A Large-scale, High-Quality Protein-Gene Annotated Benchmark Corpus, LREC 2020

508
234
views
downloads
All versions This version
Views 508508
Downloads 234234
Data volume 5.8 TB5.8 TB
Unique views 435435
Unique downloads 160160

Share

Cite as