Published June 5, 2021 | Version 1.0
Annotation collection Open

Whatizit performance evaluation against CRAFT corpus

  • 1. scientific assistant - working student
  • 2. Team Lead
  • 3. Head

Description

The Colorado Richly Annotated Full-Text (CRAFT) Corpus is a collection of 97 articles from the PubMed Central.
I am utilizing a subset of CRAFT - i.e. the concept annotation to Gene Ontology.
The annotations are separated for of the GO subdivisions - biological process, cellular component, and molecular function. The approach is extracting and combining the annotations from the files in order to be able to provide comprehensive overview on the corpus annotations and to allow comparison with Whatizit annotations where there is no subdivision separation.

Whatizit Named-entity recognition tool which has been adapted to work with a dictionary based on Gene Ontology + including preferred and synonym terms for each GO Id, as applicable.

Whatizit annotations are created on the same 97 articles from CRAFT corpus and performance is compared between both:
- general statistics is generated - number of GO terms per article, number of articles per GO term, articles IDs per GO term, number of GO terms for the corpus

- TF is calculated

- IDF is calculated

- merged dataframe is created for comparison - CRAFT annotations are "gold standard" and occurences are aligned.

Files

results_csv.zip

Files (331.0 kB)

Name Size Download all
md5:6fba5a13f3a6b1a38f2b13ec732cb609
331.0 kB Preview Download