There is a newer version of the record available.

Published September 9, 2022 | Version 1
Dataset Open

S1000 corpus, large-scale tagging results and other supplementary files

  • 1. Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Denmark
  • 2. TurkuNLP Group, Department of Computing, University of Turku, Finland
  • 3. Textmi, Tokyo, Japan
  • 4. Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Greece

Description

Data associated with the S1000 corpus

The tagger software for which the dictionary files in tagger-organisms-dictionary-S1000.tar.gz can be used with can be found here: https://github.com/larsjuhljensen/tagger

The online version of the annotation documentation can be found here: https://katnastou.github.io/s1000-corpus-annotation-guidelines/

The S1000 corpus split in training, development and test sets in BRAT format can be found in S1000-corpus.tar.gz and in CoNLL format here: s1000-conll.tar.gz

The tagging results of Jensenlab tagger for the S1000 test set are here: S1000-jensenlab-tagger.tar.gz

The result from the large scale run in entire PubMed and PMC Open Access articles for Jensenlab tagger is provided here: Jensenlab_tagger_large_scale_matches_with_rank.tsv.gz

The model used for the large scale run of the transformer-based method is here: S1000_Transformer_based_tagger_large_scale_model.tar.gz and the results from the large scale tagging here: Transformer_based_tagger_large_scale_matches_with_rank.tsv.zip

Files

Annotation guidelines for S1000 corpus.pdf

Files (19.3 GB)

Name Size Download all
md5:6992a62d91aebeca9100ee6bd46be3da
158.0 kB Preview Download
md5:8c1135fe8cb216f41f54061c43abc171
16.1 GB Download
md5:961aa1d006aefa8681c38c192983ccf9
585.3 kB Download
md5:06088fa7a73fc84f9df9662f6394714e
272.0 kB Download
md5:7e812ffe2967f04526556e194af68618
275.1 kB Download
md5:b8cff89db39ea908ed7051d17a931e96
1.3 GB Download
md5:4e14d7396d7b68f531ea797112d620da
208.2 MB Download
md5:a495adf984d2f9ce193113c0719bc87d
1.7 GB Preview Download

Additional details

Related works

Is required by
Preprint: 10.1101/2023.02.20.528934 (DOI)

Funding

DeepTextNet – Deep learning-based text mining for interpretation of omics data 101023676
European Commission