================================================================================ Rhea NLP ================================================================================ This directory contains data to support the development of natural language processing (NLP) methods to mine biochemical reactions from text for Rhea and UniProt. List of files: - EnzChemRED.tar.gz Content: The EnzChemRED corpus of 1,210 PubMed abstracts with annotations of gene/protein and chemical mentions using UniProtKB and ChEBI respectively, chemical conversions - relations that link pairs of chemical mentions - and the enzymes that catalyze those conversions, when available. The development of EnzChemRED is described in Lai, P.-T., et al. (2024). File format: BioC - see: Comeau, D.C., et al. (2013) BioC: a minimalist approach to interoperability for biomedical text processing. Database (Oxford), 2013, bat064. DOI:10.1093/database/bat064; PMID:24048470; PMCID:PMC3889917 - BioREx_EnzChemRED_PubMed.tsv Content: Raw results of scanning PubMed abstracts up to December 2023 using a BioREx model fine-tuned using EnzChemRED, as described in Lai, P.-T., et al. (2024). File format: TSV with fields: 1. ChEBI ID 1 2. Chemical name 1 3. ChEBI ID 2 4. Chemical name 2 5. Relation type Values: Indirect_conversion, Conversion, Non_conversion 6. BioREx score Values: Number ranging from 0 to 1 7. PMID 8. Sentence Sentence from PMID containing the relation. - BioREx_EnzChemRED_PubMed_normalized.tsv Content: Processed results of scanning PubMed abstracts up to December 2023 using a BioREx model fine-tuned using EnzChemRED, as described in Lai, P.-T., et al. (2024). ChEBI compounds were first normalized based on the pH7.3 relationships used in Rhea, then by the first two layers of their InChI keys for compounds with InChI. Pairs with identical ChEBI ID 1 and 2 and pairs where at least one ChEBI ID is a participant in >= 100 Rhea reactions were removed from the dataset. File format: TSV with fields: 1. ChEBI ID 1 2. ChEBI name 1 3. ChEBI ID 2 4. ChEBI name 2 5. Relation type Values: Indirect_conversion, Conversion, Non_conversion For a ChEBI pair found with different relation types in different PMIDs, the relation type for the pair is chosen according to the order in which the values are listed above. 6. Max BioREx score Values: Number ranging from 0 to 1 Maximum BioREx score of all relations that include this ChEBI pair. 7. PMIDs Number of PMIDs in which the ChEBI pair appears in the PubMed dataset. 8. Rhea Values: yes, no Indicates whether the ChEBI pair is found in Rhea reactions. 9. Atom conservation (OPTIONAL) Percent atom conservation of the ChEBI pair. 10. Min DRFP distance (OPTIONAL) Values: Number ranging from 0 to 2 Minimum cosine distance between the DRFP of the ChEBI pair to the DRFPs of ChEBI pairs from Rhea reactions.