A machine learning framework for extracting information from biological pathway images in the literature
Creators
Description
Training and validation datasets_arrow detection.zip:
Training and validation datasets for arrow detection using Faster R-CNN model. A total of 6,471 images have been prepared, including 2,332 images from five different sources and 4,139 augmented images.
Test dataset_arrow detection.zip:
Test dataset for arrow detection using Faster R-CNN model. A total of 100 images have been prepared from 89 papers searched through PubMed Central (PMC).
EBPI outputs.txt:
Reaction information extracted using EBPI from 49,846 biological pathway images across 466 target chemicals.
Supplementary Data 1:
Bounding box labels for 6,471 images in the training and validation datasets and 100 images in the test dataset.
Supplementary Data 2:
Dataset for text classification using BioBERT. A total of 59,370 terms have been prepared, including 15,101 “gene” terms, 21,417 “protein” terms, and 22,852 “others” terms by combining the data from MetaCyc and the PaddleOCR results from the papers.
Supplementary Data 3:
Collection and processing of pathway images illustrating biological pathways for 466 target chemicals from the bio-based chemicals map.
Supplementary Data 4:
Target chemicals satisfying criteria for biochemical reactions not covered by MetaNetX and KEGG.
Files
EBPI outputs.txt.txt
Files
(956.2 MB)
Name | Size | Download all |
---|---|---|
md5:5071b63b36e3cfcdd6b178d1ce679565
|
200.7 MB | Preview Download |
md5:352040510c95ce384b7812ea381a3e0a
|
2.6 MB | Download |
md5:354718de3a03fa57aaa9f10cd96df011
|
1.3 MB | Download |
md5:34fa6aec3a6542cd177a67c18da8529d
|
128.8 kB | Download |
md5:550e209de82f90f8fed150cc37f2dc73
|
15.0 kB | Download |
md5:5979ae013794eaef483aebc92904d938
|
11.3 MB | Preview Download |
md5:a85d2a5057de84a8d7630f0eda87c9f6
|
740.1 MB | Preview Download |
Additional details
Software
- Repository URL
- https://github.com/kaist-sbml/EBPI