Source Code - Clustering Semantic Predicates in the Open Research Knowledge Graph
Creators
Description
This source code and its required materials implement a content-based recommender system in the context of the Open Research Knowledge Graph (ORKG). The recommender system accepts research paper's title and abstracts as input and recommends existing predicates in the ORKG semantically relevant to the given paper.
All notebooks are dependent on this Dataset. Please consider downloading the files and uploading them to your Google Drive in order to run the notebooks.
Also, please consider to adapt the notebooks to your Google Drive folder as well as you Google Cloud Storage bucket name, or configuring the applied clustering algorithm (agglomerative or kmeans) and the number of clusters "k"
scibert_embeddings.ipynb:
This notebook is responsible for representing the training and test instances in SciBERT embeddings. The output of this notebook is the files scibert_training_representations.npz and scibert_test_representations.npz that are required for running predicates_clustering_scibert.ipynb
predicates_clustering_scibert.ipynb:
This notebook depends on the output of scibert_embeddings.ipynb. It trains different clustering models depending on "N_CLUSTERS" using SciBERT embeddings and uploads the trained models to a specified bucket on Google Cloud Storage.
It is also responsible for downloading the trained models and evaluating them.
predicates_clustering_tfidf.ipynb:
This notebook trains different clustering models depending on "N_CLUSTERS" using TF-IDF embeddings and uploads the trained models to a specified bucket on Google Cloud Storage.
It is also responsible for downloading the trained models, evaluating them analyzing the constructed clusters.
Pre-trained Models:
We hereby publish 3 pre-trained clustering models with the naming format <embedding approach>_<clustering algorithm>_<number of clusters>.pkl:
- scibert_agglomerative_1450.pkl with micro-averaged F1-score 29.8%
- tfidf_kmeans_1850.pkl with micro-averaged F1-score 65.5%
- tfidf_agglomerative_1300.pkl with micro-averaged F1-score 80.4%
Files
predicates_clustering_scibert.ipynb
Files
(3.9 GB)
Name | Size | Download all |
---|---|---|
md5:8cb74f2012b9e336a2728325fa4ad1a9
|
16.4 kB | Preview Download |
md5:78b69facafccdd81bb1289df8114fe74
|
18.9 kB | Preview Download |
md5:78b6b4f5f5a14ea1bbca25f5a312536a
|
104.8 kB | Download |
md5:a026d0cd9ac7b862db16fc8fcf47961d
|
6.5 kB | Preview Download |
md5:93fb3139eb0d866dfc0fed1eeb2be567
|
3.5 MB | Download |
md5:5bd125f8ca0023c787d0a9c7354f6a5e
|
8.9 MB | Download |
md5:370e2465cb4b4ff7b9ebab75efe72a95
|
104.8 kB | Download |
md5:48704714da8434ecbf3da9aa1cdea486
|
3.8 GB | Download |