Published January 14, 2022 | Version 0.2.0
Software Open

Source Code - Clustering Semantic Predicates in the Open Research Knowledge Graph

Description

This source code and its required materials implement a content-based recommender system in the context of the Open Research Knowledge Graph (ORKG). The recommender system accepts research paper's title and abstracts as input and recommends existing predicates in the ORKG semantically relevant to the given paper.

 

All notebooks are dependent on this Dataset. Please consider downloading the files and uploading them to your Google Drive in order to run the notebooks.

Also, please consider to adapt the notebooks to your Google Drive folder as well as you Google Cloud Storage bucket name, or configuring the applied clustering algorithm (agglomerative or kmeans) and the number of clusters "k"

 

scibert_embeddings.ipynb:

This notebook is responsible for representing the training and test instances in SciBERT embeddings. The output of this notebook is the files scibert_training_representations.npz and scibert_test_representations.npz that are required for running predicates_clustering_scibert.ipynb

 

predicates_clustering_scibert.ipynb:

This notebook depends on the output of scibert_embeddings.ipynb. It trains different clustering models depending on "N_CLUSTERS" using SciBERT embeddings and uploads the trained models to a specified bucket on Google Cloud Storage.

It is also responsible for downloading the trained models and evaluating them.

 

predicates_clustering_tfidf.ipynb:

This notebook trains different clustering models depending on "N_CLUSTERS" using TF-IDF embeddings and uploads the trained models to a specified bucket on Google Cloud Storage.

It is also responsible for downloading the trained models, evaluating them analyzing the constructed clusters.

 

Pre-trained Models:

We hereby publish 2 pre-trained clustering models with the naming format <embedding approach>_<clustering algorithm>_<number of clusters>.pkl:

  1. scibert_kmeans_2050.pkl with micro-averaged F1-score 72.6%
  2. tfidf_agglomerative_1300.pkl with micro-averaged F1-score 80.4%

Files

predicates_clustering_scibert.ipynb

Files (12.5 MB)

Name Size Download all
md5:8cb74f2012b9e336a2728325fa4ad1a9
16.4 kB Preview Download
md5:78b69facafccdd81bb1289df8114fe74
18.9 kB Preview Download
md5:a026d0cd9ac7b862db16fc8fcf47961d
6.5 kB Preview Download
md5:93fb3139eb0d866dfc0fed1eeb2be567
3.5 MB Download
md5:5bd125f8ca0023c787d0a9c7354f6a5e
8.9 MB Download
md5:370e2465cb4b4ff7b9ebab75efe72a95
104.8 kB Download