Source Code - Clustering Semantic Predicates in the Open Research Knowledge Graph

Arab Oghli, Omar

doi:10.5281/zenodo.6973678

Published January 14, 2022 | Version 0.2.0

Software Open

Source Code - Clustering Semantic Predicates in the Open Research Knowledge Graph

Arab Oghli, Omar

Contributors

Supervisors:

This source code and its required materials implement a content-based recommender system in the context of the Open Research Knowledge Graph (ORKG). The recommender system accepts research paper's title and abstracts as input and recommends existing predicates in the ORKG semantically relevant to the given paper.

All notebooks are dependent on this Dataset. Please consider downloading the files and uploading them to your Google Drive in order to run the notebooks.

Also, please consider to adapt the notebooks to your Google Drive folder as well as you Google Cloud Storage bucket name, or configuring the applied clustering algorithm (agglomerative or kmeans) and the number of clusters "k"

scibert_embeddings.ipynb:

This notebook is responsible for representing the training and test instances in SciBERT embeddings. The output of this notebook is the files scibert_training_representations.npz and scibert_test_representations.npz that are required for running predicates_clustering_scibert.ipynb

predicates_clustering_scibert.ipynb:

This notebook depends on the output of scibert_embeddings.ipynb. It trains different clustering models depending on "N_CLUSTERS" using SciBERT embeddings and uploads the trained models to a specified bucket on Google Cloud Storage.

It is also responsible for downloading the trained models and evaluating them.

predicates_clustering_tfidf.ipynb:

This notebook trains different clustering models depending on "N_CLUSTERS" using TF-IDF embeddings and uploads the trained models to a specified bucket on Google Cloud Storage.

It is also responsible for downloading the trained models, evaluating them analyzing the constructed clusters.

Pre-trained Models:

We hereby publish 2 pre-trained clustering models with the naming format <embedding approach>_<clustering algorithm>_<number of clusters>.pkl:

scibert_kmeans_2050.pkl with micro-averaged F1-score 72.6%
tfidf_agglomerative_1300.pkl with micro-averaged F1-score 80.4%

Files

predicates_clustering_scibert.ipynb

Files (12.5 MB)

Name	Size	Download all
predicates_clustering_scibert.ipynb md5:8cb74f2012b9e336a2728325fa4ad1a9	16.4 kB	Preview Download
predicates_clustering_tfidf.ipynb md5:78b69facafccdd81bb1289df8114fe74	18.9 kB	Preview Download
scibert_embeddings.ipynb md5:a026d0cd9ac7b862db16fc8fcf47961d	6.5 kB	Preview Download
scibert_test_representations.npz md5:93fb3139eb0d866dfc0fed1eeb2be567	3.5 MB	Download
scibert_training_representations.npz md5:5bd125f8ca0023c787d0a9c7354f6a5e	8.9 MB	Download
tfidf_agglomerative_1300.pkl md5:370e2465cb4b4ff7b9ebab75efe72a95	104.8 kB	Download

	All versions	This version
Views	138	51
Downloads	152	62
Data volume	46.4 GB	130.1 MB

Source Code - Clustering Semantic Predicates in the Open Research Knowledge Graph

Creators

Contributors

Supervisors:

Description

Files

predicates_clustering_scibert.ipynb

Files (12.5 MB)