Exploring the Impact of Negative Sampling on Patent Citation Recommendation

Rima Dessi

doi:10.5281/zenodo.7870197

Published April 27, 2023 | Version v1

Dataset Open

Exploring the Impact of Negative Sampling on Patent Citation Recommendation

Rima Dessi¹

1. FIZ - Karlsruhe

pcr_patents.csv is the dataset which is generated by collecting samples randomly from Google Patents by exploiting a Python library. The dataset comprises around 250,000 US patents and their titles, abstracts, and citations. Each patent has roughly on average 27 citations.

The zip file contains 3 different datasets for training and testing patent citation recommendation systems. These datasets were generated by utilizing the main dataset. They consist of around 1 million instances which are positive as well as negative samples.

pcr_cpc_negative_sample_data.csv consists of negative samples that were generated based on CPC subclass codes.
pcr_random_negative_sample_data.csv consists of negative samples that were generated randomly.
pcr_sem_sim_negative_sample_data_2.csv consists of negative samples that were generated based on nearest neighbor relation.

Files

patent_citation_recommendation_training_data.zip

Files (6.7 GB)

Name	Size	Download all
patent_citation_recommendation_training_data.zip md5:168eab8914e9a0cc4d4e15eaf7eac6f1	6.0 GB	Preview Download
pcr_patents.csv md5:a702d2616476c2d1946c1dfc04c87955	649.9 MB	Preview Download

	All versions	This version
Views	394	394
Downloads	50	50
Data volume	187.4 GB	187.4 GB

Exploring the Impact of Negative Sampling on Patent Citation Recommendation

Creators

Description

Files

patent_citation_recommendation_training_data.zip

Files (6.7 GB)