Published April 27, 2023 | Version v1
Dataset Open

Exploring the Impact of Negative Sampling on Patent Citation Recommendation

Creators

  • 1. FIZ - Karlsruhe

Description

  • pcr_patents.csv is the dataset which is generated by collecting samples randomly from Google Patents by exploiting a Python library. The dataset comprises around 250,000 US patents and their titles, abstracts, and citations.  Each patent has roughly on average 27 citations.

The zip file contains 3 different datasets for training and testing patent citation recommendation systems. These datasets were generated by utilizing the main dataset. They consist of around 1 million instances which are positive as well as negative samples.  

  • pcr_cpc_negative_sample_data.csv  consists of negative samples that were generated based on CPC subclass codes. 

  • pcr_random_negative_sample_data.csv consists of negative samples that were generated randomly. 

  • pcr_sem_sim_negative_sample_data_2.csv consists of negative samples that were generated based on nearest neighbor relation.

Files

patent_citation_recommendation_training_data.zip

Files (6.7 GB)

Name Size Download all
md5:168eab8914e9a0cc4d4e15eaf7eac6f1
6.0 GB Preview Download
md5:a702d2616476c2d1946c1dfc04c87955
649.9 MB Preview Download