Exploring the Impact of Negative Sampling on Patent Citation Recommendation
Description
-
pcr_patents.csv is the dataset which is generated by collecting samples randomly from Google Patents by exploiting a Python library. The dataset comprises around 250,000 US patents and their titles, abstracts, and citations. Each patent has roughly on average 27 citations.
The zip file contains 3 different datasets for training and testing patent citation recommendation systems. These datasets were generated by utilizing the main dataset. They consist of around 1 million instances which are positive as well as negative samples.
-
pcr_cpc_negative_sample_data.csv consists of negative samples that were generated based on CPC subclass codes.
-
pcr_random_negative_sample_data.csv consists of negative samples that were generated randomly.
-
pcr_sem_sim_negative_sample_data_2.csv consists of negative samples that were generated based on nearest neighbor relation.
Files
patent_citation_recommendation_training_data.zip
Files
(6.7 GB)
Name | Size | Download all |
---|---|---|
md5:168eab8914e9a0cc4d4e15eaf7eac6f1
|
6.0 GB | Preview Download |
md5:a702d2616476c2d1946c1dfc04c87955
|
649.9 MB | Preview Download |