There is a newer version of the record available.

Published March 14, 2020 | Version 0.15
Dataset Open

PatCit: A Comprehensive Dataset of Patent Citations

  • 1. École Polytechnique Fédérale de Lausanne
  • 2. Collège de France

Description

PATCIT: A Comprehensive Dataset of Patent Citations [Website, NewsletterGitHub]

Patents are at the crossroads of many innovation nodes: science, industry, products, competition, etc. Such interactions can be identified through citations in a broad sense.

It is now common to use front-page patent citations to study some aspects of the innovation system. However, there is much more buried in the Non Patent Literature (NPL) citations and in the patent text itself.

Good news: Natural Language Processing (NLP) tools now enable social scientists to excavate and structure this long hidden information. That's the purpose of this project

IN PRACTICE

A detailed presentation of the current state of the project is available in our March 2020 presentation.

So far, we have:

  1. classified the 40 million NPL citations reported in the DOCDB database in 9 distinct research oriented classes with a 90% accuracy rate.
  2. parsed and consolidated the 27 million NPL citations classified as bibliographical references.

  3. extractedparsed and consolidated in-text bibliographical references and patent citations from the body of all time USPTO patents.

The latest version of the dataset is the v0.15. It is made of the v0.1 of the US contextual citations dataset and v0.2 of the front-page NPL citations dataset. 

Give it a try! The dataset is publicly available on Google Cloud BigQuery, just click here.

FEATURES

Open

  • The code is licensed under MIT-2 and the dataset is licensed under CC4. Two highly permissive licenses.
  • The project is thought to be dynamically improved by and for the community. Anyone should feel free to open discussions, raise issues, request features and contribute to the project.

Comprehensive

  • We address worldwide patents, as long as the data is available.
  • We address all classes of citations, not only bibliographical references.
  • We address front-page and in-text citations.

Highest standards

  • We use and implement state-of-the art machine learning solutions.
  • We take great care to implement only the most efficient solutions. We believe that computational resources should be used sparsely, for both environmental sustainability and long term financial sustainability of the project.

 

Files

README.md

Files (15.6 GB)

Name Size Download all
md5:627eae2354f6fec89017144443206e85
8.8 GB Download
md5:0f500f8f7123e400782a11e340b1df02
6.9 GB Download
md5:fc34be3b9ea9cbd791c8107fb08dfa37
1.4 kB Preview Download