PatCit: A Comprehensive Dataset of Patent Citations
Creators
- 1. École Polytechnique Fédérale de Lausanne
- 2. Collège de France
Description
PATCIT: A Comprehensive Dataset of Patent Citations [Website, Newsletter, GitHub]
Patents are at the crossroads of many innovation nodes: science, industry, products, competition, etc. Such interactions can be identified through citations in a broad sense.
It is now common to use front-page patent citations to study some aspects of the innovation system. However, there is much more buried in the Non Patent Literature (NPL) citations and in the patent text itself.
Good news: Natural Language Processing (NLP) tools now enable social scientists to excavate and structure this long hidden information. That's the purpose of this project
IN PRACTICE
A detailed presentation of the current state of the project is available in our March 2020 presentation.
So far, we have:
- classified the 40 million NPL citations reported in the DOCDB database in 9 distinct research oriented classes with a 90% accuracy rate.
-
parsed and consolidated the 27 million NPL citations classified as bibliographical references.
-
extracted, parsed and consolidated in-text bibliographical references and patent citations from the body of all time USPTO patents.
The latest version of the dataset is the v0.15. It is made of the v0.1 of the US contextual citations dataset and v0.2 of the front-page NPL citations dataset.
Give it a try! The dataset is publicly available on Google Cloud BigQuery, just click here.
FEATURES
Open
- The code is licensed under MIT-2 and the dataset is licensed under CC4. Two highly permissive licenses.
- The project is thought to be dynamically improved by and for the community. Anyone should feel free to open discussions, raise issues, request features and contribute to the project.
Comprehensive
- We address worldwide patents, as long as the data is available.
- We address all classes of citations, not only bibliographical references.
- We address front-page and in-text citations.
Highest standards
- We use and implement state-of-the art machine learning solutions.
- We take great care to implement only the most efficient solutions. We believe that computational resources should be used sparsely, for both environmental sustainability and long term financial sustainability of the project.
Files
README.md
Files
(15.6 GB)
Name | Size | Download all |
---|---|---|
md5:627eae2354f6fec89017144443206e85
|
8.8 GB | Download |
md5:0f500f8f7123e400782a11e340b1df02
|
6.9 GB | Download |
md5:fc34be3b9ea9cbd791c8107fb08dfa37
|
1.4 kB | Preview Download |