PatCit: A Comprehensive Dataset of Patent Citations

Cyril Verluise; Gabriele Cristelli; Kyle Higham; Lucas Violon; Gaétan de Rassenfosse

doi:10.5281/zenodo.4391095

Published December 23, 2020 | Version 0.3.1

Dataset Open

PatCit: A Comprehensive Dataset of Patent Citations

1. Collège de France
2. École Polytechnique Fédérale de Lausanne
3. Hitotsubashi University
4. HEC Paris

patCit: A Comprehensive Dataset of Patent Citations [Newsletter, GitHub]

Patents are at the crossroads of many innovation nodes: science, industry, products, competition, etc. Such interactions can be identified through citations in a broad sense.

It is now common to use front-page patent citations to study some aspects of the innovation system. However, there is much more buried in the Non Patent Literature (NPL) citations and in the patent text itself. patCit extracts and structures these citations.

Want to know more? Read patCit academic presentation or dive into usage and technical guides on patCit documentation website.

IN PRACTICE

At patCit, we are building a comprehensive dataset of patent citations to help the community explore this terra incognita. patCit has the following features:

global coverage
front-page and in-text citations
all categories of NPL documents

Front-page

patCit builds on DOCDB, the largest database of Non Patent Literature (NPL) citations. First, we deduplicate this corpus and organize it into 10 categories (bibliographical reference, database, norm & standard, etc). Then, we design and apply category specific information extraction models using spaCy. Eventually, when possible, we enrich the data using external domain specific high quality databases (e.g. Crossref for bibliographical references).

In-text

patCit builds on Google Patents corpus of USPTO full-text patents. First, we extract patent and bibliographical reference citations. Then, we parse detected in-text citations into a series of category dependent attributes using grobid. Patent citations are matched with a standard publication number using the Google Patents matching API and bibliographical references are matched with a DOI using biblio-glutton. Eventually, when possible, we enrich the data using external domain specific high quality databases (e.g. Crossref for bibliographical references).

FAIR

Find - The patCit dataset is available on BigQuery in an interactive environment. For those who have a smattering of SQL, this is the perfect place to explore the data. It can also be downloaded on Zenodo.

Interoperate - Interoperability is at the core of patCit ambition. We take care to extract unique identifiers whenever it is possible to enable data enrichment for domain specific high quality databases. This includes the DOI, PMID and PMCID for bibliographical references, the Technical Doc Number for standards, the Accession Number for Genetic databases, the publication number for PATSTAT and Claims, etc. See specific table for more details.

Reproduce - Our gitHub repository is the project factory. You can learn more about data recipes and models on the patCit documentation website.

Files

README.md

Files (16.5 GB)

Name	Size	Download all
frontpage_allmeta.tar md5:7bc7f660e1e6493ae47fa576c555de20	4.0 GB	Download
frontpage_bibliographicalreference.tar md5:71b96c8b2f6473cc0827ef2cdc9d7e19	3.1 GB	Download
frontpage_database.tar md5:5aca8ba24cdcf4283f7b1edcd0d702fa	18.0 MB	Download
frontpage_normstandard.tar md5:a8592e6c0578887be491f690e9dc8e1b	44.1 MB	Download
frontpage_wiki.tar md5:91d4d4b2120034b305b074a66333e989	6.3 MB	Download
intext_bibliographicalreference.tar md5:37792f3cb02a5154f286e9394c75813b	3.3 GB	Download
intext_patent_csv.tar md5:93641563e5563ba7c80f676145965086	2.8 GB	Download
intext_patent_jsonl.tar md5:b79d9eb7f5e62fa09cdc2d70ff4c7096	3.3 GB	Download
README.md md5:5f9dcbaca55f5cbd80872c6f800c54bb	1.9 kB	Preview Download

	All versions	This version
Views	5,694	2,643
Downloads	1,639	1,026
Data volume	25.7 TB	17.1 TB

PatCit: A Comprehensive Dataset of Patent Citations

Creators

Description

Files

README.md

Files (16.5 GB)