Published March 31, 2022 | Version 3.3.0
Dataset Open

patccat: A classifier for patent claims

  • 1. University of Mannheim
  • 2. Wake Forest University
  • 3. Southern Methodist University

Description

Data version: 3.3.0

Authors:
Bernhard Ganglmair (University of Mannheim, Department of Economics, and ZEW Mannheim)
W. Keith Robinson (Wake Forest University, School of Law)
Michael Seeligson (Southern Methodist University, Cox School of Business)


1. Notes on Data Construction
2. Citation and Code
3. Description of the Data Files
3.1. File List
3.2. List of Variables for Files with Claim-Level Information
3.3. List of Variables for Files with Patent-Level Information
4. Coming Soon!


1. Notes on Data Construction

This is version 3.3.0 of the patccat data (patent claim classification by algorithmic text analysis).

Patent claims define an invention. A patent application is required to have one or more claims that distinctly claim the subject matter which the patent applicant regards as her invention or discovery. We construct a classifier of patent claims that identifies three distinct claim types: process claims, product claims, and product-by-process claims.

For this classification, we combine information obtained from both the preamble and the body of a claim. The preamble is a general description of the invention (e.g., a method, an apparatus, or a device), whereas the body identifies steps and elements (specifying in detail the invention laid out in the preamble) that the applicant is claiming as the invention. The combination of the preamble type and the body type provides us with a more detailed and more accurate classification of claims than other approaches in the literature. This approach also accounts for unconventional drafting approaches. We eventually validate our classification using close to 10,000 manually classified claims.

The data files contain the results of our classification. We provide claim-level information for each independent claim of U.S. utility patents granted between 1836 and 2020. We also provide patent-level information, i.e., the counts of different claim types for a given patent.

For a detailed description of our classification approach, please take a look at the accompanying paper (Ganglmair, Robinson, and Seeligson 2022).

2. Citation

Please cite the following paper when using the data in your own work:

Ganglmair, Bernhard, W. Keith Robinson, and Michael Seeligson (2022): "The Rise of Process Claims: Evidence from a Century of U.S. Patents," unpublished manuscript available at https://papers.ssrn.com/abstract=4069994.

In the paper, we document the use of process claims in the U.S. over the last century, using the patccat data. We show an increase in the annual share of process claims of about 25 percentage points (from below 10% in 1920). This rise in process intensity of patents is not limited to a few patent classes, but we observe it across a broad spectrum of technologies. Process intensity varies by applicant type: companies file more process-intense patents than individuals, and U.S. applicants file more process-intense patents than foreign applicants. We further show that patents with higher process intensity are more valuable but are not necessarily cited more often. Last, process claims are on average shorter than product claims (with the gap narrowing since the 1970s).

We would love to see how other researchers use the data and eventually learn from it. If you have a discussion paper or a publication in which you use the data, please send us a copy at patccat.data@gmail.com.

We will the R code used to construct the data on Github with the next data version (version 3.4.0). Contact us at patccat.data@gmail.com if you would like to take a look at an earlier version of the code.


3. Description of the Data Files

The data files contain claim-level information for independent claims of 10,140,848 U.S. utility patents granted between 1836 and 2020. The files further contain patent-level information for U.S. utility patents.

3.1. File List

File list
claims-patccat-v3-3-sample.csv claim-level information for independent claims of a sample of 1000 patents issued between 1976 and 2020
claims-patccat-v3-3-1836-1919.csv claim-level information for independent claims of 1,038,041 patents issued between 1836 and 1919
claims-patccat-v3-3-1920-2020.csv claim-level information for independent claims of 9,102,807 patents issued between 1920 and 2020
patents-patccat-v3-3-sample.csv patent-level information for a sample of 1000 patents issued between 1976 and 2020
patents-patccat-v3-3-1836-1919.csv patent-level information for 1,038,041 patents issued between 1836 and 1919
patents-patccat-v3-3-1920-2020.csv patent-level information for 9,102,807 patents issued between 1920 and 2020


3.2. List of Variables for Files with Claim-Level Information

For detailed descriptions, see the appendix in Ganglmair, Robinson, and Seeligson (2022).

List of Variables (Claim-Level Information)
PatentClaim patent claim identifier; 8-digit patent number and 4-digit claim number (Ex: 01234567-0001)
singleLine =1 if claim is published in single-line format
singleReformat outcome code of reformating of single-line claims
Jepson =1 if claim is a Jepson claim
JepsonReformat outcome code of reformating of Jepson claims
inBegin =1 if claim begins with the word "in"
wordsPreamble number of words in the claim preamble
wordsBody number of words in the claim body
dependentClaims number of dependent claims that refer to this independent claim
isMeansPreamble =1 if term "means" is used in the preamble
isMeansBody =1 if term "means" is used in the body
isMeans =1 if term "means" is used anywhere in the claim (~ means-plus-function claim)
processPreamble =1 if terms "method" or "process" are used in the preamble
processBody =1 if terms "method" or "process" are used in the body
processSimple =1 if terms "method" or "process" are used anywhere in the claim (for simple approach of process claim classification)
claimType claim type of full classification (1 = process; 2 = product; 3 = product-by-process; 0 = no type)
preambleType preamble type
preambleTerm keyword used to classify preamble type
preambleTermAlt alternative keyword (if preambleTerm were not used)
preambleTextStub first 15 words of the preamble
bodyType body type
bodyLinesStep number of steps in the body
bodyLinesElement number of elements in the body
bodyLinesTotal total number of identified lines in the body
label 2-character label of the preamble-body combination; classification table maps label to claim type

 

3.3. List of Variables for Files with Patent-Level Information

For detailed descriptions, see the appendix in Ganglmair, Robinson, and Seeligson (2022).

List of Variables (Patent-Level Information)
patent_id U.S. patent number (8-digit patent number)
claims number of independent claims (the sum of the four claim types: 0, 1, 2, and 3)
noCategory number of claims without a classified type
processClaims number of process claims
productClaims number of product claims
prodByProcessClaims number of product-by-process claims
firstClaim type of the first claim (1 = process; 2 = product; 3 = product-by-process; 0 = no type)
simpleProcessClaims number of process claims by simple approach (terms "method" or "process" anywhere in the claim)
simpleProcessPreamble number of process claims by simple approach (terms "method" or "process" in the preamble)
meansClaims number of means-plus-function claims
meansFirst =1 if first claim is a means-plus-function claim
JepsonClaims number of Jepson claims
JepsonFirst =1 if first claim is a Jepson claim


Note: The following variables/fields are currently empty (March 30, 2020); we will populate these variables/fields with data version 3.4.0.

preambleTerm
preambleTermAlt
preambleTextStub
bodyLinesStep
bodyLinesElement
bodyLinesTotal

Note: We will release the data for patents issued in 2021 with data version 3.4.0.


4. Coming Soon!

We are working on a number of extensions of the patccat data.

- With data version 3.4.0, we plan to release data for all published U.S. patent applications (2001 through 2021)
- In late spring/early summer 2022, we will release data for patents issued by the European Patent Office (EPO) [Update: March 28, 2023: see https://doi.org/10.5281/zenodo.7776092]
- In late spring/early summer 2022, we will release data for patents issued by the Canadian Intellectual Property Office (CIPO)

 

Files

claims-patccat-v3-3-1836-1919.csv

Files (2.4 GB)

Name Size Download all
md5:22a291c54e3789767317e30e0036c4a8
265.0 MB Preview Download
md5:f58b48b4a9b1d184bfdd8930e9532415
1.7 GB Preview Download
md5:ed7ca9b1ec6b8e5ad1698975ec945034
160.5 kB Preview Download
md5:50e44d54e2ee62137cb34b1fc35d8b8e
36.1 MB Preview Download
md5:47b472f5189531ce4c73256fe37e460d
318.9 MB Preview Download
md5:da80169e70150f2ba118e4bb324f144e
35.2 kB Preview Download