patccat: A classifier for patent claims
- 1. University of Mannheim
- 2. Wake Forest University
- 3. Southern Methodist University
Description
Data version: 3.3.0
Authors:
Bernhard Ganglmair (University of Mannheim, Department of Economics, and ZEW Mannheim)
W. Keith Robinson (Wake Forest University, School of Law)
Michael Seeligson (Southern Methodist University, Cox School of Business)
1. Notes on Data Construction
2. Citation and Code
3. Description of the Data Files
3.1. File List
3.2. List of Variables for Files with Claim-Level Information
3.3. List of Variables for Files with Patent-Level Information
4. Coming Soon!
1. Notes on Data Construction
This is version 3.3.0 of the patccat data (patent claim classification by algorithmic text analysis).
Patent claims define an invention. A patent application is required to have one or more claims that distinctly claim the subject matter which the patent applicant regards as her invention or discovery. We construct a classifier of patent claims that identifies three distinct claim types: process claims, product claims, and product-by-process claims.
For this classification, we combine information obtained from both the preamble and the body of a claim. The preamble is a general description of the invention (e.g., a method, an apparatus, or a device), whereas the body identifies steps and elements (specifying in detail the invention laid out in the preamble) that the applicant is claiming as the invention. The combination of the preamble type and the body type provides us with a more detailed and more accurate classification of claims than other approaches in the literature. This approach also accounts for unconventional drafting approaches. We eventually validate our classification using close to 10,000 manually classified claims.
The data files contain the results of our classification. We provide claim-level information for each independent claim of U.S. utility patents granted between 1836 and 2020. We also provide patent-level information, i.e., the counts of different claim types for a given patent.
For a detailed description of our classification approach, please take a look at the accompanying paper (Ganglmair, Robinson, and Seeligson 2022).
2. Citation
Please cite the following paper when using the data in your own work:
Ganglmair, Bernhard, W. Keith Robinson, and Michael Seeligson (2022): "The Rise of Process Claims: Evidence from a Century of U.S. Patents," unpublished manuscript available at https://papers.ssrn.com/abstract=4069994.
In the paper, we document the use of process claims in the U.S. over the last century, using the patccat data. We show an increase in the annual share of process claims of about 25 percentage points (from below 10% in 1920). This rise in process intensity of patents is not limited to a few patent classes, but we observe it across a broad spectrum of technologies. Process intensity varies by applicant type: companies file more process-intense patents than individuals, and U.S. applicants file more process-intense patents than foreign applicants. We further show that patents with higher process intensity are more valuable but are not necessarily cited more often. Last, process claims are on average shorter than product claims (with the gap narrowing since the 1970s).
We would love to see how other researchers use the data and eventually learn from it. If you have a discussion paper or a publication in which you use the data, please send us a copy at patccat.data@gmail.com.
We will the R code used to construct the data on Github with the next data version (version 3.4.0). Contact us at patccat.data@gmail.com if you would like to take a look at an earlier version of the code.
3. Description of the Data Files
The data files contain claim-level information for independent claims of 10,140,848 U.S. utility patents granted between 1836 and 2020. The files further contain patent-level information for U.S. utility patents.
3.1. File List
claims-patccat-v3-3-sample.csv | claim-level information for independent claims of a sample of 1000 patents issued between 1976 and 2020 |
claims-patccat-v3-3-1836-1919.csv | claim-level information for independent claims of 1,038,041 patents issued between 1836 and 1919 |
claims-patccat-v3-3-1920-2020.csv | claim-level information for independent claims of 9,102,807 patents issued between 1920 and 2020 |
patents-patccat-v3-3-sample.csv | patent-level information for a sample of 1000 patents issued between 1976 and 2020 |
patents-patccat-v3-3-1836-1919.csv | patent-level information for 1,038,041 patents issued between 1836 and 1919 |
patents-patccat-v3-3-1920-2020.csv | patent-level information for 9,102,807 patents issued between 1920 and 2020 |
3.2. List of Variables for Files with Claim-Level Information
For detailed descriptions, see the appendix in Ganglmair, Robinson, and Seeligson (2022).
PatentClaim | patent claim identifier; 8-digit patent number and 4-digit claim number (Ex: 01234567-0001) |
singleLine | =1 if claim is published in single-line format |
singleReformat | outcome code of reformating of single-line claims |
Jepson | =1 if claim is a Jepson claim |
JepsonReformat | outcome code of reformating of Jepson claims |
inBegin | =1 if claim begins with the word "in" |
wordsPreamble | number of words in the claim preamble |
wordsBody | number of words in the claim body |
dependentClaims | number of dependent claims that refer to this independent claim |
isMeansPreamble | =1 if term "means" is used in the preamble |
isMeansBody | =1 if term "means" is used in the body |
isMeans | =1 if term "means" is used anywhere in the claim (~ means-plus-function claim) |
processPreamble | =1 if terms "method" or "process" are used in the preamble |
processBody | =1 if terms "method" or "process" are used in the body |
processSimple | =1 if terms "method" or "process" are used anywhere in the claim (for simple approach of process claim classification) |
claimType | claim type of full classification (1 = process; 2 = product; 3 = product-by-process; 0 = no type) |
preambleType | preamble type |
preambleTerm | keyword used to classify preamble type |
preambleTermAlt | alternative keyword (if preambleTerm were not used) |
preambleTextStub | first 15 words of the preamble |
bodyType | body type |
bodyLinesStep | number of steps in the body |
bodyLinesElement | number of elements in the body |
bodyLinesTotal | total number of identified lines in the body |
label | 2-character label of the preamble-body combination; classification table maps label to claim type |
3.3. List of Variables for Files with Patent-Level Information
For detailed descriptions, see the appendix in Ganglmair, Robinson, and Seeligson (2022).
patent_id | U.S. patent number (8-digit patent number) |
claims | number of independent claims (the sum of the four claim types: 0, 1, 2, and 3) |
noCategory | number of claims without a classified type |
processClaims | number of process claims |
productClaims | number of product claims |
prodByProcessClaims | number of product-by-process claims |
firstClaim | type of the first claim (1 = process; 2 = product; 3 = product-by-process; 0 = no type) |
simpleProcessClaims | number of process claims by simple approach (terms "method" or "process" anywhere in the claim) |
simpleProcessPreamble | number of process claims by simple approach (terms "method" or "process" in the preamble) |
meansClaims | number of means-plus-function claims |
meansFirst | =1 if first claim is a means-plus-function claim |
JepsonClaims | number of Jepson claims |
JepsonFirst | =1 if first claim is a Jepson claim |
Note: The following variables/fields are currently empty (March 30, 2020); we will populate these variables/fields with data version 3.4.0.
preambleTerm
preambleTermAlt
preambleTextStub
bodyLinesStep
bodyLinesElement
bodyLinesTotal
Note: We will release the data for patents issued in 2021 with data version 3.4.0.
4. Coming Soon!
We are working on a number of extensions of the patccat data.
- With data version 3.4.0, we plan to release data for all published U.S. patent applications (2001 through 2021)
- In late spring/early summer 2022, we will release data for patents issued by the European Patent Office (EPO) [Update: March 28, 2023: see https://doi.org/10.5281/zenodo.7776092]
- In late spring/early summer 2022, we will release data for patents issued by the Canadian Intellectual Property Office (CIPO)
Files
claims-patccat-v3-3-1836-1919.csv
Files
(2.4 GB)
Name | Size | Download all |
---|---|---|
md5:22a291c54e3789767317e30e0036c4a8
|
265.0 MB | Preview Download |
md5:f58b48b4a9b1d184bfdd8930e9532415
|
1.7 GB | Preview Download |
md5:ed7ca9b1ec6b8e5ad1698975ec945034
|
160.5 kB | Preview Download |
md5:50e44d54e2ee62137cb34b1fc35d8b8e
|
36.1 MB | Preview Download |
md5:47b472f5189531ce4c73256fe37e460d
|
318.9 MB | Preview Download |
md5:da80169e70150f2ba118e4bb324f144e
|
35.2 kB | Preview Download |