Published August 15, 2022 | Version v1
Dataset Open

CPC Patent Classification: USPTO-70k-enriched

Authors/Creators

  • 1. Bosch Center for AI

Description

The patent classification task falls under the category of hierarchical multi-label classification. A patent document contains `title`, `abstract`, `claims` and `description` as four textual fields. Because of the large text, most of the previous work focused on title, abstract and claims as patent fields. In the paper, we make use description as a more elaborate patent field. For evaluation, we create a new dataset (USPTO-70k-enriched) from the previously releasd USPTO-70k dataset which contains title and abstract as patent fields.

Now, the dataset is enriched with four additional text columns, claims, brief-summary, fig-desc, detail-desc, where the later three columns are the subfield of description. Both the datasets are created from the bulk-data-dump provided by United States Patent and Trademark Office (USPTO) released under CC-BY-4.0.

We also release the dataset under the same license, CC-BY-4.0.

Files

bir_dataset_2022.zip

Files (871.1 MB)

Name Size Download all
md5:2c3eb45f98a93d42b2f14227b303e81b
871.1 MB Preview Download