CPC Patent Classification: USPTO-70k-enriched
Description
The patent classification task falls under the category of hierarchical multi-label classification. A patent document contains `title`, `abstract`, `claims` and `description` as four textual fields. Because of the large text, most of the previous work focused on title, abstract and claims as patent fields. In the paper, we make use description as a more elaborate patent field. For evaluation, we create a new dataset (USPTO-70k-enriched) from the previously releasd USPTO-70k dataset which contains title and abstract as patent fields.
Now, the dataset is enriched with four additional text columns, claims, brief-summary, fig-desc, detail-desc, where the later three columns are the subfield of description. Both the datasets are created from the bulk-data-dump provided by United States Patent and Trademark Office (USPTO) released under CC-BY-4.0.
We also release the dataset under the same license, CC-BY-4.0.
Files
bir_dataset_2022.zip
Files
(871.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:2c3eb45f98a93d42b2f14227b303e81b
|
871.1 MB | Preview Download |