Published August 16, 2024 | Version v1
Dataset Open

Examining Patented Artefact Knowledge Graphs to understand Linguistic and Structural Basis

  • 1. ROR icon Singapore University of Technology and Design

Contributors

Supervisor:

  • 1. ROR icon City University of Hong Kong

Description

Introduction

This resource is uploaded in support of our research that involves examining knowledge graphs of patented artefacts to understand the linguistic and structural basis of engineering design knowledge. The research is detailed in the following paper.

https://arxiv.org/abs/2312.06355

The resource is segregated into multiple Pandas dataframes in pickled format – “.pkl”. To access any dataframe, please use the following Python code.

import pandas as pd

data = pd.read_pickle("PATH TO FILE.pkl")
print(data.head())

The individual datasets are described as follows.

Series information (English)

Patent Data

The original patent data is given in "engineering-design-knowledge/1-patent-data/"

For our analysis, we sampled 33,881 patents from USPTO using Patents View such that these are stratified according to the CPC subclasses. 

Patent List

In the dataframe "1-patent-list.pkl", we provide the introductory information about these patents as follows.

FIELD EXAMPLE
patent_id 7745779
patent_date 29/6/2010
patent_title Color pixel arrays having common color filters for multiple adjacent pixels for use in CMOS imagers
patent_abstract Image sensors and methods of operating image sensors. An image sensor includes an array of pixels and an array of color filters …
patent_classifications ['H04N', 'G01J']

Patent Sentences

From each patent in the dataset as described above, we acquire the full-text and process the sentences in these. In the dataframe "2-patent-sentences.pkl", we provide the list of 7,566,829 formatted sentences along with patent ID and lengths in terms of token count.

FIELD EXAMPLE
patent_id 10716120
sentence_id 10716120_725
sentence In certain examples, aspects of the operations of block 1915 may be performed by an uplink component as described with reference to FIGS 14 through 17.
length 28

Patent Knowledge Graphs

From the sentences as described above, we extracted 24,537,587 facts of the form - head entity :: relationship :: tail entity using a method described in our prior work - https://arxiv.org/abs/2307.06985

Combining these facts within a patent would form a patent knowledge graph that we examined in the current work to understand basis of design knowledge. The dataframe "3-patent-knowledge-graphs.pkl" provides individual facts as follows. The sentence ID in each row is same as the one mentioned in the previous dataframe - "2-patent-sentences.pkl".

FIELD EXAMPLE
patent_id 10075499
sentence_id 10075499_98
head the host facility
relation with
tail the highest average and aggregate weighting value

We combine facts as described above, within a patent, to get a knowledge graph that is used for examination in our current work. The results of our analysis are compiled into dataframes as explained below.

Series information (English)

Linguistic Basis

We analysed the frequencies of entities and relationships in the knowledge graphs populated for each patent in the sample. In the dataframes provided under "engineering-design-knowledge/2-linguistic-basis/", we provide information for 5,015,681 entities, 845,303 relationships, and 165 hierarchical relationships regarding their frequencies and linguistic syntaxes. In our work, we fit the proportions of the syntaxes to a Zipf distribution to visualise these at different percentiles.

Entity Syntaxes

FIELD EXAMPLE
entity the upper connecting member
frequency 27
syntax the JJ VBG NN

Relation Syntaxes

FIELD EXAMPLE
entity are suitable for recovering
frequency 2
syntax are JJ for VBG

Hierarchical Relation Syntaxes

FIELD EXAMPLE
hierarchical_relation comprises determining
count 1237
syntax compris* VBG

 

Series information (English)

Structural Basis

We did motif analysis on the network structures of the patent knowledge graphs to identfy statistically recurrent 3-node and 4-node patterns that are building blocks for each patent knowledge graph. In the dataframe - "1-patent-motifs.pkl" under "engineering-design-knowledge/3-structural-basis/", we list the network size and the motifs for each patent in the sample.

FIELD EXAMPLE
patent_id 9139018
node_count 85
edge_count 146
motifs [5, 61, 80]

The pattern # mentioned in the motifs field as above could be references using the images included in the folder "pattern-figures".

 

 

Files

engineering-design-knowledge.zip

Files (610.0 MB)

Name Size Download all
md5:9c53764978ebd607df260cb3777f72a8
610.0 MB Preview Download

Additional details

Identifiers

Related works

Requires
Preprint: arXiv:2307.06985 (arXiv)