Examining Patented Artefact Knowledge Graphs to understand Linguistic and Structural Basis
Description
Introduction
This resource is uploaded in support of our research that involves examining knowledge graphs of patented artefacts to understand the linguistic and structural basis of engineering design knowledge. The research is detailed in the following paper.
https://arxiv.org/abs/2312.06355
The resource is segregated into multiple Pandas dataframes in pickled format – “.pkl”. To access any dataframe, please use the following Python code.
import pandas as pd
data = pd.read_pickle("PATH TO FILE.pkl")
print(data.head())
The individual datasets are described as follows.
Series information (English)
Patent Data
The original patent data is given in "engineering-design-knowledge/1-patent-data/"
For our analysis, we sampled 33,881 patents from USPTO using Patents View such that these are stratified according to the CPC subclasses.
Patent List
In the dataframe "1-patent-list.pkl", we provide the introductory information about these patents as follows.
FIELD | EXAMPLE |
patent_id | 7745779 |
patent_date | 29/6/2010 |
patent_title | Color pixel arrays having common color filters for multiple adjacent pixels for use in CMOS imagers |
patent_abstract | Image sensors and methods of operating image sensors. An image sensor includes an array of pixels and an array of color filters … |
patent_classifications | ['H04N', 'G01J'] |
Patent Sentences
From each patent in the dataset as described above, we acquire the full-text and process the sentences in these. In the dataframe "2-patent-sentences.pkl", we provide the list of 7,566,829 formatted sentences along with patent ID and lengths in terms of token count.
FIELD | EXAMPLE |
patent_id | 10716120 |
sentence_id | 10716120_725 |
sentence | In certain examples, aspects of the operations of block 1915 may be performed by an uplink component as described with reference to FIGS 14 through 17. |
length | 28 |
Patent Knowledge Graphs
From the sentences as described above, we extracted 24,537,587 facts of the form - head entity :: relationship :: tail entity using a method described in our prior work - https://arxiv.org/abs/2307.06985
Combining these facts within a patent would form a patent knowledge graph that we examined in the current work to understand basis of design knowledge. The dataframe "3-patent-knowledge-graphs.pkl" provides individual facts as follows. The sentence ID in each row is same as the one mentioned in the previous dataframe - "2-patent-sentences.pkl".
FIELD | EXAMPLE |
patent_id | 10075499 |
sentence_id | 10075499_98 |
head | the host facility |
relation | with |
tail | the highest average and aggregate weighting value |
We combine facts as described above, within a patent, to get a knowledge graph that is used for examination in our current work. The results of our analysis are compiled into dataframes as explained below.
Series information (English)
Linguistic Basis
We analysed the frequencies of entities and relationships in the knowledge graphs populated for each patent in the sample. In the dataframes provided under "engineering-design-knowledge/2-linguistic-basis/", we provide information for 5,015,681 entities, 845,303 relationships, and 165 hierarchical relationships regarding their frequencies and linguistic syntaxes. In our work, we fit the proportions of the syntaxes to a Zipf distribution to visualise these at different percentiles.
Entity Syntaxes
FIELD | EXAMPLE |
entity | the upper connecting member |
frequency | 27 |
syntax | the JJ VBG NN |
Relation Syntaxes
FIELD | EXAMPLE |
entity | are suitable for recovering |
frequency | 2 |
syntax | are JJ for VBG |
Hierarchical Relation Syntaxes
FIELD | EXAMPLE |
hierarchical_relation | comprises determining |
count | 1237 |
syntax | compris* VBG |
Series information (English)
Structural Basis
We did motif analysis on the network structures of the patent knowledge graphs to identfy statistically recurrent 3-node and 4-node patterns that are building blocks for each patent knowledge graph. In the dataframe - "1-patent-motifs.pkl" under "engineering-design-knowledge/3-structural-basis/", we list the network size and the motifs for each patent in the sample.
FIELD | EXAMPLE |
patent_id | 9139018 |
node_count | 85 |
edge_count | 146 |
motifs | [5, 61, 80] |
The pattern # mentioned in the motifs field as above could be references using the images included in the folder "pattern-figures".
Files
engineering-design-knowledge.zip
Files
(610.0 MB)
Name | Size | Download all |
---|---|---|
md5:9c53764978ebd607df260cb3777f72a8
|
610.0 MB | Preview Download |
Additional details
Identifiers
- arXiv
- arXiv:2312.06355
Related works
- Requires
- Preprint: arXiv:2307.06985 (arXiv)