Published June 17, 2024 | Version v1
Computational notebook Open

Training Infrastructure for Engineering Design Knowledge Extraction

  • 1. ROR icon Singapore University of Technology and Design

Contributors

Supervisor:

  • 1. City University of Hong Kong

Description

We provide the training infrastructure for extracting engineering design knowledge from patent documents.

Let us consider an example, 

{HEAD ~ This product}, which can be applied by {TAIL ~ spray methods}, contains fungicides, preservatives

where entities are marked using {HEAD ~ ...} and {TAIL ~ ...} markers. The goal of the training is to identify the relation - "applied by".

We propose three approaches to accomplish this goal.

  • Relation identification - When the sentence as marked above is input to a transformer as list of tokens, the output should be token labels such as "HEAD", "REL", "TAIL", "OTH". Among these, the tokens denoted by "REL" shall be retrieved as the relation. In NLP, this task is standardly referred to as "token classification".
  • Relation identification with spaCy - The token classification approach above could also be accomplished using the spaCy training module that handles various training steps, while limiting the choice of models.
  • Relation elicitation - When the sentence is input as is, in this approach, a transformer should directly elicit the relation. This task is referred to as "Seq2Seq" or "Text2Text".

For each of the above, we created a separate folder and included a "training.ipynb" notebook with inner instructions. In each folder, we also include the datasets processed as per the task requirement. The general version of the dataset with 375,084 examples is uploaded to Huggingface Hub - https://huggingface.co/datasets/siddharthl1293/engineering_design_facts

The dataset and training are also described in our paper as follows.

Siddharth, L., Luo, J., 2024. Retrieval-Augmented Generation using Engineering Design Knowledge. arXiv (cs.CL) https://arxiv.org/abs/2307.06985.

 

 

Files

design-knowledge-extraction.zip

Files (179.4 MB)

Name Size Download all
md5:c8f5af644c109614746a759632ce4782
179.4 MB Preview Download

Additional details

Related works

Is described by
Preprint: arXiv:2307.06985 (arXiv)

Software

Programming language
Python
Development Status
Active