Published April 19, 2023 | Version v1
Presentation Open

MatKG: The largest knowledge graph in Material Science

  • 1. MIT

Description

Information on materials is available through several platforms - online computational data repositories such as the Materials Project, OQMD, NOMAD etc, scientific literature databases such as a Elsevier, Arxiv, Web of Science etc, repositories of experiments such as ICSD, NIST databases, published textbooks, handbooks, industrial datasheets, and more recently in high throughput databases. This data can be structured - eg, the indexed, tabular data in online databases - or unstructured, as in textbooks and scientific literature.  Further, while some data is available as numerical or categorical attributes ('melting point', 'space group'), most are delocalised within large chunks of text, or available in images. The presence of multiple streams and modes of data along with the sheer amount of information available at present and being continuously generated taxes human cognition and calls for automated systems that identify, catalog, link, and query information on its own. 

 

Recently, Knowledge Graphs (KGs) have emerged as a tool for integrating data and relational ontologies through versatile graph databases. KGs arethe industrial standard for data retrieval and organization as demonstrated by KG use in Google Search, social media sites such as Facebook and LinkedIN, as well as companies with large data inventory such as Ikea and NASA. In the field of materials science as well, several domain specific ontologies and property graphs have been proposed, though the use of knowledge graphs as a relational database tool is not very common at present.

 

Here we present a knowledge graph in the field of materials science comprising over 80,000 unique entities and over 5 million statements, where each statement is an (entity, relation, entity) triple.  The KG covers several topical fields such as inorganic oxides, functional materials, battery materials, metals and alloys, polymers, cements, high entropy alloys, biomaterials, and catalysts. The triples are generated autonomously through data driven natural language processing pipelines and extracted from a corpus of around 4 million published scientific articles. Several informational entities such as materials, properties, application areas, synthesis information, and characterization methods are integrated together with a hierarchical ontological schema, where the base relations are extracted through statistical correlations to which higher level ontologies are appended. Thus the KG is heterogeneous and contains multiple relations between entities. It is shown that a bipartite projection of the base KG leads to comprehensive relational graphs that link materials to their chief attributes and applications and help answer questions such as "what are key attributes of battery materials?" without human intervention. We use a graph neural network based representational learning method to learn embeddings for entities and their relations which translate the graph data structure to a high dimensional mathematical space in which semantic relations between entities can be formulated as algebraic operations. This can be used not only to query the KG but can also predict new linkages between existing entities, thereby providing a versatile data informed tool for materials development and discovery. Key aspects of the knowledge graph which include entity extraction and link prediction are variously validated and compared with benchmarks where available. Finally, it is shown that the learned embedding representations encrypt physical and chemical information, which lend itself to machine learning in an easy manner.

Files

Files (3.2 MB)

Name Size Download all
md5:8bbf1a3017076a5988931361c5a95a07
3.2 MB Download