Towards Ontology Reshaping for KG Generation with User-in-the-Loop: Applied to Bosch Welding

Knowledge graphs (KG) are used in a wide range of applications. The automation of KG generation is very desired due to the data volume and variety in industries. One important approach of KG generation is to map the raw data to a given KG schema, namely a domain ontology, and construct the entities and properties according to the ontology. However, the automatic generation of such ontology is demanding and existing solutions are often not satisfactory. An important challenge is a trade-off between two principles of ontology engineering: knowledge-orientation and data-orientation. The former one prescribes that an ontology should model the general knowledge of a domain, while the latter one emphasises on reflecting the data specificities to ensure good usability. We address this challenge by our method of ontology reshaping, which automates the process of converting a given domain ontology to a smaller ontology that serves as the KG schema. The domain ontology can be designed to be knowledge-oriented and the KG schema covers the data specificities. In addition, our approach allows the option of including user preferences in the loop. We demonstrate our on-going research on ontology reshaping and present an evaluation using real industrial data, with promising results.


INTRODUCTION
Knowledge graphs (KGs) allow to structure information in terms of nodes and edges, where the nodes represent entities, e.g., welding machines, and edges connect entities and thus represent relationships between them, e.g., by assigning software systems to concrete welding machines, or edges connect entities to their data values, e.g., by assigning the weight and price to welding machines [14]. In the context of Industry 4.0 [18] and Internet of Things [32], KGs have been successfully used in a wide range of applications and industrial sectors in well known production companies such as Bosch [19,44,62], Siemens [15,25,26], Festo [10], Equinor [22,23,39], etc.
An important challenge in scaling the use of KGs in industry is to facilitate automation of KG construction from industrial data due to its complexity and variety. Indeed, a common approach on KG construction is to construct entities and properties by relying on a given knowledge graph schema or ontology. The KG schema defines upper-level concepts of data, and consists of classes and properties. A classical domain ontology is a formal specification of shared conceptualisation of knowledge [13,40] and it reflects experts knowledge on upper level concepts, specific domains, or applications.
However, since the domain ontologies are knowledge-oriented, they do not focus on the specificities of arbitrary datasets. In industry, data often come with a wide range of varieties. Many attributes exist in some datasets but not in others, and many terms in the domain ontologies do not exist in all datasets. Thus, directly using domain ontologies as KG schemata can naturally lead to a number of issues. Indeed, the resulting KG can contain a high number of blanknodes, it may be affected by information loss, incomplete data coverage, and happen to be not user-friendly enough for applications.
Considering an example in automated welding [60,61], which is an essential manufacturing process for producing the hundreds of thousands of car bodies in car factories everyday. The domain ontology ( Figure 1a) shows that, in the welding process, the welding operations are operated under welding software systems, which arXiv:2209.11067v1 [cs.AI] 22 Sep 2022 (partially shown in a) reflects the knowledge; the KG schema (partially shown in b) needs to reflect raw data specificities and usability. Blue boxes: classes that can be mapped to attributes in the raw data; black boxes: classes that cannot be found in the raw data.
have measurement modules. These modules measure a sensor signal and save it as an operation curve, namely the current curve. The current curve is stored as an array, and it has a mean value. This presentation is created in close collaboration with the domain (welding) experts. It is thus intuitive for human to understand the domain knowledge well. Yet, the real datasets differ from the mental model. In most welding datasets, there exist one current mean value and one current array value for each welding operation (highlighted with blue boxes), while one welding software system is responsible for a huge group of welding operations, and there exist no attribute for the measurement module and operation curve current. If the domain ontology in Figure 1a is used for KG generaton directly, the KG will contain a lot of blank nodes or dummy nodes generated by the class MeasurementModule and OperationCurveCurrent. Furthermore, the user will not be able to find the one-to-one correspondences between WeldingOperation, CurrentMeanValue and CurrentArrayValue. Meanwhile, the deep structure of the domain ontology will make the applications over the KG (e.g. query-based analytics) excessively complicated. Instead, the ideal schema for the KG generation would be Figure 1b, in which the ontology is significantly simplified. The generated KGs will have zero dummy nodes, and a much simpler structure. The KGs will be more efficient to generate, and easier to understand and use for the users.
The above-mentioned example exhibits the challenge of a tradeoff between two principles of ontology engineering, (1) knowledgeorientation, which reflects the generality of domains, and (2) dataorientation, which aims at KG generation well-suited for specificities of data. The former principle focuses on conveying the meaning, or knowledge of a given domain, while the latter principle emphasises on the usability of ontologies for applications [16].
To address this challenge, we propose our ontology reshaping method that computes KG schemata from domain ontologies; these schemata in turn allow for generation of KGs of high quality. In this way, we circumvent the trade-off problem and can use both ontologies for their different purposes. Our contributions are as follows: • We propose an algorithm for ontology reshaping, which converts domain ontologies that reflect general knowledge to smaller KG schemata that cover data specificities, addressing the issue of sparse KG with dummy nodes. • We design and conduct experiments for a proof-of-concept evaluation. We derive requirements for the use case, and design performance metrics for ontology reshaping.

PRELIMINARIES
Problem Formulation. Intuitively, ontology reshaping is about computing a smaller ontology from a larger one by taking a subset of its classes and possibly redefining some of its axioms based on some external heuristics as well as on some notion of optimality. In this work we do not aim at developing a formal theory of ontology reshaping, but rather at providing intuitions behind this problem and preliminary solutions that account for a particular type of reshaping. More precisely, in this work we consider reshaping where (i) re-definition of axioms is essentially "re-assigning" of properties from some classes to another (ii) the external heuristics is user's input and (iii) the notion of optimality is the coverage of the data which the re-shaped ontology should be mapped to. In other words, such reshaping can be seen as: where is an algorithm that takes in the inputs and outputs a KG schema S. The inputs are: is a larger ontology, is the raw data, that is related to. The raw data is in the form of relational table. is a set of mappings that relate the attributes and table names in raw data to the classes in . Apart from that, we need some more information given by the users. In our case, this includes two parts, (1) the users need to point out the most important entity (the most important class in ), named as the main class (MC); (2) User information some more optional information.
Requirements of KG. We derive the requirements as follows: • R1 Completeness. The knowledge graphs should be able to completely represent the raw data. • R2 Efficiency. The generation of KG schema and KG should be efficient both in computational time and storage space. • R3 Simplicity. The KG schema should not be over-complicated for understanding and use: (1) The generated KG should not have too much redundant information, e.g. dummy entities that are generated solely because of the schema but have no correspondence in the raw data; (2) the KG should not have complicated structures.

OUR METHOD: ONTOLOGY RESHAPING
Algorithm Explanation. Intuition of the algorithm in five steps: • Step 1: Initialise the KG schema S with MC, and two sets from classes in : potential classes and potential properties . • Step 5: Connect the rest classes in S to MC or according to user information (optional) Step 1. Initialisation. We start our algorithm with initialisation (Line 1). S is initialised with the main class MC. Then we map classes in to attributes in with , where classes that can be mapped to attributes in are initialised in the set of potential properties , the rest classes are initialised as potential classes .
Step 2. Class addition with table names. The raw data D is in form of relational table. If the relational table's name is able to be mapped from to classes in , We then add the mapped classes from to S (Line 2 to 3).
Step 3. Entity identification by key words. In Step 3 we identify the classes in potential properties by key words and (optional) user information. We map the attributes A stored in raw data from to properties in . Indeed, if properties in are named as 'ID', 'NAME', e.g., WeldingProgramID, WeldingMachineName, we then add the corresponding classes in , e.g., WeldingProgram, WeldingMachine, to S. Besides, we add the specidied entities in S based on user information (Line 4 to 12).
Step 4. Classes connection . In Step 4 (in algorithm 2) we connect classes in S and connect properties to classes according to . It takes 4 inputs: KG schema S, Domain ontology O, main class MC, and user information . We start the Algorithm 2 with initialisation (Line 1), where the is initialised with copying classes in S . Then we iterate all permutation of the classes pairs ( , ) in S (Line 2). For each pair ( , ), if the relation ( , ) already exists in S, we then continue to the next pair (Line 3 to 4). Otherwise, if direct relation ( , ) exists in O but not in S (Line 5 to 6), then this relation will be added to S. In case of existing in the ontology an indirect relation between the two classes , , the algorithm first evaluates if user information exists and adds a relation between classes ( , ) (Line 7 to 10), and if not both classes are related to the main class ( , ), ( , ) (Line 11 to 12).
Step 5. Classes connection with UserInfo. In Step 5 (in algorithm 2) we connect classes in S to MC that are not connected to any classes, or according to UserInfo (optional) (Line 18 to 25). The domain ontology O is an OWL 2 ontology and can be expressed in Description Logics SH I (D). With its 1249 axioms, which contain 147 classes, 145 object properties, and 132 datatype properties, it models the general knowledge of a type of fully automated welding process, resistance spot welding [47,59,62,66], The industrial dataset is collected from welding production lines in a factory in Germany. The raw form of is various, including txt, csv, RUI (a special format of Bosch time series data), SQL database, etc. They are transformed into relational tables. We selected a subset from the huge amount of data to evaluate our method. The transformed data contain one huge table of welding operation records and a series of tables of welding sensor measurements. In total, there exist about 4.315 million records. These data account for 1000 welding operations, estimated to be related to 100 cars.
The mapping M consists of three mappings: (1) a meta mapping that gives the correspondence between the table names in and the class names in O; (2) an operation mapping that relates the attribute names of the welding operation records to the classes in O; (3) a sensor mapping that annotates the attribute names of the welding sensor measurements with the classes in O.
The UserInfo contains two parts, (1) mandatory: the users need to point out the main entity ME, which is the welding operation; (2) optional: other information, e.g. the attribute names corresponding to other possible entity names and their related properties Experiment Design. To test whether our ontology reshaping algorithm can perform well for an arbitrary dataset, we randomly sub-sample the dataset to 6 sub-datasets (Set 1-6 in Table 1). Each set contains a subset of the attributes of D, reflecting different data complexity. The numbers of attributes in the subsets increase by ten each time, from 10 to 60. We repeat the sub-sampling for each subset 10 times to decrease the randomness, and provide the mean values of the evaluation metrics (introduced in next paragraph).
We compare two approaches: the baseline of KG generation without ontology reshaping; KG generation with our method of ontology reshaping. The baseline is a naive approach to select a subset from the domain ontology and use it as the KG schema. The subset includes (1) classes that have correspondence to attributes in raw data, and (2) the classes connect these classes in (1).
Evaluation Metrics. We use three sets of metrics to evaluate the fulfilment of the three requirements in Sect. 2: completeness, efficiency and simplicity. To evaluate completeness, we have data coverage that represents the percentage of attributes in raw data covered by the generated KG.Efficiency measures the ability of the approaches to use the least time cost for generating KG from raw data and to take the least storage space and the least entities and properties to represent the same information of raw data. The efficiency metrics thus include time cost, storage space, number of classes in the KG schema, and number of entities and properties (including object properties and data properties) in the generated KG. Simplicity measures the performance of the approaches to represent the same raw data with the simplest KG. These metrics include the number of dummy entities in the generated KG (namely the entities that cannot be mapped to attributes in the raw data, but are generated solely because of the KG schema), and two depths (the depth characterise the number of edges to connect two nodes in the KG via the shortest path): (1) root to leaf depth measures the "depth" to find the furthest entity starting from the main entity; (2) the global depth is the largest depth across the whole KG.
Results and Discussion. In Table 1, the six subsets with different number of randomly sub-sampled (each repeated 10 times) attributes are listed as the columns. The data complexity increases from Set 1 to Set 6. Both approaches can represent the raw data 100%. Thus it is not displayed in the table.We observe that OntoReshape outperforms the baseline in terms of time cost, storage space and other efficiency metrics, especially that the KG generation with OntoReshape is 7 to 8 times fast than that of the baseline.In terms of simplicity, OntoReshape also outperforms the baseline significantly. The baseline generates a huge amount of dummy entities, while OntoReshape generates zero dummy entities, drastically reducing information redundancy. The two depths of the KGs generated by OntoReshape are also only half or one third of that of the baseline. The KGs generated by OntoReshape are thus much more simpler, and easier to understand for the users. With an average root to leaf depth of only 1.2, we can see the users can use fewer queries to reach the deepest entities in the KG generated by OntoReshape.

RELATED WORK
Knowledge graphs have received much attention in industries [11,15,37,55]. KGs provide semantically structured information that can be interpreted by computing machines [52,67], and an efficient   foundation for standardised ways of data retrieval and analytics to support data driven methods. Data driven methods have been widely used in industries [33,34,57,58], especially machine learning [48,56,[63][64][65]. The problem of transforming a bigger ontology to a smaller ontology of the same domain is often referred to as ontology modularisation [4][5][6][7]29] and ontology summarisation [35]. Most of them focus on the problem of selecting a subset of the ontology that is interesting for the users [27], but they still cannot avoid dummy entities. Works on ontology reengineering [45,46] also talked about reuse/adjustment of ontologies, but they do not focus on automatically creating an ontology that reflect data specificities.

CONCLUSION AND OUTLOOK
This work presents our on-going research of ontology reshaping, that generates a small ontology covering data specificities from a more complex domain ontology reflecting general knowledge. The current approach can not fully retain the semantics of the domain ontology. This can be addressed by uniform interpolation, also known as forgetting [28,31], which we also plan to study in the future. Furthermore, we plan to compare our work with the body of research, where we actively contributed, on ontology evolution [12,53,54], knowledge modelling and summarisation [8,9,21,30,[49][50][51], ontology extraction or bootstrapping [17,36], and to investigate how to extend our work to account for ontology aggregation techniques [3,24], and to develop end-user interfaces for exploration and improvement of reshaped ontologies [1,2,20,38,[41][42][43].