Malay IK : An Ontological Approach to Knowledge Transformation in Malay Unstructured Documents

ABSTRACT


INTRODUCTION
The difficulty of defining knowledge in unstructured documents is due to the paradox that knowledge resides in a person"s mind and at the same time, it has to be captured, stored, and reported. For that, philosophers classify knowledge into knowing-that and knowing-how. Knowing-that is factual where data are stored in databases and facts can be recalled, processed, and disseminated. While knowing-how is actionable to do something, turning data into information and in turn into knowledge [1].
However, structured data represent only a little part of the overall organization of knowledge; in fact, the major part of this knowledge is incorporated in textual documents. For example, available business data are captured in text files that are not structured, e.g. memoranda and journal articles that are available electronically [2][3][4]. A large portion of the available information does not appear in structured databases but rather in collections of text articles drawn from various sources [5]. Thus, the main concern here is to dig knowledge from the available vast amount of textual documents.
The proposed ontological approach to knowledge transformation is based on interrogative structure [6] and conceptual modeling [7][8][9][10][11] approach. In transforming the extracted knowledge in unstructured document, "deep-level understanding" of complete sentences is extracted by identifying, organizing, and structuring the information into interrogative structured form. The "deep-level understanding" of complete sentence refers to the understanding of a group of words in a complete sentence which, when they are written down, begin with a capital letter and end with a full stop, question mark, or exclamation mark.
The interrogative approach to knowledge extraction relies on data and conceptual modeling, as well as context and knowledge representation. Knowledge extraction supports the creation of: (1) knowledge; (2) relationship; (3) contextual information; and (4) representation of common languages. It gives aid in the transformation of extracted knowledge in an unstructured document into an interrogative structured form. The first issue to address corresponds to the need for a mechanism to identify knowledge from the sourced unstructured document in order to extract the knowledge. This is essentially in the interrogative knowledge identification, which identifies the type of document by separating text into knowledge, information or data and unifying it with personal components of values and beliefs. To identify knowledge, the approach of answering interrogatively is proposed to answer the question within the text in unstructured document.
The interrogative contextual information is derived from the incorporation of context and additional information annotation with context key facility. Context is an abstraction of the context factors, which are represented as concepts [12]. It is further exploited by [13] as contextual information, where information entered into the computer is tagged with context keys facilitating future retrieval using those keys. It may be any information that could be used to characterize the situation of an entity i.e. person, place, object [14]. For that, the interrogative contextual information is utilized to understand the process of making sense of information into knowledge and maintain the meaning of the information. This is to gain the interpretation of the identical knowledge by classifying the main point of the unstructured document interrogatively.
The rational to incorporate personal components towards the interrogative knowledge identification is as follows. According to [15], personal components have a powerful impact on organizational knowledge [16]. Assert that knowledge is a fluid mix of frame experience, values, contextual information, and expert insight. It originates in the mind of the knower to determine a large part of what the knower sees, absorbs, and concludes from his observations [17]. Stated that knowledge is a private and personal thing, which is intuitive and strongly linked to the user"s values and beliefs. By manually transforming documents, values are embedded because humans read documents, extract the values of existing fields, and then enter the values into a user interface [18].

RESEARCH METHOD
This research proposes the MalayIK-Ontology model that is designed to transform extracted knowledge in Malay unstructured documents into an interrogative structured form based on interrogative knowledge identification as well as interrogative knowledge organization and structuring. The first step is to prepare the unstructured documents into an extension of plain text. The second step is to invoke the lexicon identifier that uses lexicon interrogative analysis matching rules of a specific corpus, which in this research, the MalayIK-Corpus. The lexicon identifier is used to identify and to extract knowledge in each of the complete sentences written in the Malay unstructured document. It is also used to extract interrogative lexical constructs from the individual unstructured document.
Next, the third step is to invoke the object recognizer that uses matching rules of object interrogative analysis in order to extract the ontological constructs from the interrogative lexical constructs. The object recognizer populates and maps the objects using ontology engineering, which is a mechanism of a knowledge structure to represent the concept and relationship of the abstract model on how people think about things in the world. Finally, the fourth step is to populate the database scheme by transforming the ontological constructs through connection between the ontology model and the object-relationship model.
The MalayIK-Ontology architecture ( Figure 1) consists of three main components, which are IKL-Identifier, IKO-Recognizer, and IKS-OntologyDB. IKL-Identifier attempts to answer the question within the text interrogatively, and IKS-OntologyDB connects the ontology and object-relationship model to be populated into database.
The Protégé-Frames editor [19] is adopted in this research to structure and capture knowledge. It provides a full-fledged user interface and knowledge server to support users in constructing and storing frame-based domain ontologies, customizing data entry forms, and entering instance data. An object-based recognizer using interrogative knowledge approach or IKO-Recognizer [20], [21] is also adopted. In this research, the IKO-Recognizer maps the object interrogative analysis rule with ontology.

IKL-Identifier:
Interrogative Knowledge Lexicon Identifier answering the question within the text interrogatively.

IKO-Recognizer:
Interrogative Knowledge Object Recognizer mapping object interrogative analysis rule with ontology.

IKS-OntologyDB:
Interrogative Knowledge Structure connecting ontology and object-relationship model to populate into database.

Preparation:
Conversion of file.

IKL-Identifier
The Interrogative Knowledge Lexicon-(IKL-) Identifier is a lexicon identifier that uses lexicon interrogative analysis of 'apa' (what), 'siapa' (who), 'bila' (when), 'di mana' (where), 'mengapa' (why), and 'bagaimana' (how) in answering interrogative questions within the text in an unstructured document. The mechanism for the IKL-Identifier is to convert sentences into interrogative lexical constructs in the form of interrogative annotation.
Basically, the IKL-Identifier identifies the type of interrogative lexical constructs in each complete sentence within the Malay unstructured document by separating the text into knowledge, information or data. It is also responsible to tag the interrogative lexical constructs with interrogative contextual information, which is important to interpret the information into knowledge and maintain the meaning of the information in the Malay unstructured document. The processes of the IKL-Identifier are illustrated in Figure 2, which are tokenization, lexicon interrogative analysis, interrogative contextual information tagging, and phrases constructor tagging.  During tokenization, the text of unstructured document is segmented into sentences and tokenized into lexicons. Subsequently, the case format is defined, either the lexicon will hold digits, lower, upper, title or toggle cases. Each lexicon will then be assigned with automated serial number by lines, sentences, and token numbers. Next, during the Lexicon Interrogative Analysis, each lexicon is analyzed with lexicon interrogative analysis matching rules of the MalayIK-Corpus using the standard Data Manipulation Language (DML). DML is used to analyze, to check, and to insert the lexicon into interrogative annotation as interrogative lexical construct, should it exists. Any new lexicon that is analyzed will be inserted and defined in the MalayIK-Corpus.
Finally, the interrogative lexical constructs are used during Phrases Constructor Tagging. In this step, a phrase is constructed by putting together words based on interrogative annotation of the word. A phrase is a group of words, which contains an idea that forms a unit in which writing is part rather than a whole of a sentence. The words are divided depending on their use in a part of speech.

IKO-Recognizer
The IKO-Recognizer (Interrogative Knowledge Object Recognizer) performs matching and mapping object interrogative analysis rule of what/who/when/where/ why/how to extract ontological constructs [20], [21]. There are two major processes, object recognizer and mapping process. First, the object recognizer uses object interrogative analysis rules by utilizing Object-Oriented Programming (OOP) in order to conceptually organize the program around its data (objects/concepts). In this process, a number of object interrogative analysis rules and precondition language is pre-defined but users may manually define additional rules. Second, the following mapping process uses an ontology engineering approach, whereby objects that have been created by the object recognizer are accessible as plug-ins in the ontology system.
The main process in the IKO-Recognizer is the Object Interrogative Analysis Rules and the Precondition Language. Object interrogative analysis rules capitalize on Java OOP class encapsulation approach. For this, the object interrogative analysis rules use interrogative elements as the most upper class of the object. The structure and behaviors of the objects are implemented through (a) Struktur Kata Nama Am (Noun Structure) and (b) Struktur Leksikon Semantik (Semantic Lexicon Structure) in order to construct objects.
The first structure, which is the Struktur Kata Nama Am (Noun Structure), the object interrogative analysis rule is defined by combining the structure and behavior of an object with its inheritance and its conceptual modifiers of one or more subclasses in a hierarchical structure. The structure and behaviour of the object are defined by 'kata_masuk' as tagged during the interrogative lexical construct earlier. The "kata_masuk' for 'penyelidik' (investigator) is the grammatical information of "kata nama am". It is a noun of 'kata nama am orang', which refers to as a conceptual of 'Orang' (People), and has the interrogative element of 'siapa' (who). Hence, it inherits the general behaviour or properties of its parent 'siapa' (who).
In the second structure, the Struktur Leksikon Semantik (Lexicon Semantic Structure), the object interrogative analysis rule uses the corresponding structure and behaviour of semantic lexicon that defines interrogative elements of 'bila' (when), 'di mana' (where), 'mengapa' (why), and 'bagaimana' (how). The semantic lexicons of 'bila' (when) and 'di mana' (where) correspond to the phrase or proper noun constructed after the semantic lexicon of the interrogative elements. The structure and behaviour of semantic lexicon 'bila' (when) shows about the time at which an event take place. Whereas, the semantic lexicon of 'di mana' (where) shows about the place something is in, or is coming from or going to.
However, the semantic lexicons of 'mengapa' (why) and 'bagaimana' (how) correspond to the predicate after the semantic lexicons of 'mengapa' (why) and 'bagaimana' (how). Reason being is to describe the meaning of the semantic lexicons and to give information about the sentence. The semantic lexicon of 'mengapa' (why) talks about the reasons for something which introduces a relative. Whereas, the semantic lexicon 'bagaimana' (how) explains the way in which something happens or is done and introduces a statement or fact. The objects of 'mengapa' (why) and 'bagaimana' (how) correspond accordingly to their definitions of interrogative element.

IKS-OntologyDB
The IKS-OntologyDB is a process of exporting the ontology structure into a database. The metadata of the information regarding the relationships, properties, attributes, and facets of the class structure are created in the ontology system and are exported into Microsoft Access. The exportation is done by using the facility provided by Protégé by selecting the option of Export to HTML format. The transformation of the knowledge-based system created via the Protégé to the database management system by using HTML format. The HTML information is used to create attributes and constraints in the Microsoft Access. The table is created according to the definition and declarations of the SQL schema and ontology declaration of the Protégé knowledge-based system. This is shown in Table 1. The components of ontology and conceptual model are basically equivalent in terms of concepts and entities, relationships, and attributes. The ontological constructs generated are mapped with the Objectoriented System Model (OSM) established by [7][8][9][10][11]. It is used by the object-relationship model to describe the data interest which includes relationships, lexical appearances and context keywords. Besides, it is used to structure the data identified and extracted and populate them into database scheme.
In general, the relevant knowledge about an object set is represented by a colon (:) after an objectset name which denotes that the object set is a specialization. For example, the lexical object set of Death Date: Date, where date describes the string patterns of interrogative element of 'bila' (when). For the lexical object set of Deceased Name: Name and Relative Name: Name, name is matched by recognizing the string patterns of proper nouns interrogative element of 'siapa' (who). The context keywords indicate the presence of an object in an object set. For example, 'kematian' (died) and 'meninggal dunia' (passed away) are the context keywords for Death Date; 'pengebumian' (buried) is a context keyword for Interment.

MalayIK Corpus
This research uses the MalayIK-Ontology based on interrogative approach. While most approaches of text processing as discussed in [22] use NLP or information extraction to select the set of keywords or phrases to be analyzed, ontology approach is able to avoid mislead in the "vocabulary problem" which leads to spurious results. By establishing a fixed set of general concepts ("People", "Location", "Things") with the entry of word answering the question interrogatively ("People", "Location", "Things" refer to "who", "where", "what" respectively), the vocabulary used in the rule mapping phase may be controlled.
The most important attribute is the grammatical information of lexicon entry to answer the question of the lexicon grammatical information interrogatively besides the root word (lexicon). The MalayIK-Corpus is a Malay language corpus where the Malay dictionary of Kamus Dewan [23,24] and the dictionary of root words act as important secondary controls of the lexicon entries. It also refers to the dictionary of Kamus Imbuhan Bahasa Melayu [25], Kamus Dwibahasa Oxford Fajar [26], and Kamus Komprehensif Bahasa Melayu [27]. The lexicons entries are manually inserted in the database using standard DML of the related database.
In order to create a general purpose corpus for Malay language, the Ahmad"s and Abdullah"s stop words [23], [25] are included which indicate pronoun, auxiliary verb, adverb, predicate, preposition, negative, conjunction, relative and determinant. Table 2 presents examples of words entry extracted from MalayIK-Corpus in a table format (by columns and rows). The header row of Table 2

RESULTS AND ANALYSIS
The first question that arises in designing the MalayIK-Ontology is to check whether the constant/keyword recognizer to extract and structure data of Ontos can be applied to Malay unstructured documents. The next question is to check whether the MalayIK-Ontology can identify knowledge as well as data to be extracted and structured are equivalent or better than Ontos. Furthermore, the knowledge or data obtained needs to be checked for its validity. This is to prove that the MalayIK-Ontology works effectively in extracting and structuring data as compared with Ontos. Therefore, following are the steps taken to perform the experiment.
The accuracy of Ontos is measured by the numbers of data extracted between the English and Malay obituaries. The accuracy of MalayIK-Ontology is measured by numbers of knowledge or data extracted. When applying Ontos on English and Malay obituaries, three tables are created based on the obituaries ontology, which are DeceasedPerson, Viewing, and DeceasedPersonRelationshipRelativeName. For MalayIK-Ontology, the table DeceasedPersonRelationshipRelativeName is used to compare the translation of Malay language for Relationship. An example of data extracted for both Ontos and MalayIK-Ontology are listed in Table 3 for DeceasedPerson.

Analysis of Ontos on English and Malay Obituaries
The results of Ontos being applied on English and Malay obituaries are shown in Table 4 and Table 5. This table shows the counted number of facts (attributes values) in the test-set documents of English and Malay obituaries. They [7][8][9][10][11] are consistent with their implementation, which only extracts explicit constants. A string is counted as correct, if the constant extracted occurs in the text. With this understanding, counting is basically straightforward. Due to their name lexicon is incomplete and because of their nameextraction expressions are not rich; sometimes parts of a name are missed or a single name were split into two. For these cases, they list the count after + in the Declared Correctly column. Partial names also caused most of the problems for the large number of incorrectly identified relatives. With a more accurate and complete lexicon coupled with richer name-extraction expression, they believe they can achieve much higher precision.
As anticipated, the experiment to check the constant/keyword recognizer of Ontos on Malay obituaries does not produce the same results as English obituaries for numbers of facts generated. However, the results show that the DeceasedPerson, DeceasedName, BirthDate and DeathDate generate 100% recall and precision. Facts generated are classified as nonlexical and lexical objects set. The nonlexical and lexical objects set are described in what they defined as data frames. A data frame describes the string patterns for its constants.
For that, results of lexical objects sets such as counted number of facts for the IntermentDate, IntermentAddress, ViewingDate, ViewingAddress, Relationship and RelativeName generate 0% recall and precision listed in Table 4. This is due to the context keywords in Malay obituaries being translated. However, non-lexical object sets such as DeceasedPerson are always generated for an obituary record and consequently the results for that sets are 100% recall and precision. They represent non-lexical object sets by surrogate identifiers which are generally easier to identify correctly. This shows Ontos can be applied on Malay obituaries for data frame of non-lexical object sets by surrogate identifiers.  DeceasedPerson  3  3  3  0  0  100 100 100 100   DeceasedName  3  3  3  0  0  100 100 100 100   Age  3  3  3  0  3  100  0  100  50   BirthDate  3  3  3  0  0  100 100 100 100   DeathDate  3  3  3

CONCLUSION
The main objective of this research is to propose a new approach to transform extracted knowledge in Malay unstructured document by identifying, organizing, and structuring them into interrogative structured form. In order to achieve this objective, an approach is established through the MalayIK-Ontology approach. Based on the results, the annotation of interrogative contextual information tagged in interrogative lexical constructs improves the data extraction. The annotation of interrogative contextual information is annotated with interrogative and grammatical information of the lexical constructs. For example, the lexical object set of BirthDate, DeathDate, FuneralDate, and ViewingDate generate precision of 100% which also generate 100% precision on Ontos. This improvement is due to the implementation of annotation interrogative contextual information which is tagged with interrogative element of 'bila' (when) which describes about the string patterns of time at which things happened. For lexical object set of DeceasedName and RelativeName which also generate precision of 100%, as the name is matched by recognizing the string patterns of proper nouns which is tagged with interrogative element of 'siapa' (who). Besides, phrases or proper nouns based on interrogative annotation of the lexicon are also annotated in the lexical constructs.