Published January 17, 2019 | Version v1
Thesis Open

Building Arabic Corpora: Concepts, Methodologies, Tools, and Experiments

  • 1. Faculty of Sciences, Mohammed First University

Contributors

  • 1. Faculty of Sciences, Mohammed First University

Description

The term corpus comes from Latin and means “body”. According to corpus linguists, a corpus can be defined as a collection of machine-readable authentic texts, including transcripts of spoken data. The focus of corpora builders is essentially divided into three areas: corpus compilation, data processing, and corpus annotation. Each one of these tasks requires specialists, takes time, and costs money. The further task is to infer information from corpora to provide empirical evidence for linguistic theories or to turn the data into products or services. Corpora are essential resources for computational linguistics and Natural Language Processing (NLP) fields. Expressly, corpora include empirical data that enable linguists and grammarians to form objective rather than subjective statements. Further, many NLP applications are moving from rule-based systems and knowledge-based methods to data-driven approaches.
The prime motivation for carrying out the research in this thesis comes from the limited research on Arabic corpus linguistics and the lack of available resources, standards, and efficient tools that can cope with the perspectives of Arabic NLP. Furthermore, most Arabic corpora builders have often proposed corpora and tools that comply with their suitable objectives without considering the standardization and the international aspects. Therefore, another purpose of this thesis is to provide an overview of the central criteria and methodology of building corpora and to give a better understanding of Arabic corpus linguistics.
To widen the scope of this thesis, it was necessary to carry out some tasks:

  1. We conducted a survey that covers 100 well-known and influential corpora to know how relevant corpora have been built, yet, what and how long it takes to complete the procedure. The survey presents a summarisation of data sources and different compilation methods used in relation to corpus characteristics like size and time consumed during the compilation process.
  2. Basically, there is a lack of appropriate tools that can deal effectively with the richness of morphology and syntax of both Classic and Modern Standard Arabic (MSA). Thus, we developed our own tools and adapted others namely stemmer, lemmatizer, and part-of-speech tagger. In doing so, we study and investigate the state-of-the-art of available tools, then, we propose standard concepts and tagset considering the Arabic language features. Furthermore, we carefully collect Arabic linguistics resources to create the required dictionaries to enhance the performance of developed and adapted tools. Finally, comparative and usability tests are performed.
  3. In order to enrich our work, we built three different types of corpora: Classic Arabic (i.e., Al-Mus’haf), MSA (i.e., OSIAN), and multilingual (i.e., MulTed). Detailed information about the building procedures and the characteristics of the constructed corpora are presented. Furthermore, they are compared to similar corpora, stressing their significant contribution to the literature. Finally, these corpora will publicly release to push forward the state-of-the-art in Arabic NLP and corpus linguistics.

Files

Thesis - I.Zeroual.pdf

Files (3.7 MB)

Name Size Download all
md5:24cfdff64a880186fb9e9a64b19f560c
3.7 MB Preview Download