A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUAL BILSTM NETWORK

Many Natural Language Processing (NLP) applications involve Named Entity Recognition (NER) as an important task, where it leads to improve the overall performance of NLP applications. In this paper the Deep learning techniques are used to perform NER task on Hindi text data as it found that as compared to English NER, Hindi language NER is not sufficiently done. This is a barrier for resource-scarce languages as many resources are not readily available. Many researchers use various techniques such as rule based, machine learning based and hybrid approaches to solve this problem. Deep learning based algorithms are being developed in large scale as an innovative approach now a days for the advanced NER models which will give the best results out of it. In this paper we devise a Novel architecture based on residual network architecture for preferably Bidirectional Long Short Term Memory (BiLSTM) with fasttext word embedding layers. For this purpose we use pre-trained word embedding to represent the words in the corpus where the NER tags of the words are defined as the used annotated corpora. BiLSTM Development of an NER system for Indian languages is a comparatively difficult task. In this paper, we have done the various experiments to compare the results of NER with normal embedding and fasttext embedding layers to analyse the performance of word embedding with different batch sizes to train the deep learning models. Here we present a state-of-the-art results with said approach F1 Score measures.


INTRODUCTION
Named Entity Recognition (NER) was first introduced in 1995 in (MUC-6) Message Understanding Conference-6 (MUC- 6,1995). [8] Where it is stated as it is consisting of three sub tasks, and these tasks are namely, i) entity names, ii) temporal expressions and iii) number expressions. where the terms to be annotated are as unique identifiers like (a) entity names like the names of organizations, the names of persons or the names of locations etc. (b) temporal expressions like times and dates, and (c) number expressions or quantities like monetary values, percentages. Hence NER is one of the key tasks in the field of information extraction and Natural Language Processing (NLP). English language can boast of a rich NER literature, however, the same cannot be said to be true for Hindi language. There have been periodical attempts, as there is big scope to explore in the Hindi language domain, while considering especially the use of deep learning models have made their way to resolve several language processing problems. Due to Lack of availability of ready tools, rich morphology nature of Hindi language and more precisely the scarcity of annotated corpus data makes it i) difficult to reuse existing deep learning architectures which are used for English language are more challenging and (b) allows exploring novel and advanced approaches being used for NER task.
Based on the success of using machine learning architectures for NER task, for resource rich languages like English, in this paper we follow a simple and effective approach of refining previously proven successful deep neural network models for Hindi language. The idea behind this is to use fasttext embedding structure with residual deep neural network architecture which is novel in nature and which is easy to optimise the model parameters in low-resource scenario. As we design increasingly deeper networks it becomes imperative to understand how adding layers can increase the complexity and expressiveness of the network. Even it is more important that the ability to design networks where adding layers makes networks strictly more expressive rather than just different. The architecture geared towards low resource data and less resources in terms of computing time and power but also shows an improvement over the existing models for the Hindi NER task. We show experimentally that there is an improvement in Hindi NER performance over the base BiLSTM model by adding residual connections, which is the main contribution of this paper. Deep residual networks were shown to be able to scale up to thousands of layers and still have improving performance. [12] We believe that these kinds of modifications or integration of different network models help improve Hindi NER performance especially in low-resource conditions.

RELATED WORK
Development of an NER system for Indian languages is a comparatively difficult task.
Hindi and many other Indian languages provide some inherent difficulties in many NLP related tasks. Consequently, not much work has been done on NER for Indian languages like Hindi. Hindi is the third most spoken language of the world and still no accurate Hindi NER system exists. As some features like capitalization are not available in Hindi and due to lack of a large labelled dataset [11] and of standardization and spelling variations, an English NER system cannot be used directly for Hindi.
Furthermore, the structure of the language contain many complexities like free word ordering (which affect ngram-based approaches significantly) and its inflectional nature (affecting handengineered approaches significantly). Also, in Indian languages there are many word constructions that can be classified as Named Entities (Derivational/Inflectional constructions) and these constraints on these constructions vary from language to language hence carefully crafted rules need to be made for each language which is a very time consuming and expensive task. Also, the scarcity of labelled data renders many of the statistical approaches like Deep Learning unusable. This complexity in the task is a significant challenge to solve. However, Shah et. al. have demonstrated promising results by utilizing BiLSTM networks to solve the NER problem [5], our work builds upon theirs and adds residual connections to the network.
There is a need to develop an accurate Hindi NER system for better presence of Hindi on the Internet. It is necessary to understand Hindi language structure and learn new features for building better Hindi NER systems.

Word Embeddings
Word embeddings are an efficient way to represent words -i.e. words with same meanings are represented in the same way which is useful for various NLP tasks. As the quality of word embeddings depends upon the quality of input data, hence representing the data in the form of words is the essential task and now a days embeddings of words into low dimensional space is mostly suggested. Recently word embeddings like Distributed word representations have contribution to competitive performance in language modeling and with various NLP tasks. There are many neural network embedding approaches where as the skip-gram model of has achieved significant results in many NLP tasks, where it includes sentence completion, analogy and sentiment analysis etc. Word2vec is a statistical method for learning word embeddings from a large text corpus. It outputs a high-dimensional vector space, where each word from the corpus is assigned a vector and words with common contexts are placed proximally close in the vector space. [1] We have chosen Fasttext, a pre-trained word embedding developed and open-sourced by Facebook [2] for our task. As already fasttext approach for English language NER has given results which are comparatively better than regular methods used for Named entity recognition. But in regional language like Hindi it is found that due to the unavailability of large corpus of data the experiments are done with regular Deep learning algorithm with traditional approach. Here, we use novel architecture to analyse the performance of NER w.r.t. BiLSTM neural network. It provides word embeddings for Hindi (and 157 other languages) and is based on the CBOW (Continuous Bag-of-Words) model. The CBOW model learns by predicting the current word based on its context, and it was trained on Common Crawl and Wikipedia. [3]

Dataset
We perform the task of labelling the named entities on the dataset, available at [4], released during ICJNLP 2008 as part of the workshop on NER for South and South East Asian Languages, consisting of 19822 annotated sentences, 490368 total tokens among which 34193 are unique tokens, and 12 categories of entities and one negative entity class other. The 12 categories are given in Table 1

Pre-Processing Steps
The dataset was in Shakti Standard Format (SSF) but could not directly be fed into a model, so it needed parsing, which was carried out with handwritten Regex parsers in Python.
Steps involved in pre-processing the data  Parsing SSF  Removing sentences with no tags, after which 7966 sentences remained.  Mapping all words to numbers which would then be mapped to their respective embeddings with each embedding of dimension 300 for Fasttext  Padding sentences with "0" and truncating sentences so that all sentences are of same length, i.e. 30  The dataset was split in a 70:15:15 ratio for training, testing and validation sets respectively.

Mathematical Algorithms Used
1) Softmax Activation Function: For activation, our model uses the Softmax function. It is a type of activation function used in Neural Networks. It is used to compute probability distribution from a vector of numbers. It produces an output between 0 to 1, and the sum of probabilities are equal to 1. The Softmax activation function is computed using the following relationship.
The Softmax function is used in multi-class models where it returns probabilities of each class, with the target class having the highest probability.
In most cases, the Softmax function shows up in the output layers of deep learning architectures, even in ours.
2) Recurrent Dropout: Recurrent dropout is an method that can preserve memory in an LSTM while still generating different dropout masks for each input sample. Recurrent dropout works by selectively applying dropout to that part of the Recurrent Neural Network which is updating the hidden state, as opposed to the state itself. Thus, a dropped element does not contribute to the network's memory and does not erase the hidden state. For LSTM, the equation is same as vanilla LSTM, except that the equation for Ct changes.

Proposed Approach
Previous works have used Bi-LSTM networks for Hindi NER, but our approach builds on it and adds residual connections to the model. The input is in the form of batches of Hindi sentences in which there is a mapping of numbers to words which is then passed to the embedding (fasttext) layer wherein each number is mapped to a specific vector i.e., each word is mapped to a learned vector in fasttext. To get a deeper representation of the words, we have used a residual connection architecture of two layers which was obtained by adding the output of the first layer to the stacked output of the second layer to get a deeper representation. This residual connection allows the model to get a deeper understanding of the context of the words and improves the performance by increasing the precision score from 78% to 81.9% as compared to the work done by Shah et. al. [5] In order to counter over fitting, we have added a dropout layer after the residual connection and used recurrent dropout in the recurrent layers. At the end of the model, we have used a time distributed dense layer so as to map each word representation in the sentence to a dense layer and from there to an output tag probability for each word.
A plot of the model can be seen in Figure 1.

Hardware Setup
The models were trained on an MSI laptop having specifications given in Table 2. Due to the heavy word embedding dimensions, it is advisable to carry out the training process on GPUs only.

Results Obtained and Their Analysis
The model was trained on 12,464,023 parameters with varying batch sizes and was subject to testing on each. The best results were obtained with batch size 32 and at 5 epochs. The metrics have been calculated on a single fit. Cross validation was not carried out because the dataset is large enough. The results are tabulated and shown in Table 3. The precision was found to be higher by 3.9% than that of previous work done on BiLSTMs for NER. [5]

CONCLUSION
Most of the NLP applications in Computer Science have their first step rooted in Named Entity Recognition. However, there is a lack of collated information on NER methods used for processing Hindi.This is one of the first attempts at applying residual connections to BiLSTM networks for NER task.It has been shown that rule-based approaches outperform others if expert linguists are available, but with advances in machine learning and deep learning models, this situation is soon to change, for a large set of languages.