Named Entity Recognition using Hidden Markov Model (HMM)

Named Entity Recognition (NER) is the subtask of Natural Language Processing (NLP) which is the branch of artificial intelligence. It has many applications mainly in machine translation, text to speech synthesis, natural language understanding, Information Extraction, Information retrieval, question answering etc. The aim of NER is to classify words into some predefined categories like location name, person name, organization name, date, time etc. In this paper we describe the Hidden Markov Model (HMM) based approach of machine learning in detail to identify the named entities. The main idea behind the use of HMM model for building NER system is that it is language independent and we can apply this system for any language domain. In our NER system the states are not fixed means it is of dynamic in nature one can use it according to their interest. The corpus used by our NER system is also not domain specific.


1.Introduction
Named Entity Recognition is a subtask of Information extraction whose aim is to classify text from a document or corpus into some predefined categories like person name, location name, organisation name, month, date, time etc.And other to the text which is not named entities.NER has many applications in NLP.Some of the applications include machine translation, more accurate internet search engines, automatic indexing of documents, automatic questionanswering, information retrieval etc.An accurate NER system is needed for these applications.Most NER systems use a rule based approach or statistical machine learning approach or a Combination of these.A Rule-based NER system uses hand-written rules frame by linguist which are certain language dependent rules that help in the identification of Named Entities in a document.Rule based systems are usually best performing system but suffers some limitation such as language dependent, difficult to adapt changes.
Machine-learning (ML) approach Learn rules from annotated corpora.Now a day's machine learning approach is commonly used for NER because training is easy, same ML system can be used for different domains and languages and their maintenance is also less expensive?There are various machine learning approaches for NER such as CRF (conditional Random Fields), MEMM (Maximum Entropy Markov Model), SVM (Support Vector Machine) and HMM (Hidden Markov Model) and dictionary based approach.Among all these HMM, being the most promising, has not been explored in its full potential for NER.The work that has been reported is domain specific and does not establish it as a general technique.
Mostly the researcher uses hybrid NER system which take advantages of both rule-based and statistical approaches so that the performance of NER system can be improved.

2.Challenges of NER in Indian Languages
In technology for Indian languages NER has an essential need.NER in Indian Languages is a more challenging problem as compared to languages using Roman script due to absence of capitalization, resources etc.Because of these issues any English NER system cannot be used directly for performing NER for Indian language.
To get various features we adapt Hidden Markov Model machine learning approach for Named Entity Recognition in Indian language.Which can be used as general techniques?
• For English and other European languages, capitalization plays a very important role to identify NEs but for Indian languages there is no concept of capitalization which makes NER difficult for these languages.
• Large number of ambiguity exists in Indian names and this makes the recognition a very difficult task for Indian language.
• Indian languages are also a resource poor language.Annotated corpora, name dictionaries, good morphological analyzers, POS taggers web source for name list etc are not yet available in the required quantity and quality [2].
• Although Indian languages have a very old and rich literary history still technology development are recent [2].
• India is a multilingual country with different language and there is large number of variation in each language.Because of these variations Named entity recognition systems for one language domain do not usually work well in other language domains.

3.Our approach 3.1 Hidden Markov Model based machine learning
HMM stands for Hidden Markov Model.HMM is a generative model.The model assigns the joint probability to paired observation and label sequence [6].Then the parameters are trained to maximize the joint likelihood of training sets [6].
Among all approaches, the evaluation performance of HMM is higher than those of others [7].
The main reason may be due to its better ability of capturing the locality of phenomena, which indicates names in text [7].
We can define HMM in a formal way as follows: λ = (A, B, π).Here, A represents the transition probability.B represents emission probability and π represents the start probability [4].
A = aij = (Number of transitions from state si to sj) / (Number of transitions from state si) [4].B = bj (k) = (Number of times in state j and observing symbol k) / (expected number of times in state j) [4].
It means that the word occurs first in a sentence.Baum Welch Algorithm is used to find the maximum likelihood and posterior mode estimates for the HMM parameters [9].Forward Backward Algorithm is used to find the posterior marginal's of all hidden state variables given a sequence of observations/emissions [8].

Viterbi algorithm
The Viterbi algorithm (Viterbi 1967) is implemented to find the most likely tag sequence in the state space of the possible tag distribution based on the state transition probabilities [10].The Viterbi algorithm allows us to find the optimal tags in linear time.The idea behind the algorithm is that of all the state sequences, only the most probable of these sequences need to be considered.
Moreover, HMM seems more and more used in NE recognition because of the efficiency of the Viterbi algorithm [Viterbi67] used in decoding the NE-class state sequence [7].

Current NER in Indian Language
Current work in Indian language regarding NER suffer from following limitations • Language dependent -NER in one language may not use for other language in any case if it is too much effort required.
• Domain Specific -NER system work best for one domain but in other domain performance is not up to the mark.

•
The rule based method gives high accuracy up to certain extent but it requires language experts to construct rule for any language domain.
• NER process requires much time and effort.

•
The accuracy of Gazetteer method is acceptable but it has problem when corpus is very large.Since the Indian languages are free format languages and new words are generated rapidly.So managing the list size is big task [5].
• Gazetteer method also takes lots of time to search any named entities in the list and for each word we have to search the entire list from the beginning [5].

•
The problem with Maximum entropy model is that it does not solve the label biasing problem [1].

HMM based NER
• We can develop NER system which is language independent.They are not specific for particular language domain.We can use it for any language domain .

•
The HMM based NER system is easily understandable and is easy to implement and analyse.It can be used for any amount of data so the system is scalable.
• It solves Sequence labelling problem very efficiently.

•
The states used in the model are also not fixed.One can use it according to their requirements or interest means it is of dynamic in nature.

•
The HMM based NER system does not require language experts means if a person has little knowledge about the language in which he/she wants to find named entities can easily run/operate this system.

Proposed System
Proposed System uses learning by example methodology.It provides easy to use method with minimum efforts for Named Entity Recognition in any natural language.Person has to just annotate his corpus and test the system for any sentence.Steps to be followed for any language are as follows-1.Data preparation 2. Parameter Estimation(Training) 3. Test the system

Step 1: Data Preparation
We need to convert the raw data into trainable form, so as to make it suitable to be used in the Hidden Markov model framework for all the languages.The training data may be collected from any source like from open source, tourism corpus or simply a plaintext file containing some sentences.So in order to make these file in trainable form we have to perform following steps:

Algorithm
Step1: Separate each word in the sentence.
Step5: Tag (Named Entity tag) the words by using your experience.
Step6: Now the corpus is in trainable form.

Algorithm:
For each starting tag Find frequency of that tag as starting tag Calculate π

FEATURES OF PROPOSED SYSTEM
Our Hidden Markov model based NER system has been trained and tested with different Indian languages namely Hindi, Urdu, and Punjabi etc.We have performing training and testing on our tourism corpus and it gives better performance.The works reported in this paper differ from other previous work in terms of the following points: • language independent -This methodology works for any natural language European language also.This work tested for Hindi, Urdu and Punjabi and English.
• General Approach -This approach is not domain specific.This work tested for tourism corpus, general sentences and stories.
• High Accuracy -If rich corpus is developed if perform best.During testing we also get accuracy till 90 %.
• Dynamic -All the parameters used by our system are of dynamic in nature means one can use according to their interest.This work is tested for Person, Location, river Country tags in tourism corpus and Person, time, month, dry-fruits, food items tags in story corpus.
• Usefulness to other classification -Since the parameters are of dynamic in nature the same NER system can be used for other NLP classification like Part-of-speech tagging etc.
• Fine grained tagging -Mostly systems allot location tag to name of place, river, palace etc.In this system you can set subclass of location tags according to your need.This system has been tested for country, river, tree etc. tags.
• Use of Annotated corpus -To use this system you have to design tagged corpus either with the help of proposed system or with other tools.This tagged corpus can be used in other natural language processing applications.

CONCLUSION
Building a NER based system in Hindi using HMM is a very conducive and helpful in many significant applications.We have studied various approaches of NER and compared these approaches on the basis of their accuracies.India is a multilingual country.It has 22 Indian Languages.So, there is lot of scope in NER in Indian languages.Once, this NER based system with high accuracy is build, then this will give way to NER in all the Indian Languages and further an efficient language independent based approach can be used to perform NER on a single system for all the Indian Languages.So NER system based on HMM model is very efficient especially for Indian languages where large variation occurs.

3.5.2.2. Procedure to find Start probability Start
State is vector contains all the named entity tags candidate interested.probability is the probability that the sentence start with particular tag.
Input: Annotated Text file: Output: Start Probability Vector