search

Part of Speech Tagger for Hindi Language: a Survey

Essay details

Please note! This essay has been submitted by a student.

Download PDF

Introduction

Part of Speech is an important application of natural language processing. Natural Language Processing is a rapidly growing technology at present and with the help of imposing some queries and keywords it fetches information from collection of huge amount of data. Part of speech tagging is basically a practice set of assigning language specific grammar tags and a part of speech like noun, verb, preposition, pronoun, adverb, adjective or other lexical class maker to each word in a sentence of language-specific input text, according to word’s appearance in the text[1]. Natural Language Processing is a field of computer science, artificial intelligence and linguistics it has interactions between computers and human language. It is a process of extracting information from natural language input and producing output.

Essay due? We'll write it for you!

Any subject

Min. 3-hour delivery

Pay if satisfied

Get your price

The process of POS tagging consists of these stages Tokenization, Assign a tag to tokenized word and search for Ambiguous word. Text Tagging is a complex task as we get words which have different tag categories many times as they are used in different context. This phenomenon is termed as lexical ambiguity. Ambiguity occurs in most words in text associated with it in terms of their part of speech. For example ‘bank’ can be treated as a noun or a verb. Part -of-speech information leads to higher-level analysis, such as recognizing noun phrases and other patterns in text. C-DAC and TDIL IIT Bombay played important role on research venture ‘POS tagger for Hindi’. In Tagger some common tags: [N] Nouns [V] Verbs, [PR] Pronouns, [JJ] Adjectives, [RB] Adverbs, [PP] Postpositions, [PL] Participles, [QT] Quantifiers, [RP] Particles, [PU] Punctuations. There are a number of approaches to implement part of speech tagger, i.e. Rule Based approach, Statistical approach and Hybrid approach.

Literature Survey

There exist a number of part-of-speech taggers for many languages using a number of approaches, for Hindi Language also there exists a number of implementation of POS taggers. AnnCorra, shortened for “Annotated Corpora,” is a project of Lexical Resources for Indian Languages (LERIL), is a collaborative effort of several groups. They developed a system using statistical approach, which provides syntactic and semantic information.

Pratibha Singh and Aditya Tripathi “Hindi Language text search: a literature review” 2017 literature review focuses on the major problems of hindi text searching over the web. The review reveals the availability of a number of techniques and search engines that have been developed to facilitate Hindi text searching which was collected, scanned and reviewed. Arnav Sharma and Raveesh Motlani “POS Tagging For Code-Mixed Indian Social Media Text : Systems from IIIT-H for ICON NLP Tools Contest” 2016 paper explore POS tagging of code mixed Indian social media text using machine learning approaches. The text has English mixed with one of the three Indian languages, Hindi, Bengali and Tamil. POS tagger using only the given dataset, namely a constrained system, which gave an accuracy of 75.04% after being evaluated on unseen test dataset.

Rajesh Kumar Sayar and Singh Shekhawat “PARTS OF SPEECH TAGGING FOR HINDI LANGUAGES USING HMM” 2018 paper describes the Part of Speech (POS) tagging for Indian Languages “HINDI”. They prefer Hindi POS tagging using Hindi WordNet dictionary and HMM. HMM approaches concerned for POS tagging of sentences written in Hindi languages are discussed in their paper The performance analysis has been carried out for Precision, Recall and F1-Measure. we obtained 93.17 % precision,96.46 % Recall and 90.13 % F-measure. Deepa Modi and Neeta Nain. “Part-of-Speech Tagging of Hindi Corpus Using Rule-Based Method” 2016 their aim is to increase automaticity and maintain high precision, while limiting the size of human made corpus.

They have used human made corpus of around 9,000 words to increase tagging and rule-based (lexical features based) approach to decrease the size of already trained corpus. The system yields 91.84 % of average precision and 85.45 % of average accuracy. Vijeta Khicha and Mantosh Manna “Part-of-Speech Tagging of Hindi Language Using Hybrid Approach” 2017 They used a hybrid approach. The system has 32 tags set with 27 tags set from IIIT – Hyderabad tagset (POS tagger 2007) and 5 new special tag set have been added. Around 13,000 Hindi words pre-corpus tagged set is prepared in .xml format. The system is build using Java language. System yield 92.56% of average precision and 87.55% of average accuracy.

Kanak Mohnot, Neha Bansal, Shashi Pal Singh, Ajai Kumar “Hybrid approach for Part of Speech Tagger for Hindi language” 2014. Their System has evaluated over a corpus of 80,000 words with 7 different standard part of speech tags for Hindi. They used hybrid based approach and 7 different standard part of speech tags. The system has good performance with an average accuracy of 89.9% for POS tagging. Pravesh Kumar Dwivedi and Pritendra Kumar Malakar “Hybrid Approach Based POS Tagger for Hindi Language” 2015. They present an implementation of newer Hindi POS tagger based on Hybrid approach uses some linguistic resources, several rules and database with some predefine list of possible prefixes, suffixes and other required and related data for the Hindi language.

POS Tagging Techniques

POS tagging is used as an early stage of text analysis in many applications such as subcategory acquisition, text to speech synthesis and alignment of parallel corpora. POS tagging is a necessary pre-module and building block for various NLP tasks like Machine translation, Natural language text processing and summarization, User interfaces, Multilingual and cross language information retrieval, Speech recognition, Artificial intelligence, Parsing, Expert system and so on [12].

POS tagging is a basic step for language processing and can work as the first phase in other language processing tasks. The work on Part-of-Speech (POS) tagging for natural language tagging has begun in the early 1960s. For Indian languages researcher, it’s difficult to write linguistic rules using rule based approaches because of morphological richness. The POS tagger can be implemented by using either a supervised technique or an unsupervised technique.

Supervised POS taggers

Supervised POS taggers are based on pre-tagged Corpora. The supervised POS tagging models performance generally increases with the increase in size of these corpora and are used for training to learn information about the word-tag frequencies, rule and tagset, sets etc.

This is a method of facilitating the system to learn rule or disambiguation for tagging.

Unsupervised POS tagger

The unsupervised POS tagging models do not require pre-tagged corpora. Instead, they use those methods through which automatically tags are assigned to words, Advanced computational methods like the Baum-Welch algorithm to automatically include tag sets, transformation rules etc. Again supervised and unsupervised techniques are fallen into three subcategories:

  • Rule based
  • Stochastic or Statistical based POS tagger
  • Hybrid

Rule Based Approach / Transformation Based

The earliest POS tagging systems are rule-based system, in which a set of rules is manually constructed and then applied to a given text. Probably the first rule-based tagging system is given by Klein and Simpson (1963), which is based on a large set of handcrafted rules and a small lexicon to handle the exception. The rule based POS tagging approach uses a set of hand written rules. Rule base taggers depend on word list or lexicon or dictionary to assign appropriate tag to each word. The tagger divided into two stages. First, it search words in dictionary and second, it assigns a tag by removing disambiguity of words using linguistic features of word. In Rule based approach specific rules were formulated to determine the label of each word. Even having enough rules this approach failed to tag some unknown words. When encountering unknown words this system fails. Hence such ready to drop set of rules are needed to achieve good accuracy.

Statistical Approach / Stochastic Tagger

Stochastic tagger as a simple generalization of the stochastic taggers generally resolves the ambiguity by computing the probability of a given word (or the tag). Stochastic approaches use statistical model to tag input text and are based on previously tagged data and useful in building model. Model can be built by providing training on already tagged dataset. These models are more promising to tag both known and unknown words, but correctness of tagging depends on size of tagged training data. The disadvantage of this approach is that sometimes we come up with sequence of tags that are not valid according to language rules. Most of the taggers are created based on models like Hidden Markov Model (HMM), Support Vector Machine (SVM), n-gram, Decision tree, Maximum Entropy Markov model (MEMM) and Conditional Random Field (CRF). The common machine learning models used for POS tag are:

Maximum Entropy Markov Model

Max Ent stands for Maximum Entropy Markov Model (MEMM). It is a conditional probabilistic sequence model. It can handle long term dependency and represent multiple features of a word.

It is based on the principle of maximum entropy which states that the least biased model which considers all known facts is the one which maximizes entropy. The large dependency problem of HMM is resolved by this model. It has higher recall and precision as compared to HMM. MEMM favors those states through which less number of transitions occurs The disadvantage of this approach is the label bias problem.

Conditional Random Field Model

CRF stands for Conditional Random Field. It is a type of discriminative probabilistic model. It has all the advantages of MEMMs without the label bias problem. CRFs are undirected graphical models which are used to calculate the conditional probability of values on assigned output nodes given the values assigned to other assigned input nodes and are also known as random field.

Hidden Markov Model (HMM)

HMM stands for Hidden Markov Model. HMM is a generative model. The model assigns the joint probability to paired observation and label sequence. Then the parameters are trained to maximize the joint likelihood of training sets. It is advantageous as its basic theory is elegant and easy to understand. Hence it is easier to implement and analyze. It uses only positive data, so they can be easily scaled. It has few disadvantages. In order to define joint probability over observation and label sequence HMM needs to enumerate all possible observation sequence. Hence it makes various assumptions about data like Markovian assumption i.e. current label depends only on the previous label. It is not practical to represent multiple overlapping features and long term dependencies. Number of parameter to be evaluated is huge. So it needs a large data set for training.

Support Vector Machine

Support Vector Machines is machine learning approach basically used for classification and regression. The role of SVM in NLP is applied to text categorization, and gives the high accuracy with a large number of texts taken as features. SVMs are well known for their good generalization performance and also used for pattern recognition. SVMs have high generalization performance independent of dimension of feature vectors.

Hybrid Models

Hybrid models are basically combination of rules based and statistical models. In Hybrid system, approach uses the combination of both rule-based and ML technique and makes new methods using strongest points from each method. It is making use of essential feature from ML approaches and uses the rules to make it more efficient. Words in this technique are first tagged probabilistically and then as post processing, linguistic rules are applied to tag tokens. Accuracy of taggers based on this approach generally gives good results than other techniques.

Conclusion

In this paper, we discussed several Part-of-speech taggers. Majority of taggers have been created using machine learning techniques such as Hidden Markov model, Conditional Random Fields, and Maximum Entropy. Automatic POS tagging makes errors because many high frequency words of part-of-speech are ambiguous. Anon based technique is better because it has high accuracy.

Get quality help now

writer-Justin

Verified writer

Proficient in: Language and Linguistics

4.8 (345 reviews)
“Writer-Justin was a very nice and great writer. He asked questioned as necessary to perform the job at the highest level. ”

+75 relevant experts are online

More Essay Samples on Topic

banner clock
Clock is ticking and inspiration doesn't come?
We`ll do boring work for you. No plagiarism guarantee. Deadline from 3 hours.

We use cookies to offer you the best experience. By continuing, we’ll assume you agree with our Cookies policy.