search
Only on Eduzaurus

iStemmatizer – Stemmer based lemmatizer and Abhipray – an Opinion Mining tool for Gujarati language

Download essay Need help with essay?
Need help with writing assignment?
74
writers online
to help you with essay
Download PDF

iStemmatizer – Stemmer based Lemmatizer

In our previous work the Stemmatizer is developed using Rule based with combination of Dictionary based approach giving accuracy of 98.33%. The iStemmatizer discussed below is developed to overcome its limitations which are also discussed below.

Lemmatization, a process to retrieve a lemma from the given inflected word by removing its inflections, is important process in any Information Retrieval [1]. Stemming is also used to remove inflections and returns stem but unlike lemma, stem is not always a valid dictionary word. The tool used to lemmatize a word is known as Lemmatizer and the tool for stemming is known as Stemmer. There are two types of Lemmatizer [1]: Inflectional and Derivational. Inflectional Lemmatizer is used to obtain lemma from the word of same Part of Speech(PoS) while Derivational Lemmatizer also provides lemma from different PoS. The lemmatizer discussed in this paper is both inflectional and derivational. In our previous work, the tool Stemmatizer [2] is developed using Rule based approach with the combination of Dictionary based approach giving accuracy of 98.33%. The tool is having few limitations which are discussed below:

Essay due? We'll write it for you!

Any subject

Min. 3-hour delivery

Pay if satisfied

Get your price

• For some words, its inflected word or part of inflected word is stored in dictionary as another lemma. For example, „જેમ ‟ ાં (/dʒeːm / – in which) is having stem „જે‟ with inflection „મ ‟ ાં . But after removing inflections „ા +ાાં ‟ it gives „જેમ‟ (/dʒeːmə/ – like) which is also a stem having different meaning. In this case, it returns „જેમ‟ ((/dʒeːmə/ – like) instead of „જે‟ (/dʒeː/ – which).

• For some inflections, the part of it appears as another inflection which is tested before the said inflection. For example, word „બળપ ૂર્વક ‟ (/bəɭəpuɾʋəkə / – forcefully) is having inflection „પ ૂર્વક ‟ (/puɾʋəkə/). But due to sorted order of inflection according to its length, 6another inflection „ક‟ is found and removed. So, actual inflection cannot be found and removed. This gives output as „બળપ ૂર્વ ‟ (/bəɭəpuɾʋə/) which is incorrect.

• Stemmatizer is considering 179 inflections for Gujarati language, if any inflection found out of this list then it is not able to remove it.

• The algorithm searches all inflections multiple times until there is no inflection left in word, which reduces the performance of algorithm with respect to time. Considering the above limitations, the iStemmatizer is developed by Authors which is able to improve results by handling these limitations. The iStemmatizer is also stemmer based lemmatizer using tree based approach with combination of dictionary based approach. Lemmatization for Malayalam [3] also uses Tree based method but the core structure of the tree used by them is different according to the nature of languages.

Methodology

The basic objective of iStemmatizer is to lemmatize an inflected [1] Gujarati word and retrieve lemma from it. The Tree Based method uses rooted tree as an inflection tree and Dictionary based method uses Vocabulary as lexical resource. Following is discussion regarding the inflection tree, vocabulary and working of the Algorithm used in iStemmatizer.

Inflection Tree

iStemmatizer uses Tree based method with combination of dictionary based method. Authors collected inflections available in Gujarati literature [4,5,6,7], study them and generated trees. This section contains detailed process adopted for preparation of tree used for iStemmatizer. We focus on removing suffix inflections from words. For that, a list of 290 inflections is prepared from books of Gujarati grammar as well as from the research previously done in this context. The enhancement in list of inflections reduces the 3rd limitation discussed above. The list is divided into 38 sublists where all inflections of an individual sublist end with same character.

Characteristics of Tree

A tree is generated considering following characteristics. Each characteristic is explained by considering the example of tree prepared for five inflections ending with character „ઠ‟. Table 1 shows all five inflections in last column and individual characters of each inflection in previous columns. Table 1 Five inflections ending with character ‘ઠ’ ન દ ા ઠ દીઠ િા ષ ા ઠ િાષ્ઠ ા ષ ા ઠ ા ષ્ઠ ા ષ ા ઠ ા ષ્ઠ િા ષ ા ઠ િનષ્ઠ Figure 1 Inflection tree for the root ‘ઠ’ The end character of all inflections in single sublist is common which is considered as root for the tree. The children are selected by iterating through each character of inflection in reverse order.

Each character appearing at same position in different inflections added only once in tree. For example, in Figure 2, „ા ‟ is the second last character in four inflections and „ષ’ is third last character in four inflections but it appears once only. While iterating through each character in reverse order, whenever characters are different than other characters at same position with similar characters after that position, they all are considered as siblings in tree. For example, fourth last characters ‘ા ’, „ા ‟ and „િા‟ are siblings as they all are having characters „ષ’, „ા ‟ and „ઠ’ at third last, second last and last characters respectively. Node of tree has one or more characters as its value but it cannot be empty node. For example, in the tree shown in Figure 2, the first child of root is „ા દ*‟, which consist of two characters and „*‟ symbol, while the second child is just having one character „ા ‟. Figure 1 shows a representation of tree where there is only one character in one node, which un-necessarily increases level of tree.

The value of a node ends with sign “*” if it is a first character of any inflection. For example, in Figure 1, there is „*‟ symbol in „ા દ*‟ because „દ’ is a first character in inflection „દીઠ’. Also the node with value „ષ’ is not having „*‟ symbol because it is not a first character in any of the five inflections. There is no restriction in number of child and number of levels in these trees.

Vocabulary Structure

The vocabulary structure used in iStemmatizer is taken from Stemmatizer [2]. Three cases are handled using two different structures of vocabulary which is described in Figure 3. “$” structure is used for the words with unique stem-lemma combination irrespective of whether stem-lemma are same or different. A different structure “$” is used for the same stem for more than one lemma. Authors have not come across any word which does not work in either of these structures of vocabulary.

Algorithm

The basic flow of iStemmatizer is similar to that of Stemmatizer. But, Stemmatizer returns first matched lemma from the vocabulary while the iStemmatizer continues processing even after receiving one lemma to check the existence of another lemma, which helps resolving 1 st limitation discussed above. Also, iStemmatizer uses tree structure to traverse all the inflections instead of simple list of inflections which resolves the 2 nd limitation discussed above and also reduces the processing time compare to Stemmatizer. iStemmatizer takes as an input a word to be lemmatized and gives as an output a lemma or list of lemmas if found more than one lemma for the inputted word. The list of lemma is received in two cases:

When there are more than one lemma corresponding to the received stem. For example, „નર્ ’ (/ʈəʋi/ – new) is an inflected form of lemma „નવ ’ ાં (/ʈəʋu/ – new) and „નર્મ ’ ાં (/ʈəʋəm / – ninth) is inflected form of lemma „નર્’(/ʈəʋə/ – nine). But both are having same stem „નર્’(/ʈəʋə/). So, the algorithm gives both possible lemmas as an output.

After removing one inflection, it matched with some stem in the vocabulary, while actually it is still an inflected word having some different stem. For example, an inflected word „મ નર્ મ ’ ાં is having stem „મ ન’ with three inflections „મ ’ ાં , „ા ‟ and „ર્’. But the stem received after removing two inflections is „મ નર્’ which is also a stem of different word. So, the algorithm gives both possible lemmas as an output.

Input: word to be lemmatized, listFlag used to know if the first lemma is to return or all possible lemmas are to be retrieved

Output: lemma or list of lemmas iStemmatize(word, listFlag = False) If word found in vocabulary return lemma part of vocabulary for this word 10Start a loop until word is still having inflections If last character of word matched with any of the root from all trees

select the tree for traversal

else

if listFlag set

add lemma part of vocabulary for this word to list

else

return lemma part of vocabulary for this word

childern = root of selected tree

Start a loop for traversing each child in children

inflection = merging of all characters on the path from current node to root in tree

new word = new word after removing inflection

if inflection marked with “*”

if new word is not same as word

if new word is empty

if listFlag set

add lemma part of vocabulary for this word to list

else

return lemma part of vocabulary for this word

if new word found in vocabulary

if listFlag set

add lemma part of vocabulary for this word to list

else

return lemma part of vocabulary for this word

else

if new word ends with ‘ા ’

new word = word after removing ‘ા ’

if new word found in vocabulary

if listFlag set

add lemma part of vocabulary for this new

word to list

else

return lemma part of vocabulary for this

new word

children = children of currently iterating child

if children is NULL

break loop of children iteration

else

children = children of currently iterating child

if children is NULL

break loop of children iteration

if word is changed in tree traversal

word = new word

else

exit the loop

if listFlag set

return list of words

else

return word

The algorithm also allows returning the very first lemma received from the inputted word. In this case the second point discussed above is not considered. This behavior is controlled using the flag taken as an input to the algorithm. If the flag is set, then the algorithm returns 11the first lemma otherwise it returns all possible lemmas. The algorithm is implemented in Python 3.x and for providing user interface PHP 5.6.x is used. Like Stemmatizer, iStemmatizer also allows to enhance the vocabulary if new word is found which is not exists currently in the vocabulary. The status regarding word is found in dictionary or not is maintained during the searching of word in vocabulary.

Disclaimer

This essay has been submitted by a student. This is not an example of the work written by our professional essay writers. You can order our professional work here.

We use cookies to offer you the best experience. By continuing to use this website, you consent to our Cookies policy.

background

Want to get a custom essay from scratch?

Do not miss your deadline waiting for inspiration!

Our writers will handle essay of any difficulty in no time.