NEW_PPT

20
A RULE BASED APPROACH ON STEMMING OF BENGALI VERBS Presented by: Joyeeta Bagchi Sneha Sarkar Anasuya Paul Koushik Dutta Under the guidance of- Mr. Alok Ranjan Pal 1

Transcript of NEW_PPT

Page 1: NEW_PPT

1

A RULE BASED APPROACH ON STEMMING OF BENGALI

VERBS

Presented by:Joyeeta Bagchi

Sneha SarkarAnasuya Paul

Koushik Dutta

Under the guidance of-

Mr. Alok Ranjan Pal

Page 2: NEW_PPT

2

CONTENTS Introduction Difficulty of the language Overview of stemming Related Work Pictorial Representation of proposed approach Example of each step Module 1(Suffix Stripping) Module 2(Applying Rules) Explanation of module 2 with example Algorithm Partial view of input file List of suffixes Partial view of output file Efficiency and time complexity Conclusion and future work

Page 3: NEW_PPT

3

INTRODUCTIONStemming is an operation that splits a word into the constituent root part and affix without doing complete morphological analysis.For example, Eating = Eat + -ing

Worked = Work + -ed

The main purpose of stemming is to reduce different grammatical forms / word forms of a word like its noun, adjective, verb, adverb etc. to its root form.

We can say that the goal of stemming is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.

Page 4: NEW_PPT

4

DIFFICULTY OF THE LANGUAGE

Bengali is one of the most morphologically rich language and stemming of Bengali verb is the most problematic area for Stemming.

Sometimes, nearly 10x5 forms for a certain verb in Bengali may appear in different contexts.

Page 5: NEW_PPT

5

OVERVIEW OF STEMMING

A typical simple stemmer algorithm involves removing suffixes using a list of frequent suffixes, while a more complex one would use morphological knowledge to derive a stem from the words.

Page 6: NEW_PPT

6

RELATED WORK In 1980 Martin Porter developed the “Porter

Stemmer”.It uses the fact that English language suffixes are mostly a combination of smaller and simpler suffixes.

Ramanathan and Rao (2003) proposed a lightweight stemmer for Hindi which has used a hand crafted suffix list and has performed longest match stripping

Dasgupta and Ng (2006) proposed unsupervised morphological parsing of Bengali. When evaluated on a set of 4,110 human-segmented Bengali words, the algorithm achieves 83% success.

Majgaonker and Siddiqui (2010) developed an unsupervised approach for Marathi stemmer. The maximum accuracy observed is 82.5% for the statistical suffix stripping approach.

Suba et al. (2011) proposed two stemmers for Gujarati, with an average accuracy of about 90.7%.

Page 7: NEW_PPT

7

PICTORIAL REPRESENTATION OF PROPOSED APPROACH

Page 8: NEW_PPT

8

EXPLANATION OF EACH STEP

Page 9: NEW_PPT

9

MODULE 1 (SUFFIX STRPPING)

Page 10: NEW_PPT

10

MODULE 2 (APPLYING RULES)

Page 11: NEW_PPT

11

EXPLANATION OF MODULE 2 USING EXAMPLE

Page 12: NEW_PPT

12

ALGORITHMSTEP1: Start of algorithm.

STEP 2: Create 4 new string[] namely splits1[], splits2[ ] and splits3[ ].

STEP 3: Read the contents of the doc files and split the words by space (‘ ’) separator.

3.1. Store the words of each sentence in splits1[ ].

3.2. Store the inflexions in splits2[ ].

3.3. Store the desired root words in splits3[ ].

STEP 4: Declare and initialize variables l1=length of splits1[ ] , 12=length of splits2[ ] .

STEP 5: Fetch the inflected verb forms in input1[] from splits1[i] if ‘/verb’ is contained by the currently fetched word. This step is repeated 11 times.

5.1. Determine the subroot from input1[i] by repeating the steps 12 times.

5.1.1. if splits2[j] in contained in input1[i] then,

5.1.1.a. Declare variable index which stores the index of last occurrence of splits2[j] in input1[i].

Page 13: NEW_PPT

13

5.1.1.b. If index is greater than equal to 2 then,5.1.1.b.i. Store the substring of input1[i]

from begindex=0 to endindex=index in input1[i].

5.1.1.b.ii. Break the loop.5.2. Determine the actual root input1[i] by repeating the steps l1 times. 5.2.1. Check the ending kar of input1[i].

5.2.1.a. I f input1[i] ends with e-kar(‘ি�’), o-kar(‘ো��’), a-kar(‘ো�’) or aa-kar(‘��’) then, replace it with aa-kar(‘��’).

5.2.1.b. if length of input1[i] is less than 3, concate it with aa-kar(‘��’).

5.2.2. Check the starting kar of input1[i].5.2.2.a. if input1[i] starts with e-kar

(‘ি�’), then replace it with a-kar(‘ো�’).

5.2.2.b. if input1[i] starts with u-kar (‘��’), then replace it with o-kar(‘ো��’).

5.2.2.c. if input1[i] starts with a-kar(‘ো�’), then replace it with aa-kar(‘��’).

Page 14: NEW_PPT

14

STEP 6: Generate the output doc file by copying the contents of splits1[] and concatenating it with their obtained root words from input1[] wherever the word contains ‘/verb’.

STEP 7: Compare the obtained sentences in splits1[ ] with the desired sentences in splits3[ ] and calculate the efficiency.

STEP 8: End of algorithm.

 

Page 15: NEW_PPT

15

PARTIAL VIEW OF INPUT FILE

Page 16: NEW_PPT

16

LIST OF SUFFIXES

Page 17: NEW_PPT

17

PARTIAL VIEW OF OUTPUT FILE

Page 18: NEW_PPT

18

EFFICIENCY:

Dealing with 450 inflections of 14 selected root verbs, the proposed approach gives an efficiency of 99.36%.

TIME COMPLEXITY:

The time complexity of the proposed algorithm in worst case is O(n2).

EFFICIENCY & TIME COMPLEXITY

Page 19: NEW_PPT

19

CONCLUSION AND FUTURE WORK

In this project, we present a lightweight stemmer for 14 selected Bengali Verbs that strips the suffixes using a predefined suffix list, on a “longest match” basis, and then finds root on basis of some rules.

Except a few cases, the result obtained from our algorithm is quite satisfactory according to our expectation.

We argue that a stronger and populated learning set would invariably yield better result. In future , we plan to test our algorithm with more sets of Bengali verbs.

 

Page 20: NEW_PPT

20

THANK YOU