Sanskrit parser Project Report

33
SANSKRIT PARSER (Parsing a Sanskrit Sentence in Some Recognizable Format) Project Mentor: Mr. Nikhil Debbarma Assistant Prof. CSE Dept. NIT,Agartala Team Members: Akash Bhargava (10UCS002) Ashok Kumar(10UCS010) Laxmi Kant Yadav(10UCS027) Vijay Kumar Gupta(10UCS057)

description

In this project we will basically try to parse a Sanskrit sentence so that later on it could be easy to translate it in some other language.

Transcript of Sanskrit parser Project Report

Page 1: Sanskrit parser Project Report

SANSKRIT PARSER(Parsing a Sanskrit Sentence in Some

Recognizable Format)

Project Mentor:

Mr. Nikhil DebbarmaAssistant Prof.CSE Dept.NIT,Agartala

Team Members:Akash Bhargava (10UCS002)Ashok Kumar(10UCS010)Laxmi Kant Yadav(10UCS027)Vijay Kumar Gupta(10UCS057)

Page 2: Sanskrit parser Project Report

Translator must know the Grammatical Structure of both Input and Output language.

Page 3: Sanskrit parser Project Report

According to many researchers, Sanskrit is a very scientific language.

Sanskrit behaves very closely as programming language.

So if we are able to make a translator that translates Sanskrit into machine code, then it would prove to be a significant development in the field of NLP(Natural Language Processing).

Why We Chose This Project

Page 4: Sanskrit parser Project Report

Why We Became Interested

“NASA scientist Rick Briggs had invited 1,000 Sanskrit scholars from India for working at NASA. But scholars refused to allow the language to be put to foreign use”- Dainik

Being a computer and human understandable, Sanskrit was considered useful in Space research and many other natural language processing Applications.

Page 5: Sanskrit parser Project Report

ContentWe will first put up some concepts then employ

them -

1. Advantages of using Sanskrit

2. Lexical Analysis

3. Parsing

4. Approach

5. Where we are now.

6. Problems

7. References

Page 6: Sanskrit parser Project Report

Linguistically Sanskrit :- is common base to a large group of Indo-European languages

Limited Vocabulary :- Words represent properties Prefix+Word+Suffix

Fixed Morphology

Concept of Vibhakti

Advantages of using Sanskrit -Why Sanskrit)

Page 7: Sanskrit parser Project Report

Words in Sanskrit belong to 3 categories, namely-

Dhatu Roop – root of all verbsShabda Roop – root of all nounsAvyaya – words with no morphology(indeclinables)

Each word belonging toDhatu Roop has 36 morphed versionsShabda Roop has 21 morphed versionsAvyaya words can represent a single meaning

Fixed Morphology

Page 8: Sanskrit parser Project Report

Vibhakti as Pointer

Page 9: Sanskrit parser Project Report

Consider the Sentence'The man saw the girl with the binoculars.'The man(S) saw(V) the girl(O) with the binoculars(I) ORThe man(S) saw(V) the girl with the binoculars(O)

नरः� द्वि�न�त्र्या बालाम्� अपष्यात्�नरः� द्वि�न�त्री�म्� साकम्� बालाम्� अपष्यात्�

Same is also the reason for UNAMBIGUITY in a sentence. NO effect of shuffling words.

Vibhakti as Pointer

Page 10: Sanskrit parser Project Report

Lexical analysis is the process of converting a sequence of characters into a sequence of tokens

A program or function that performs lexical analysis is called a lexical analyzer, lexer, tokenizer, or scanner

A lexer often exists as a single function which is called by a parser or another function, or can be combined with the parser in scanner less parsing

The lexical analyzer is the first phase of translator. It’s main task is to read the input characters and produces output a sequence of tokens that the parser uses for syntax analysis.

Lexical Analysis

Page 11: Sanskrit parser Project Report

The role of lexical analyzer

Lexical Analyzer

ParserSourceprogram

token

getNextToken

Indexed Database

Output

Page 12: Sanskrit parser Project Report

Output of lexical analysis is a stream of tokens A token is a syntactic category

◦ In English:noun, verb, adjective, …

◦ In sanskrit language:Vibhakti, kriya, vishashena, ..

Parser relies on the token distinctions:

What’s a Token?

Page 13: Sanskrit parser Project Report

An implementation must do two things:

1. Recognize substrings corresponding to tokens2. Search the identified token in the database to

recognize it’s context3. According to the different context it may be different

parts of speech of Sanskrit language eg: verb (kriya), vibhakti (dhatu roop).

4. Every token is tagged accordingly.

Lexical Analyzer: Implementation

Page 14: Sanskrit parser Project Report

Two important points:1. The goal is to partition the string. This is implemented

by reading left-to-right, recognizing one token at a time

2. “Lookahead” may be required to decide where one token ends and the next token begins

◦ Even our simple example has lookahead issues i vs. if = vs. ==

14

Lookahead

Page 15: Sanskrit parser Project Report

Sanskrit's property of FIXED MORPHOLOGY lays thebasis for analyzing individual verbs and nounsprogrammically.

The input word's suffix is analyzed to obtain the following result -

Verbs – Tense,number,personNoun – Sex,number,case

LEXICAL ANALYSIS

Page 16: Sanskrit parser Project Report

LEXICAL ANALYSIS

Consider the dhatu(verb root) त्प� meaning ‘to heat’The following inflections are analyzed lexically -

HEATS WILL HEAT त्पद्वित्, त्पत्�, त्पन्ति�त् | त्प्स्याद्वित्, त्प्स्यात्�, त्प्स्यान्ति�त् | त्पसिसा, त्पथः�, त्पथः | त्प्स्यासिसा, त्प्स्याथः�,त्प्स्याथः | त्पमिम्, त्पवः�, त्पम्� त्प्स्यामिम्, त्प्स्यावः�, त्प्स्याम्�

HEATED HEAT IT(order) अत्पत्�, अत्पत्म्�, अत्पन� | त्पत्�, त्पत्म्�, त्प�त्� | अत्प�, अत्पत्म्�, अत्पत् | त्प, त्पत्म्�, त्पत् | अत्पम्�, अत्पवः, अत्पम् त्पद्विन, त्पवः, त्पम्

Page 17: Sanskrit parser Project Report

LEXICAL ANALYSIS

Consider the noun दे�वः representing GodThe following inclusions are possible

1. Nominative (subject) दे�वः� दे�वः! दे�वः�2. Accusative (object) दे�वःम्� दे�वः! दे�वःन�3. Instrumental (by) दे�वः�न दे�वःभ्याम्� दे�वः#�4. Dative(to) दे�वःया दे�वःभ्याम्� दे�वः�भ्या�5. Ablative(from) दे�वःत्� दे�वःभ्याम्� दे�वः�भ्या�6. Genitive(of) दे�वःस्या दे�वःया$� दे�वःनम्�7. Locative(in) दे�वः� दे�वःया$� दे�वः�षु�

Page 18: Sanskrit parser Project Report

LEXICAL ANALYSIS

Input Sentence

Tokenize

Avyaya Analysis

Verb Analysis

Noun Analysis

Unknown word(add to database)

Page 19: Sanskrit parser Project Report

The scanner recognizes words

The parser recognizes syntactic units Parser operations:

◦ Check and verify syntax based on specified syntax rules

◦ Report errors

Automation:◦ The process can be automated

Parsing

Page 20: Sanskrit parser Project Report

1. Simplicity of design

2. Improving efficiency

3. Enhancing portability

Why to Separate Lexical Analysis and Parsing

Page 21: Sanskrit parser Project Report

Parsing Sanskrit Text

Now we move towards translating a Sanskritsentence into its parser equivalent

PARSING Analyze (a sentence) into its component parts and describe their syntactic roles.

Analyze (a string or text) into logical syntactic components, typically in order to test conformability to a logical grammar.

Page 22: Sanskrit parser Project Report

Parsing Sanskrit Text

Sanskrit Sentence StructureSOV

English Sentence StructureSVO

बाला� पठम्� पठद्वित् Boy reads chapter S O V S V O

Page 23: Sanskrit parser Project Report

Example Sanskrit Sentence

Page 24: Sanskrit parser Project Report

Approach(Coding Concept)

We first tokenize the input using strtok(str,” ”); Each token can be of 3 types- Noun,verb,

preposition.The task is to identify these token which is done by matching in indexed database.

Each token is stored in a structure along with the meaning and its morphologic.

Then parser comes into play and form a tree

type of structure using these tokens.

Page 25: Sanskrit parser Project Report

Bottom-Up Parser Technique

Bottom-Up LR◦ Construct parse tree in a bottom-up manner◦ Find the rightmost derivation in a reverse order◦ For every potential right hand side and token

decide when a production is found

More powerful Bottom-up parsers can handle the largest class of

grammars that can be parsed deterministically

Page 26: Sanskrit parser Project Report

Approach

Programming language used: C and C++ Database Used: Linux file system, indexed Data Structures: Array, Linked List, structure,Tree,

Indexing and Hashing INPUT: A sanskrit sentence or paragraph eg: यात्री रःम्� गच्छद्वित् त्त्री दे�वः� बाला�न साह नदे*म्� द्विनकषु द्वित्ष्ठन्ति�त्! OUTPUT: recognize all the parts of speech Form a tree structure to be able to understand the

sentence.

Page 27: Sanskrit parser Project Report

How the Output Will be Shown in Terminal

यात्री::: this is a avyaya.. and the meaning is: where_there ] रःम्�::: Nominative,Singular, Gender-Masculine ,noun and the root

is: रःम् and the meaning is Ram गच्छद्वित्::: The root is: गच्छ the meaning is: go present-tense,first-

person,singular त्त्री::: this is a avyaya.. and the meaning is: there दे�वः�::: Nominative,Plural Gender-Masculine ,noun ,and the root is:

दे�वः and the meaning is god बाला�न::: Instrumental,Singular, Gender-Masculine ,noun, and the

root is: बाला and the meaning is boy नदे*म्�::: Accusative,Singular, Gender-Feminine ,noun and the root is:

नदे* and the meaning is river

Page 28: Sanskrit parser Project Report

Avyaya's Role in Sanskrit

Avyaya words(indeclinables) are used to connect 2 or more simple sentences. Examples -यादिदे-त्दिदे (if-then)यात्री-त्त्री (where-there)परः�त्� (but)अथःद्विप (hence)चे�दे� (provided,if)Not only do avyaya connect sentences but they also affect structure of a simple sentence.

Page 29: Sanskrit parser Project Report

Challanges in the code

Every word encountered in the input sentence could be any parts of speech of sanskrit as there is no fixed ordering.

Because of the above mentioned property of sanskrit, searching becomes important.

Database and word collection were in unicode format, size of each word becomes even larger.

Page 30: Sanskrit parser Project Report

Problems

Grammar of Sanskrit language

How can we represent it in BNF grammar.

Parser techniques

Structure of code

Page 31: Sanskrit parser Project Report

Where We are Now

A big chunk of our time was invested in research of sanskrit language and its grammar which was quite difficult.

Till now we have implemented lexer part and parser part.

Page 32: Sanskrit parser Project Report

Reference

Sanskrit & Artificial Intelligence — NASAKnowledge Representation in Sanskrit and Artificial Intelligence by  Rick Briggs

http://www.vedicsciences.net/articles/sanskrit-nasa.html AI Magazine publishes the importance of Sanskrit

http://www.parankusa.org/SanskritAsProgramming.pdf

http://sanskrit.jnu.ac.in/morph/analyze.jsp

http://en.wikipedia.org/wiki/Sanskrit_verbs

http://en.wikipedia.org/wiki/Sanskrit_grammar

Page 33: Sanskrit parser Project Report

Thank You