Two-Stage Constraint Based Sanskrit Parser Akshar Bharati, IIIT,Hyderabad.
Sanskrit parser Project Report
-
Upload
laxmi-kant-yadav -
Category
Technology
-
view
302 -
download
4
description
Transcript of Sanskrit parser Project Report
SANSKRIT PARSER(Parsing a Sanskrit Sentence in Some
Recognizable Format)
Project Mentor:
Mr. Nikhil DebbarmaAssistant Prof.CSE Dept.NIT,Agartala
Team Members:Akash Bhargava (10UCS002)Ashok Kumar(10UCS010)Laxmi Kant Yadav(10UCS027)Vijay Kumar Gupta(10UCS057)
Translator must know the Grammatical Structure of both Input and Output language.
According to many researchers, Sanskrit is a very scientific language.
Sanskrit behaves very closely as programming language.
So if we are able to make a translator that translates Sanskrit into machine code, then it would prove to be a significant development in the field of NLP(Natural Language Processing).
Why We Chose This Project
Why We Became Interested
“NASA scientist Rick Briggs had invited 1,000 Sanskrit scholars from India for working at NASA. But scholars refused to allow the language to be put to foreign use”- Dainik
Being a computer and human understandable, Sanskrit was considered useful in Space research and many other natural language processing Applications.
ContentWe will first put up some concepts then employ
them -
1. Advantages of using Sanskrit
2. Lexical Analysis
3. Parsing
4. Approach
5. Where we are now.
6. Problems
7. References
Linguistically Sanskrit :- is common base to a large group of Indo-European languages
Limited Vocabulary :- Words represent properties Prefix+Word+Suffix
Fixed Morphology
Concept of Vibhakti
Advantages of using Sanskrit -Why Sanskrit)
Words in Sanskrit belong to 3 categories, namely-
Dhatu Roop – root of all verbsShabda Roop – root of all nounsAvyaya – words with no morphology(indeclinables)
Each word belonging toDhatu Roop has 36 morphed versionsShabda Roop has 21 morphed versionsAvyaya words can represent a single meaning
Fixed Morphology
Vibhakti as Pointer
Consider the Sentence'The man saw the girl with the binoculars.'The man(S) saw(V) the girl(O) with the binoculars(I) ORThe man(S) saw(V) the girl with the binoculars(O)
नरः� द्वि�न�त्र्या बालाम्� अपष्यात्�नरः� द्वि�न�त्री�म्� साकम्� बालाम्� अपष्यात्�
Same is also the reason for UNAMBIGUITY in a sentence. NO effect of shuffling words.
Vibhakti as Pointer
Lexical analysis is the process of converting a sequence of characters into a sequence of tokens
A program or function that performs lexical analysis is called a lexical analyzer, lexer, tokenizer, or scanner
A lexer often exists as a single function which is called by a parser or another function, or can be combined with the parser in scanner less parsing
The lexical analyzer is the first phase of translator. It’s main task is to read the input characters and produces output a sequence of tokens that the parser uses for syntax analysis.
Lexical Analysis
The role of lexical analyzer
Lexical Analyzer
ParserSourceprogram
token
getNextToken
Indexed Database
Output
Output of lexical analysis is a stream of tokens A token is a syntactic category
◦ In English:noun, verb, adjective, …
◦ In sanskrit language:Vibhakti, kriya, vishashena, ..
Parser relies on the token distinctions:
What’s a Token?
An implementation must do two things:
1. Recognize substrings corresponding to tokens2. Search the identified token in the database to
recognize it’s context3. According to the different context it may be different
parts of speech of Sanskrit language eg: verb (kriya), vibhakti (dhatu roop).
4. Every token is tagged accordingly.
Lexical Analyzer: Implementation
Two important points:1. The goal is to partition the string. This is implemented
by reading left-to-right, recognizing one token at a time
2. “Lookahead” may be required to decide where one token ends and the next token begins
◦ Even our simple example has lookahead issues i vs. if = vs. ==
14
Lookahead
Sanskrit's property of FIXED MORPHOLOGY lays thebasis for analyzing individual verbs and nounsprogrammically.
The input word's suffix is analyzed to obtain the following result -
Verbs – Tense,number,personNoun – Sex,number,case
LEXICAL ANALYSIS
LEXICAL ANALYSIS
Consider the dhatu(verb root) त्प� meaning ‘to heat’The following inflections are analyzed lexically -
HEATS WILL HEAT त्पद्वित्, त्पत्�, त्पन्ति�त् | त्प्स्याद्वित्, त्प्स्यात्�, त्प्स्यान्ति�त् | त्पसिसा, त्पथः�, त्पथः | त्प्स्यासिसा, त्प्स्याथः�,त्प्स्याथः | त्पमिम्, त्पवः�, त्पम्� त्प्स्यामिम्, त्प्स्यावः�, त्प्स्याम्�
HEATED HEAT IT(order) अत्पत्�, अत्पत्म्�, अत्पन� | त्पत्�, त्पत्म्�, त्प�त्� | अत्प�, अत्पत्म्�, अत्पत् | त्प, त्पत्म्�, त्पत् | अत्पम्�, अत्पवः, अत्पम् त्पद्विन, त्पवः, त्पम्
LEXICAL ANALYSIS
Consider the noun दे�वः representing GodThe following inclusions are possible
1. Nominative (subject) दे�वः� दे�वः! दे�वः�2. Accusative (object) दे�वःम्� दे�वः! दे�वःन�3. Instrumental (by) दे�वः�न दे�वःभ्याम्� दे�वः#�4. Dative(to) दे�वःया दे�वःभ्याम्� दे�वः�भ्या�5. Ablative(from) दे�वःत्� दे�वःभ्याम्� दे�वः�भ्या�6. Genitive(of) दे�वःस्या दे�वःया$� दे�वःनम्�7. Locative(in) दे�वः� दे�वःया$� दे�वः�षु�
LEXICAL ANALYSIS
Input Sentence
Tokenize
Avyaya Analysis
Verb Analysis
Noun Analysis
Unknown word(add to database)
The scanner recognizes words
The parser recognizes syntactic units Parser operations:
◦ Check and verify syntax based on specified syntax rules
◦ Report errors
Automation:◦ The process can be automated
Parsing
1. Simplicity of design
2. Improving efficiency
3. Enhancing portability
Why to Separate Lexical Analysis and Parsing
Parsing Sanskrit Text
Now we move towards translating a Sanskritsentence into its parser equivalent
PARSING Analyze (a sentence) into its component parts and describe their syntactic roles.
Analyze (a string or text) into logical syntactic components, typically in order to test conformability to a logical grammar.
Parsing Sanskrit Text
Sanskrit Sentence StructureSOV
English Sentence StructureSVO
बाला� पठम्� पठद्वित् Boy reads chapter S O V S V O
Example Sanskrit Sentence
Approach(Coding Concept)
We first tokenize the input using strtok(str,” ”); Each token can be of 3 types- Noun,verb,
preposition.The task is to identify these token which is done by matching in indexed database.
Each token is stored in a structure along with the meaning and its morphologic.
Then parser comes into play and form a tree
type of structure using these tokens.
Bottom-Up Parser Technique
Bottom-Up LR◦ Construct parse tree in a bottom-up manner◦ Find the rightmost derivation in a reverse order◦ For every potential right hand side and token
decide when a production is found
More powerful Bottom-up parsers can handle the largest class of
grammars that can be parsed deterministically
Approach
Programming language used: C and C++ Database Used: Linux file system, indexed Data Structures: Array, Linked List, structure,Tree,
Indexing and Hashing INPUT: A sanskrit sentence or paragraph eg: यात्री रःम्� गच्छद्वित् त्त्री दे�वः� बाला�न साह नदे*म्� द्विनकषु द्वित्ष्ठन्ति�त्! OUTPUT: recognize all the parts of speech Form a tree structure to be able to understand the
sentence.
How the Output Will be Shown in Terminal
यात्री::: this is a avyaya.. and the meaning is: where_there ] रःम्�::: Nominative,Singular, Gender-Masculine ,noun and the root
is: रःम् and the meaning is Ram गच्छद्वित्::: The root is: गच्छ the meaning is: go present-tense,first-
person,singular त्त्री::: this is a avyaya.. and the meaning is: there दे�वः�::: Nominative,Plural Gender-Masculine ,noun ,and the root is:
दे�वः and the meaning is god बाला�न::: Instrumental,Singular, Gender-Masculine ,noun, and the
root is: बाला and the meaning is boy नदे*म्�::: Accusative,Singular, Gender-Feminine ,noun and the root is:
नदे* and the meaning is river
Avyaya's Role in Sanskrit
Avyaya words(indeclinables) are used to connect 2 or more simple sentences. Examples -यादिदे-त्दिदे (if-then)यात्री-त्त्री (where-there)परः�त्� (but)अथःद्विप (hence)चे�दे� (provided,if)Not only do avyaya connect sentences but they also affect structure of a simple sentence.
Challanges in the code
Every word encountered in the input sentence could be any parts of speech of sanskrit as there is no fixed ordering.
Because of the above mentioned property of sanskrit, searching becomes important.
Database and word collection were in unicode format, size of each word becomes even larger.
Problems
Grammar of Sanskrit language
How can we represent it in BNF grammar.
Parser techniques
Structure of code
Where We are Now
A big chunk of our time was invested in research of sanskrit language and its grammar which was quite difficult.
Till now we have implemented lexer part and parser part.
Reference
Sanskrit & Artificial Intelligence — NASAKnowledge Representation in Sanskrit and Artificial Intelligence by Rick Briggs
http://www.vedicsciences.net/articles/sanskrit-nasa.html AI Magazine publishes the importance of Sanskrit
http://www.parankusa.org/SanskritAsProgramming.pdf
http://sanskrit.jnu.ac.in/morph/analyze.jsp
http://en.wikipedia.org/wiki/Sanskrit_verbs
http://en.wikipedia.org/wiki/Sanskrit_grammar
Thank You