Download - Script to Sentiment : on future of Language TechnologyMysore latest

Transcript
Page 1: Script to Sentiment : on future of Language TechnologyMysore latest

Script to Sentiment : on future of Language

Technology

Jaganadh [email protected]

Different Diementions of Language TechnologyCentral Institute of Hindi

MysoreFeb. 25-26 2010

Abstract

Human Language Technology(HLT) is no longer confined as a sub-ject for class room teaching. Revolutionary developments are occur-ring in the field of HLT. These developments are capable enough tobring changes in the human life. Information Communication Tech-nology(ICT) became and inevitable component for our day to day life.Directly or indirectly we are consumers of ICT based products. Forthe last few years we saw that the ICT revolution is appearing in ournative languages too. As a result HLT became a direct or indirect com-ponent in ICT products and services. HLT is supposed to premate allares of our life in future. Whether you are a Doctor, Engineer, Frameror a lover irrespective of your profile we are all going to be addictive ofHLT based ICT products. The present paper discusses developmentsin the field of HLT and the future.

1 Introduction

Human Language Technology(HLT) is no longer confined as a subject forclass room teaching. Revolutionary developments are occurring in the fieldof HLT. These developments are capable enough to bring changes in thehuman life. Information Communication Technology(ICT) became and in-evitable component for our day to day life. Directly or indirectly we areconsumers of ICT based products. For the last few years we saw that theICT revolution is appearing in our native languages too. As a result HLTbecame a direct or indirect component in ICT products and services. HLTis supposed to premate all ares of our life in future. Whether you are aDoctor, Engineer, Framer or a lover irrespective of your profile we are allgoing to be addictive of HLT based ICT products.

1

Page 2: Script to Sentiment : on future of Language TechnologyMysore latest

The history of HLT begins from the birth of Personal Computers(PC).From the early 1950’s Researchers and Scientists were trying to developcomputers programs that can handle human languages as like a human.The earliest Research and Development(R&D) in this field was related tothe development of Machine Translation Systems(MT). As of now we cansay that significant developments were occurred in the field and workingsystems are available. Some are ready accepted, some are imperfect but noalternatives. So still we are not in a state to say that ’Yes! we cracked thelanguage challenge! and now able to provide smart engineering solutions’.Path breaking R&D activities are happening this field. In this scenario it isquiet interesting to investigate where we are standing in the field of HumanLanguage Technology.

The present paper is a compilation on the developments in the field ofHLT. The paper also discuss some of future technologies in HLT. Recentdevelopments in Indian Language Technology is also discussed in the paperwith special fous on issues involved in it.

2 Where we are now?

R&D activities in HLT can be broadly classified in two major categories.1) Text processing 2) Speech Processing. Activities under text process-ing involves development of spell chcekr systems to discourese analysis sys-tems. Speech processing involves text to speech conversion(TTS) to speechto speech translation.For all most all tasks in both fields; Free and OpenSource (FOSS)1 and propitiatory solutions are available. Internet based so-lutions are also there like; Google Translate and other services2. The FOSSbased solutions as well as public domain solution in this field played a vitalrole in rapid developments in HLT including Indian Languages. This sectionis a brief survey on present status of the R&D activities in the field.

2.1 Language and Scripts in Computers

In early days of HLT representing the vernaculars in the computers was achallenge. ASCII 3 was the early character encoding scheme4 existed in theearly days. The encoding scheme was used to represent English alphabets.This encoding scheme was not sufficient enough to represent the other lan-guages. Some work around were done for attaining the same. Most of these

1http://en.wikipedia.org/wiki/Free and open source software2http://translate.google.com/#

www.google.com/transliterate/www.google.com/dictionary etc..

3http://en.wikipedia.org/wiki/ASCII - Accessed on 01-01-20104http://en.wikipedia.org/wiki/Character encoding - Accessed on 01-01-2010

2

Page 3: Script to Sentiment : on future of Language TechnologyMysore latest

workarounds were purely font5 based solutions. In India we developed sucha solution called ISCII 6for representing Indian Languages. The introduc-tion of Unicode 7 is a remarkable development in this field. Unicode madethe task of representing vernaculars in computers very easy and it becamede facto standard. Apparently suitable font8 technology also developed.

The incarnation of Unicode standard boosted the penetration of locallanguage contents in internet. All the living languages which received en-coding space in Unicode got opportunity to dominate in the Informationtechnology(IT) world. It leaded to information overflow. As result there isan increasing demand for information processing tools like keyboard driversto search engines to decision support systems9.

2.2 Developments in Text Processing

This section is a brief survey on the developments in text processing tech-nologies. A wide variety and number of text processing systems are avail-able now; like spell checkers, grammar correcting systems, MT systems andsearch engines etc.. People who are using computer for preparing the docu-ments etc.. are familiar with tools like spell checking systems. They knowsthat life is not easy without such tools. Because human being is tend tocommit errors and lazy too! But when the computers were placed in thedesk of hard core language people like translators they were interested inelectronic dictionaries as well as machine translation. When computers werecame in to the life of business people they are having different intentions.But who ever may be and what ever may be the profile of the computer userscategory there demands were directly or indirectly related to HLT. Becauseeverybody’s uses language, and they can’t live with out language. The im-pact of such demands caused to rise of new methodologies and technologiesin HLT itself. Those developments are discussed here.

Spellcheckers

In computing, a spell checker (spell check) is an application program thatflags words in a document that may not be spelled correctly10. The verytechnology is very-much advanced now. Spell checkers are available forall most all languages in the world. Most of the popular word processingsoftware having the feature. Spell checker systems are available for Indian

5http://en.wikipedia.org/wiki/Font - Accessed on 01-01-20106http://en.wikipedia.org/wiki/Indian Script Code for Information Interchange - Ac-

cessed on 01-01-20107http://unicode.org/

http://en.wikipedia.org/wiki/Unicode - Accessed on 01-01-20108http://en.wikipedia.org/wiki/Font - Accessed on 01-02-109http://en.wikipedia.org/wiki/Decision support system Accessed on 01-02-10

10http://en.wikipedia.org/wiki/Spell checker

3

Page 4: Script to Sentiment : on future of Language TechnologyMysore latest

Languages too. The language software collection cd’s distributed by theTDIL11 program contains spell checker applications for almost all IndianLanguages.

The FOSS movement in India is very active in spell checker dictionarydevelopment for Indian Languages12. The FOSS frameworks13 available forspell checker systems are being widely used by these FOSS peoples. Develop-ments in Indian Language Spell checker dictionaries needs more volunteers.

Machine Translation

MT is one of the oldest and live task in HLT. For the last 50 and more yearsR&D activities in the very field is in progress. Some systems are availablefor use too. But majority are not in a state to consider as a perfect solution.Divergent methodologies are available for the task of MT like statistical,rule-based and hybrid etc.14. But fully automated high quality MT remains asa target to be achieved. Among the available MT systems/services theGoogle Translate and Babel Fish15 is most famous. Google Translate havethe facility of English to Hindi and vice verse translation.

MT research in IL is very active from early 1970’s. AnglaBharati16 andAnusaaraka17 are two major approaches developed in the early days and stillin active development. Other systems like Sampark18, UNL based machinetranslations systems19 are also available. The TDIL program of Govt. ofIndia is providing extensive support to MT research in India.Except theabove mentioned systems, some other IL MT initiatives are there.

Some FOSS based solutions are also available for MT system develop-ment. There are two famous frameworks called Moses20 and Apertium21.These tools follows the statistical paradigm of MT. MT researchers in Indiais also came forward to work in these two frameworks. Hope that this willboost the MT research in India too.

11www.tdil.mit.gov.in12http://indlinux.org/

http://smc.org.in/http://wiki.services.openoffice.org/wiki/Dictionaries

13http://hunspell.sourceforge.nethttp://en.wikipedia.org/wiki/MySpell

14http://www.hutchinsweb.me.uk/IntroMT-TOC.htm15http://babelfish.yahoo.com/16http://www.cse.iitk.ac.in/users/langtech/anglabharti.htm17http://ltrc.iiit.ac.in/anusaaraka/18http://sampark.iiit.ac.in/19http://www.springerlink.com/content/t1005w166746727l/fulltext.pdf20www.statmt.org/moses21www.apertium.org

4

Page 5: Script to Sentiment : on future of Language TechnologyMysore latest

Search and IR

Search Engines(SE) and Information Retrieval(IR) systems are the mostwidely used HLT tool by the general public. Google22, Yahoo23 and Bing24

are the three most famous search engine giants in the world. Revolutionarydevelopments are occurring in this field. Domain based searches like ’patentsearch’, content based search like ’video search’, localized search like ’movietiming’ and cross lingual search are the recent trends in this field. The latestdevelopment in the field is Semantic Search which will be discussed in thelater section of the paper. All the search search engines are now capableenough to handle local language search requests too. Cross Lingual InforSystems(CLIR) for Indian Languages are in development.

3 Speech Processing

This section is meant for to give a brief survey on the developments inSpeech Processing. The main technologies discussed in this section are Textto Speech(TTS) system and Automatic Speech Recognition(ASR).

TTS

Text to Speech system or TTS is a software which can convert an electronictext to corresponding speech. The very field involves both text processingas well as signal processing techniques. R&D activities in this direction pro-duced hopeful and acceptable solutions. FOSS based as well as proprietarysolutions are available now. The major FOSS based framework available forTTS system development is Festival25 and Festvox26 system. Introductionof both framework boosted the development of TTS in various languages in-cluding Indian Languages too. The most remarkable development in IndianLanguage TTS system under FOSS is the Dhvani project27.

Even-though we are in a state to say that we achieved significant growthin the field of TTS development more challenges are there. Those challengesincludes providing more naturalness to the synthesized voice, intonation andemotion based TTS etc..

ASR

Automatic Speech Recognition(ASR) is technology that allows a computerprogram to identify and transcribe the word that a person speaks in to

22http://www.google.co.in/23www.yahoo.co.in24http://www.bing.com/25http://www.cstr.ed.ac.uk/projects/festival/26http://festvox.org/festival/27http://dhvani.sourceforge.net/

5

Page 6: Script to Sentiment : on future of Language TechnologyMysore latest

a microphone. As like TTS, ASR also involves both text processing andsignal processing techniques. It is one of the most challenging and inter-esting tasks in HLT. Significant developments are in this field too. ASRsystems are available for some Indian Languages like Hindi28 and Telugu.The most widely used FOSS based framework for ASR development is CMUSphinx29. The introduction of CMU Sphinx opened a new direction in theR&D of ASR. Apart from CMU Sphinx some other FOSS based as well aspropitiatory frameworks are available for ASR development.

4 Future of HLT

Over the past few decades colossal progress has been came up in the fieldof HLT. From simple systems that can understand numbers to text un-derstanding and summarization systems were developed with in the pastfew decades. So many challenges are there to be addressed in the future.Hopefully we can build complex systems from the existing HLT systems.These developments are the results of a long journey from lab experimentsto deployment in real time work environments. The wide range of tools andtechnologies developed as part of R&D in HLT is capable enough to makedeep impact in the human life. These tools are having great relevance andimpact in market oriented society.

What will be the future? Can we imagine it? Yes! Imagine that youare asking your car to show the route to Central Institute of Hindi fromMysore bus stand, and it is telling the directions or giving a detailed printoutdescribing the route. In-fact it is not a dream technology.It is possible withclubbing of other technologies like GPS(Global Positioning System) andSpeech Processing. Suppose that a judge is analyzing the arguments relatedto a case with a software and reaching in judgment. Or consider a legislativeassembly publishes some draft bills in its website and receives comments onthe bill.After receiving the comment and before proceeding to further actionsthey are analyzing it ti find how many of them are positive comments andhow many of them are negative!! It is already possible. The technologywhich analyzes the opinion is called ’Sentiment Analysis’. There is no endfor imaginations. But these imaginations will come in to reality very soon.This section highlights some of the future technologies are R&D ares in HLT.

Semantic Web/Search

Semantics is a branch of modern linguistics which studies about the struc-ture of meaning. The Semantic Web(SeW) is an evolving development of theWorld Wide Web in which the meaning (semantics) of information and ser-

28http://sourceforge.net/projects/hindiasr/29http://www.speech.cs.cmu.edu/sphinx/

6

Page 7: Script to Sentiment : on future of Language TechnologyMysore latest

vices on the web is defined, making it possible for the web to ”understand”and satisfy the requests of people and machines to use the web content30.Tim Berners Lee the father of www31 is the inventor of this technology. W3Cor the World Wide Web consortium is the authority in publishing and main-taining standards and recommendation on SeW. The semantic web basedHLT implementations are going to bring a big revolution in the comingyears. Semantic Search is one of such technologies which HLT people arediscussing now a days. SeW search engines are already there32, but not thatmuch accepted as of now. It will bring revolutionary changes in the field ofonline publishing, e-governance, and e-commerce etc...

Sentiment Analysis

Sentiment analysis or opinion mining refers to a broad (definitionally chal-lenged) area of natural language processing, computational linguistics andtext mining33. The basic task in sentiment analysis is classifying the polarityof a given text at the document, sentence, or feature/aspect level — whetherthe expressed opinion in a document, a sentence or an entity feature/aspectis positive, negative or neutral34. The rise social media like blog, twitter,facebook, and linkedin etc.. has fueled great interest in the field of Sen-timent Analysis. Publishers, movie companies and fast moving consumergoods(FMCG) companies are the main consumers of this technology. Thetechnology is already present in the market. Very soon the technology willbe getting its own position in politics governance etc..

Future of MT

In previous section we discussed the developments in MT research. Re-markable achievements were made in this direction. But still we have toissue many issues to achieve the goal Fully Automated High Quality Ma-chine Aided Translation (FAHQMAT). Other expectation is to build effi-cient speech to speech translation systems. I think with in a few years ourresearchers will be providing revolutionary solutions in this field.

HLT in Education

Computer Assisted Teaching(CAT) is already in practice through out theglobe. It is considered as one of the best way to for effective and interactive

30Berners-Lee, Tim; James Hendler and Ora Lassila (May 17, 2001). ”The Seman-tic Web”. Scientific American Magazine. http://www.sciam.com/article.cfm?id=the-semanticweb&print=true. Accessed March 26, 2008.

31World Wide Web32www.hakia.com33http://en.wikipedia.org/wiki/Sentiment analysis34http://en.wikipedia.org/wiki/Sentiment analysis

7

Page 8: Script to Sentiment : on future of Language TechnologyMysore latest

teaching. HLT techniques like ASR, TTS, morphological synthesis, parsingand MT can be used for interactive language teaching especially secondlanguage teaching. With the help of HLT we can build online systems whichcan teach second language and evaluate the progress made by the studentwith out the intervention of a human instructor.

HLT in Bio-Medical Research

HLT techniques like Named Entity Recognition35(NER),SeW and Text Min-ing36 techniques are widely used in the field of Bio-Medical research. Thevery field of research is now called as Bio-medical Natural Language Processing(Bio-NLP).

HLT in Forensic Science

Another vital are which HLT is going to applied is Forensic Science. TheHLT techniques are very useful for authorship dispute resolution,disputes ofmeaning and use, identification of the author of anonymous texts, identifyingcases of plagiarism37 and reconstructing mobile phone text conversationsetc..

HLT for Business

It is well known that without search engines there is no existence for web-pages. Without advertisements there is no existence for business too. Theemergence of new media pawed the way to online advertisement techniques.Marriage of IR and other HLT techniques with online advertisement givebirth to a new field called ’Computational Advertisement’. It helps the ad-vertisers to put heir advertisement in appropriate place according to thetaste of consumers. Another vital business oriented area of R&D is ’Collec-tive Intelligence’38 where wide range of HLT techniques are used. It helpsservice providers like online stores to give product recommendations for theconsumer based on his/her purchasing behavior and taste. This will beattained by comparing and analyzing the purchasing behavior and tastecustomers who shares similar taste. So remember when ever you are receiv-ing context relevant advertising or product recommendation the power ofHLT is there!!

35http://en.wikipedia.org/wiki/Named entity recognition36http://en.wikipedia.org/wiki/Text mining37http://en.wikipedia.org/wiki/Plagiarism38http://en.wikipedia.org/wiki/Collective intelligence

8

Page 9: Script to Sentiment : on future of Language TechnologyMysore latest

5 Issues in HLT

The developments in HLT which happened during the past few years is quitepromising and the future technologies which is slowly coming in to practiceand on the way out of the lab too are quite exiting one. Still there are lotsof research issues are there. This section is dedicated to the discussion onsome of the selected issues in HLT with special focus to Indian LanguageTechnology.

A large number of Language Technology based products are coming in tomarket. How these technology products can be evaluated? Many techniqueswere evolved for evaluating LT project/product like EAGLES39. But mostof the evaluation methodologies are not that much compatible enough tohandle the linguistic phenomenas in Indian Language. A typical example isMT evaluation. BLUE and METROR are the two major methodologies forevaluating MT. But both of this methedologies are not that much efficientfor handling MT between English and Indian Languages40. Another vitalissue in evaluating HLT project/product is availability of data for testingthe tools. For example to evaluate an MT system reference translation setsare required. In a way the reference translation is parallel corpus only. Butapart from a parallel corpus it has to posses some quality. Such referencetranslation corpus should cover different syntactico-semantic phenomena insource language as well as target language. Availability, especially publicallyavailable such data sets and standards are lacking in the case of Indian Lan-guages Technology. In the case of India we don’t have any defined standardbody, policy or standard body to evaluatie the HLT projects/products. Inshort the issues in HLT can be classified in three broad ares 1) The devel-opment challenges which involves the algorithm development and bafflingissues in language etc.. 2) Availability of resources and standards in publicdomain 3) The evaluation problem. A detailed discussion on the topic isquite out of the scope of this paper.

6 Conclusion

Much resources and tools were developed in the past few years in HLT.The developments in the field is quite promising and the future too. Aswe discussed in the beginning of this paper we can hope that all the ICTtools will be powered by HLT in future. On the contrary we cant forgot thechallenges and issues which involved in the field. To solve the major issuesin HLT especially in Indian Language scenario much enhanced policies andstandards might be introduced in near future to boost the R&D activitiesin the field.

39http://www.issco.unige.ch/en/research/projects/ewg95//ewg95.html40http://www.cse.iitb.ac.in/pb/papers/icon07bleu.pdf

9