Natural Language Processing

Ph.D. Coursework Assignment

UNIT 2:

SANSKRIT Language

Ancient & Modern

MODULE 5: MODERN TRENDS IN SANSKRIT

General awareness of

NATURAL

LANGUAGE PROCESSING

(NLP)

General awareness of NATURAL LANGUAGE PROCESSING (NLP)

PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5

Pag

e2

प्रकृतिभाषाससंाधनम ् (प्रभास)ं

The processing of language by the human brain is termed

as Language processing. Neuro-linguistic programming

deals with Non-science communication approach. Natural

Language Processing is related with language processing by

computers.

NLP is a computer activity in which computers are

entailed to analyze, understand, alter, or generate natural

language. This includes the automation of any or all

linguistic forms, activities, or methods of communication,

such as conversation, correspondence, reading, written

composition, dictation, publishing, translation, lip reading,

and so on. Natural language processing is also the name of

the branch of computer science, artificial intelligence, and

linguistics concerned with enabling computers to engage in

communication using natural language(s) in all forms,

including but not limited to speech, print, writing, and

signing.

Natural Language Processing (NLP) is a field of computer

science (संगणकविज्ञानम)्, artificial intelligence (कृत्रिमबदु्धिः), and linguistics (भाषाविज्ञानम)् concerned with the interactions

between computers and human (natural) languages. As

such, NLP is related to the area of human–computer

interaction. Many challenges in NLP involve natural

language understanding, that is, enabling computers to

derive meaning from human or natural language input, and

others involve natural language generation.



Pag

e3

1. Definitions …………….3

2. Prerequisite technologies………………5

3. History: Pāṇini to NLP……..……….5

4. NLP using machine learning……………….8

5. Major tasks, Sub tasks and related tasks in NLP…………..…10

6. Statistical NLP……………….20

7. Evaluation of natural language processing…………………20

7.1. Objectives

7.2. Short history of evaluation in NLP

7.3. Different types of evaluation

8. Natural Language Processing toolkits..………………..23

9. Named Entity Recognizers………………….27

10. Translation software..………………..27

11. Other software…………………….28

12. Chatterbots……………………..30

13. Natural language processing organizations……………………….31

14. Standardization in NLP……………………….31

15. The future of NLP…….…………………31

16. Other related fields…..…………………..32

17. References……..………………….32



Pag

e4

Definitions

Natural language processing can be described as all of the following:

A field of science – systematic enterprise that builds and organizes

knowledge in the form of testable explanations and predictions

about the universe.

An applied science – field that applies human knowledge to build or

design useful things.

A field of computer science – scientific and practical approach to

computation and its applications.

A branch of artificial intelligence – intelligence of machines and

robots and the branch of computer science that aims to create it.

A subfield of computational linguistics – interdisciplinary field

dealing with the statistical or rule-based modelling of natural

language from a computational perspective.

An application of engineering – science, skill, and profession of

acquiring and applying scientific, economic, social, and practical

knowledge, in order to design and also build structures, machines,

devices, systems, materials and processes.

An application of software engineering – application of a systematic,

disciplined, quantifiable approach to the design, development,

operation, and maintenance of software, and the study of these

approaches; that is, the application of engineering to software.

A subfield of computer programming – process of designing, writing,

testing, debugging, and maintaining the source code of computer

programs. This source code is written in one or more programming

languages (such as Java, C++, C#, Python, etc.). The purpose of

programming is to create a set of instructions that computers use to

perform specific operations or to exhibit desired behaviours.



Pag

e5

A subfield of Artificial Intelligence programming. AI programming

involves (mainly) manipulating symbols and not numbers. These

symbols might represent objects in the world and relationships

between those objects - complex structures of symbols are needed

to capture our knowledge of the world.

A type of system – set of interacting or interdependent components

forming an integrated whole or a set of elements (often called

'components') and relationships which are different from

relationships of the set or its elements to other elements or sets.

A system that includes software – software is a collection of

computer programs and related data that provides the instructions

for telling a computer what to do and how to do it. Software refers

to one or more computer programs and data held in the storage of

the computer. In other words, software is a set of programs,

procedures, algorithms and its documentation concerned with the

operation of a data processing system.

A type of technology – making, modification, usage, and knowledge

of tools, machines, techniques, crafts, systems, methods of

organization, in order to solve a problem, improve a pre-existing

solution to a problem, achieve a goal, handle an applied

input/output relation or perform a specific function. It can also refer

to the collection of such tools, machinery, modifications,

arrangements and procedures. Technologies significantly affect

human as well as other animal species' ability to control and adapt

to their natural environments.

A form of computer technology – computers and their application.

NLP makes use of computers, image scanners, microphones, and

many types of software programs.



Pag

e6

Prerequisite technologies

The following technologies make natural language processing possible:

Communication (the activity of a source sending a message to a

receiver.)

Language

Speech

Writing

Computing

Computers

Computer programming

Information extraction

User interface

Software

Text editing

Word processing

Input devices (pieces of hardware for sending data to a computer

to be processed.)

Computer keyboard (typewriter style input device whose input is

converted into various data depending on the circumstances.)

Image scanners

History of natural language processing

The history of natural language processing describes the advances of

natural language processing. The history of NLP generally starts in the

1950s, although work can be found from earlier periods. There is some

overlap with the history of modern linguistics, the history of machine

translation and the history of artificial intelligence.



Pag

e7

Pāṇini

Pāṇini (fl. 4th century BCE) (Sanskrit: पाणणनन, IPA: [p ]; a

patronymic meaning "descendant of Paṇi"), or Panini, was a Sanskrit

grammarian from ancient India. He was born in Pushkalavati, Gandhara, in

the modern-day Charsadda of Khyber Pakhtunkhwa, Pakistan.

Pāṇini is known for his Sanskrit grammar, particularly for his

formulation of the 3,959 rules of Sanskrit morphology, syntax and

semantics in the grammar known as Ashtadhyayi (अष्टाध्यायी Aṣṭādhyāyī,

meaning "eight chapters"), the foundational text of the grammatical

branch of the Vedanga, the auxiliary scholarly disciplines of Vedic religion

[Hinduism].

The Ashtadhyayi is one of the earliest known grammars of Sanskrit,

although Pāṇini refers to previous texts like the Unadisutra, Dhatupatha,

and Ganapatha. It is the earliest known work on descriptive linguistics,

and together with the work of his immediate predecessors (Nirukta,

Nighantu, Pratishakyas) stands at the beginning of the history of

linguistics itself. His theory of morphological analysis was more advanced

than any equivalent Western theory before the mid 20th century, and his

analysis of noun compounds still forms the basis of modern linguistic

theories of compounding, which have borrowed Sanskrit terms such as

bahuvrihi and dvandva.

Pāṇini's comprehensive and scientific theory of grammar is

conventionally taken to mark the end of the period of Vedic Sanskrit,

introducing the period of Classical Sanskrit.

Modern linguistics

Pāṇini's work became known in 19th-century Europe, where it

influenced modern linguistics initially through Franz Bopp, who mainly

looked at Pāṇini. Subsequently, a wider body of work influenced Sanskrit

scholars such as Ferdinand de Saussure, Leonard Bloomfield, and Roman

Jakobson. Frits Staal (1930-2012) discussed the impact of Indian ideas on

language in Europe. After outlining the various aspects of the contact,

Staal notes that the idea of formal rules in language – proposed by



Pag

e8

Ferdinand de Saussure in 1894 and developed by Noam Chomsky in 1957

– has origins in the European exposure to the formal rules of Pāṇinian

grammar. In particular, de Saussure, who lectured on Sanskrit for three

decades, may have bee flue ced by Pāṇini and Bhartrihari; his idea of

the unity of signifier-signified in the sign somewhat resembles the notion

of Sphoṭa. More importantly, the very idea that formal rules can be applied

to areas outside of logic or mathematics may itself have been catalyzed by

Europe's contact with the work of Sanskrit grammarians.

De Saussure

Pāṇini, and the later Indian linguist Bhartrihari, had a significant

influence on many of the foundational ideas proposed by Ferdinand de

Saussure, professor of Sanskrit, who is widely considered the father of

modern structural linguistics. Saussure himself cited Indian grammar as

an influence on some of his ideas. In his Mémoire sur le système primitif des

voyelles dans les langues indo-européennes (Memoir on the Original System

of Vowels in the Indo-European Languages) published in 1879, he

mentions Indian grammar as an influence on his idea that "reduplicated

aorists represent imperfects of a verbal class." In his De l'emploi du génitif

absolu en sanscrit (On the Use of the Genitive Absolute in Sanskrit)

publ shed 1881, he spec f cally me t o s Pāṇini as an influence on the

work.

Prem Singh, in his foreword to the reprint edition of the German

tra slat o of Pāṇ ’s Grammar 1998, co cluded that the "effect

Panini's work had on Indo-European linguistics shows itself in various

studies" and that a "number of seminal works come to mind," including

Saussure's works and the analysis that "gave rise to the laryngeal theory,"

further stating: "This type of structural analysis suggests influence from

Panini's analytical teaching." George Cardona, however, warns against

overestimating the i flue ce of Pāṇini on modern linguistics: "Although

Saussure also refers to predecessors who had taken this Paninian rule into

account, it is reasonable to conclude that he had a direct acquaintance

with Panini's work. As far as I am able to discern upon rereading

Saussure's Mémoire, however, it shows no direct influence of Paninian



Pag

e9

grammar. Indeed, on occasion, Saussure follows a path that is contrary to

Paninian procedure."

Leonard Bloomfield

The founding father of American structuralism, Leonard Bloomfield,

wrote a 1927 paper t tled "O some rules of Pāṇini".

Panini: Comparison with modern formal systems

Pāṇini's grammar is the world's first formal system, developed well

before the 19th century innovations of Gottlob Frege and the subsequent

developme t of mathemat cal log c. I des g g h s grammar, Pāṇini used

the method of "auxiliary symbols", in which new affixes are designated to

mark syntactic categories and the control of grammatical derivations. This

technique, rediscovered by the logician Emil Post, became a standard

method in the design of computer programming languages. Sanskritists

ow accept that Pāṇini's linguistic apparatus is well-described as an

"applied" Post system. Considerable evidence shows ancient mastery of

context-sensitive grammars, and a general ability to solve many complex

problems. Frits Staal has written that "Panini is the Indian Euclid."

The history of machine translation dates back to the seventeenth

century, when philosophers such as Leibniz and Descartes put forward

proposals for codes which would relate words between languages. All of

these proposals remained theoretical, and none resulted in the

development of an actual machine.

The first patents for "translating machines" were applied for in the mid-

1930s. One proposal, by Georges Artsrouni was simply an automatic

bilingual dictionary using paper tape. The other proposal, by Peter

Troyanskii, a Russian, was more detailed. It included both the bilingual

dictionary, and a method for dealing with grammatical roles between

languages, based on Esperanto.

In 1950, Alan Turing published an article titled "Computing Machinery

and Intelligence" which proposed what is now called the Turing test as a

criterion of intelligence.



Pag

e10

The Georgetown experiment in 1954 involved fully automatic

translation of more than sixty Russian sentences into English. The authors

claimed that within three or five years, machine translation would be a

solved problem. However, real progress was much slower, and after the

ALPAC report in 1966, which found that ten year long research had failed

to fulfill the expectations, funding for machine translation was

dramatically reduced. Little further research in machine translation was

conducted until the late 1980s, when the first statistical machine

translation systems were developed.

Some notably successful NLP systems developed in the 1960s were

SHRDLU, a natural language system working in restricted "blocks worlds"

with restricted vocabularies, and ELIZA, a simulation of a Rogerian

psychotherapist, written by Joseph Weizenbaum between 1964 to 1966.

Using almost no information about human thought or emotion, ELIZA

sometimes provided a startlingly human-like interaction. When the

"patient" exceeded the very small knowledge base, ELIZA might provide a

generic response, for example, responding to "My head hurts" with "Why

do you say your head hurts?".

During the 1970s many programmers began to write 'conceptual

ontologies', which structured real-world information into computer-

understandable data. Examples are MARGIE (Schank, 1975), SAM

(Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976),

QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units

(Lehnert 1981). During this time, many chatterbots were written

including PARRY, Racter, and Jabberwacky.

Up to the 1980s, most NLP systems were based on complex sets of

hand-written rules. Starting in the late 1980s, however, there was a

revolution in NLP with the introduction of machine learning algorithms

for language processing. This was due both to the steady increase in

computational power resulting from Moore's Law and the gradual

lessening of the dominance of Chomskyan1 theories of linguistics (e.g.

1 Chomskyan linguistics encourages the investigation of "corner cases"

that stress the limits of its theoretical models (comparable to pathological

phenomena in mathematics), typically created using thought experiments,



Pag

e11

transformational grammar), whose theoretical underpinnings discouraged

the sort of corpus linguistics that underlies the machine-learning

approach to language processing. Some of the earliest-used machine

learning algorithms, such as decision trees, produced systems of hard if-

then rules similar to existing hand-written rules. Increasingly, however,

research has focused on statistical models, which make soft, probabilistic

decisions based on attaching real-valued weights to the features making

up the input data. The cache language models upon which many speech

recognition systems now rely are examples of such statistical models. Such

models are generally more robust when given unfamiliar input, especially

input that contains errors (as is very common for real-world data), and

produce more reliable results when integrated into a larger system

comprising multiple subtasks.

Many of the notable early successes occurred in the field of machine

translation, due especially to work at IBM Research, where successively

more complicated statistical models were developed. These systems were

able to take advantage of existing multilingual textual corpora that had

been produced by the Parliament of Canada and the European Union as a

result of laws calling for the translation of all governmental proceedings

into all official languages of the corresponding systems of government.

However, most other systems depended on corpora specifically developed

for the tasks implemented by these systems, which was (and often

continues to be) a major limitation in the success of these systems. As a

result, a great deal of research has gone into methods of more effectively

learning from limited amounts of data.

rather than the systematic investigation of typical phenomena that occur

in real-world data, as is the case in corpus linguistics. The creation and use

of such corpora of real-world data is a fundamental part of machine-

learning algorithms for NLP. In addition, theoretical underpinnings of

Chomskyan linguistics such as the so-called "poverty of the stimulus"

argument entail that general learning algorithms, as are typically used in

machine learning, cannot be successful in language processing. As a result,

the Chomskyan paradigm discouraged the application of such models to

language processing.



Pag

e12

Recent research has increasingly focused on unsupervised and semi-

supervised learning algorithms. Such algorithms are able to learn from

data that has not been hand-annotated with the desired answers, or using

a combination of annotated and non-annotated data. Generally, this task is

much more difficult than supervised learning, and typically produces less

accurate results for a given amount of input data. However, there is an

enormous amount of non-annotated data available (including, among

other things, the entire content of the World Wide Web), which can often

make up for the inferior results.

NLP using machine learning

Modern NLP algorithms are based on machine learning, especially

statistical machine learning. The paradigm of machine learning is different

from that of most prior attempts at language processing. Prior

implementations of language-processing tasks typically involved the

direct hand coding of large sets of rules. The machine-learning paradigm

calls instead for using general learning algorithms — often, although not

always, grounded in statistical inference — to automatically learn such

rules through the analysis of large corpora of typical real-world examples.

A corpus (plural, "corpora") is a set of documents (or sometimes,

individual sentences) that have been hand-annotated with the correct

values to be learned.

Many different classes of machine learning algorithms have been

applied to NLP tasks. These algorithms take as input a large set of

"features" that are generated from the input data. Some of the earliest-

used algorithms, such as decision trees, produced systems of hard if-then

rules similar to the systems of hand-written rules that were then common.

Increasingly, however, research has focused on statistical models, which

make soft, probabilistic decisions based on attaching real-valued weights

to each input feature. Such models have the advantage that they can

express the relative certainty of many different possible answers rather

than only one, producing more reliable results when such a model is

included as a component of a larger system.



Pag

e13

Systems based on machine-learning algorithms have many advantages

over hand-produced rules:

The learning procedures used during machine learning automatically

focus on the most common cases, whereas when writing rules by hand it is

often not obvious at all where the effort should be directed.

Automatic learning procedures can make use of statistical inference

algorithms to produce models that are robust to unfamiliar input (e.g.

containing words or structures that have not been seen before) and to

erroneous input (e.g. with misspelled words or words accidentally

omitted). Generally, handling such input gracefully with hand-written

rules — or more generally, creating systems of hand-written rules that

make soft decisions — is extremely difficult, error-prone and time-

consuming.

Systems based on automatically learning the rules can be made more

accurate simply by supplying more input data. However, systems based on

hand-written rules can only be made more accurate by increasing the

complexity of the rules, which is a much more difficult task. In particular,

there is a limit to the complexity of systems based on hand-crafted rules,

beyond which the systems become more and more unmanageable.

However, creating more data to input to machine-learning systems simply

requires a corresponding increase in the number of man-hours worked,

generally without significant increases in the complexity of the annotation

process.

The subfield of NLP devoted to learning approaches is known as Natural

Language Learning (NLL) and its conference CoNLL and peak body

SIGNLL are sponsored by ACL, recognizing also their links with

Computational Linguistics and Language Acquisition. When the aims of

computational language learning research is to understand more about

human language acquisition, or psycholinguistics, NLL overlaps into the

related field of Computational Psycholinguistics.



Pag

e14

Major tasks in NLP

The following is a list of some of the most commonly researched tasks in

NLP. Note that some of these tasks have direct real-world applications,

while others more commonly serve as sub-tasks that are used to aid in

solving larger tasks. What distinguishes these tasks from other potential

and actual NLP tasks is not only the volume of research devoted to them

but the fact that for each one there is typically a well-defined problem

setting, a standard metric for evaluating the task, standard corpora on

which the task can be evaluated, and competitions devoted to the specific

task.

Automatic summarization

Produce a readable summary of a chunk of text. Often used to provide

summaries of text of a known type, such as articles in the financial section

of a newspaper.

Co-reference resolution

Given a sentence or larger chunk of text, determine which words

("mentions") refer to the same objects ("entities"). Anaphora resolution is

a specific example of this task, and is specifically concerned with matching

up pronouns with the nouns or names that they refer to. The more general

task of co-reference resolution also includes identifying so-called

"bridging relationships" involving referring expressions. For example, in a

sentence such as "He entered John's house through the front door", "the

front door" is a referring expression and the bridging relationship to be

identified is the fact that the door being referred to is the front door of

John's house (rather than of some other structure that might also be

referred to).

Discourse analysis

This rubric includes a number of related tasks. One task is identifying

the discourse structure of connected text, i.e. the nature of the discourse

relationships between sentences (e.g. elaboration, explanation, contrast).

Another possible task is recognizing and classifying the speech acts in a

chunk of text (e.g. yes-no question, content question, statement, assertion,

etc.).



Pag

e15

Machine translation (यन्त्िानुिादिः)

Automatically translate text from one human language to another. This

is one of the most difficult problems, and is a member of a class of

problems colloquially termed "AI-complete", i.e. requiring all of the

different types of knowledge that humans possess (grammar, semantics,

facts about the real world, etc.) in order to solve properly.

Morphological segmentation

Separate words into individual morphemes and identify the class of the

morphemes. The difficulty of this task depends greatly on the complexity

of the morphology (i.e. the structure of words) of the language being

considered. English has fairly simple morphology, especially inflectional

morphology, and thus it is often possible to ignore this task entirely and

simply model all possible forms of a word (e.g. "open, opens, opened,

opening") as separate words. In languages such as Turkish, however, such

an approach is not possible, as each dictionary entry has thousands of

possible word forms.

Named Entity Recognition (NER)

Given a stream of text, determine which items in the text map to proper

names, such as people or places, and what the type of each such name is

(e.g. person, location, organization). Note that, although capitalization can

aid in recognizing named entities in languages such as English, this

information cannot aid in determining the type of named entity, and in

any case is often inaccurate or insufficient. For example, the first word of a

sentence is also capitalized, and named entities often span several words,

only some of which are capitalized. Furthermore, many other languages in

non-Western scripts (e.g. Chinese or Arabic) do not have any

capitalization at all, and even languages with capitalization may not

consistently use it to distinguish names. For example, German capitalizes

all nouns, regardless of whether they refer to names, and French and

Spanish do not capitalize names that serve as adjectives.

Natural language generation

Convert information from computer databases into readable human

language.



Pag

e16

Natural language understanding

Convert chunks of text into more formal representations such as first-

order logic structures that are easier for computer programs to

manipulate. Natural language understanding involves the identification of

the intended semantic from the multiple possible semantics which can be

derived from a natural language expression which usually takes the form

of organized notations of natural languages concepts. Introduction and

creation of language metamodel and ontology are efficient however

empirical solutions. An explicit formalization of natural languages

semantics without confusions with implicit assumptions such as closed

world assumption (CWA) vs. open world assumption, or subjective

Yes/No vs. objective True/False is expected for the construction of a basis

of semantics formalization.

Optical Character Recognition (OCR)

Given an image representing printed text, determine the corresponding

text.

Part-of-speech tagging

Given a sentence, determine the part of speech for each word. Many

words, especially common ones, can serve as multiple parts of speech. For

example, "book" can be a noun ("the book on the table") or verb ("to book

a flight"); "set" can be a noun, verb or adjective; and "out" can be any of at

least five different parts of speech. Some languages have more such

ambiguity than others. Languages with little inflectional morphology, such

as English are particularly prone to such ambiguity. Chinese is prone to

such ambiguity because it is a tonal language during verbalization. Such

inflection is not readily conveyed via the entities employed within the

orthography to convey intended meaning.

Parsing

Determine the parse tree (grammatical analysis) of a given sentence.

The grammar for natural languages is ambiguous and typical sentences

have multiple possible analyses. In fact, perhaps surprisingly, for a typical

sentence there may be thousands of potential parses (most of which will

seem completely nonsensical to a human).



Pag

e17

Question answering

Given a human-language question, determine its answer. Typical

questions have a specific right answer (such as "What is the capital of

Canada?"), but sometimes open-ended questions are also considered

(such as "What is the meaning of life?").

Relationship extraction

Given a chunk of text, identify the relationships among named entities

(e.g. who is the wife of whom).

Sentence breaking (also known as sentence boundary disambiguation)

Given a chunk of text, find the sentence boundaries. Sentence

boundaries are often marked by periods or other punctuation marks, but

these same characters can serve other purposes (e.g. marking

abbreviations).

Sentiment analysis

Extract subjective information usually from a set of documents, often

using online reviews to determine "polarity" about specific objects. It is

especially useful for identifying trends of public opinion in the social

media, for the purpose of marketing.

Speech recognition

Given a sound clip of a person or people speaking, determine the textual

representation of the speech. This is the opposite of text to speech and is

one of the extremely difficult problems colloquially termed "AI-complete"

(see above). In natural speech there are hardly any pauses between

successive words, and thus speech segmentation is a necessary subtask of

speech recognition (see below). Note also that in most spoken languages,

the sounds representing successive letters blend into each other in a

process termed co-articulation, so the conversion of the analog signal to

discrete characters can be a very difficult process.

Speech segmentation

Given a sound clip of a person or people speaking, separate it into

words. A subtask of speech recognition and typically grouped with it.



Pag

e18

Topic segmentation and recognition

Given a chunk of text, separate it into segments each of which is devoted

to a topic, and identify the topic of the segment.

Word segmentation

Separate a chunk of continuous text into separate words. For a language

like English, this is fairly trivial, since words are usually separated by

spaces. However, some written languages like Chinese, Japanese and Thai

do not mark word boundaries in such a fashion, and in those languages

text segmentation is a significant task requiring knowledge of the

vocabulary and morphology of words in the language.

Word sense disambiguation

Many words have more than one meaning; we have to select the

meaning which makes the most sense in context. For this problem, we are

typically given a list of words and associated word senses, e.g. from a

dictionary or from an online resource such as WordNet.

Subfields of NLP: Related tasks

In some cases, sets of related tasks are grouped into subfields of NLP

that are often considered separately from NLP as a whole. Examples

include:

Information Retrieval (IR)

This is concerned with storing, searching and retrieving information. It

is a separate field within computer science (closer to databases), but IR

relies on some NLP methods (for example, stemming). Some current

research and applications seek to bridge the gap between IR and NLP.

Information Extraction (IE)

This is concerned in general with the extraction of semantic information

from text. This covers tasks such as named entity recognition, Co-

reference resolution, relationship extraction, etc.



Pag

e19

Ontology engineering

A field that studies the methods and methodologies for building

ontologies, which are formal representations of a set of concepts within a

domain and the relationships between those concepts.

Speech processing

This covers speech recognition, text-to-speech and related tasks.

Other tasks include:

Stemming

In linguistic morphology and information retrieval, stemming is the

process for reducing inflected (or sometimes derived) words to their stem,

base or root form—generally a written word form. The stem need not be

identical to the morphological root of the word; it is usually sufficient that

related words map to the same stem, even if this stem is not in itself a

valid root. Algorithms for stemming have been studied in computer

science since the 1960s. Many search engines treat words with the same

stem as synonyms as a kind of query expansion, a process called

conflation.

Stemming programs are commonly referred to as stemming algorithms

or stemmers.

A stemmer for English, for example, should identify the string "cats"

(and possibly "catlike", "catty" etc.) as based on the root "cat", and

"stemmer", "stemming", "stemmed" as based on "stem". A stemming

algorithm reduces the words "fishing", "fished", and "fisher" to the root

word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and

"argus" reduce to the stem "argu" (illustrating the case where the stem is

not itself a word or root) but "argument" and "arguments" reduce to the

stem "argument".



Pag

e20

Text simplification

Text simplification is an operation used in natural language processing

to modify, enhance, classify or otherwise process an existing corpus of

human-readable text in such a way that the grammar and structure of the

prose is greatly simplified, while the underlying meaning and information

remains the same. Text simplification is an important area of research,

because natural human languages ordinarily contain complex compound

constructions that are not easily processed through automation. In terms

of reducing language diversity, semantic compression can be employed to

limit and simplify a set of words used in given texts.

Text Simplification is illustrated with an example from Siddharthan

(2006). The first sentence contains two relative clauses and one conjoined

verb phrase. A text simplification system aims to simplify the first

sentence to the second sentence.

“The ab l ty to s mpl fy mea s to el m ate the u ecessary so that the

ecessary may speak”

Also contributing to the firmness in copper, the analyst noted, was

a report by Chicago purchasing agents, which precedes the full

purchasing age t’s report that is due out today and gives an

indication of what the full report might hold.

Also contributing to the firmness in copper, the analyst noted, was

a report by Chicago purchasing agents. The Chicago report

precedes the full purchasing agents report. The Chicago report

gives an indication of what the full report might hold. The full

report is due out today.

Text-to-speech

Speech synthesis is the artificial production of human speech. A

computer system used for this purpose is called a speech synthesizer, and

can be implemented in software or hardware products. A text-to-speech

(TTS) system converts normal language text into speech; other systems



Pag

e21

render symbolic linguistic representations like phonetic transcriptions

into speech.

A text-to-speech system (or "engine") is composed of two parts: a front-

end and a back-end. The front-end has two major tasks. First, it converts

raw text containing symbols like numbers and abbreviations into the

equivalent of written-out words. This process is often called text

normalization, pre-processing, or tokenization. The front-end then assigns

phonetic transcriptions to each word, and divides and marks the text into

prosodic units, like phrases, clauses, and sentences. The process of

assigning phonetic transcriptions to words is called text-to-phoneme or

grapheme-to-phoneme conversion. Phonetic transcriptions and prosody

information together make up the symbolic linguistic representation that

is output by the front-end. The back-end—often referred to as the

synthesizer—then converts the symbolic linguistic representation into

sound. In certain systems, this part includes the computation of the target

prosody (pitch contour, phoneme durations), which is then imposed on

the output speech.

Text-proofing

Proofreading is the reading of a galley proof or an electronic copy of a

publication to detect and correct production errors of text or art.

Proofreaders are expected to be consistently accurate by default because

they occupy the last stage of typographic production before publication.

Natural language search

A natural language search engine would in theory find targeted answers

to user questions (as opposed to keyword search). For example, when



Pag

e22

confronted with a question of the form 'which U.S. state has the highest

income tax?', conventional search engines ignore the question and instead

search on the keywords 'state', 'income' and 'tax'. Natural language search,

on the other hand, attempts to use natural language processing to

understand the nature of the question and then to search and return a

subset of the web that contains the answer to the question. If it works,

results would have a higher relevance than results from a keyword search

engine.

Query expansion

Query expansion (QE) is the process of reformulating a seed query to

improve retrieval performance in information retrieval operations. In the

context of web search engines, query expansion involves evaluating a

user's input (what words were typed into the search query area, and

sometimes other types of data) and expanding the search query to match

additional documents. Query expansion involves techniques such as:

Finding synonyms of words, and searching for the synonyms as

well

Finding all the various morphological forms of words by

stemming each word in the search query

Fixing spelling errors and automatically searching for the

corrected form or suggesting it in the results

Re-weighting the terms in the original query

Query expansion is a methodology studied in the field of computer

science, particularly within the realm of natural language processing and

information retrieval.

Automated essay scoring

Automated essay scoring (AES) and project essay grade (PEG) is the use

of specialized computer programs to assign grades to essays written in an



Pag

e23

educational setting. It is a method of educational assessment and an

application of natural language processing. Its objective is to classify a

large set of textual entities into a small number of discrete categories,

corresponding to the possible grades—for example, the numbers 1 to 6.

Therefore, it can be considered a problem of statistical classification.

Several factors have contributed to a growing interest in AES. Among

them are cost, accountability, standards, and technology. Rising education

costs have led to pressure to hold the educational system accountable for

results by imposing standards. The advance of information technology

promises to measure educational achievement at reduced cost.

The use of AES for high-stakes testing in education has generated

significant backlash, with opponents pointing to research that computers

cannot yet grade writing accurately and arguing that their use for such

purposes promotes teaching writing in reductive ways (i.e. teaching to the

test).

True-casing

True-casing is the problem in natural language processing (NLP) of

determining the proper capitalization of words where such information is

unavailable. This commonly comes up due to the standard practice (in

English and many other languages) of automatically capitalizing the first

word of a sentence. It can also arise in badly cased or non-cased text (for

example, all-lowercase or all-uppercase text messages). True-casing aids

in many other NLP tasks, such as named entity recognition, machine

translation and Automatic Content Extraction.

True-casing is unnecessary in languages whose scripts do not have a

distinction between uppercase and lowercase letters. This includes all

languages not written in the Latin, Greek, Cyrillic or Armenian alphabets,

such as Sanskrit, Japanese, Chinese, Thai, Hebrew, Arabic, Hindi, etc.



Pag

e24

Statistical NLP

Statistical natural-language processing uses stochastic, probabilistic and

statistical methods to resolve some of the difficulties discussed above,

especially those which arise because longer sentences are highly

ambiguous when processed with realistic grammars, yielding thousands

or millions of possible analyses. Methods for disambiguation often involve

the use of corpora and Markov models. Statistical NLP comprises all

quantitative approaches to automated language processing, including

probabilistic modelling, information theory, and linear algebra. The

technology for statistical NLP comes mainly from machine learning and

data mining, both of which are fields of artificial intelligence that involve

learning from data.

Evaluation of natural language processing

Objectives

The goal of NLP evaluation is to measure one or more qualities of an

algorithm or a system, in order to determine whether (or to what extent)

the system answers the goals of its designers, or meets the needs of its

users. Research in NLP evaluation has received considerable attention,

because the definition of proper evaluation criteria is one way to specify

precisely an NLP problem, going thus beyond the vagueness of tasks

defined only as language understanding or language generation. A precise

set of evaluation criteria, which includes mainly evaluation data and

evaluation metrics, enables several teams to compare their solutions to a

given NLP problem.

Short history of evaluation in NLP

The first evaluation campaign on written texts seems to be a campaign

dedicated to message understanding in 1987 (Pallet 1998). Then, the

Parseval/GEIG project compared phrase-structure grammars (Black

1991). A series of campaigns within Tipster project were realized on tasks

like summarization, translation and searching (Hirschman 1998). In 1994,

in Germany, the Morpholympics compared German taggers. Then, the

Senseval & Romanseval campaigns were conducted with the objectives of



Pag

e25

semantic disambiguation. In 1996, the Sparkle campaign compared

syntactic parsers in four different languages (English, French, German and

Italian). In France, the Grace project compared a set of 21 taggers for

French in 1997 (Adda 1999). In 2004, during the Technolangue/Easy

project, 13 parsers for French were compared. Large-scale evaluation of

dependency parsers were performed in the context of the CoNLL shared

tasks in 2006 and 2007. In Italy, the EVALITA campaign was conducted in

2007 and 2009 to compare various NLP and speech tools for Italian; the

2011 campaign is in full progress - EVALITA web site. In France, within

the ANR-Passage project (end of 2007), 10 parsers for French were

compared - passage web site.

Different types of evaluation

Depending on the evaluation procedures, a number of distinctions are

traditionally made in NLP evaluation.

Intrinsic vs. extrinsic evaluation

Intrinsic evaluation considers an isolated NLP system and characterizes

its performance mainly with respect to a gold standard result, pre-defined

by the evaluators. Extrinsic evaluation, also called evaluation in use

considers the NLP system in a more complex setting, either as an

embedded system or serving a precise function for a human user. The

extrinsic performance of the system is then characterized in terms of its

utility with respect to the overall task of the complex system or the human

user. For example, consider a syntactic parser that is based on the output

of some new part of speech (POS) tagger. An intrinsic evaluation would

run the POS tagger on some labelled data, and compare the system output

of the POS tagger to the gold standard (correct) output. An extrinsic

evaluation would run the parser with some other POS tagger, and then

with the new POS tagger, and compare the parsing accuracy.



Pag

e26

Black-box vs. glass-box evaluation

Black-box evaluation requires one to run an NLP system on a given data

set and to measure a number of parameters related to the quality of the

process (speed, reliability, resource consumption) and, most importantly,

to the quality of the result (e.g. the accuracy of data annotation or the

fidelity of a translation). Glass-box evaluation looks at the design of the

system, the algorithms that are implemented, the linguistic resources it

uses (e.g. vocabulary size), etc. Given the complexity of NLP problems, it is

often difficult to predict performance only on the basis of glass-box

evaluation, but this type of evaluation is more informative with respect to

error analysis or future developments of a system.

Automatic vs. manual evaluation

In many cases, automatic procedures can be defined to evaluate an NLP

system by comparing its output with the gold standard (or desired) one.

Although the cost of producing the gold standard can be quite high,

automatic evaluation can be repeated as often as needed without much

additional costs (on the same input data). However, for many NLP

problems, the definition of a gold standard is a complex task, and can

prove impossible when inter-annotator agreement is insufficient. Manual

evaluation is performed by human judges, which are instructed to

estimate the quality of a system, or most often of a sample of its output,

based on a number of criteria. Although, thanks to their linguistic

competence, human judges can be considered as the reference for a

number of language processing tasks, there is also considerable variation

across their ratings. This is why automatic evaluation is sometimes

referred to as objective evaluation, while the human kind appears to be

more "subjective."

Natural language processing toolkits

The following natural language processing toolkits are popular collections of natural language processing software. They are suites of libraries, frameworks, and applications for symbolic, statistical natural language and speech processing.

http://en.wikipedia.org/wiki/List_of_toolkits

http://en.wikipedia.org/wiki/Natural_language_processing

http://en.wikipedia.org/wiki/Library_(computer_science)

http://en.wikipedia.org/wiki/Software_framework

http://en.wikipedia.org/wiki/Software_application



Pag

e27

Name Language License Creators

Antelope framework C#, VB.net Free for research Proxem

Apertium C++, Java GPL (various)

Cogito

Commercial Expert System S.p.A.

Carabao Language

Kit

Any COM+

compliant language.

Customization is via

data entry

Commercial LinguaSys

ClearTK Java new BSD license

The Center for Computational

Language and Education

Research at the University of

Colorado Boulder

DELPH-IN LISP, C++ LGPL, MIT, ... Deep Linguistic Processing

with HPSG Initiative

Distinguo C++ Commercial Ultralingua Inc.

dkPro Java ASL/GPL tu-darmstadt.de

Ellogon C / C++ LGPL Georgios Petasis

FreeLing C++ GPL

Universitat Politècnica de

Catalunya

http://en.wikipedia.org/w/index.php?title=Antelope_framework&action=edit&redlink=1

http://en.wikipedia.org/wiki/C_Sharp_(programming_language)

http://en.wikipedia.org/wiki/Visual_Basic_.NET

http://en.wikipedia.org/w/index.php?title=Proxem&action=edit&redlink=1

http://en.wikipedia.org/wiki/Apertium

http://en.wikipedia.org/wiki/C%2B%2B

http://en.wikipedia.org/wiki/Java_(programming_language)

http://en.wikipedia.org/wiki/GPL

http://en.wikipedia.org/w/index.php?title=Cogito_(NLP_software)&action=edit&redlink=1

http://en.wikipedia.org/wiki/Commercial_software

http://en.wikipedia.org/wiki/Expert_System_S.p.A.

http://en.wikipedia.org/w/index.php?title=Carabao_Language_Kit&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=Carabao_Language_Kit&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=LinguaSys&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=ClearTK&action=edit&redlink=1


http://en.wikipedia.org/wiki/New_BSD_license

http://en.wikipedia.org/wiki/University_of_Colorado_Boulder

http://en.wikipedia.org/wiki/University_of_Colorado_Boulder

http://en.wikipedia.org/wiki/DELPH-IN

http://en.wikipedia.org/wiki/LISP


http://en.wikipedia.org/wiki/LGPL

http://en.wikipedia.org/wiki/MIT

http://en.wikipedia.org/wiki/HPSG

http://en.wikipedia.org/wiki/Distinguo


http://en.wikipedia.org/w/index.php?title=Ultralingua_Inc.&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=DkPro&action=edit&redlink=1

http://en.wikipedia.org/wiki/Java

http://en.wikipedia.org/w/index.php?title=Tu-darmstadt.de&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=Ellogon&action=edit&redlink=1

http://en.wikipedia.org/wiki/C_(programming_language)



http://en.wikipedia.org/w/index.php?title=Georgios_Petasis&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=FreeLing&action=edit&redlink=1



http://en.wikipedia.org/wiki/Universitat_Polit%C3%A8cnica_de_Catalunya

http://en.wikipedia.org/wiki/Universitat_Polit%C3%A8cnica_de_Catalunya



Pag

e28


General Architecture

for Text

Engineering (GATE)

Java LGPL GATE open source community

Gensim Python LGPL Rad m Řehůřek

Graph Expression Java Apache License Startup huti.ru

IceNLP Java LGPL

Icelandic Centre for Language

Technology (ICLT)

Learning Based Java Java BSD

Cognitive Computation Group

at the University of Illinois

LingPipe Java royalty free or commercial Alias-i

LinguaStream Java Free for research University of Caen, France

Mallet Java Common Public License

University of Massachusetts

Amherst

MII nlp toolkit Java LGPL

UCLA Medical Imaging

Informatics (MII) Group

Modular Audio Recognition

Framework

Java BSD

The MARF Research and Development

Group, Concordia University

MontyLingua Python, Java Free for research MIT

http://en.wikipedia.org/wiki/General_Architecture_for_Text_Engineering





http://gate.ac.uk/

http://en.wikipedia.org/wiki/Gensim

http://en.wikipedia.org/wiki/Python_(programming_language)


http://en.wikipedia.org/w/index.php?title=Graph_Expression&action=edit&redlink=1


http://en.wikipedia.org/wiki/Apache_License



http://en.wikipedia.org/wiki/Learning_Based_Java


http://en.wikipedia.org/wiki/BSD_license#3-clause_license_.28.22New_BSD_License.22.29

http://en.wikipedia.org/wiki/University_of_Illinois_at_Urbana_Champaign

http://en.wikipedia.org/w/index.php?title=LingPipe&action=edit&redlink=1


http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt

http://en.wikipedia.org/w/index.php?title=Alias-i&action=edit&redlink=1

http://en.wikipedia.org/wiki/LinguaStream


http://en.wikipedia.org/wiki/University_of_Caen

http://en.wikipedia.org/wiki/France

http://en.wikipedia.org/wiki/Mallet_(software_project)


http://en.wikipedia.org/wiki/Common_Public_License

http://en.wikipedia.org/wiki/University_of_Massachusetts_Amherst


http://en.wikipedia.org/w/index.php?title=MII_nlp_toolkit&action=edit&redlink=1



http://en.wikipedia.org/w/index.php?title=UCLA_Medical_Imaging_Informatics_(MII)_Group&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=UCLA_Medical_Imaging_Informatics_(MII)_Group&action=edit&redlink=1

http://en.wikipedia.org/wiki/Modular_Audio_Recognition_Framework

http://en.wikipedia.org/wiki/Modular_Audio_Recognition_Framework


http://en.wikipedia.org/wiki/BSD_license

http://en.wikipedia.org/wiki/Concordia_University_(Quebec)

http://en.wikipedia.org/wiki/MontyLingua



http://en.wikipedia.org/wiki/MIT



Pag

e29


NLP Engine Java Commercial H-Care

Natural Language

Toolkit(NLTK) Python Apache 2.0

NooJ (based

on INTEX)

.NET Framework-

based Free for research

University of Franche-

Comté, France

Apache OpenNLP Java Apache License 2.0 Online community

Pattern Python BSD

Tom De Smedt,

CLiPS, University of Antwerp

PSI-Toolkit C++ LGPL

Adam Mickiewicz University in

Poz ań

Rosette C, C++, Java, .NET Commercial Basis Technology

Rosoka Java Commercial Rosoka Software

ScalaNLP Scala Apache License David Hall and Daniel Ramage

Stanford NLP Java GPL

The Stanford Natural

Language Processing Group

Rasp C++ LGPL

University of Cambridge, University of Sussex

Natural JavaScript, NodeJs GPL Chris Umbel

http://en.wikipedia.org/w/index.php?title=NLP_Engine&action=edit&redlink=1



http://en.wikipedia.org/w/index.php?title=H-Care&action=edit&redlink=1

http://en.wikipedia.org/wiki/Natural_Language_Toolkit

http://en.wikipedia.org/wiki/Natural_Language_Toolkit



http://en.wikipedia.org/wiki/NooJ

http://en.wikipedia.org/w/index.php?title=INTEX&action=edit&redlink=1

http://en.wikipedia.org/wiki/.NET_Framework

http://en.wikipedia.org/wiki/University_of_Franche-Comt%C3%A9

http://en.wikipedia.org/wiki/University_of_Franche-Comt%C3%A9

http://en.wikipedia.org/wiki/France

http://en.wikipedia.org/wiki/OpenNLP


http://en.wikipedia.org/wiki/Apache_Software_Foundation


http://en.wikipedia.org/wiki/BSD

http://en.wikipedia.org/wiki/University_of_Antwerp

http://en.wikipedia.org/w/index.php?title=PSI-Toolkit&action=edit&redlink=1



http://en.wikipedia.org/wiki/Adam_Mickiewicz_University_in_Pozna%C5%84

http://en.wikipedia.org/wiki/Adam_Mickiewicz_University_in_Pozna%C5%84

http://en.wikipedia.org/w/index.php?title=Rosette_(software)&action=edit&redlink=1

http://en.wikipedia.org/wiki/Basis_Technology

http://en.wikipedia.org/w/index.php?title=Rosoka_(software)&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=Rosoka_Software&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=ScalaNLP&action=edit&redlink=1

http://en.wikipedia.org/wiki/Scala_(programming_language)


http://en.wikipedia.org/w/index.php?title=Stanford_NLP&action=edit&redlink=1



http://en.wikipedia.org/w/index.php?title=The_Stanford_Natural_Language_Processing_Group&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=The_Stanford_Natural_Language_Processing_Group&action=edit&redlink=1



http://en.wikipedia.org/wiki/University_of_Cambridge

http://en.wikipedia.org/wiki/University_of_Sussex

http://en.wikipedia.org/wiki/JavaScript

http://en.wikipedia.org/wiki/NodeJs




Pag

e30


Text Engineering

Software Laboratory

(Tesla)

Java Eclipse Public License University of Cologne

Thinktelligence

Delegator

Java Commercial Thinktelligence Corporation

Treex Perl GPL / Artistic Charles University in Prague

UIMA Java / C++ Apache 2.0 Apache

VisualText

NLP++ / compiles to

C++ Free or Commercial

Text Analysis International,

Inc

WebLab-project Java / C++ LGPL OW2

Unitex/GramLab

C++ (Core

components)

& Java (VisualIDE)

LGPL & LGPLLR(Linguistic

resources)

Laboratoire d'Informatique

Gaspard-Monge

The Dragon Toolkit Java GPL Drexel University

Palladian Java Commercial

Dresden University of

Technology

Factorie Java Apache License

University of Massachusetts

Amherst

http://en.wikipedia.org/w/index.php?title=Text_Engineering_Software_Laboratory_(Tesla)&action=edit&redlink=1




http://en.wikipedia.org/wiki/Eclipse_Public_License

http://en.wikipedia.org/wiki/University_of_Cologne

http://en.wikipedia.org/w/index.php?title=Thinktelligence_Delegator&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=Thinktelligence_Delegator&action=edit&redlink=1



http://en.wikipedia.org/w/index.php?title=Thinktelligence_Corporation&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=Treex&action=edit&redlink=1

http://en.wikipedia.org/wiki/Perl


http://en.wikipedia.org/wiki/Artistic_License

http://en.wikipedia.org/wiki/Charles_University_in_Prague

http://en.wikipedia.org/wiki/UIMA




http://en.wikipedia.org/wiki/Apache_Software_Foundation

http://en.wikipedia.org/w/index.php?title=VisualText&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=WebLab-project&action=edit&redlink=1




http://en.wikipedia.org/wiki/OW2

http://www-igm.univ-mlv.fr/~unitex/



http://en.wikipedia.org/wiki/Integrated_development_environment


http://en.wikipedia.org/w/index.php?title=LGPLLR&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=Laboratoire_d%27Informatique_Gaspard-Monge&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=Laboratoire_d%27Informatique_Gaspard-Monge&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=The_Dragon_Toolkit&action=edit&redlink=1



http://en.wikipedia.org/wiki/Drexel_University

http://en.wikipedia.org/wiki/Palladian



http://en.wikipedia.org/wiki/Dresden_University_of_Technology

http://en.wikipedia.org/wiki/Dresden_University_of_Technology

http://en.wikipedia.org/w/index.php?title=Factorie&action=edit&redlink=1







Pag

e31


Silpa Indic Language

Processing Toolkit

Python AGPL Silpa opensource community

developers

Text Extraction,

Annotation and

Retrieval Toolkit

Ruby GPL Louis Mullie

Zhihuita NLP API C Free for research Zhihuita.org

Named entity recognizers

ABNER (A Biomedical Named Entity Recognizer) – open source text mining program that uses linear-chain conditional random fields. It automatically tags genes, proteins and other entity names in text. Written by Burr Settles of the University of Wisconsin-Madison.

Translation software

Comparison of machine translation applications Machine translation applications Google Translate Linguee – web service that provides an online dictionary for a

number of language pairs. Unlike similar services, such as LEO, Linguee incorporates a search engine that provides access to large amounts of bilingual, translated sentence pairs, which come from the World Wide Web. As a translation aid, Linguee therefore differs from machine translation services like Babelfish and is more similar in function to a translation memory.

Hindi-to-Punjabi Machine Translation System UNL Universal Networking Language Yahoo! Babel Fish Bing Translate

http://en.wikipedia.org/w/index.php?title=Silpa_Indic_Language_Processing_Toolkit&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=Silpa_Indic_Language_Processing_Toolkit&action=edit&redlink=1


http://en.wikipedia.org/wiki/Affero_General_Public_License

http://en.wikipedia.org/w/index.php?title=Silpa_opensource_community_developers&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=Silpa_opensource_community_developers&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=Text_Extraction,_Annotation_and_Retrieval_Toolkit&action=edit&redlink=1



http://en.wikipedia.org/wiki/Ruby_(programming_language)

http://en.wikipedia.org/wiki/General_Public_License

http://en.wikipedia.org/w/index.php?title=Louis_Mullie&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=Zhihuita_NLP_API&action=edit&redlink=1

http://en.wikipedia.org/wiki/C_(programming_language)

http://en.wikipedia.org/w/index.php?title=Zhihuita.org&action=edit&redlink=1

http://en.wikipedia.org/wiki/Comparison_of_machine_translation_applications

http://en.wikipedia.org/wiki/Google_Translate

http://en.wikipedia.org/wiki/Linguee

http://en.wikipedia.org/wiki/Hindi-to-Punjabi_Machine_Translation_System

http://en.wikipedia.org/wiki/UNL

http://en.wikipedia.org/wiki/Yahoo!_Babel_Fish

http://en.wikipedia.org/w/index.php?title=Bing_Translate&action=edit&redlink=1



Pag

e32

Other softwares

AQUA

BORIS

CTAKES – open-source natural language processing system for information extraction from electronic medical record clinical free-text. It processes clinical notes, identifying types of clinical named e t t es — drugs, d seases/d sorders, s g s/symptoms, a atom cal sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, context (family history of, current, unrelated to patient), and negated/not negated. Also known as Apache cTAKES.

DMAP

ETAP-3 – proprietary linguistic processing system focusing on English and Russian. It is a rule-based system which uses the Meaning-Text Theory as its theoretical foundation.

FASTUS

FRUMP

Iris – personal assistant application for Android. The application uses natural language processing to answer questions based on user voice request.

IPP

JAPE – the Java Annotation Patterns Engine, a component of the open-source General Architecture for Text Engineering (GATE) platform. JAPE is a finite state transducer that operates over annotations based on regular expressions.

Learning Reader – Forbus et. al 2007

LOLITA – "Large-scale, Object-based, Linguistic Interactor, Translator and Analyzer". LOLITA was developed by Roberto Garigliano and colleagues between 1986 and 2000. It was designed as a general-purpose tool for processing unrestricted text that could be the basis of a wide variety of applications. At its core was a semantic network containing some 90,000 interlinked concepts.

Maluuba – intelligent personal assistant for Android devices, that uses a contextual approach to search which takes into account the user's geographic location, contacts, and language.

METAL MT – machine translation system developed in the 1980s at the University of Texas and at Siemens which ran on Lisp Machines.

http://en.wikipedia.org/wiki/AQUA

http://en.wikipedia.org/wiki/BORIS

http://en.wikipedia.org/wiki/CTAKES

http://en.wikipedia.org/wiki/DMAP

http://en.wikipedia.org/wiki/ETAP-3

http://en.wikipedia.org/wiki/Rule-based_machine_translation

http://en.wikipedia.org/wiki/Meaning-Text_Theory

http://en.wikipedia.org/wiki/Meaning-Text_Theory

http://en.wikipedia.org/w/index.php?title=FASTUS&action=edit&redlink=1

http://en.wikipedia.org/w/index.php?title=FRUMP&action=edit&redlink=1

http://en.wikipedia.org/wiki/Iris_(software)

http://en.wikipedia.org/wiki/IPP

http://en.wikipedia.org/wiki/JAPE_(linguistics)

http://en.wikipedia.org/w/index.php?title=Learning_Reader&action=edit&redlink=1

http://en.wikipedia.org/wiki/LOLITA

http://en.wikipedia.org/wiki/Maluuba

http://en.wikipedia.org/wiki/METAL_MT



Pag

e33

Never-Ending Language Learning – semantic machine learning system developed by a research team at Carnegie Mellon University, and supported by grants from DARPA, Google, and the NSF, with portions of the system running on a supercomputing cluster provided by Yahoo!. NELL was programmed by its developers to be able to identify a basic set of fundamental semantic relationships between a few hundred predefined categories of data, such as cities, companies, emotions and sports teams. Since the beginning of 2010, the Carnegie Mellon research team has been running NELL around the clock, sifting through hundreds of millions of web pages looking for connections between the information it already knows and what it finds through its search process – to make new connections in a manner that is intended to mimic the way humans learn new information.

NLTK

Online-translator.com

Regulus Grammar Compiler – software system for compiling unification grammars into grammars for speech recognition systems.

S Voice

Siri (software)

Speaktoit

START

TeLQAS

Weka's classification tools

Festival Speech Synthesis System

CMU Sphinx speech recognition system

Chatterbots

Chatterbot – text-based conversation agent that can interact with human users through some medium, such as an instant message service. Some chatterbots are designed for specific purposes, while others converse with human users on a wide range of topics.

Classic chatterbots

Dr. Sbaitso

ELIZA

PARRY

Racter (or Claude Chatterbot)

Mark V Shaney

http://en.wikipedia.org/wiki/Never-Ending_Language_Learning

http://en.wikipedia.org/wiki/NLTK

http://en.wikipedia.org/wiki/Online-translator.com

http://en.wikipedia.org/wiki/Regulus_Grammar_Compiler

http://en.wikipedia.org/wiki/S_Voice

http://en.wikipedia.org/wiki/Siri_(software)

http://en.wikipedia.org/wiki/Speaktoit

http://en.wikipedia.org/wiki/START

http://en.wikipedia.org/wiki/TeLQAS

http://en.wikipedia.org/wiki/Weka_(machine_learning)

http://en.wikipedia.org/wiki/Festival_Speech_Synthesis_System

http://en.wikipedia.org/wiki/CMU_Sphinx

http://en.wikipedia.org/wiki/Chatterbot

http://en.wikipedia.org/wiki/Software_agent

http://en.wikipedia.org/wiki/Instant_message

http://en.wikipedia.org/wiki/Dr._Sbaitso

http://en.wikipedia.org/wiki/ELIZA

http://en.wikipedia.org/wiki/PARRY

http://en.wikipedia.org/wiki/Racter

http://en.wikipedia.org/wiki/Mark_V_Shaney



Pag

e34

General chatterbots

Albert One - 1998 and 1999 Loebner winner, by Robby Garner.

A.L.I.C.E. - 2001, 2002, and 2004 Loebner Prize winner developed by Richard Wallace.

Charlix

Cleverbot (winner of the 2010 Mechanical Intelligence Competition)

Elbot - 2008 Loebner Prize winner, by Fred Roberts.

Eugene Goostman - 2012 Turing 100 winner, by Vladimir Veselov.

Fred - an early chatterbot by Robby Garner.

Jabberwacky

Jeeney AI

MegaHAL

SimSimi - A popular artificial intelligence conversation program that was created in 2002 by ISMaker.

Spookitalk - A chatterbot used for NPCs in Douglas Adams' Starship Titanic video game.

Ultra Hal - 2007 Loebner Prize winner, by Robert Medeksza.

Verbot

My Chatterbot - A popular chatterbot who simulates movie characters.

Prelude@# - Winner of the 2005 Self-Learning Chatbot award

Instant messenger chatterbots

GooglyMinotaur, specializing in Radiohead, the first bot released by ActiveBuddy (June 2001-March 2002)

SmarterChild, developed by ActiveBuddy and released in June 2001

Infobot, an assistant on IRC channels such as #perl, primarily to help out with answering Frequently Asked Questions (June 1995-today)

Natural language processing organizations

AFNLP (Asian Federation of Natural Language Processing Associations) – the organization for coordinating the natural language processing related activities and events in the Asia-Pacific region.

Australasian Language Technology Association –

http://en.wikipedia.org/wiki/Albert_One

http://en.wikipedia.org/wiki/Loebner_Prize

http://en.wikipedia.org/wiki/Robby_Garner

http://en.wikipedia.org/wiki/Artificial_Linguistic_Internet_Computer_Entity


http://en.wikipedia.org/wiki/Richard_Wallace_(scientist)

http://en.wikipedia.org/wiki/Charlix

http://en.wikipedia.org/wiki/Cleverbot

http://en.wikipedia.org/wiki/Elbot


http://en.wikipedia.org/wiki/Fred_Roberts

http://en.wikipedia.org/wiki/Eugene_Goostman

http://en.wikipedia.org/w/index.php?title=Vladimir_Veselov&action=edit&redlink=1

http://en.wikipedia.org/wiki/Fred_Chatterbot

http://en.wikipedia.org/wiki/Robby_Garner

http://en.wikipedia.org/wiki/Jabberwacky

http://en.wikipedia.org/wiki/Jeeney_AI

http://en.wikipedia.org/wiki/MegaHAL

http://en.wikipedia.org/wiki/SimSimi

http://en.wikipedia.org/wiki/Starship_Titanic#Gameplay

http://en.wikipedia.org/wiki/Non-player_character

http://en.wikipedia.org/wiki/Douglas_Adams

http://en.wikipedia.org/wiki/Ultra_Hal_Assistant


http://en.wikipedia.org/w/index.php?title=Robert_Medeksza&action=edit&redlink=1

http://en.wikipedia.org/wiki/Verbot

https://apps.facebook.com/my-chatterbot/

http://en.wikipedia.org/wiki/Prelude@

http://en.wikipedia.org/wiki/GooglyMinotaur

http://en.wikipedia.org/wiki/Radiohead

http://en.wikipedia.org/wiki/ActiveBuddy

http://en.wikipedia.org/wiki/SmarterChild

http://en.wikipedia.org/wiki/ActiveBuddy

http://en.wikipedia.org/wiki/Infobot

http://en.wikipedia.org/wiki/Internet_Relay_Chat

http://en.wikipedia.org/wiki/Frequently_Asked_Questions

http://en.wikipedia.org/wiki/AFNLP

http://en.wikipedia.org/wiki/Australasian_Language_Technology_Association



Pag

e35

Association for Computational Linguistics – international scientific and professional society for people working on problems involving natural language processing.

Standardization in NLP

An ISO sub-committee is working in order to ease interoperability

between lexical resources and NLP programs. The sub-committee is part

of ISO/TC37 and is called ISO/TC37/SC4. Some ISO standards are already

published but most of them are under construction, mainly on lexicon

representation (see LMF), annotation and data category registry.

The future of NLP

Human-level natural language processing is an AI-complete problem.

That is, it is equivalent to solving the central artificial intelligence

problem—making computers as intelligent as people, or strong AI. NLP's

future is therefore tied closely to the development of AI in general.

As natural language understanding improves, computers will be able to

learn from the information online and apply what they learned in the real

world. Combined with natural language generation, computers will

become more and more capable of receiving and giving instructions.

In the future, humans may not need to code programs, but will dictate

to a computer in a human natural language, and the computer will

understand and act upon the instructions.

Other related fields

Biomedical text mining

Compound term processing

Computer-assisted reviewing

Controlled natural language

http://en.wikipedia.org/wiki/Association_for_Computational_Linguistics



Pag

e36

Deep Linguistic Processing

Foreign language reading aid

Foreign language writing aid

Language technology

Latent semantic indexing

LRE Map

Natural language programming

Reification (linguistics)

Spoken dialogue system

Telligent Systems

Trans-derivational search

References

1. Bates, M. (1995). Models of natural language understanding.

Proceedings of the National Academy of Sciences of the United

States of America, Vol. 92, No. 22 (Oct. 24, 1995), pp. 9977–9982.

2. Christopher D. Manning, Hinrich Schütze: Foundations of

Statistical Natural Language Processing, MIT Press (1999), ISBN

978-0-262-13360-9, p. xxxi

3. Christopher D. Manning, Prabhakar Raghavan, and Hinrich

Schütze (2008). Introduction to Information Retrieval. Cambridge

University Press. ISBN 978-0-521-86571-5. Official html and pdf

versions available without charge.

4. Daniel Jurafsky and James H. Martin (2008). Speech and Language

Processing, 2nd edition. Pearson Prentice Hall. ISBN 978-0-13-

187321-6.



Pag

e37

5. Implementing an online help desk system based on conversational

agent: Alisa Kongthon, Chatchawal Sangkeettrakarn, Sarawoot

Kongyoung and Choochart Haruechaiyasak. Published by ACM

2009 Article, Bibliometrics Data Bibliometrics. Published in:

Proceeding, MEDES '09 Proceedings of the International

Conference on Management of Emergent Digital EcoSystems, ACM

New York, NY, USA. ISBN 978-1-60558-829-2,

doi:10.1145/1643823.1643908

6. Roger Schank, (1969). A conceptual dependency parser for

natural language Proceedings of the 1969 conference on

Computational linguistics, Sång-Säby, Sweden, pages 1-3.

7. Steven Bird, Ewan Klein, and Edward Loper (2009). Natural

Language Processing with Python. O'Reilly Media. ISBN 978-0-

596-51649-9.

8. Woods, William A (1970). "Transition Network Grammars for

Natural Language Analysis". Communications of the ACM 13 (10):

591–606.

9. Yucong Duan, Christophe Cruz (2011), Formalizing Semantic of

Natural Language through Conceptualization from Existence.

International Journal of Innovation, Management and

Technology(2011) 2 (1), pp. 37-42.

Natural Language Processing

Documents

Transcript of Natural Language Processing