Natural Language Processing
-
Upload
sara-mckenzie -
Category
Documents
-
view
99 -
download
0
description
Transcript of Natural Language Processing
Ph.D. Coursework Assignment
UNIT 2:
SANSKRIT Language
Ancient & Modern
MODULE 5: MODERN TRENDS IN SANSKRIT
General awareness of
NATURAL
LANGUAGE PROCESSING
(NLP)
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e2
प्रकृतिभाषाससंाधनम ् (प्रभास)ं
The processing of language by the human brain is termed
as Language processing. Neuro-linguistic programming
deals with Non-science communication approach. Natural
Language Processing is related with language processing by
computers.
NLP is a computer activity in which computers are
entailed to analyze, understand, alter, or generate natural
language. This includes the automation of any or all
linguistic forms, activities, or methods of communication,
such as conversation, correspondence, reading, written
composition, dictation, publishing, translation, lip reading,
and so on. Natural language processing is also the name of
the branch of computer science, artificial intelligence, and
linguistics concerned with enabling computers to engage in
communication using natural language(s) in all forms,
including but not limited to speech, print, writing, and
signing.
Natural Language Processing (NLP) is a field of computer
science (संगणकविज्ञानम)्, artificial intelligence (कृत्रिमबदु्धिः), and linguistics (भाषाविज्ञानम)् concerned with the interactions
between computers and human (natural) languages. As
such, NLP is related to the area of human–computer
interaction. Many challenges in NLP involve natural
language understanding, that is, enabling computers to
derive meaning from human or natural language input, and
others involve natural language generation.
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e3
1. Definitions …………….3
2. Prerequisite technologies………………5
3. History: Pāṇini to NLP……..……….5
4. NLP using machine learning……………….8
5. Major tasks, Sub tasks and related tasks in NLP…………..…10
6. Statistical NLP……………….20
7. Evaluation of natural language processing…………………20
7.1. Objectives
7.2. Short history of evaluation in NLP
7.3. Different types of evaluation
8. Natural Language Processing toolkits..………………..23
9. Named Entity Recognizers………………….27
10. Translation software..………………..27
11. Other software…………………….28
12. Chatterbots……………………..30
13. Natural language processing organizations……………………….31
14. Standardization in NLP……………………….31
15. The future of NLP…….…………………31
16. Other related fields…..…………………..32
17. References……..………………….32
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e4
Definitions
Natural language processing can be described as all of the following:
A field of science – systematic enterprise that builds and organizes
knowledge in the form of testable explanations and predictions
about the universe.
An applied science – field that applies human knowledge to build or
design useful things.
A field of computer science – scientific and practical approach to
computation and its applications.
A branch of artificial intelligence – intelligence of machines and
robots and the branch of computer science that aims to create it.
A subfield of computational linguistics – interdisciplinary field
dealing with the statistical or rule-based modelling of natural
language from a computational perspective.
An application of engineering – science, skill, and profession of
acquiring and applying scientific, economic, social, and practical
knowledge, in order to design and also build structures, machines,
devices, systems, materials and processes.
An application of software engineering – application of a systematic,
disciplined, quantifiable approach to the design, development,
operation, and maintenance of software, and the study of these
approaches; that is, the application of engineering to software.
A subfield of computer programming – process of designing, writing,
testing, debugging, and maintaining the source code of computer
programs. This source code is written in one or more programming
languages (such as Java, C++, C#, Python, etc.). The purpose of
programming is to create a set of instructions that computers use to
perform specific operations or to exhibit desired behaviours.
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e5
A subfield of Artificial Intelligence programming. AI programming
involves (mainly) manipulating symbols and not numbers. These
symbols might represent objects in the world and relationships
between those objects - complex structures of symbols are needed
to capture our knowledge of the world.
A type of system – set of interacting or interdependent components
forming an integrated whole or a set of elements (often called
'components') and relationships which are different from
relationships of the set or its elements to other elements or sets.
A system that includes software – software is a collection of
computer programs and related data that provides the instructions
for telling a computer what to do and how to do it. Software refers
to one or more computer programs and data held in the storage of
the computer. In other words, software is a set of programs,
procedures, algorithms and its documentation concerned with the
operation of a data processing system.
A type of technology – making, modification, usage, and knowledge
of tools, machines, techniques, crafts, systems, methods of
organization, in order to solve a problem, improve a pre-existing
solution to a problem, achieve a goal, handle an applied
input/output relation or perform a specific function. It can also refer
to the collection of such tools, machinery, modifications,
arrangements and procedures. Technologies significantly affect
human as well as other animal species' ability to control and adapt
to their natural environments.
A form of computer technology – computers and their application.
NLP makes use of computers, image scanners, microphones, and
many types of software programs.
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e6
Prerequisite technologies
The following technologies make natural language processing possible:
Communication (the activity of a source sending a message to a
receiver.)
Language
Speech
Writing
Computing
Computers
Computer programming
Information extraction
User interface
Software
Text editing
Word processing
Input devices (pieces of hardware for sending data to a computer
to be processed.)
Computer keyboard (typewriter style input device whose input is
converted into various data depending on the circumstances.)
Image scanners
History of natural language processing
The history of natural language processing describes the advances of
natural language processing. The history of NLP generally starts in the
1950s, although work can be found from earlier periods. There is some
overlap with the history of modern linguistics, the history of machine
translation and the history of artificial intelligence.
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e7
Pāṇini
Pāṇini (fl. 4th century BCE) (Sanskrit: पाणणनन, IPA: [p ]; a
patronymic meaning "descendant of Paṇi"), or Panini, was a Sanskrit
grammarian from ancient India. He was born in Pushkalavati, Gandhara, in
the modern-day Charsadda of Khyber Pakhtunkhwa, Pakistan.
Pāṇini is known for his Sanskrit grammar, particularly for his
formulation of the 3,959 rules of Sanskrit morphology, syntax and
semantics in the grammar known as Ashtadhyayi (अष्टाध्यायी Aṣṭādhyāyī,
meaning "eight chapters"), the foundational text of the grammatical
branch of the Vedanga, the auxiliary scholarly disciplines of Vedic religion
[Hinduism].
The Ashtadhyayi is one of the earliest known grammars of Sanskrit,
although Pāṇini refers to previous texts like the Unadisutra, Dhatupatha,
and Ganapatha. It is the earliest known work on descriptive linguistics,
and together with the work of his immediate predecessors (Nirukta,
Nighantu, Pratishakyas) stands at the beginning of the history of
linguistics itself. His theory of morphological analysis was more advanced
than any equivalent Western theory before the mid 20th century, and his
analysis of noun compounds still forms the basis of modern linguistic
theories of compounding, which have borrowed Sanskrit terms such as
bahuvrihi and dvandva.
Pāṇini's comprehensive and scientific theory of grammar is
conventionally taken to mark the end of the period of Vedic Sanskrit,
introducing the period of Classical Sanskrit.
Modern linguistics
Pāṇini's work became known in 19th-century Europe, where it
influenced modern linguistics initially through Franz Bopp, who mainly
looked at Pāṇini. Subsequently, a wider body of work influenced Sanskrit
scholars such as Ferdinand de Saussure, Leonard Bloomfield, and Roman
Jakobson. Frits Staal (1930-2012) discussed the impact of Indian ideas on
language in Europe. After outlining the various aspects of the contact,
Staal notes that the idea of formal rules in language – proposed by
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e8
Ferdinand de Saussure in 1894 and developed by Noam Chomsky in 1957
– has origins in the European exposure to the formal rules of Pāṇinian
grammar. In particular, de Saussure, who lectured on Sanskrit for three
decades, may have bee flue ced by Pāṇini and Bhartrihari; his idea of
the unity of signifier-signified in the sign somewhat resembles the notion
of Sphoṭa. More importantly, the very idea that formal rules can be applied
to areas outside of logic or mathematics may itself have been catalyzed by
Europe's contact with the work of Sanskrit grammarians.
De Saussure
Pāṇini, and the later Indian linguist Bhartrihari, had a significant
influence on many of the foundational ideas proposed by Ferdinand de
Saussure, professor of Sanskrit, who is widely considered the father of
modern structural linguistics. Saussure himself cited Indian grammar as
an influence on some of his ideas. In his Mémoire sur le système primitif des
voyelles dans les langues indo-européennes (Memoir on the Original System
of Vowels in the Indo-European Languages) published in 1879, he
mentions Indian grammar as an influence on his idea that "reduplicated
aorists represent imperfects of a verbal class." In his De l'emploi du génitif
absolu en sanscrit (On the Use of the Genitive Absolute in Sanskrit)
publ shed 1881, he spec f cally me t o s Pāṇini as an influence on the
work.
Prem Singh, in his foreword to the reprint edition of the German
tra slat o of Pāṇ ’s Grammar 1998, co cluded that the "effect
Panini's work had on Indo-European linguistics shows itself in various
studies" and that a "number of seminal works come to mind," including
Saussure's works and the analysis that "gave rise to the laryngeal theory,"
further stating: "This type of structural analysis suggests influence from
Panini's analytical teaching." George Cardona, however, warns against
overestimating the i flue ce of Pāṇini on modern linguistics: "Although
Saussure also refers to predecessors who had taken this Paninian rule into
account, it is reasonable to conclude that he had a direct acquaintance
with Panini's work. As far as I am able to discern upon rereading
Saussure's Mémoire, however, it shows no direct influence of Paninian
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e9
grammar. Indeed, on occasion, Saussure follows a path that is contrary to
Paninian procedure."
Leonard Bloomfield
The founding father of American structuralism, Leonard Bloomfield,
wrote a 1927 paper t tled "O some rules of Pāṇini".
Panini: Comparison with modern formal systems
Pāṇini's grammar is the world's first formal system, developed well
before the 19th century innovations of Gottlob Frege and the subsequent
developme t of mathemat cal log c. I des g g h s grammar, Pāṇini used
the method of "auxiliary symbols", in which new affixes are designated to
mark syntactic categories and the control of grammatical derivations. This
technique, rediscovered by the logician Emil Post, became a standard
method in the design of computer programming languages. Sanskritists
ow accept that Pāṇini's linguistic apparatus is well-described as an
"applied" Post system. Considerable evidence shows ancient mastery of
context-sensitive grammars, and a general ability to solve many complex
problems. Frits Staal has written that "Panini is the Indian Euclid."
The history of machine translation dates back to the seventeenth
century, when philosophers such as Leibniz and Descartes put forward
proposals for codes which would relate words between languages. All of
these proposals remained theoretical, and none resulted in the
development of an actual machine.
The first patents for "translating machines" were applied for in the mid-
1930s. One proposal, by Georges Artsrouni was simply an automatic
bilingual dictionary using paper tape. The other proposal, by Peter
Troyanskii, a Russian, was more detailed. It included both the bilingual
dictionary, and a method for dealing with grammatical roles between
languages, based on Esperanto.
In 1950, Alan Turing published an article titled "Computing Machinery
and Intelligence" which proposed what is now called the Turing test as a
criterion of intelligence.
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e10
The Georgetown experiment in 1954 involved fully automatic
translation of more than sixty Russian sentences into English. The authors
claimed that within three or five years, machine translation would be a
solved problem. However, real progress was much slower, and after the
ALPAC report in 1966, which found that ten year long research had failed
to fulfill the expectations, funding for machine translation was
dramatically reduced. Little further research in machine translation was
conducted until the late 1980s, when the first statistical machine
translation systems were developed.
Some notably successful NLP systems developed in the 1960s were
SHRDLU, a natural language system working in restricted "blocks worlds"
with restricted vocabularies, and ELIZA, a simulation of a Rogerian
psychotherapist, written by Joseph Weizenbaum between 1964 to 1966.
Using almost no information about human thought or emotion, ELIZA
sometimes provided a startlingly human-like interaction. When the
"patient" exceeded the very small knowledge base, ELIZA might provide a
generic response, for example, responding to "My head hurts" with "Why
do you say your head hurts?".
During the 1970s many programmers began to write 'conceptual
ontologies', which structured real-world information into computer-
understandable data. Examples are MARGIE (Schank, 1975), SAM
(Cullingford, 1978), PAM (Wilensky, 1978), TaleSpin (Meehan, 1976),
QUALM (Lehnert, 1977), Politics (Carbonell, 1979), and Plot Units
(Lehnert 1981). During this time, many chatterbots were written
including PARRY, Racter, and Jabberwacky.
Up to the 1980s, most NLP systems were based on complex sets of
hand-written rules. Starting in the late 1980s, however, there was a
revolution in NLP with the introduction of machine learning algorithms
for language processing. This was due both to the steady increase in
computational power resulting from Moore's Law and the gradual
lessening of the dominance of Chomskyan1 theories of linguistics (e.g.
1 Chomskyan linguistics encourages the investigation of "corner cases"
that stress the limits of its theoretical models (comparable to pathological
phenomena in mathematics), typically created using thought experiments,
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e11
transformational grammar), whose theoretical underpinnings discouraged
the sort of corpus linguistics that underlies the machine-learning
approach to language processing. Some of the earliest-used machine
learning algorithms, such as decision trees, produced systems of hard if-
then rules similar to existing hand-written rules. Increasingly, however,
research has focused on statistical models, which make soft, probabilistic
decisions based on attaching real-valued weights to the features making
up the input data. The cache language models upon which many speech
recognition systems now rely are examples of such statistical models. Such
models are generally more robust when given unfamiliar input, especially
input that contains errors (as is very common for real-world data), and
produce more reliable results when integrated into a larger system
comprising multiple subtasks.
Many of the notable early successes occurred in the field of machine
translation, due especially to work at IBM Research, where successively
more complicated statistical models were developed. These systems were
able to take advantage of existing multilingual textual corpora that had
been produced by the Parliament of Canada and the European Union as a
result of laws calling for the translation of all governmental proceedings
into all official languages of the corresponding systems of government.
However, most other systems depended on corpora specifically developed
for the tasks implemented by these systems, which was (and often
continues to be) a major limitation in the success of these systems. As a
result, a great deal of research has gone into methods of more effectively
learning from limited amounts of data.
rather than the systematic investigation of typical phenomena that occur
in real-world data, as is the case in corpus linguistics. The creation and use
of such corpora of real-world data is a fundamental part of machine-
learning algorithms for NLP. In addition, theoretical underpinnings of
Chomskyan linguistics such as the so-called "poverty of the stimulus"
argument entail that general learning algorithms, as are typically used in
machine learning, cannot be successful in language processing. As a result,
the Chomskyan paradigm discouraged the application of such models to
language processing.
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e12
Recent research has increasingly focused on unsupervised and semi-
supervised learning algorithms. Such algorithms are able to learn from
data that has not been hand-annotated with the desired answers, or using
a combination of annotated and non-annotated data. Generally, this task is
much more difficult than supervised learning, and typically produces less
accurate results for a given amount of input data. However, there is an
enormous amount of non-annotated data available (including, among
other things, the entire content of the World Wide Web), which can often
make up for the inferior results.
NLP using machine learning
Modern NLP algorithms are based on machine learning, especially
statistical machine learning. The paradigm of machine learning is different
from that of most prior attempts at language processing. Prior
implementations of language-processing tasks typically involved the
direct hand coding of large sets of rules. The machine-learning paradigm
calls instead for using general learning algorithms — often, although not
always, grounded in statistical inference — to automatically learn such
rules through the analysis of large corpora of typical real-world examples.
A corpus (plural, "corpora") is a set of documents (or sometimes,
individual sentences) that have been hand-annotated with the correct
values to be learned.
Many different classes of machine learning algorithms have been
applied to NLP tasks. These algorithms take as input a large set of
"features" that are generated from the input data. Some of the earliest-
used algorithms, such as decision trees, produced systems of hard if-then
rules similar to the systems of hand-written rules that were then common.
Increasingly, however, research has focused on statistical models, which
make soft, probabilistic decisions based on attaching real-valued weights
to each input feature. Such models have the advantage that they can
express the relative certainty of many different possible answers rather
than only one, producing more reliable results when such a model is
included as a component of a larger system.
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e13
Systems based on machine-learning algorithms have many advantages
over hand-produced rules:
The learning procedures used during machine learning automatically
focus on the most common cases, whereas when writing rules by hand it is
often not obvious at all where the effort should be directed.
Automatic learning procedures can make use of statistical inference
algorithms to produce models that are robust to unfamiliar input (e.g.
containing words or structures that have not been seen before) and to
erroneous input (e.g. with misspelled words or words accidentally
omitted). Generally, handling such input gracefully with hand-written
rules — or more generally, creating systems of hand-written rules that
make soft decisions — is extremely difficult, error-prone and time-
consuming.
Systems based on automatically learning the rules can be made more
accurate simply by supplying more input data. However, systems based on
hand-written rules can only be made more accurate by increasing the
complexity of the rules, which is a much more difficult task. In particular,
there is a limit to the complexity of systems based on hand-crafted rules,
beyond which the systems become more and more unmanageable.
However, creating more data to input to machine-learning systems simply
requires a corresponding increase in the number of man-hours worked,
generally without significant increases in the complexity of the annotation
process.
The subfield of NLP devoted to learning approaches is known as Natural
Language Learning (NLL) and its conference CoNLL and peak body
SIGNLL are sponsored by ACL, recognizing also their links with
Computational Linguistics and Language Acquisition. When the aims of
computational language learning research is to understand more about
human language acquisition, or psycholinguistics, NLL overlaps into the
related field of Computational Psycholinguistics.
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e14
Major tasks in NLP
The following is a list of some of the most commonly researched tasks in
NLP. Note that some of these tasks have direct real-world applications,
while others more commonly serve as sub-tasks that are used to aid in
solving larger tasks. What distinguishes these tasks from other potential
and actual NLP tasks is not only the volume of research devoted to them
but the fact that for each one there is typically a well-defined problem
setting, a standard metric for evaluating the task, standard corpora on
which the task can be evaluated, and competitions devoted to the specific
task.
Automatic summarization
Produce a readable summary of a chunk of text. Often used to provide
summaries of text of a known type, such as articles in the financial section
of a newspaper.
Co-reference resolution
Given a sentence or larger chunk of text, determine which words
("mentions") refer to the same objects ("entities"). Anaphora resolution is
a specific example of this task, and is specifically concerned with matching
up pronouns with the nouns or names that they refer to. The more general
task of co-reference resolution also includes identifying so-called
"bridging relationships" involving referring expressions. For example, in a
sentence such as "He entered John's house through the front door", "the
front door" is a referring expression and the bridging relationship to be
identified is the fact that the door being referred to is the front door of
John's house (rather than of some other structure that might also be
referred to).
Discourse analysis
This rubric includes a number of related tasks. One task is identifying
the discourse structure of connected text, i.e. the nature of the discourse
relationships between sentences (e.g. elaboration, explanation, contrast).
Another possible task is recognizing and classifying the speech acts in a
chunk of text (e.g. yes-no question, content question, statement, assertion,
etc.).
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e15
Machine translation (यन्त्िानुिादिः)
Automatically translate text from one human language to another. This
is one of the most difficult problems, and is a member of a class of
problems colloquially termed "AI-complete", i.e. requiring all of the
different types of knowledge that humans possess (grammar, semantics,
facts about the real world, etc.) in order to solve properly.
Morphological segmentation
Separate words into individual morphemes and identify the class of the
morphemes. The difficulty of this task depends greatly on the complexity
of the morphology (i.e. the structure of words) of the language being
considered. English has fairly simple morphology, especially inflectional
morphology, and thus it is often possible to ignore this task entirely and
simply model all possible forms of a word (e.g. "open, opens, opened,
opening") as separate words. In languages such as Turkish, however, such
an approach is not possible, as each dictionary entry has thousands of
possible word forms.
Named Entity Recognition (NER)
Given a stream of text, determine which items in the text map to proper
names, such as people or places, and what the type of each such name is
(e.g. person, location, organization). Note that, although capitalization can
aid in recognizing named entities in languages such as English, this
information cannot aid in determining the type of named entity, and in
any case is often inaccurate or insufficient. For example, the first word of a
sentence is also capitalized, and named entities often span several words,
only some of which are capitalized. Furthermore, many other languages in
non-Western scripts (e.g. Chinese or Arabic) do not have any
capitalization at all, and even languages with capitalization may not
consistently use it to distinguish names. For example, German capitalizes
all nouns, regardless of whether they refer to names, and French and
Spanish do not capitalize names that serve as adjectives.
Natural language generation
Convert information from computer databases into readable human
language.
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e16
Natural language understanding
Convert chunks of text into more formal representations such as first-
order logic structures that are easier for computer programs to
manipulate. Natural language understanding involves the identification of
the intended semantic from the multiple possible semantics which can be
derived from a natural language expression which usually takes the form
of organized notations of natural languages concepts. Introduction and
creation of language metamodel and ontology are efficient however
empirical solutions. An explicit formalization of natural languages
semantics without confusions with implicit assumptions such as closed
world assumption (CWA) vs. open world assumption, or subjective
Yes/No vs. objective True/False is expected for the construction of a basis
of semantics formalization.
Optical Character Recognition (OCR)
Given an image representing printed text, determine the corresponding
text.
Part-of-speech tagging
Given a sentence, determine the part of speech for each word. Many
words, especially common ones, can serve as multiple parts of speech. For
example, "book" can be a noun ("the book on the table") or verb ("to book
a flight"); "set" can be a noun, verb or adjective; and "out" can be any of at
least five different parts of speech. Some languages have more such
ambiguity than others. Languages with little inflectional morphology, such
as English are particularly prone to such ambiguity. Chinese is prone to
such ambiguity because it is a tonal language during verbalization. Such
inflection is not readily conveyed via the entities employed within the
orthography to convey intended meaning.
Parsing
Determine the parse tree (grammatical analysis) of a given sentence.
The grammar for natural languages is ambiguous and typical sentences
have multiple possible analyses. In fact, perhaps surprisingly, for a typical
sentence there may be thousands of potential parses (most of which will
seem completely nonsensical to a human).
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e17
Question answering
Given a human-language question, determine its answer. Typical
questions have a specific right answer (such as "What is the capital of
Canada?"), but sometimes open-ended questions are also considered
(such as "What is the meaning of life?").
Relationship extraction
Given a chunk of text, identify the relationships among named entities
(e.g. who is the wife of whom).
Sentence breaking (also known as sentence boundary disambiguation)
Given a chunk of text, find the sentence boundaries. Sentence
boundaries are often marked by periods or other punctuation marks, but
these same characters can serve other purposes (e.g. marking
abbreviations).
Sentiment analysis
Extract subjective information usually from a set of documents, often
using online reviews to determine "polarity" about specific objects. It is
especially useful for identifying trends of public opinion in the social
media, for the purpose of marketing.
Speech recognition
Given a sound clip of a person or people speaking, determine the textual
representation of the speech. This is the opposite of text to speech and is
one of the extremely difficult problems colloquially termed "AI-complete"
(see above). In natural speech there are hardly any pauses between
successive words, and thus speech segmentation is a necessary subtask of
speech recognition (see below). Note also that in most spoken languages,
the sounds representing successive letters blend into each other in a
process termed co-articulation, so the conversion of the analog signal to
discrete characters can be a very difficult process.
Speech segmentation
Given a sound clip of a person or people speaking, separate it into
words. A subtask of speech recognition and typically grouped with it.
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e18
Topic segmentation and recognition
Given a chunk of text, separate it into segments each of which is devoted
to a topic, and identify the topic of the segment.
Word segmentation
Separate a chunk of continuous text into separate words. For a language
like English, this is fairly trivial, since words are usually separated by
spaces. However, some written languages like Chinese, Japanese and Thai
do not mark word boundaries in such a fashion, and in those languages
text segmentation is a significant task requiring knowledge of the
vocabulary and morphology of words in the language.
Word sense disambiguation
Many words have more than one meaning; we have to select the
meaning which makes the most sense in context. For this problem, we are
typically given a list of words and associated word senses, e.g. from a
dictionary or from an online resource such as WordNet.
Subfields of NLP: Related tasks
In some cases, sets of related tasks are grouped into subfields of NLP
that are often considered separately from NLP as a whole. Examples
include:
Information Retrieval (IR)
This is concerned with storing, searching and retrieving information. It
is a separate field within computer science (closer to databases), but IR
relies on some NLP methods (for example, stemming). Some current
research and applications seek to bridge the gap between IR and NLP.
Information Extraction (IE)
This is concerned in general with the extraction of semantic information
from text. This covers tasks such as named entity recognition, Co-
reference resolution, relationship extraction, etc.
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e19
Ontology engineering
A field that studies the methods and methodologies for building
ontologies, which are formal representations of a set of concepts within a
domain and the relationships between those concepts.
Speech processing
This covers speech recognition, text-to-speech and related tasks.
Other tasks include:
Stemming
In linguistic morphology and information retrieval, stemming is the
process for reducing inflected (or sometimes derived) words to their stem,
base or root form—generally a written word form. The stem need not be
identical to the morphological root of the word; it is usually sufficient that
related words map to the same stem, even if this stem is not in itself a
valid root. Algorithms for stemming have been studied in computer
science since the 1960s. Many search engines treat words with the same
stem as synonyms as a kind of query expansion, a process called
conflation.
Stemming programs are commonly referred to as stemming algorithms
or stemmers.
A stemmer for English, for example, should identify the string "cats"
(and possibly "catlike", "catty" etc.) as based on the root "cat", and
"stemmer", "stemming", "stemmed" as based on "stem". A stemming
algorithm reduces the words "fishing", "fished", and "fisher" to the root
word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and
"argus" reduce to the stem "argu" (illustrating the case where the stem is
not itself a word or root) but "argument" and "arguments" reduce to the
stem "argument".
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e20
Text simplification
Text simplification is an operation used in natural language processing
to modify, enhance, classify or otherwise process an existing corpus of
human-readable text in such a way that the grammar and structure of the
prose is greatly simplified, while the underlying meaning and information
remains the same. Text simplification is an important area of research,
because natural human languages ordinarily contain complex compound
constructions that are not easily processed through automation. In terms
of reducing language diversity, semantic compression can be employed to
limit and simplify a set of words used in given texts.
Text Simplification is illustrated with an example from Siddharthan
(2006). The first sentence contains two relative clauses and one conjoined
verb phrase. A text simplification system aims to simplify the first
sentence to the second sentence.
“The ab l ty to s mpl fy mea s to el m ate the u ecessary so that the
ecessary may speak”
Also contributing to the firmness in copper, the analyst noted, was
a report by Chicago purchasing agents, which precedes the full
purchasing age t’s report that is due out today and gives an
indication of what the full report might hold.
Also contributing to the firmness in copper, the analyst noted, was
a report by Chicago purchasing agents. The Chicago report
precedes the full purchasing agents report. The Chicago report
gives an indication of what the full report might hold. The full
report is due out today.
Text-to-speech
Speech synthesis is the artificial production of human speech. A
computer system used for this purpose is called a speech synthesizer, and
can be implemented in software or hardware products. A text-to-speech
(TTS) system converts normal language text into speech; other systems
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e21
render symbolic linguistic representations like phonetic transcriptions
into speech.
A text-to-speech system (or "engine") is composed of two parts: a front-
end and a back-end. The front-end has two major tasks. First, it converts
raw text containing symbols like numbers and abbreviations into the
equivalent of written-out words. This process is often called text
normalization, pre-processing, or tokenization. The front-end then assigns
phonetic transcriptions to each word, and divides and marks the text into
prosodic units, like phrases, clauses, and sentences. The process of
assigning phonetic transcriptions to words is called text-to-phoneme or
grapheme-to-phoneme conversion. Phonetic transcriptions and prosody
information together make up the symbolic linguistic representation that
is output by the front-end. The back-end—often referred to as the
synthesizer—then converts the symbolic linguistic representation into
sound. In certain systems, this part includes the computation of the target
prosody (pitch contour, phoneme durations), which is then imposed on
the output speech.
Text-proofing
Proofreading is the reading of a galley proof or an electronic copy of a
publication to detect and correct production errors of text or art.
Proofreaders are expected to be consistently accurate by default because
they occupy the last stage of typographic production before publication.
Natural language search
A natural language search engine would in theory find targeted answers
to user questions (as opposed to keyword search). For example, when
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e22
confronted with a question of the form 'which U.S. state has the highest
income tax?', conventional search engines ignore the question and instead
search on the keywords 'state', 'income' and 'tax'. Natural language search,
on the other hand, attempts to use natural language processing to
understand the nature of the question and then to search and return a
subset of the web that contains the answer to the question. If it works,
results would have a higher relevance than results from a keyword search
engine.
Query expansion
Query expansion (QE) is the process of reformulating a seed query to
improve retrieval performance in information retrieval operations. In the
context of web search engines, query expansion involves evaluating a
user's input (what words were typed into the search query area, and
sometimes other types of data) and expanding the search query to match
additional documents. Query expansion involves techniques such as:
Finding synonyms of words, and searching for the synonyms as
well
Finding all the various morphological forms of words by
stemming each word in the search query
Fixing spelling errors and automatically searching for the
corrected form or suggesting it in the results
Re-weighting the terms in the original query
Query expansion is a methodology studied in the field of computer
science, particularly within the realm of natural language processing and
information retrieval.
Automated essay scoring
Automated essay scoring (AES) and project essay grade (PEG) is the use
of specialized computer programs to assign grades to essays written in an
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e23
educational setting. It is a method of educational assessment and an
application of natural language processing. Its objective is to classify a
large set of textual entities into a small number of discrete categories,
corresponding to the possible grades—for example, the numbers 1 to 6.
Therefore, it can be considered a problem of statistical classification.
Several factors have contributed to a growing interest in AES. Among
them are cost, accountability, standards, and technology. Rising education
costs have led to pressure to hold the educational system accountable for
results by imposing standards. The advance of information technology
promises to measure educational achievement at reduced cost.
The use of AES for high-stakes testing in education has generated
significant backlash, with opponents pointing to research that computers
cannot yet grade writing accurately and arguing that their use for such
purposes promotes teaching writing in reductive ways (i.e. teaching to the
test).
True-casing
True-casing is the problem in natural language processing (NLP) of
determining the proper capitalization of words where such information is
unavailable. This commonly comes up due to the standard practice (in
English and many other languages) of automatically capitalizing the first
word of a sentence. It can also arise in badly cased or non-cased text (for
example, all-lowercase or all-uppercase text messages). True-casing aids
in many other NLP tasks, such as named entity recognition, machine
translation and Automatic Content Extraction.
True-casing is unnecessary in languages whose scripts do not have a
distinction between uppercase and lowercase letters. This includes all
languages not written in the Latin, Greek, Cyrillic or Armenian alphabets,
such as Sanskrit, Japanese, Chinese, Thai, Hebrew, Arabic, Hindi, etc.
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e24
Statistical NLP
Statistical natural-language processing uses stochastic, probabilistic and
statistical methods to resolve some of the difficulties discussed above,
especially those which arise because longer sentences are highly
ambiguous when processed with realistic grammars, yielding thousands
or millions of possible analyses. Methods for disambiguation often involve
the use of corpora and Markov models. Statistical NLP comprises all
quantitative approaches to automated language processing, including
probabilistic modelling, information theory, and linear algebra. The
technology for statistical NLP comes mainly from machine learning and
data mining, both of which are fields of artificial intelligence that involve
learning from data.
Evaluation of natural language processing
Objectives
The goal of NLP evaluation is to measure one or more qualities of an
algorithm or a system, in order to determine whether (or to what extent)
the system answers the goals of its designers, or meets the needs of its
users. Research in NLP evaluation has received considerable attention,
because the definition of proper evaluation criteria is one way to specify
precisely an NLP problem, going thus beyond the vagueness of tasks
defined only as language understanding or language generation. A precise
set of evaluation criteria, which includes mainly evaluation data and
evaluation metrics, enables several teams to compare their solutions to a
given NLP problem.
Short history of evaluation in NLP
The first evaluation campaign on written texts seems to be a campaign
dedicated to message understanding in 1987 (Pallet 1998). Then, the
Parseval/GEIG project compared phrase-structure grammars (Black
1991). A series of campaigns within Tipster project were realized on tasks
like summarization, translation and searching (Hirschman 1998). In 1994,
in Germany, the Morpholympics compared German taggers. Then, the
Senseval & Romanseval campaigns were conducted with the objectives of
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e25
semantic disambiguation. In 1996, the Sparkle campaign compared
syntactic parsers in four different languages (English, French, German and
Italian). In France, the Grace project compared a set of 21 taggers for
French in 1997 (Adda 1999). In 2004, during the Technolangue/Easy
project, 13 parsers for French were compared. Large-scale evaluation of
dependency parsers were performed in the context of the CoNLL shared
tasks in 2006 and 2007. In Italy, the EVALITA campaign was conducted in
2007 and 2009 to compare various NLP and speech tools for Italian; the
2011 campaign is in full progress - EVALITA web site. In France, within
the ANR-Passage project (end of 2007), 10 parsers for French were
compared - passage web site.
Different types of evaluation
Depending on the evaluation procedures, a number of distinctions are
traditionally made in NLP evaluation.
Intrinsic vs. extrinsic evaluation
Intrinsic evaluation considers an isolated NLP system and characterizes
its performance mainly with respect to a gold standard result, pre-defined
by the evaluators. Extrinsic evaluation, also called evaluation in use
considers the NLP system in a more complex setting, either as an
embedded system or serving a precise function for a human user. The
extrinsic performance of the system is then characterized in terms of its
utility with respect to the overall task of the complex system or the human
user. For example, consider a syntactic parser that is based on the output
of some new part of speech (POS) tagger. An intrinsic evaluation would
run the POS tagger on some labelled data, and compare the system output
of the POS tagger to the gold standard (correct) output. An extrinsic
evaluation would run the parser with some other POS tagger, and then
with the new POS tagger, and compare the parsing accuracy.
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e26
Black-box vs. glass-box evaluation
Black-box evaluation requires one to run an NLP system on a given data
set and to measure a number of parameters related to the quality of the
process (speed, reliability, resource consumption) and, most importantly,
to the quality of the result (e.g. the accuracy of data annotation or the
fidelity of a translation). Glass-box evaluation looks at the design of the
system, the algorithms that are implemented, the linguistic resources it
uses (e.g. vocabulary size), etc. Given the complexity of NLP problems, it is
often difficult to predict performance only on the basis of glass-box
evaluation, but this type of evaluation is more informative with respect to
error analysis or future developments of a system.
Automatic vs. manual evaluation
In many cases, automatic procedures can be defined to evaluate an NLP
system by comparing its output with the gold standard (or desired) one.
Although the cost of producing the gold standard can be quite high,
automatic evaluation can be repeated as often as needed without much
additional costs (on the same input data). However, for many NLP
problems, the definition of a gold standard is a complex task, and can
prove impossible when inter-annotator agreement is insufficient. Manual
evaluation is performed by human judges, which are instructed to
estimate the quality of a system, or most often of a sample of its output,
based on a number of criteria. Although, thanks to their linguistic
competence, human judges can be considered as the reference for a
number of language processing tasks, there is also considerable variation
across their ratings. This is why automatic evaluation is sometimes
referred to as objective evaluation, while the human kind appears to be
more "subjective."
Natural language processing toolkits
The following natural language processing toolkits are popular collections of natural language processing software. They are suites of libraries, frameworks, and applications for symbolic, statistical natural language and speech processing.
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e27
Name Language License Creators
Antelope framework C#, VB.net Free for research Proxem
Apertium C++, Java GPL (various)
Cogito
Commercial Expert System S.p.A.
Carabao Language
Kit
Any COM+
compliant language.
Customization is via
data entry
Commercial LinguaSys
ClearTK Java new BSD license
The Center for Computational
Language and Education
Research at the University of
Colorado Boulder
DELPH-IN LISP, C++ LGPL, MIT, ... Deep Linguistic Processing
with HPSG Initiative
Distinguo C++ Commercial Ultralingua Inc.
dkPro Java ASL/GPL tu-darmstadt.de
Ellogon C / C++ LGPL Georgios Petasis
FreeLing C++ GPL
Universitat Politècnica de
Catalunya
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e28
Name Language License Creators
General Architecture
for Text
Engineering (GATE)
Java LGPL GATE open source community
Gensim Python LGPL Rad m Řehůřek
Graph Expression Java Apache License Startup huti.ru
IceNLP Java LGPL
Icelandic Centre for Language
Technology (ICLT)
Learning Based Java Java BSD
Cognitive Computation Group
at the University of Illinois
LingPipe Java royalty free or commercial Alias-i
LinguaStream Java Free for research University of Caen, France
Mallet Java Common Public License
University of Massachusetts
Amherst
MII nlp toolkit Java LGPL
UCLA Medical Imaging
Informatics (MII) Group
Modular Audio Recognition
Framework
Java BSD
The MARF Research and Development
Group, Concordia University
MontyLingua Python, Java Free for research MIT
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e29
Name Language License Creators
NLP Engine Java Commercial H-Care
Natural Language
Toolkit(NLTK) Python Apache 2.0
NooJ (based
on INTEX)
.NET Framework-
based Free for research
University of Franche-
Comté, France
Apache OpenNLP Java Apache License 2.0 Online community
Pattern Python BSD
Tom De Smedt,
CLiPS, University of Antwerp
PSI-Toolkit C++ LGPL
Adam Mickiewicz University in
Poz ań
Rosette C, C++, Java, .NET Commercial Basis Technology
Rosoka Java Commercial Rosoka Software
ScalaNLP Scala Apache License David Hall and Daniel Ramage
Stanford NLP Java GPL
The Stanford Natural
Language Processing Group
Rasp C++ LGPL
University of Cambridge, University of Sussex
Natural JavaScript, NodeJs GPL Chris Umbel
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e30
Name Language License Creators
Text Engineering
Software Laboratory
(Tesla)
Java Eclipse Public License University of Cologne
Thinktelligence
Delegator
Java Commercial Thinktelligence Corporation
Treex Perl GPL / Artistic Charles University in Prague
UIMA Java / C++ Apache 2.0 Apache
VisualText
NLP++ / compiles to
C++ Free or Commercial
Text Analysis International,
Inc
WebLab-project Java / C++ LGPL OW2
Unitex/GramLab
C++ (Core
components)
& Java (VisualIDE)
LGPL & LGPLLR(Linguistic
resources)
Laboratoire d'Informatique
Gaspard-Monge
The Dragon Toolkit Java GPL Drexel University
Palladian Java Commercial
Dresden University of
Technology
Factorie Java Apache License
University of Massachusetts
Amherst
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e31
Name Language License Creators
Silpa Indic Language
Processing Toolkit
Python AGPL Silpa opensource community
developers
Text Extraction,
Annotation and
Retrieval Toolkit
Ruby GPL Louis Mullie
Zhihuita NLP API C Free for research Zhihuita.org
Named entity recognizers
ABNER (A Biomedical Named Entity Recognizer) – open source text mining program that uses linear-chain conditional random fields. It automatically tags genes, proteins and other entity names in text. Written by Burr Settles of the University of Wisconsin-Madison.
Translation software
Comparison of machine translation applications Machine translation applications Google Translate Linguee – web service that provides an online dictionary for a
number of language pairs. Unlike similar services, such as LEO, Linguee incorporates a search engine that provides access to large amounts of bilingual, translated sentence pairs, which come from the World Wide Web. As a translation aid, Linguee therefore differs from machine translation services like Babelfish and is more similar in function to a translation memory.
Hindi-to-Punjabi Machine Translation System UNL Universal Networking Language Yahoo! Babel Fish Bing Translate
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e32
Other softwares
AQUA
BORIS
CTAKES – open-source natural language processing system for information extraction from electronic medical record clinical free-text. It processes clinical notes, identifying types of clinical named e t t es — drugs, d seases/d sorders, s g s/symptoms, a atom cal sites and procedures. Each named entity has attributes for the text span, the ontology mapping code, context (family history of, current, unrelated to patient), and negated/not negated. Also known as Apache cTAKES.
DMAP
ETAP-3 – proprietary linguistic processing system focusing on English and Russian. It is a rule-based system which uses the Meaning-Text Theory as its theoretical foundation.
FASTUS
FRUMP
Iris – personal assistant application for Android. The application uses natural language processing to answer questions based on user voice request.
IPP
JAPE – the Java Annotation Patterns Engine, a component of the open-source General Architecture for Text Engineering (GATE) platform. JAPE is a finite state transducer that operates over annotations based on regular expressions.
Learning Reader – Forbus et. al 2007
LOLITA – "Large-scale, Object-based, Linguistic Interactor, Translator and Analyzer". LOLITA was developed by Roberto Garigliano and colleagues between 1986 and 2000. It was designed as a general-purpose tool for processing unrestricted text that could be the basis of a wide variety of applications. At its core was a semantic network containing some 90,000 interlinked concepts.
Maluuba – intelligent personal assistant for Android devices, that uses a contextual approach to search which takes into account the user's geographic location, contacts, and language.
METAL MT – machine translation system developed in the 1980s at the University of Texas and at Siemens which ran on Lisp Machines.
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e33
Never-Ending Language Learning – semantic machine learning system developed by a research team at Carnegie Mellon University, and supported by grants from DARPA, Google, and the NSF, with portions of the system running on a supercomputing cluster provided by Yahoo!. NELL was programmed by its developers to be able to identify a basic set of fundamental semantic relationships between a few hundred predefined categories of data, such as cities, companies, emotions and sports teams. Since the beginning of 2010, the Carnegie Mellon research team has been running NELL around the clock, sifting through hundreds of millions of web pages looking for connections between the information it already knows and what it finds through its search process – to make new connections in a manner that is intended to mimic the way humans learn new information.
NLTK
Online-translator.com
Regulus Grammar Compiler – software system for compiling unification grammars into grammars for speech recognition systems.
S Voice
Siri (software)
Speaktoit
START
TeLQAS
Weka's classification tools
Festival Speech Synthesis System
CMU Sphinx speech recognition system
Chatterbots
Chatterbot – text-based conversation agent that can interact with human users through some medium, such as an instant message service. Some chatterbots are designed for specific purposes, while others converse with human users on a wide range of topics.
Classic chatterbots
Dr. Sbaitso
ELIZA
PARRY
Racter (or Claude Chatterbot)
Mark V Shaney
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e34
General chatterbots
Albert One - 1998 and 1999 Loebner winner, by Robby Garner.
A.L.I.C.E. - 2001, 2002, and 2004 Loebner Prize winner developed by Richard Wallace.
Charlix
Cleverbot (winner of the 2010 Mechanical Intelligence Competition)
Elbot - 2008 Loebner Prize winner, by Fred Roberts.
Eugene Goostman - 2012 Turing 100 winner, by Vladimir Veselov.
Fred - an early chatterbot by Robby Garner.
Jabberwacky
Jeeney AI
MegaHAL
SimSimi - A popular artificial intelligence conversation program that was created in 2002 by ISMaker.
Spookitalk - A chatterbot used for NPCs in Douglas Adams' Starship Titanic video game.
Ultra Hal - 2007 Loebner Prize winner, by Robert Medeksza.
Verbot
My Chatterbot - A popular chatterbot who simulates movie characters.
Prelude@# - Winner of the 2005 Self-Learning Chatbot award
Instant messenger chatterbots
GooglyMinotaur, specializing in Radiohead, the first bot released by ActiveBuddy (June 2001-March 2002)
SmarterChild, developed by ActiveBuddy and released in June 2001
Infobot, an assistant on IRC channels such as #perl, primarily to help out with answering Frequently Asked Questions (June 1995-today)
Natural language processing organizations
AFNLP (Asian Federation of Natural Language Processing Associations) – the organization for coordinating the natural language processing related activities and events in the Asia-Pacific region.
Australasian Language Technology Association –
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e35
Association for Computational Linguistics – international scientific and professional society for people working on problems involving natural language processing.
Standardization in NLP
An ISO sub-committee is working in order to ease interoperability
between lexical resources and NLP programs. The sub-committee is part
of ISO/TC37 and is called ISO/TC37/SC4. Some ISO standards are already
published but most of them are under construction, mainly on lexicon
representation (see LMF), annotation and data category registry.
The future of NLP
Human-level natural language processing is an AI-complete problem.
That is, it is equivalent to solving the central artificial intelligence
problem—making computers as intelligent as people, or strong AI. NLP's
future is therefore tied closely to the development of AI in general.
As natural language understanding improves, computers will be able to
learn from the information online and apply what they learned in the real
world. Combined with natural language generation, computers will
become more and more capable of receiving and giving instructions.
In the future, humans may not need to code programs, but will dictate
to a computer in a human natural language, and the computer will
understand and act upon the instructions.
Other related fields
Biomedical text mining
Compound term processing
Computer-assisted reviewing
Controlled natural language
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e36
Deep Linguistic Processing
Foreign language reading aid
Foreign language writing aid
Language technology
Latent semantic indexing
LRE Map
Natural language programming
Reification (linguistics)
Spoken dialogue system
Telligent Systems
Trans-derivational search
References
1. Bates, M. (1995). Models of natural language understanding.
Proceedings of the National Academy of Sciences of the United
States of America, Vol. 92, No. 22 (Oct. 24, 1995), pp. 9977–9982.
2. Christopher D. Manning, Hinrich Schütze: Foundations of
Statistical Natural Language Processing, MIT Press (1999), ISBN
978-0-262-13360-9, p. xxxi
3. Christopher D. Manning, Prabhakar Raghavan, and Hinrich
Schütze (2008). Introduction to Information Retrieval. Cambridge
University Press. ISBN 978-0-521-86571-5. Official html and pdf
versions available without charge.
4. Daniel Jurafsky and James H. Martin (2008). Speech and Language
Processing, 2nd edition. Pearson Prentice Hall. ISBN 978-0-13-
187321-6.
General awareness of NATURAL LANGUAGE PROCESSING (NLP)
PhD Coursework: UNIT 2 - SANSKRIT ANCIENT & MODERN/module 5
Pag
e37
5. Implementing an online help desk system based on conversational
agent: Alisa Kongthon, Chatchawal Sangkeettrakarn, Sarawoot
Kongyoung and Choochart Haruechaiyasak. Published by ACM
2009 Article, Bibliometrics Data Bibliometrics. Published in:
Proceeding, MEDES '09 Proceedings of the International
Conference on Management of Emergent Digital EcoSystems, ACM
New York, NY, USA. ISBN 978-1-60558-829-2,
doi:10.1145/1643823.1643908
6. Roger Schank, (1969). A conceptual dependency parser for
natural language Proceedings of the 1969 conference on
Computational linguistics, Sång-Säby, Sweden, pages 1-3.
7. Steven Bird, Ewan Klein, and Edward Loper (2009). Natural
Language Processing with Python. O'Reilly Media. ISBN 978-0-
596-51649-9.
8. Woods, William A (1970). "Transition Network Grammars for
Natural Language Analysis". Communications of the ACM 13 (10):
591–606.
9. Yucong Duan, Christophe Cruz (2011), Formalizing Semantic of
Natural Language through Conceptualization from Existence.
International Journal of Innovation, Management and
Technology(2011) 2 (1), pp. 37-42.