Language and Speech Technology: Introduction Jan Odijk January 2011 LOT Winter School 2011 1.
-
Upload
bernice-banks -
Category
Documents
-
view
217 -
download
0
Transcript of Language and Speech Technology: Introduction Jan Odijk January 2011 LOT Winter School 2011 1.
Language and Speech Technology: Introduction
Jan OdijkJanuary 2011
LOT Winter School 2011
1
Overview
• What is language and speech technology (LST)? (3-7)
• Major Subfields of LST (8-25)
• Characterization of the last 30 years (26-27)– 80s (28-36), 90s (37-49), 00s (50-56)– Current Status (57-69)
• CLARIN infrastructure (70-75)
• This week’s programme (76)2
Language Technology
• Language Technology is the study of computational systems that process natural language
• Alternative names:– Human Language Technology (HLT)– Natural Language Processing (NLP)
3
Speech Technology
• Speech Technology is the study of computational systems that process speech
• Is a part of Language Technology
• Often– Term “Language technology” reserved for the
study of computational systems that process written language
4
Computational Linguistics
• Computational Linguistics (CL) is the study of language from a computational perspective
• Often used interchangeably with language technology
• Often grouped under Artificial Intelligence (AI) , although CL predates AI– AI: the study and design of intelligent systems
5
Computational Systems
• Computational systems to process natural language do not exist naturally (except in the human brain)– They must be designed, implemented, and
evaluated– Therefore it is a kind of engineering
6
Computational Systems
• LST is NOT
• the study of processing of natural language by humans in – cognition, – (cognitive) psychology,– (psycho)linguistics– phonetics
7
Language Technology Subfields
• Orthographic processing– Text = sequence of characters
– Tokenization• Text => sequence of tokens• Token= occurrence of a word form• Relatively simple for languages that uses interpunction
(space, dot, comma, etc.) for separating tokens• More difficult for languages such as Chinese, Thai, etc.
8
Language Technology Subfields
• Orthographic processing– Orthographic normalization– Token => (token, normalized token)– Normalized token = canonical orthographic
representation for a set of orthographic variants– Examples:
• Contemporary spelling variants: aktie => actie• Older spelling variants: vleesch => vlees• Typos: actei => actie• OCR errors: raarn => raam
9
Language Technology Subfields
• Morphological processing– Lemmatization: token => (token, lemma)
• Lemma = canonical orthographic representation for an inflectional paradigm
• Often ambiguities
• Examples– lemma(walked) = walk; Lemma(men) = man
– Lemma (graven) = {graf, graaf, graven} (Dutch)
10
Language Technology Subfields
• Morphological processing– Inflection analysis/generation
• Word form (lemma, inflectional features)
• Examples: – graven (graf, PoS=Noun, number=plural)
– graven (graaf, PoS=Noun, number=plural)
– graven (graven, PoS=Verb, form=infinitive)
– graven (graven, PoS=Verb, form= indicative, tense=present, number = plural)
11
Language Technology Subfields
• Morphological processing– Compound processing– word form ((word form,affix?)+, word form)– lemma ((word form,affix?)+, lemma)– Example:– Vleeskoeienhouders ([vlees,koeien], houders)
‘meat cow farmers’– gebiedsbepaling ([(gebied, s)], bepaling)
12
Language Technology Subfields
• Morphological processing– Derivational morphology processing– word form (prefix*, lemma, suffix*) – Example:
• Characterization ([], characterize, [ation])
13
Language Technology Subfields
• (PoS-)tagging– Assignment of a grammatical tag to a token in
context (tag=label for grammatical properties)– Token => (token, tag) in context – Usually assignment of PoS-tags– Often more detailed grammatical (inflectional)
tags
14
Language Technology Subfields
• (PoS-)tagging– Context: usually:
• Some words and/or tags preceding
• Some words following
– Examples:• (graven, Zij __ een graf) => Vindprespl
• (graven, De __ zijn boos) => Npl
15
Language Technology Subfields
• Chunking– identifying major phrases in a sentence– Example
• The man bought a present for his wife =>
• [NP The man] bought [NP a present] [PP for his wife]
16
Language Technology Subfields
• Parsing– Assign a syntactic structure to a sentence– Example: The man bought a present for his wife =>
[S
[subj/NP The man]
[pred/VP bought [obj/NP a present]
[pobj/PP for [obj/NP his wife]]
]
]17
Language Technology Subfields
• Machine Translation– Automatic translation of an input text– Example
• The man bought a present for his wife =>
• L’homme a acheté un cadeau pour sa femme
18
Language Technology Subfields
• Content extraction and processing– Named entity recognition– Question-answering– Information retrieval– Information extraction– Sentiment/ opinion mining– Reasoning/Inference on semantic representation– …
19
Speech Technology Subfields
• Speech Synthesis– Artificial production of human speech– Text => speech– Often called Text-To-Speech (TTS)– TTS system usually contains two components
• Grapheme to Phoneme (G2P) component– Text => symbolic speech representation (phonetic
representation)
• Speech Synthesis component– Symbolic speech representation => speech
20
Speech Technology Subfields
• Speech Synthesis (cont.)– Term Speech Synthesis often reserved for this
second component– Meaning => speech– Usually called Speech Generation, or Concept-
To-Speech, or Data-to-Speech
21
Speech Technology Subfields
• Speech Recognition– Recognition of human speech– Audio containing speech => text – Often called automatic speech recognition
(ASR)
• Speech Understanding– Understanding of human speech– Audio containing speech => meaning or action
22
Speech Technology Subfields
• Speaker Recognition– Recognition of a speaker given a speech signal– Speech => person identity
• Speaker Verification– Verification of the identity of a person– Speech + claimed identity => Boolean
23
Speech Technology Subfields
• Speech Compression– Reduction of the size of speech representations
(speech encoding), or– Time-compression of speech representations
(so that they sound faster to the listener)
24
Related fields
• Speech often used in dialogues– Study of spoken dialogues (human-human,
human-machine)
• Speech often combined with other modalities– Study of Multimodal Interaction
• Speech part of an man-machine interface– Study of Human - Machine Interaction
25
Introduction
• Three decades:– “80s”= 1980-1994– “90s”= 1990-2005– “00s” = 2000-2011
26
Overview
• 80s: Language Technology
• 80s: Speech Technology
• 90s Language and Speech Technology
• 90s Commercial Activity
• 90s Importance of Data
• 00s Language and Speech Technology
27
80s: Language Technology
• Focus on MT (in Europe)– Eurotra (Europe)– Rosetta (Philips, Netherlands)– Distributed Translation (BSO, Netherlands)
28
80s: Language Technology
• Linguistic “Research Approach”• Focus on Research
– not/less on Technology Development• Knowledge-based approach
– hand-crafted lexicons and rules– based on a theory / grammatical formalism
• Focus on linguistically interesting complex phenomena– less on phenomena that occur often– not strongly data-driven
29
80s: Language Technology
• Focus on an idealized language– not on actual language use– no focus on robustness
• Computational approach seen (in research) as a way to gain insight into language, grammar and grammar formalisms– no focus on developing a working system– no pragmatic solutions
30
80s: Language Technology
• Little formal (quantitative) evaluation– only with test suites
• constructed sentences illustrating linguistic phenomena
• E.g. the HP Test Suite (Flickinger et al. 1987)
• computational linguistics rather than language technology
31
80s: Language Technology
Major Problems (from a technology point of view):• Ambiguity
– Real– Temporary
• Computational Complexity– computation-intensive grammar formalisms
• Complexity of language– handcrafting lexicons and rules
• requires linguistic and computational expertise• requires a lot of effort and time
32
80s: Language Technology
• Major problems (cont.):
• Idealized Language v. actual Language Use
• Require large and rich lexicons, suited to the application domain: difficult/ large effort to make them, and to tune (adapt) to specific domains
33
80s: Speech Technology
• Automatic Speech Recognition (ASR)
• Statistical “Engineering Approach”
• approach based on Noisy Channel Model
• derive acoustic models from a lot of annotated speech examples
• derive statistical language models from large text corpora (n-gram probabilities)
34
80s: Speech Technology
• Focus on making (small) working systems
• Statistical approach: system uses probabilities derived from data
• Focus initially on limited, “simple” tasks (e.g. digit recognition), and increasingly on more complex tasks
35
80s: Speech Technology
• Focus on real language use under realistic conditions
• Progress made by making concrete systems and evaluating them rigorously
36
90s: Language Technology
• Statistical MT– derive language models from monolingual
corpora (probabilities of word ( sequence)s– align “sentences” with their translations– derive translation model from parallel corpora:
• estimate translation probabilities for words and word sequences from the aligned “sentences”
• use these probabilities to compute translations for new “sentences”
37
90: Language Technology
• Ambiguity: resolved by probabilities based on statistics• Computational Complexity
– computationally feasible formalisms– proven in speech recognition
• Complexity of language– language and translation model automatically derived from data
• Strong focus on actual language use– Highly data driven
• Lexicons can be simpler and are derived automatically from the data; adaptation to specific domains easy once the data are available
38
90s: Language Technology
• Rise of Internet• increasing need for information retrieval• approximated by search for word and word
sequence strings• Information Retrieval
– strongly statistically based– Limited linguistics– formal evaluation (recall, precision, F-score)
39
90s: Language Technology
• Resulted in– strongly data-driven approach in language
technology– increasing use of machine learning techniques– explicit focus on formal, esp. quantative
evaluation– re-examination of simpler/computationally less
intensive formalisms (finite-state) for syntax
40
90s: Speech Technology
• Continued working under the established paradigm
• increasingly improving performance and extending environments and application areas
41
90s: Companies
• many companies active in Speech technology– IBM, Microsoft, Siemens, Nokia, Philips,
Motorola, Matra Nortel, Nortel,..– Dragon, Kurzweil, Lernout & Hauspie,
SpeechWorks, Nuance, Babel, Loquendo, Rhetorical, Vocalis, Telisma, Elan, ...
42
90s: Companies
• many companies in Language technology– IBM, Microsoft, INSO, Novell, ...– GMS, Apptek, Globalink, Lernout & Hauspie,
Systran, LANT (Xplanation), ...
43
90s: Companies
• MT systems:– knowledge based systems, – developed under an engineering approach
• grammatical formalism simple or pruning in search space– to reduce ambiguity– to reduce computational resource requirements– to reduce hand-crafting of rules
44
90s: Companies
• resulted in low quality MT systems– still useful in many circumstances
• Differentiating factors– rapid adaptation to (multi-word) terms /
vocabulary of new domain– good performance on named entity recognition
45
90s: Data
• Knowledge Based NLP realized cooperation on lexicons was required
• ASR Methodology requires a lot of data:– “There is no data like more data”
• This led to – Data creation projects– Set-up of data distribution centers– Projects for developing standards for data
46
90s: Data
• Projects– Lexicon projects
• Multilex, • Genelex• Acquilex• Parole• WordNet, EuroWordNet
– SpeechDat projects• SpeechDat, SpeechDat-Car, SpeechDat-East, SPEECON,
Orientel– National / Local projects
• Spoken Dutch Corpus (Netherlands and Flanders)
47
90s: Data
• Data distribution Centers are set up– LDC (1993)– ELRA (1995)
• Standards:– TEI for text corpora
• CES, XCES
– Eagles, ISLE for grammatical properties
48
Automating Data Production
• Usually existing (imperfect) tools are used to create data (semi-)automatically– G2P for creating phonetic dictionaries– PoS-tagging for PoS-tagged text corpora– Parsers for treebanks
• For bootstrapping annotations– Faster and more consistent results
• Followed by (partial) manual correction49
00s
• Early 00s– Many data and research initiatives, nationally– Netherlands
• IMIX 2001-2008• STEVIN 2004-2011• TST-Centrale (HLT Agency) 2005-..
– France• EVALDA• Technolangue
50
00s
• Early 00s– International
• TREC
• CLEF
• TC-STAR 2004-2007
• EuroMatrix 2006-2009
• EUROMATRIXPlus 2009-2012
• ECESS
• PASCAL / PASCAL2
• ACE51
00s
• Early 00s– International
• TAC US
• DUC US
• GALE US
• NTCIR Japan
• RTE
• SemEval
• SensEval52
00s
• Companies offer services via the internet and via mobile (smart) phones– Search: Google, Bing, Yahoo!, etc.– Social networks: FaceBook, LinkedIn, Youtube– Cloud Computing: Amazon, Google, Salesforce
• Companies gain access to huge amounts of data (text, pictures, movies, etc,) including user behavior
54
00s
• Data are used– to improve existing services– To create new services– To personalize services and advertisements
55
00s
• New Services relevant for LST– Google: Translation, search by voice, open platform
for mobile devices (Android) – Amazon: Mechanical Turk
• Allows large scale distribution of work, e.g. on manual annotation of language resources
– Apple: several iPhone Apps• Dragon Dictate (for SMS, e-mail)• Jibbigo
– ReCaptcha: transcription of (hand-written) documents (now part of Google)
56
Current Status
• Language and Speech Technology in 2011:– Exciting area!
• A lot of commercial activity, and expanding
• A large and active research community
• A lot of interesting topics are open for research
57
Commercial Activity
• many companies in Language technology– Google, Yahoo!, IBM, Microsoft, ...– Apptek, Linguatec, Systran, Knowledge
Concepts, Q-go, ...
• applications– MT, content management, information
retrieval, dealing with customer questions, sentiment and opinion mining, ...
58
Commercial Activity
• many companies in Speech technology– Google, IBM, Microsoft, Motorola, Nokia, ...– Nuance, Loquendo, Acapela, SVOX,
Telisma, ...
• even more in application development and system integration
59
Commercial Activity
• applications– Network IVR applications (Call centers,
banking, information services,...)– Embedded applications
• in-car applications, e.g. voice activated dialing, navigation (voice destination entry)
• mobile phone/PDA applications– multimodal output e.g. for navigation– command and control– (SMS) dictation coming soon
60
Commercial Activity
• applications– Office Applications
• Dictation, horizontal and vertical (medical, legal)
• Language learning
– Audiomining• information retrieval from recorded speech
(possibly incl. other modalities): Radio/TV-broadcasts, parliamentary sessions, ...
61
Research Topics?
• Speech Technology (Recognition)– new paradigms?
• cf . FLAVOR project http://www.esat.kuleuven.be/psi/spraak/projects/FLaVoR/
– Combination with other modalities• AMI http://www.amiproject.org
• CHIL http://chil.server.de/servlet/is/101/
• IMIX (Interactive Multimodal Information eXtraction)
62
Research Topics?
• Speech Technology (Recognition)– robustness against noise and other speakers
• increasing use in car and in public places on PDAs and mobile phones
• MIDAS project
– pronunciation of names• Autonomata I and TOO (incl. Nuance, Ghent,
Nijmegen and Utrecht)
63
Research Topics?
• Speech technology (Text-to-Speech)– better control over prosody in corpus-based
TTS?– Combination with other modalities
64
Research Topics?
• Language Technology– Semantic Lexical databases created– WordNet and EuroWordNet – Cornetto
65
Research Topics?
• Language Technology– Focus now on Semantic Annotation of Corpora
• OntoNotes http://www.isi.edu/natural-language/people/hovy/papers/06HLT-NAACL-OntoNotes-short.pdf
• STEVIN D-COI and SONAR
• DutchSemCor
– How to use this semantic annotation in practical systems?
66
Research Topics?
• Language Technology– (Semi-)automatic lexicon creation/adaptation – Sophisticated information retrieval
• Information extraction, summarization and merging, opinion and sentiment mining,
67
Research Topics?
• Language And Speech Technology– Speech to Speech Translation
• TC-STAR http://www.tc-star.org/
68
Research Topics?
• Dutch-Flemish STEVIN programme– running from 2004-2011 – 11.4M€ budget
• resources• research• applications• demonstration projects
– Most projects finished– some projects are still running– http://www.taalunieversum.nl/stevin
69
CLARIN
• aims to design, construct, validate, and exploit – a research infrastructure that is needed to provide
a sustainable and persistent eScience working environment
– for researchers in the Social Sciences & Humanities
– who want to make use of language data and tools
70
CLARIN
• Make data and tools on different locations easily accessible – via web interfaces and services– CLARIN-portal(s) with intelligent searching,
browsing, viewing and querying services)
• make it possible for non-technical researchers to extract / combine/ enrich data (supported by dissemination and training)
71
CLARIN
• Will make available interoperable data and tools based on existing standards and best practices– Formal interoperability and– Semantic interoperability
72
CLARIN
• For researchers that work with language data and tools– Humanities and Social Sciences
• Linguistics (broadly construed)
• Literary and Theatrical Studies
• Media en Culture
• History
• Political Sciences
• …73
CLARIN
• Preparatory Project (CLARIN-prep)– Funded by EU– 2008-2011– >33 partners from >23 countries– Goals
• Get commitments from EU countries to contribute to the CLARIN infrastructure after CLARIN-prep
• Investigate needs, requirements• Make initial specification (and prototype implementations)
74
CLARIN
• Current Status– Most countries in the process– CLARIN infrastructure to start in Mid 2011– Netherlands committed and has leading role
• CLARIN-NL– Funded by NWO– 2009-2015– Many subprojects running– Focus on Humanities
75
This week’s Programme
• Tuesday: Parsing• Wednesday: Machine Learning• Thursday: Speech Recognition
– Guest lecturer: Arjan van Hessen
• Friday: Machine Translation
76
Thanks for Your Attention!
77
References
• Flickinger D., Nerbonne J., Sag I., Wasow T., "Toward Evaluation of NLP Systems", Hewlett-Packard Laboratories, Palo Alto, CA, 1987.
78