Language and Speech Technology: Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Language and Speech Technology: Introduction

Jan OdijkJanuary 2011

LOT Winter School 2011

1

Overview

• What is language and speech technology (LST)? (3-7)

• Major Subfields of LST (8-25)

• Characterization of the last 30 years (26-27)– 80s (28-36), 90s (37-49), 00s (50-56)– Current Status (57-69)

• CLARIN infrastructure (70-75)

• This week’s programme (76)2

Language Technology

• Language Technology is the study of computational systems that process natural language

• Alternative names:– Human Language Technology (HLT)– Natural Language Processing (NLP)

3

Speech Technology

• Speech Technology is the study of computational systems that process speech

• Is a part of Language Technology

• Often– Term “Language technology” reserved for the

study of computational systems that process written language

4

Computational Linguistics

• Computational Linguistics (CL) is the study of language from a computational perspective

• Often used interchangeably with language technology

• Often grouped under Artificial Intelligence (AI) , although CL predates AI– AI: the study and design of intelligent systems

5

http://www.aclweb.org/archive/misc/what.html

http://en.wikipedia.org/wiki/Artificial_intelligence

Computational Systems

• Computational systems to process natural language do not exist naturally (except in the human brain)– They must be designed, implemented, and

evaluated– Therefore it is a kind of engineering

6

http://en.wikipedia.org/wiki/Engineering

Computational Systems

• LST is NOT

• the study of processing of natural language by humans in – cognition, – (cognitive) psychology,– (psycho)linguistics– phonetics

7

Language Technology Subfields

• Orthographic processing– Text = sequence of characters

– Tokenization• Text => sequence of tokens• Token= occurrence of a word form• Relatively simple for languages that uses interpunction

(space, dot, comma, etc.) for separating tokens• More difficult for languages such as Chinese, Thai, etc.

8


• Orthographic processing– Orthographic normalization– Token => (token, normalized token)– Normalized token = canonical orthographic

representation for a set of orthographic variants– Examples:

• Contemporary spelling variants: aktie => actie• Older spelling variants: vleesch => vlees• Typos: actei => actie• OCR errors: raarn => raam

9


• Morphological processing– Lemmatization: token => (token, lemma)

• Lemma = canonical orthographic representation for an inflectional paradigm

• Often ambiguities

• Examples– lemma(walked) = walk; Lemma(men) = man

– Lemma (graven) = {graf, graaf, graven} (Dutch)

10


• Morphological processing– Inflection analysis/generation

• Word form (lemma, inflectional features)

• Examples: – graven (graf, PoS=Noun, number=plural)

– graven (graaf, PoS=Noun, number=plural)

– graven (graven, PoS=Verb, form=infinitive)

– graven (graven, PoS=Verb, form= indicative, tense=present, number = plural)

11


• Morphological processing– Compound processing– word form ((word form,affix?)+, word form)– lemma ((word form,affix?)+, lemma)– Example:– Vleeskoeienhouders ([vlees,koeien], houders)

‘meat cow farmers’– gebiedsbepaling ([(gebied, s)], bepaling)

12


• Morphological processing– Derivational morphology processing– word form (prefix*, lemma, suffix*) – Example:

• Characterization ([], characterize, [ation])

13


• (PoS-)tagging– Assignment of a grammatical tag to a token in

context (tag=label for grammatical properties)– Token => (token, tag) in context – Usually assignment of PoS-tags– Often more detailed grammatical (inflectional)

tags

14


• (PoS-)tagging– Context: usually:

• Some words and/or tags preceding

• Some words following

– Examples:• (graven, Zij __ een graf) => Vindprespl

• (graven, De __ zijn boos) => Npl

15


• Chunking– identifying major phrases in a sentence– Example

• The man bought a present for his wife =>

• [NP The man] bought [NP a present] [PP for his wife]

16


• Parsing– Assign a syntactic structure to a sentence– Example: The man bought a present for his wife =>

[S

[subj/NP The man]

[pred/VP bought [obj/NP a present]

[pobj/PP for [obj/NP his wife]]

]

]17


• Machine Translation– Automatic translation of an input text– Example

• The man bought a present for his wife =>

• L’homme a acheté un cadeau pour sa femme

18


• Content extraction and processing– Named entity recognition– Question-answering– Information retrieval– Information extraction– Sentiment/ opinion mining– Reasoning/Inference on semantic representation– …

19

Speech Technology Subfields

• Speech Synthesis– Artificial production of human speech– Text => speech– Often called Text-To-Speech (TTS)– TTS system usually contains two components

• Grapheme to Phoneme (G2P) component– Text => symbolic speech representation (phonetic

representation)

• Speech Synthesis component– Symbolic speech representation => speech

20


• Speech Synthesis (cont.)– Term Speech Synthesis often reserved for this

second component– Meaning => speech– Usually called Speech Generation, or Concept-

To-Speech, or Data-to-Speech

21


• Speech Recognition– Recognition of human speech– Audio containing speech => text – Often called automatic speech recognition

(ASR)

• Speech Understanding– Understanding of human speech– Audio containing speech => meaning or action

22


• Speaker Recognition– Recognition of a speaker given a speech signal– Speech => person identity

• Speaker Verification– Verification of the identity of a person– Speech + claimed identity => Boolean

23


• Speech Compression– Reduction of the size of speech representations

(speech encoding), or– Time-compression of speech representations

(so that they sound faster to the listener)

24

Related fields

• Speech often used in dialogues– Study of spoken dialogues (human-human,

human-machine)

• Speech often combined with other modalities– Study of Multimodal Interaction

• Speech part of an man-machine interface– Study of Human - Machine Interaction

25

Introduction

• Three decades:– “80s”= 1980-1994– “90s”= 1990-2005– “00s” = 2000-2011

26

Overview

• 80s: Language Technology

• 80s: Speech Technology

• 90s Language and Speech Technology

• 90s Commercial Activity

• 90s Importance of Data

• 00s Language and Speech Technology

27

80s: Language Technology

• Focus on MT (in Europe)– Eurotra (Europe)– Rosetta (Philips, Netherlands)– Distributed Translation (BSO, Netherlands)

28


• Linguistic “Research Approach”• Focus on Research

– not/less on Technology Development• Knowledge-based approach

– hand-crafted lexicons and rules– based on a theory / grammatical formalism

• Focus on linguistically interesting complex phenomena– less on phenomena that occur often– not strongly data-driven

29


• Focus on an idealized language– not on actual language use– no focus on robustness

• Computational approach seen (in research) as a way to gain insight into language, grammar and grammar formalisms– no focus on developing a working system– no pragmatic solutions

30


• Little formal (quantitative) evaluation– only with test suites

• constructed sentences illustrating linguistic phenomena

• E.g. the HP Test Suite (Flickinger et al. 1987)

• computational linguistics rather than language technology

31


Major Problems (from a technology point of view):• Ambiguity

– Real– Temporary

• Computational Complexity– computation-intensive grammar formalisms

• Complexity of language– handcrafting lexicons and rules

• requires linguistic and computational expertise• requires a lot of effort and time

32


• Major problems (cont.):

• Idealized Language v. actual Language Use

• Require large and rich lexicons, suited to the application domain: difficult/ large effort to make them, and to tune (adapt) to specific domains

33

80s: Speech Technology

• Automatic Speech Recognition (ASR)

• Statistical “Engineering Approach”

• approach based on Noisy Channel Model

• derive acoustic models from a lot of annotated speech examples

• derive statistical language models from large text corpora (n-gram probabilities)

34


• Focus on making (small) working systems

• Statistical approach: system uses probabilities derived from data

• Focus initially on limited, “simple” tasks (e.g. digit recognition), and increasingly on more complex tasks

35


• Focus on real language use under realistic conditions

• Progress made by making concrete systems and evaluating them rigorously

36


• Statistical MT– derive language models from monolingual

corpora (probabilities of word ( sequence)s– align “sentences” with their translations– derive translation model from parallel corpora:

• estimate translation probabilities for words and word sequences from the aligned “sentences”

• use these probabilities to compute translations for new “sentences”

37

90: Language Technology

• Ambiguity: resolved by probabilities based on statistics• Computational Complexity

– computationally feasible formalisms– proven in speech recognition

• Complexity of language– language and translation model automatically derived from data

• Strong focus on actual language use– Highly data driven

• Lexicons can be simpler and are derived automatically from the data; adaptation to specific domains easy once the data are available

38


• Rise of Internet• increasing need for information retrieval• approximated by search for word and word

sequence strings• Information Retrieval

– strongly statistically based– Limited linguistics– formal evaluation (recall, precision, F-score)

39


• Resulted in– strongly data-driven approach in language

technology– increasing use of machine learning techniques– explicit focus on formal, esp. quantative

evaluation– re-examination of simpler/computationally less

intensive formalisms (finite-state) for syntax

40


• Continued working under the established paradigm

• increasingly improving performance and extending environments and application areas

41

90s: Companies

• many companies active in Speech technology– IBM, Microsoft, Siemens, Nokia, Philips,

Motorola, Matra Nortel, Nortel,..– Dragon, Kurzweil, Lernout & Hauspie,

SpeechWorks, Nuance, Babel, Loquendo, Rhetorical, Vocalis, Telisma, Elan, ...

42

90s: Companies

• many companies in Language technology– IBM, Microsoft, INSO, Novell, ...– GMS, Apptek, Globalink, Lernout & Hauspie,

Systran, LANT (Xplanation), ...

43

90s: Companies

• MT systems:– knowledge based systems, – developed under an engineering approach

• grammatical formalism simple or pruning in search space– to reduce ambiguity– to reduce computational resource requirements– to reduce hand-crafting of rules

44

90s: Companies

• resulted in low quality MT systems– still useful in many circumstances

• Differentiating factors– rapid adaptation to (multi-word) terms /

vocabulary of new domain– good performance on named entity recognition

45

90s: Data

• Knowledge Based NLP realized cooperation on lexicons was required

• ASR Methodology requires a lot of data:– “There is no data like more data”

• This led to – Data creation projects– Set-up of data distribution centers– Projects for developing standards for data

46

90s: Data

• Projects– Lexicon projects

• Multilex, • Genelex• Acquilex• Parole• WordNet, EuroWordNet

– SpeechDat projects• SpeechDat, SpeechDat-Car, SpeechDat-East, SPEECON,

Orientel– National / Local projects

• Spoken Dutch Corpus (Netherlands and Flanders)

47

http://perso.orange.fr/laurence.zaysser/llc94.html

http://www.cl.cam.ac.uk/Research/NL/acquilex/

http://www.elda.org/catalogue/en/text/doc/parole.html

http://wordnet.princeton.edu/

http://www.illc.uva.nl/EuroWordNet/

http://www.speechdat.org/

http://lands.let.kun.nl/cgn/

90s: Data

• Data distribution Centers are set up– LDC (1993)– ELRA (1995)

• Standards:– TEI for text corpora

• CES, XCES

– Eagles, ISLE for grammatical properties

48

http://www.ldc.upenn.edu/

http://www.elra.info/

http://www.tei-c.org/,

http://www.xml-ces.org/

http://www.xces.org/

http://www.ilc.cnr.it/EAGLES96/home.html

http://www.ilc.cnr.it/EAGLES96/isle/ISLE_Home_Page.htm

Automating Data Production

• Usually existing (imperfect) tools are used to create data (semi-)automatically– G2P for creating phonetic dictionaries– PoS-tagging for PoS-tagged text corpora– Parsers for treebanks

• For bootstrapping annotations– Faster and more consistent results

• Followed by (partial) manual correction49

00s

• Early 00s– Many data and research initiatives, nationally– Netherlands

• IMIX 2001-2008• STEVIN 2004-2011• TST-Centrale (HLT Agency) 2005-..

– France• EVALDA• Technolangue

50

http://www.nwo.nl/IMIX

http://taalunieversum.org/taal/technologie/stevin/

http://www.inl.nl/tst-centrale

http://www.evalda.org/rubrique25.html

http://www.technolangue.net/

00s

• Early 00s– International

• TREC

• CLEF

• TC-STAR 2004-2007

• EuroMatrix 2006-2009

• EUROMATRIXPlus 2009-2012

• ECESS

• PASCAL / PASCAL2

• ACE51

http://trec.nist.gov/

http://www.clef-campaign.org/

http://www.elda.org/en/proj/tcstar-wp4/

http://www.euromatrix.net/

http://www.euromatrixplus.net/

http://www.ecess.eu/

http://www.pascal-network.org/



http://www.itl.nist.gov/iad/mig/tests/ace/ace07/

00s

• Early 00s– International

• TAC US

• DUC US

• GALE US

• NTCIR Japan

• RTE

• SemEval

• SensEval52

http://www.nist.gov/tac/

http://www.darpa.mil/ipto/programs/gale/gale.asp

http://ntcir.nii.ac.jp/

http://pascallin.ecs.soton.ac.uk/Challenges/RTE3/

http://semeval2.fbk.eu/semeval2.php

http://www.senseval.org/

00s

• More recent projects

• FLaReNet

• META-NET

53

http://www.meta-net.eu/

00s

• Companies offer services via the internet and via mobile (smart) phones– Search: Google, Bing, Yahoo!, etc.– Social networks: FaceBook, LinkedIn, Youtube– Cloud Computing: Amazon, Google, Salesforce

• Companies gain access to huge amounts of data (text, pictures, movies, etc,) including user behavior

54

00s

• Data are used– to improve existing services– To create new services– To personalize services and advertisements

55

00s

• New Services relevant for LST– Google: Translation, search by voice, open platform

for mobile devices (Android) – Amazon: Mechanical Turk

• Allows large scale distribution of work, e.g. on manual annotation of language resources

– Apple: several iPhone Apps• Dragon Dictate (for SMS, e-mail)• Jibbigo

– ReCaptcha: transcription of (hand-written) documents (now part of Google)

56

https://www.mturk.com/mturk/welcome

http://en.wikipedia.org/wiki/ReCAPTCHA

Current Status

• Language and Speech Technology in 2011:– Exciting area!

• A lot of commercial activity, and expanding

• A large and active research community

• A lot of interesting topics are open for research

57

Commercial Activity

• many companies in Language technology– Google, Yahoo!, IBM, Microsoft, ...– Apptek, Linguatec, Systran, Knowledge

Concepts, Q-go, ...

• applications– MT, content management, information

retrieval, dealing with customer questions, sentiment and opinion mining, ...

58

Commercial Activity

• many companies in Speech technology– Google, IBM, Microsoft, Motorola, Nokia, ...– Nuance, Loquendo, Acapela, SVOX,

Telisma, ...

• even more in application development and system integration

59

Commercial Activity

• applications– Network IVR applications (Call centers,

banking, information services,...)– Embedded applications

• in-car applications, e.g. voice activated dialing, navigation (voice destination entry)

• mobile phone/PDA applications– multimodal output e.g. for navigation– command and control– (SMS) dictation coming soon

60

Commercial Activity

• applications– Office Applications

• Dictation, horizontal and vertical (medical, legal)

• Language learning

– Audiomining• information retrieval from recorded speech

(possibly incl. other modalities): Radio/TV-broadcasts, parliamentary sessions, ...

61

Research Topics?

• Speech Technology (Recognition)– new paradigms?

• cf . FLAVOR project http://www.esat.kuleuven.be/psi/spraak/projects/FLaVoR/

– Combination with other modalities• AMI http://www.amiproject.org

• CHIL http://chil.server.de/servlet/is/101/

• IMIX (Interactive Multimodal Information eXtraction)

62

http://www.esat.kuleuven.be/psi/spraak/projects/FLaVoR/

http://www.amiproject.org/

Research Topics?

• Speech Technology (Recognition)– robustness against noise and other speakers

• increasing use in car and in public places on PDAs and mobile phones

• MIDAS project

– pronunciation of names• Autonomata I and TOO (incl. Nuance, Ghent,

Nijmegen and Utrecht)

63

Research Topics?

• Speech technology (Text-to-Speech)– better control over prosody in corpus-based

TTS?– Combination with other modalities

64

Research Topics?

• Language Technology– Semantic Lexical databases created– WordNet and EuroWordNet – Cornetto

65

Research Topics?

• Language Technology– Focus now on Semantic Annotation of Corpora

• OntoNotes http://www.isi.edu/natural-language/people/hovy/papers/06HLT-NAACL-OntoNotes-short.pdf

• STEVIN D-COI and SONAR

• DutchSemCor

– How to use this semantic annotation in practical systems?

66

Research Topics?

• Language Technology– (Semi-)automatic lexicon creation/adaptation – Sophisticated information retrieval

• Information extraction, summarization and merging, opinion and sentiment mining,

67

Research Topics?

• Language And Speech Technology– Speech to Speech Translation

• TC-STAR http://www.tc-star.org/

68

Research Topics?

• Dutch-Flemish STEVIN programme– running from 2004-2011 – 11.4M€ budget

• resources• research• applications• demonstration projects

– Most projects finished– some projects are still running– http://www.taalunieversum.nl/stevin

69

CLARIN

• aims to design, construct, validate, and exploit – a research infrastructure that is needed to provide

a sustainable and persistent eScience working environment

– for researchers in the Social Sciences & Humanities

– who want to make use of language data and tools

70

CLARIN

• Make data and tools on different locations easily accessible – via web interfaces and services– CLARIN-portal(s) with intelligent searching,

browsing, viewing and querying services)

• make it possible for non-technical researchers to extract / combine/ enrich data (supported by dissemination and training)

71

CLARIN

• Will make available interoperable data and tools based on existing standards and best practices– Formal interoperability and– Semantic interoperability

72

CLARIN

• For researchers that work with language data and tools– Humanities and Social Sciences

• Linguistics (broadly construed)

• Literary and Theatrical Studies

• Media en Culture

• History

• Political Sciences

• …73

CLARIN

• Preparatory Project (CLARIN-prep)– Funded by EU– 2008-2011– >33 partners from >23 countries– Goals

• Get commitments from EU countries to contribute to the CLARIN infrastructure after CLARIN-prep

• Investigate needs, requirements• Make initial specification (and prototype implementations)

74

CLARIN

• Current Status– Most countries in the process– CLARIN infrastructure to start in Mid 2011– Netherlands committed and has leading role

• CLARIN-NL– Funded by NWO– 2009-2015– Many subprojects running– Focus on Humanities

75

This week’s Programme

• Tuesday: Parsing• Wednesday: Machine Learning• Thursday: Speech Recognition

– Guest lecturer: Arjan van Hessen

• Friday: Machine Translation

76

Thanks for Your Attention!

77

References

• Flickinger D., Nerbonne J., Sag I., Wasow T., "Toward Evaluation of NLP Systems", Hewlett-Packard Laboratories, Palo Alto, CA, 1987.

78

Language and Speech Technology: Introduction Jan Odijk January 2011 LOT Winter School 2011 1.

Documents

Transcript of Language and Speech Technology: Introduction Jan Odijk January 2011 LOT Winter School 2011 1.