Lecture 1, Part I Albert Gatt Corpora and Statistical
Methods
Slide 2
Tutorial CSA5011 -- Corpora and Statistical Methods Next Monday
at 11:00. This will take the form of a discussion of the following
paper: Jurafsky, D. (2003). Probabilistic knowledge in
psycholinguistics. (Available from course web page)
Slide 3
Course goals CSA5011 -- Corpora and Statistical Methods
Introduce the field of statistical natural language processing
(statistical NLP). Describe the main directions, problems, and
algorithms in the field. Discuss the theoretical foundations.
Involve students in hands-on experiments with real problems.
Slide 4
A general introduction CSA5011 -- Corpora and Statistical
Methods
Slide 5
Language CSA5011 -- Corpora and Statistical Methods We can
define a language formally as: a set of symbols (alphabet) a set of
rules to combine those symbols This mathematical definition covers
many classes of languages, not just human language.
Slide 6
Java: An artificial (formal) language CSA5011 -- Corpora and
Statistical Methods fixed set of basic symbols: public, static,
for, while, {, } fixed syntax for symbol combination public static
void main (String[] args) { for(int i = 0; i < args.length; i++)
{ }
Slide 7
Natural language CSA5011 -- Corpora and Statistical Methods
Often much more complicated than an artificial language. NB: Some
theorists view NL as a special kind of formal language as well
(Montague). It does conform to the formal definition: there are
symbols there are modes of combination However, there are many
levels at which these symbols and rules are defined.
Slide 8
Levels of analysis in Natural language (I) CSA5011 -- Corpora
and Statistical Methods Acoustic properties (phonetics) defines a
basic set of sounds in terms of their features studies the
combination of these phonemes Higher-order acoustic features
(phonology) how combinations of phonemes combine into larger units,
with suprasegmental features such as intonation.
Slide 9
Levels of analysis in Natural language (II) CSA5011 -- Corpora
and Statistical Methods Word formation (morphology) combines
morphemes into words Combination into longer units in a
structure-dependent way (syntax) legal word combinations in a
language recursive phrasal combination Interpretation (semantics):
of words (lexical semantics) of longer units
(sentential/propositional semantics) Interpretation in context
(pragmatics)
Slide 10
Natural Language Processing CSA5011 -- Corpora and Statistical
Methods Studies language at all its levels. phonology, morphology,
syntax, semantics focusses on process (Sparck-Jones `07)
computational methods to understand and generate human language
Often, the distinction between NLP and computational linguistics is
fuzzy
Slide 11
Kindred disciplines: Linguistics CSA5011 -- Corpora and
Statistical Methods Theoretical linguistics tends to be less
process-oriented than NLP Q: how can we characterise knowledge that
native speakers have of their language? this leads to declarative
models of speakers knowledge of language tends to say less about
how speakers process language in real time NB: This depends on the
theoretical orientation! NLP has strong ties to theoretical
linguistics it has also been an important contributor: process
models can serve as tests for declarative models
Slide 12
Kindred disciplines: Psycholinguistics CSA5011 -- Corpora and
Statistical Methods Like NLP, psycholinguistics tends to be
strongly process- oriented studies the online processes of language
understanding and language production NLP has benefited from such
models. NLP has also been a contributor: it is increasingly common
to test psycholinguistic theories by building computational
models.
Slide 13
Paradigms in NLP (I) CSA5011 -- Corpora and Statistical Methods
Knowledge-based: system is based on a priori rules and constraints
e.g. a syntactic parser might have hand-crafted rules such as: NP
Det AdjP N AdjP A + Problem: it is extremely difficult to hand-code
all the relevant knowledge.
Slide 14
Paradigms in NLP (II) CSA5011 -- Corpora and Statistical
Methods Statistical: starting point is a large repository of text
or speech (a corpus) corpus is often annotated with relevant
information, e.g.: parsed corpora (syntax) tagged corpora
(part-of-speech) word-sense annotated corpora (semantics) tries to
learn a model from the data tries to generalise this model to new
data
Slide 15
The paradigms: a birds-eye view CSA5011 -- Corpora and
Statistical Methods We find similar divisions within mainstream
linguistics: generative linguistics tends to formulate
generalisations about internalised speaker knowledge of language
(competence, I-Language) corpus linguistics tends to formulate
generalisations based on patterns observed in corpora The two
paradigms are viewed as having roots in different traditions:
rationalist tradition (Plato, Descartes) empiricist tradition
(Locke)
Slide 16
The idea of linguistic knowledge CSA5011 -- Corpora and
Statistical Methods Traditional linguistic theory (since the 1950s)
introduced a dichotomy: competence: a persons knowledge of
language, formalised as a set of rules performance: actual
production and perception of language in concrete situations Much
of linguistic theory has focused on characterising competence.
Slide 17
The idea of linguistic knowledge CSA5011 -- Corpora and
Statistical Methods The use of data (corpora) involves an increased
focus on performance. The idea is that exposure to such
regularities is a crucial part of human language learning.
(Evidence for this is our topic for Mondays tutorial!)
Slide 18
An initial example CSA5011 -- Corpora and Statistical Methods
Suppose youre a linguist interested in the syntax of verb phrases.
Some verbs are transitive, some intransitive I ate the meat pie
(transitive) I swam (intransitive) What about: quiver quake Corpus
data suggests they have transitive uses: the insect quivered its
wings it quaked his bowels (with fear) Most traditional grammars
characterise these as intransitive
Slide 19
Example II: lexical semantics CSA5011 -- Corpora and
Statistical Methods Quasi-synonymous lexical items exhibit subtle
differences in context. strong powerful A fine-grained theory of
lexical semantics would benefit from data about these contextual
cues to meaning.
Slide 20
Example II continued CSA5011 -- Corpora and Statistical Methods
Some differences between strong and powerful (source: British
National Corpus): strong powerful The differences are subtle, but
examining their collocates helps. wind, feeling, accent,
flavourtool, weapon, punch, engine
Slide 21
Statistical approaches to language CSA5011 -- Corpora and
Statistical Methods Do not rely on categorical judgements of
grammaticality etc. Examples: 1. Degrees of grammaticality: people
often do not have categorical judgements of acceptability. 2.
Category blending: We live nearer town than you thought. Is near an
adjective or a preposition? 3. Syntactic ambiguity: She killed the
man with the gun. What is the most likely parse?
Slide 22
Statistical NLP vs. Corpus Linguistics (I) CSA5011 -- Corpora
and Statistical Methods Corpus linguistics became popular with the
arrival of large, machine- readable corpora. generally viewed as a
methodology tests hypotheses empirically on data aim is to refine a
theory of language, or discover novel generalisations Statistical
NLP shares these aims; however: it is often corpus-driven rather
than corpus-based the theory or model learned is often not a priori
given
Slide 23
Statistical NLP vs. Corpus Linguistics (II) CSA5011 -- Corpora
and Statistical Methods The term corpus may mean different things
to different people: To a corpus linguist, a corpus is a balanced,
representative sample of a particular language variety (e.g. The
British National Corpus) Representativeness allows generalisations
to be made more rigorously. In statistical NLP, there has
traditionally been less emphasis on these properties. emphasis on
algorithms for learning language models we frequently find the
tacit assumption that the algorithm can be applied to any set of
data, given the right annotations
Slide 24
Some applications of Statistical NLP CSA5011 -- Corpora and
Statistical Methods
Slide 25
25 Text Language Technology Natural Language Understanding
Natural Language Generation Speech Recognition Speech Synthesis
Text Meaning Speech Machine translation
Slide 26
A (very) rough division of NLP tasks CSA5011 -- Corpora and
Statistical Methods understanding: typically take as input free
text or speech, and conduct some structural or semantic analysis
POS Tagging, parsing, semantic role labelling, sentiment/opinion
mining, named entity recognition generation: typically take textual
or non-linguistic input, outputting some text/speech automatic
weather reporting, summarisation, machine translation How effective
are statistical NLP tools to carry out these and other tasks? Are
statistical techniques actually useful to learn things about
language?
Slide 27
Example 1: Semantics sheep0.359 cow0.345 pig0.331 rabbit0.305
cattle0.304 deer0.289 lamb0.286 donkey0.276 poultry0.262 boar0.261
camel0.259 elephant0.258 calf0.258 pony0.255 Example of an
automatically acquired thesaurus of similar words. Data: 1.5 bn
words obtained from the web. (www.sketchengine.co.uk) How does this
work? CSA5011 -- Corpora and Statistical Methods goat
Slide 28
Example 1: Semantics (cont/d) CSA5011 -- Corpora and
Statistical Methods Corpus-based lexical semantic acquisition
typically uses vector-space models. represent a word as a vectors
containing information about the context in which it is likely to
occur some models also include grammatical relations (subject-of,
object-of etc)
Slide 29
Example 2: POS Tagging CSA5011 -- Corpora and Statistical
Methods The tall woman and the strange boy thought statistical NLP
was pointless. The tall woman and the strange boy thought
statistical NLP was pointless. Output from a statistical POS
Tagger, trained on the Brown Corpus (LingPipe demo library) Uses of
POS Tagging: pre-parsing corpus analysis for linguistics
Slide 30
Example 3: parsing Parsed using the Stanford Parser. Based on
probabilistic context-free grammar of English trained on a treebank
CFG rules with probabilities CSA5011 -- Corpora and Statistical
Methods
Slide 31
Example 4: Machine translation CSA5011 -- Corpora and
Statistical Methods Input: (Maltese translation of example
sentence) Output: The wife and son long strange nonetheless feels
that the statistical NLP is without purpose. Translated using
Maltese-English Google Translate. Obvious shortcomings, but robust,
i.e. some output returned, even if garbled. Based on automatic
alignment between parallel text corpora.
Slide 32
Example 5: Generation/Summarisation CSA5011 -- Corpora and
Statistical Methods [] No laboratories offering molecular genetic
testing for prenatal diagnosis of 3-M syndrome are listed in the
GeneTests Laboratory Directory. However, prenatal testing may be
available for families in which the disease- causing mutations have
been identified [] Automatically generated article about 3-M
syndrome (Sauper and Barzilay 2009) Now on Wikipedia!!!
(http://en.wikipedia.org/wiki/3-
M_syndrome)http://en.wikipedia.org/wiki/3- M_syndrome Summarised
from multiple documents drawn from the web. Uses automatically
acquired templates from human-authored texts to ensure
coherence.
Slide 33
Features of Statistical NLP systems CSA5011 -- Corpora and
Statistical Methods Robustness: typically, dont break down with new
or unknown input Portability: statistical learning algorithms can
in principle be ported to new domains (given data) Sensitivity to
training data: if (say) a POS tagger is trained on medical text,
its performance will decline on a new genre (e.g. news).
Slide 34
Some important concepts CSA5011 -- Corpora and Statistical
Methods All the systems surveyed rely on regularities in large
repositories of training data, expressed as probabilities. In
practice, we distinguish between: training/development data: for
learning a model and finetuning test data: for evaluation on unseen
but compatible data
Slide 35
References CSA5011 -- Corpora and Statistical Methods
Sparck-Jones, K. (2007). Computational Linguistics: What about the
linguistics? Computational Linguistics 33 (3): 437 441 McEnery, T.,
Xiao, R. & Tono, Y. 2006: Corpus-based language studies: An
advanced resource book. London: Routledge (Contains an interesting
discussion of corpus-based vs. corpus-driven approaches)