Download - Lecture 1, Part I Albert Gatt Corpora and Statistical Methods.

Lecture 1, Part I Albert Gatt Corpora and Statistical Methods

Tutorial CSA5011 -- Corpora and Statistical Methods Next Monday at 11:00. This will take the form of a discussion of the following paper: Jurafsky, D. (2003). Probabilistic knowledge in psycholinguistics. (Available from course web page)

Course goals CSA5011 -- Corpora and Statistical Methods Introduce the field of statistical natural language processing (statistical NLP). Describe the main directions, problems, and algorithms in the field. Discuss the theoretical foundations. Involve students in hands-on experiments with real problems.

A general introduction CSA5011 -- Corpora and Statistical Methods

Language CSA5011 -- Corpora and Statistical Methods We can define a language formally as: a set of symbols (alphabet) a set of rules to combine those symbols This mathematical definition covers many classes of languages, not just human language.

Java: An artificial (formal) language CSA5011 -- Corpora and Statistical Methods fixed set of basic symbols: public, static, for, while, {, } fixed syntax for symbol combination public static void main (String[] args) { for(int i = 0; i < args.length; i++) { }

Natural language CSA5011 -- Corpora and Statistical Methods Often much more complicated than an artificial language. NB: Some theorists view NL as a special kind of formal language as well (Montague). It does conform to the formal definition: there are symbols there are modes of combination However, there are many levels at which these symbols and rules are defined.

Levels of analysis in Natural language (I) CSA5011 -- Corpora and Statistical Methods Acoustic properties (phonetics) defines a basic set of sounds in terms of their features studies the combination of these phonemes Higher-order acoustic features (phonology) how combinations of phonemes combine into larger units, with suprasegmental features such as intonation.

Levels of analysis in Natural language (II) CSA5011 -- Corpora and Statistical Methods Word formation (morphology) combines morphemes into words Combination into longer units in a structure-dependent way (syntax) legal word combinations in a language recursive phrasal combination Interpretation (semantics): of words (lexical semantics) of longer units (sentential/propositional semantics) Interpretation in context (pragmatics)

Natural Language Processing CSA5011 -- Corpora and Statistical Methods Studies language at all its levels. phonology, morphology, syntax, semantics focusses on process (Sparck-Jones `07) computational methods to understand and generate human language Often, the distinction between NLP and computational linguistics is fuzzy

Kindred disciplines: Linguistics CSA5011 -- Corpora and Statistical Methods Theoretical linguistics tends to be less process-oriented than NLP Q: how can we characterise knowledge that native speakers have of their language? this leads to declarative models of speakers knowledge of language tends to say less about how speakers process language in real time NB: This depends on the theoretical orientation! NLP has strong ties to theoretical linguistics it has also been an important contributor: process models can serve as tests for declarative models

Kindred disciplines: Psycholinguistics CSA5011 -- Corpora and Statistical Methods Like NLP, psycholinguistics tends to be strongly process- oriented studies the online processes of language understanding and language production NLP has benefited from such models. NLP has also been a contributor: it is increasingly common to test psycholinguistic theories by building computational models.

Paradigms in NLP (I) CSA5011 -- Corpora and Statistical Methods Knowledge-based: system is based on a priori rules and constraints e.g. a syntactic parser might have hand-crafted rules such as: NP Det AdjP N AdjP A + Problem: it is extremely difficult to hand-code all the relevant knowledge.

Paradigms in NLP (II) CSA5011 -- Corpora and Statistical Methods Statistical: starting point is a large repository of text or speech (a corpus) corpus is often annotated with relevant information, e.g.: parsed corpora (syntax) tagged corpora (part-of-speech) word-sense annotated corpora (semantics) tries to learn a model from the data tries to generalise this model to new data

The paradigms: a birds-eye view CSA5011 -- Corpora and Statistical Methods We find similar divisions within mainstream linguistics: generative linguistics tends to formulate generalisations about internalised speaker knowledge of language (competence, I-Language) corpus linguistics tends to formulate generalisations based on patterns observed in corpora The two paradigms are viewed as having roots in different traditions: rationalist tradition (Plato, Descartes) empiricist tradition (Locke)

The idea of linguistic knowledge CSA5011 -- Corpora and Statistical Methods Traditional linguistic theory (since the 1950s) introduced a dichotomy: competence: a persons knowledge of language, formalised as a set of rules performance: actual production and perception of language in concrete situations Much of linguistic theory has focused on characterising competence.

The idea of linguistic knowledge CSA5011 -- Corpora and Statistical Methods The use of data (corpora) involves an increased focus on performance. The idea is that exposure to such regularities is a crucial part of human language learning. (Evidence for this is our topic for Mondays tutorial!)

An initial example CSA5011 -- Corpora and Statistical Methods Suppose youre a linguist interested in the syntax of verb phrases. Some verbs are transitive, some intransitive I ate the meat pie (transitive) I swam (intransitive) What about: quiver quake Corpus data suggests they have transitive uses: the insect quivered its wings it quaked his bowels (with fear) Most traditional grammars characterise these as intransitive

Example II: lexical semantics CSA5011 -- Corpora and Statistical Methods Quasi-synonymous lexical items exhibit subtle differences in context. strong powerful A fine-grained theory of lexical semantics would benefit from data about these contextual cues to meaning.

Example II continued CSA5011 -- Corpora and Statistical Methods Some differences between strong and powerful (source: British National Corpus): strong powerful The differences are subtle, but examining their collocates helps. wind, feeling, accent, flavourtool, weapon, punch, engine

Statistical approaches to language CSA5011 -- Corpora and Statistical Methods Do not rely on categorical judgements of grammaticality etc. Examples: 1. Degrees of grammaticality: people often do not have categorical judgements of acceptability. 2. Category blending: We live nearer town than you thought. Is near an adjective or a preposition? 3. Syntactic ambiguity: She killed the man with the gun. What is the most likely parse?

Statistical NLP vs. Corpus Linguistics (I) CSA5011 -- Corpora and Statistical Methods Corpus linguistics became popular with the arrival of large, machine- readable corpora. generally viewed as a methodology tests hypotheses empirically on data aim is to refine a theory of language, or discover novel generalisations Statistical NLP shares these aims; however: it is often corpus-driven rather than corpus-based the theory or model learned is often not a priori given

Statistical NLP vs. Corpus Linguistics (II) CSA5011 -- Corpora and Statistical Methods The term corpus may mean different things to different people: To a corpus linguist, a corpus is a balanced, representative sample of a particular language variety (e.g. The British National Corpus) Representativeness allows generalisations to be made more rigorously. In statistical NLP, there has traditionally been less emphasis on these properties. emphasis on algorithms for learning language models we frequently find the tacit assumption that the algorithm can be applied to any set of data, given the right annotations

Some applications of Statistical NLP CSA5011 -- Corpora and Statistical Methods

25 Text Language Technology Natural Language Understanding Natural Language Generation Speech Recognition Speech Synthesis Text Meaning Speech Machine translation

A (very) rough division of NLP tasks CSA5011 -- Corpora and Statistical Methods understanding: typically take as input free text or speech, and conduct some structural or semantic analysis POS Tagging, parsing, semantic role labelling, sentiment/opinion mining, named entity recognition generation: typically take textual or non-linguistic input, outputting some text/speech automatic weather reporting, summarisation, machine translation How effective are statistical NLP tools to carry out these and other tasks? Are statistical techniques actually useful to learn things about language?

Example 1: Semantics sheep0.359 cow0.345 pig0.331 rabbit0.305 cattle0.304 deer0.289 lamb0.286 donkey0.276 poultry0.262 boar0.261 camel0.259 elephant0.258 calf0.258 pony0.255 Example of an automatically acquired thesaurus of similar words. Data: 1.5 bn words obtained from the web. (www.sketchengine.co.uk) How does this work? CSA5011 -- Corpora and Statistical Methods goat

Example 1: Semantics (cont/d) CSA5011 -- Corpora and Statistical Methods Corpus-based lexical semantic acquisition typically uses vector-space models. represent a word as a vectors containing information about the context in which it is likely to occur some models also include grammatical relations (subject-of, object-of etc)

Example 2: POS Tagging CSA5011 -- Corpora and Statistical Methods The tall woman and the strange boy thought statistical NLP was pointless. The tall woman and the strange boy thought statistical NLP was pointless. Output from a statistical POS Tagger, trained on the Brown Corpus (LingPipe demo library) Uses of POS Tagging: pre-parsing corpus analysis for linguistics

Example 3: parsing Parsed using the Stanford Parser. Based on probabilistic context-free grammar of English trained on a treebank CFG rules with probabilities CSA5011 -- Corpora and Statistical Methods

Example 4: Machine translation CSA5011 -- Corpora and Statistical Methods Input: (Maltese translation of example sentence) Output: The wife and son long strange nonetheless feels that the statistical NLP is without purpose. Translated using Maltese-English Google Translate. Obvious shortcomings, but robust, i.e. some output returned, even if garbled. Based on automatic alignment between parallel text corpora.

Example 5: Generation/Summarisation CSA5011 -- Corpora and Statistical Methods [] No laboratories offering molecular genetic testing for prenatal diagnosis of 3-M syndrome are listed in the GeneTests Laboratory Directory. However, prenatal testing may be available for families in which the disease- causing mutations have been identified [] Automatically generated article about 3-M syndrome (Sauper and Barzilay 2009) Now on Wikipedia!!! (http://en.wikipedia.org/wiki/3- M_syndrome)http://en.wikipedia.org/wiki/3- M_syndrome Summarised from multiple documents drawn from the web. Uses automatically acquired templates from human-authored texts to ensure coherence.

Features of Statistical NLP systems CSA5011 -- Corpora and Statistical Methods Robustness: typically, dont break down with new or unknown input Portability: statistical learning algorithms can in principle be ported to new domains (given data) Sensitivity to training data: if (say) a POS tagger is trained on medical text, its performance will decline on a new genre (e.g. news).

Some important concepts CSA5011 -- Corpora and Statistical Methods All the systems surveyed rely on regularities in large repositories of training data, expressed as probabilities. In practice, we distinguish between: training/development data: for learning a model and finetuning test data: for evaluation on unseen but compatible data

References CSA5011 -- Corpora and Statistical Methods Sparck-Jones, K. (2007). Computational Linguistics: What about the linguistics? Computational Linguistics 33 (3): 437 441 McEnery, T., Xiao, R. & Tono, Y. 2006: Corpus-based language studies: An advanced resource book. London: Routledge (Contains an interesting discussion of corpus-based vs. corpus-driven approaches)