Using Corpora for Language Research
description
Transcript of Using Corpora for Language Research
19.04.23 COGS 523 - Bilge Say 1
Using Corpora for Language Research
COGS 523-Lecture 4Using Corpora with Other Resources;Corpus Software
19.04.23 COGS 523 - Bilge Say 2
Related ReadingsReadings:
Buchholz and Green (2006); Miller and Fellbaum (2007); Sampson and McCarthy Ch 29.
Extra – Information sheet for ResourcesOptional (can be used in software reviews!!)Garretson, G. (2008) Desiderata for
Linguistics Software Design. International Journal of English Studies 8(1), 67-74. (The link is available on METU Online)
19.04.23 COGS 523 - Bilge Say 3
Lexical and Ontological Resources Useful for Natural Language
Processing, Pyscholinguistics, Corpus Annotation (eg automating semantic annotation)
A selected review is to follow, but there are others...
19.04.23 COGS 523 - Bilge Say 4
WordNet - Preliminaries Lexeme vs Sense Homonyms (Homophones or
homographs): Words that have the same form with unrelated meanings
Polysemy: Multiple related meanings with a single lexeme (eg sperm bank)
Hard to distinguish between polysemy and homonymy sometimes.
19.04.23 COGS 523 - Bilge Say 5
WordNet - Preliminaries Synonymy: Different lexemes, same
(or nearly same) meanings Hyponymy: A subclass of: poodle-
>dog; car -> vehicle (opp. direction hypernymy)
Mereonymy: A part of: leg -> table Antonymy: Opposites
19.04.23 COGS 523 - Bilge Say 6
WordNet A lexical database for English (and 30 other
languages, see Balkanet and EuroWordnet projects); most extensive use: word sense disambiguation (Wordnet book available at the library)
Synsets: A set of synonyms Each sense entry contains synsets, a dictionary style
definition, some example uses (and a frequency number)
Four separate databases: nouns (hyponymy, meronymy), verbs (hyponymy,manner, causation, etc.), adjectives and adverbs
Synsets will be chained together with hyponynms and hypernyms – multiple chains possible
19.04.23 COGS 523 - Bilge Say 7
Bass -> musical instrument -> instrument -> device ....-> entity
Bass -> singer, vocalist -> musician -> performer ....-> entity
19.04.23 COGS 523 - Bilge Say 8
Extensions WordNetPlus: Dense Weighted X-
database of automatically learned evocation (how much a certain concept brings to mind the second) ratings...First human-rated 120,000 pairs from 1000 synsets – most frequent concepts in BNC.
ImageNet: Enhancing WordNet with images and icons.
19.04.23 COGS 523 - Bilge Say 9
An example of Wordnet Query
19.04.23 COGS 523 - Bilge Say 10
Turkish WordNet project http://www.hlst.sabanciuniv.edu/TL/ Combined with phonetic rendering,
morphological analysis, English equivalent etc.
http://www.ceid.upatras.gr/Balkanet/index.htmPart of Balkanet project for 6 Balkan languages 12,000 synsets
19.04.23 COGS 523 - Bilge Say 11
An example of Turkish Wordnet Query
An Alternative to Turkish WordNet
60000 hypernyms, 72 layers Machine learning from TDK dictionary Ongoing work, needs disambiguation More coverage than Turkish WordNet By Tunga Güngör and Onur Güngör, Boğaziçi Univ
19.04.23 COGS 523 - Bilge Say 13
Ontologies - Cyc A knowledge base of human commonsense and
associated inference engine. http://www.opencyc.org/ (Free version)
http://research.cyc.com/ (Academic version) Doug Lenat’s project – 1984+ 300,000 concepts Nearly 3,000,000 assertions (facts and rules),
using 26,000+ relations, that interrelate, constrain, and, in effect, (partially) define the concepts.
Natural Language Query and Information Entry Tools
19.04.23 COGS 523 - Bilge Say 14http://www.cyc.com/cyc/technology/whatiscyc_dir/whatdoescycknow
The graph representation of the Cyc Knowledge Base
19.04.23 COGS 523 - Bilge Say 15
An example of a knowledge representation sample
coded with CycL
19.04.23 COGS 523 - Bilge Say 16
ConceptNet http://web.media.mit.edu/~hugo/conceptnet/ Part of Open Mind Initiative A huge wiki type of effort to create a commonsense
knowledgebase represented as a semantic network 1.6 million edges (assertions) connecting more than 300
000 nodes, where nodes are semi-structured English fragments.
interrelated by an ontology of twenty semantic relations such as EffectOf (causality), SubeventOf (event hierarchy), CapableOf (agent’s ability), PropertyOf, LocationOf, andMotivationOf (affect).
19.04.23 COGS 523 - Bilge Say 17
An excerpt from ConceptNet’s semantic network
19.04.23 COGS 523 - Bilge Say 18
from Liu, H. & Singh, P. (2004) ConceptNet: A Practical Commonsense Reasoning Toolkit. BT Technology Journal
19.04.23 COGS 523 - Bilge Say 19
FrameNet FrameNet is a lexicon-building project for
English, based on frame semantics, carried out by International Computer Science Institute of University of Berkeley.
Frame: schematic representation of a situation type (eating, spying, removing, classifying, etc.) together with lists of the kinds of participants, props, and other conceptual roles that are seen as components of such situations. The semantic arguments of a predicating word correspond to what we call the frame elements(FE) of the frame associated with that word.
19.04.23 COGS 523 - Bilge Say 20
FrameNet Uses BNC and ANC Currently (version 1.3), there are more
than 10,000 lexical units, more than 6,000 of which are fully annotated, in more than 800 hierarchically-related semantic frames, exemplified in more than 135,000 annotated sentences in the database.
WordNet – ConceptNet hybrid, with a grammar theory in the background (Fillmore’s Frame Semantics).
19.04.23 COGS 523 - Bilge Say 21
Interface of the Frame Grapher
19.04.23 COGS 523 - Bilge Say 22
Sample Output From Frame Grapher
input: Crime_Scenario
19.04.23 COGS 523 - Bilge Say 23
Software for Working with Corpora“Corpus Linguistics in its current form
cannot work without the help of the computer.” (Mason)
Acc. to Function: Corpus Building Software vs Corpus Query Software
Acc. to Design: Standard Software for Non-Technical Users vs Specialized Toolkits Providing Standard Functions vs Using Non-Corpus Specific Tools and Programming Languages (e.g. grep, egrep, perl, phyton, tcl/tk, java)
19.04.23 COGS 523 - Bilge Say 24
Corpus Software Standard Software: MonoConcPro,
WConcord, Wordsmith, IMS CQP (Corpus Query Processor, Qwick, Xaira, Gsearch
More General Purpose NLP Suites/Toolkits for Programmers: CUE (Corpus Universal Examiner), NLTK, GATE
19.04.23 COGS 523 - Bilge Say 25
Corpus Query/Analysis Software Text Analysis Software -> Corpus
Query Software -> Concordancers Collocations in KWIC format
(Keyword in Contex) General Features
Search Display, Save, Export Statistics
19.04.23 COGS 523 - Bilge Say 26
Features Search
Word, phrase, POS etc search Regular expression search Context-sensitive search Header info search
Display, save, export KWIC or sentence format Sorting Saving results or search patterns
Statistics Frequency and various statistics Plotting graphs
19.04.23 COGS 523 - Bilge Say 27
A Comparison Framework Platform/Operating System Price Ease of Installation User friendliness Speed Ease of setting up a corpus/texts Query syntax Query search power (collocational, discontinous constituents) Statistical Analysis Standard markup scheme handling Whole text browsing Character set handling Output for presentation
19.04.23 COGS 523 - Bilge Say 28
Desiderata – some maxims Do not build linguistic theory into the program
any more than necessary Do separate markup from annotation Do not gloss over complexities in data – sensible
defaults that can be overriden are fine Allow users to supply their own analytical
categories – e.g. Annotation of concordance lines Make use of standards Use Unicode
19.04.23 COGS 523 - Bilge Say 29
IMS Corpus Workbench (CWB) http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench
/ IMS Corpus Query Processor (CQP): query system
for CWB Allowing use of multiple knowledge sources
(corpora, machine readable dictionaries etc) Allowing the use of stored information and
calculating information on-line (from remote corpora)
Both for Human-Machine Use but not really for novice users...
Regular Expression based syntax.
19.04.23 COGS 523 - Bilge Say 30
From CWB web siteQuery language unrestricted number of attributes per corpus position regular expressions over attribute values of individual
corpus positions (e.g. wild cards for word forms, part-of-speech values)
regular expressions over sequences of corpus positions (partial) support of structural annotations (e.g. SGML) incremental concordancing application of a query to all items of a list 'virtual attributes', i.e. runtime access to external
applications (e.g. WordNet) queries on parallel translated texts
19.04.23 COGS 523 - Bilge Say 31
From CWB web siteDisplay of results user-definable size of 'keyword in context'
display 'keyword in context' lines can be sorted in
various ways frequency counts, e.g. for word combinations multilingual concordances from aligned corpora html and latex output supported query history
19.04.23 COGS 523 - Bilge Say 32
From CWB web site registration of corpora 'encoding' of corpora, i.e. indexing (and
compression) (for text sources in one-word-per-line format, using ISO8859/Latin-1 8bit character sets, and maybe others) For example, the BNC corpus with part-of-speech and lemma annotation will need about 1 GB of disk space.
incremental addition of types of corpus annotations ('attributes'). E.g. add part-of-speech values to a corpus once you have access to a POS-tagger.
19.04.23 COGS 523 - Bilge Say 33
Regular Expressions Equivalent to regular languages and
finite automaton languages Take empty language, languages
with a single string, and apply concatenation, union or Kleene star operations on them. Everything you can generate in this way will be regular languages. (Partee et al., 1993)
19.04.23 COGS 523 - Bilge Say 34
Regular Expressions From CQP Tutorial...
Basic syntax of regular expressions letters and digits are matched literally (including all non-ASCII
characters) word word; C3PO C3PO; déjà déjà
. matches any single character (``matchall'') r.ng ring, rung, rang, rkng, r3ng, ...
character set: [...] matches any of the characters listed moderni[sz]e modernise, modernize [a-c5-9] a, b, c, 5, 6, 7, 8, 9 [^aeiou] b, c, d, f, ..., 1, 2, 3, ..., ä, à, á, ...
repetition of the preceding element (character or group): ? (0 or 1), * (0 or more), + (1 or more), { } (exactly ), { , } ( ) colou?r color, colour; go{2,4}d good, goood, goood [A-Z][a-z]+ ``regular'' capitalised word such as British
grouping with parentheses: (...) (bla)+ bla, blabla, blablabla, ... (school)?bus(es)? bus, buses, schoolbus, schoolbuses
| separates alternatives (use parentheses to limit scope) mouse|mice mouse, mice; corp(us|ora) corpus, corpora
19.04.23 COGS 523 - Bilge Say 35
Regular ExpressionsComplex regular expressions can be used to
model (regular) inflection: ask(s|ed|ing)? ask, asks, asked, asking
(equivalent to the less compact expression ask|asks|asked|asking)
sa(y(s|ing)?|id) say, says, saying, said [a-z]+i[sz](e[sd]?|ing) any form of a
verb with -ise or -ize suffix
19.04.23 COGS 523 - Bilge Say 36
Some examples from CQP the specified word is interpreted as a regular expression
>"interest(s|(ed|ing)(ly)?)?"; > [(lemma="under.+") & (pos="V.*")]; a noun, followed by either is or was, followed by a verb ending in
ed:[pos="N.*"] "is|was" [pos="V.*" & word=".*ed"];
similar, but is or was followed by a past participle (which is described by a special POS tag):[pos="N.*"] "is|was" [pos="VBD"];
catch or caught, followed by a determiner, any number of adjectives and a noun, or a noun, followed by was or were, followed by caught:"catch|caught" [pos="DT"] [pos="JJ"]* [pos="N.*"] | [pos="N.*"] "was|were" "caught";
look or bring, followed by either up or down with at most 10 non-verbs in between:"look|bring" [pos != "VB.*"]{0,10} "up|down";
19.04.23 COGS 523 - Bilge Say 37
Searching for more complex patterns Gsearch Corpus Query System
http://www.hcrc.ed.ac.uk/gsearch/ Facilitating the investigation of lexical and
syntactic phenomena in unparsed but tagged corpora (can work with external taggers too)
Users specify their own context free grammar Can take something like 167 minutes for a
search on 100 million words BNC, False positives should be manually eliminated Visualization tools to display tree structures
19.04.23 COGS 523 - Bilge Say 38
Alternative: Using a class library Mason, O. Programming for Corpus
Linguistics: How to do text analysis with Java, Edinburgh University Press, 2000.
CUE (Corpus Universal Examiner): class library in Java that takes care of indexing, compressing large corpora, support for XML and Unicode
Qwick: a concordancing application that is developed using CUE
19.04.23 COGS 523 - Bilge Say 39
A Professional Alternative http://athel.com/ MonoConcPro ($95) Features: Context Search, Regular
Expression search, Part-of-Speech Tag Search, Collocations, and Corpus Comparison.
Not language specific You can also buy a Chinese (and other
languages) concordance T-shirt
19.04.23 COGS 523 - Bilge Say 40
From an older version of MonoConc Pro
19.04.23 COGS 523 - Bilge Say 41
19.04.23 COGS 523 - Bilge Say 42
Quality Control in Corpora Format: Punctuation, delimiters, character
encoding, Presence and order of all fields, Typos in labels and annotation. Explicit Documentation Format Checker – Structure Checker Solution: Versioning and Patching
mechanism in Treebanks and Corpora
19.04.23 COGS 523 - Bilge Say 43
Interrater agreements - reliability
Cochran’s Q test – binary values Kappa – multivalued (Carletta, 1996)
Sensible chosen unit of agreement Expert vs naive coders K>0.8 good
Generalizability Theory (G-Theory) (Bayerl and Paul, 2007) – finer grained
19.04.23 COGS 523 - Bilge Say 44
Lecture 5See articles on METU Turkish Corpus and
Metu-Sabanci Treebank under Lecture Notes.