BalricLing Athens VG
-
Upload
voula-giouli -
Category
Documents
-
view
222 -
download
0
Transcript of BalricLing Athens VG
-
8/2/2019 BalricLing Athens VG
1/23
Standards for
collection and annotation, processing and
advanced HLT applications.
Dissemination of experience in linguistics
infrastructure harmonisation wrt. corpora
-
8/2/2019 BalricLing Athens VG
2/23
Basic and applied research in the field of NaturalLanguage Processing focusing on the design ofcomputational models for natural language
recognition and "understanding" with applicationto two interwoven tracks:
- information processing, extraction and retrieval
- multilingual information processing (multilingualapplications and translation systems)
What is a corpus
-
8/2/2019 BalricLing Athens VG
3/23
With the use of corpora:
corpus-based NLP
Research and development of NLP tools
Testing and evaluation at component and system level
Automatic creation and maintenance of other
linguistic resources (lexica, name lists, etc)
-
8/2/2019 BalricLing Athens VG
4/23
design and implementation of annotationenvironments for training, development and
evaluation of components
compilation of linguistic resources
design and implementation of components
Information Processing, Extraction & Retrieval
-
8/2/2019 BalricLing Athens VG
5/23
Levels of Annotation
1. Surface Text Analysis (tokenisation and handling)
2. Morphosyntactic Annotation
3. Lemmatisation
4. Named Entity Recognition5. Surface Syntactic Analysis
6. Computation of Grammatical Functions in sentence parses
7. Coreference Annotation
8. Term detection
-
8/2/2019 BalricLing Athens VG
6/23
B.Multilingual Information Processing
parallel text processing text alignment at three levels:
sentence
clause word/term
example based translation text matching template extraction template matching
-
8/2/2019 BalricLing Athens VG
7/23
Name Lists
Information Processing, Extraction and Retrieval
Surface Text Analysis
Morphosyntactic Annotation
Lemmatisation
Named Entity Recognition
Surface Syntactic Analysis
Functional Analysis
Coreference Resolution
Template Construction
Lexicon
Grammar Rules
Context Rules
Subcategori
sation Frames
Domain Model
Inference Rules
Input Document
Template
Text Handler
POS Tagger &Lemmatiser
Name Recogniser
Shallow Parser &
Semantic Processor
Discourse Interpreter
NE Rules
-
8/2/2019 BalricLing Athens VG
8/23
Surface Text Analysis
Purpose
identification of word and sentence boundaries
detection of dates numbers enumeration listsacronyms abbreviations punctuation
according to Multext specifications
Tool: Handlerset of filters chained together from the entire segmentation tool:split text isolate punctuation identify abbreviations, dates,
numbers and enumerations identify sentences
ResourcesRegular Expression Grammars
Abbreviation Lists
-
8/2/2019 BalricLing Athens VG
9/23
POS Tagging & Lemmatisation
Purpose
assignment of unique part of speech & lemma according to thetokens local function according to PAROLE specifications
Tools: POS Tagger & Lemmatiser
Rule Based Tagging
automatic induction of transformation rules from annotatedcorpus
twofold nature of rules : lexical & contextual application of rules on local context to assign tags to ambiguous
or unknown words (a laBrill)Resources
manually annotated corpus using a Parole compliant tagsetextended with tokenisation tags (~600 tags)
morphological lexicon
-
8/2/2019 BalricLing Athens VG
10/23
Named Entity Recognition
Purpose
recognition and classification of named entities classification in person, organisation, location, time, date,
money, percent
according to MUC-7 specifications
Tools: Name Recogniser
cascaded finite-state machines recognising regular
expression based grammars
Resources
name lists
trigger word lists
-
8/2/2019 BalricLing Athens VG
11/23
Surface Syntactic & Functional Analysis
Purpose
identification of chunk boundaries and phrase structure
computation of grammatical functions
according to EAGLES specifications
Tools cascaded finite-state machines recognising regular
expression based grammars
Resources regular expression grammars subcategorisation frames derived from the Parole syntactic
layer
-
8/2/2019 BalricLing Athens VG
12/23
Anaphoric Relations
Purpose
detection and classification of candidate antecedents creation of coreferential chains by linking each anaphor to
its antecedent
according to MUC-7 and MATE specifications
Tools: Discourse Interpreter
candidate antecedents detection -> weight estimation of their features ->
anaphoric expression detection -> candidate elimination on the basis of
selectional restrictions -> estimation of salience value for candidate
antecedents -> selection of the most salient candidate -> discourse
model update
Resources
domain model
-
8/2/2019 BalricLing Athens VG
13/23
(SENT 1\59 TOK THN1\64 TOK 1\71 TOK
1\78 TOK 1\84 TOK 1\93 TOK 1\96 TOK 1\113 TOK 1\117 TOK 1\121 TOK 1\129 TOK
1\133 TOK 1\145 TOK O1\156 TOK 1\172 OPUNCT (1\173 TOK BTC1\176 CPUNCT )1\178 TOK 1\182 TOK
1\186 TOK O1\196 TOK 1\212 TOK 1\216 TOK 1\224 OPUNCT (1\225 TOK O1\228 CPUNCT )1\274 PTERM_P .
)SENT
Sample Handled Text in Tipster Format
-
8/2/2019 BalricLing Athens VG
14/23
(SENT 1\59 TOK THN AsPpPaFeSgAc1\64 TOK AjBaFeSgAc1\71 TOK NoCmFeSgAc
1\78 TOK VbMnIdPr03PlXxIpAvXx1\84 TOK VbMnNfXxXxXxXxPePvXx1\93 TOK AtDfFePlNm1\96 TOK NoCmFePlNm1\113 TOK AsPpSp1\117 TOK AtDfFeSgAc1\121 TOK NoCmFeSgNm1\129 TOK AtDfMaSgGe
1\133 TOK AjBaMaSgGe1\145 TOK O NoCmMaSgGe1\156 TOK NoCmFePlGe1\172 OPUNCT ( ( OPUNCT1\173 TOK BTC BTC RgFwOr1\176 CPUNCT ) ) CPUNCT1\178 TOK AsPpSp1\182 TOK AtDfMaSgAc
1\186 TOK O NoCmMaSgAc1\196 TOK NoCmFePlGe1\212 TOK AtDfFeSgGe1\216 TOK NoPrFeSgGe1\224 OPUNCT ( ( OPUNCT1\225 TOK O O RgAnXx1\228 CPUNCT ) ) CPUNCT1\277 PTERM_P . . PTERM_P
)SENT
ample POS-Tagged & Lemmatised Text in Tipster Format
-
8/2/2019 BalricLing Athens VG
15/23
(SENT 1\59 TOK THN AsPpPaFeSgAc1\64 TOK AjBaFeSgAc1\71 TOK NoCmFeSgAc1\78 TOK VbMnIdPr03PlXxIpAvXx1\84 TOK VbMnNfXxXxXxXxPePvXx1\93 TOK AtDfFePlNm1\96 TOK NoCmFePlNm1\113 TOK AsPpSp1\117 TOK AtDfFeSgAc1\121 TOK NoCmFeSgNm orgproperty1\129 TOK AtDfMaSgGe
NE [org1\133 TOK AjBaMaSgGe nationality1\145 TOK O NoCmMaSgGe company1\156 TOK NoCmFePlGe
NE /org]1\178 TOK AsPpSp1\182 TOK AtDfMaSgAc
NE [org
1\186 TOK O NoCmMaSgAc orgdesign1\196 TOK NoCmFePlGe1\212 TOK AtDfFeSgGe1\216 TOK NoPrFeSgGe
NE /org]1\277 PTERM_P . . PTERM_P
)SENT
Name Recogniser Output inTipster Format
-
8/2/2019 BalricLing Athens VG
16/23
SYN [clSYN [pp
1\59 TOK THN AsPpPaFeSgAc as_otherSYN [np_ac
SYN [adjp_ac1\64 TOK AjBaFeSgAc ajbasgacSYN /adjp_ac]
1\71 TOK NoCmFeSgAcSYN /np_ac]SYN /pp]SYN [vg
1\78 TOK VbMnIdPr03PlXxIpAvXx vb_exw1\84 TOK VbMnNfXxXxXxXxPePvXx
SYN /vg]
SYN [np_nm1\93 TOK AtDfFePlNm atdfplnm1\96 TOK NoCmFePlNm
SYN /np_nm]1\113 TOK AsPpSp as_gia1\117 TOK AtDfFeSgAc atdfsgac
SYN [np_nm1\121 TOK NoCmFeSgNm nosgnm
SYN /np_nm]
SYN [np_ge1\129 TOK AtDfMaSgGe atdfsggeSYN [adjp_ge
1\133 TOK AjBaMaSgGe ajbasggeSYN /adjp_ge]
1\145 TOK O NoCmMaSgGe nosggeSYN /np_ge]SYN [np_ge
1\156 TOK NoCmFePlGeSYN /np_ge]
Sample Syntactically Analysed Text in Tipster Format
-
8/2/2019 BalricLing Athens VG
17/23
Marker: A unified multi-level annotation tool
-
8/2/2019 BalricLing Athens VG
18/23
Marker: A unified multi-level annotation tool
-
8/2/2019 BalricLing Athens VG
19/23
Term Extraction
Purpose
identification of candidate (indexing) terms in a text
Tools
finite-state machines recognising regular expression basedgrammars
filters implementing a number of statistical scores
Resources regular expression grammars describing major term
patterns
-
8/2/2019 BalricLing Athens VG
20/23
Term Normalisation
Purpose
robust processing of queries allowing for error-tolerantmatching of query terms and index terms
Resources list of functional words for each language (optional)
Tools
finite state machines
combinatorial processor
-
8/2/2019 BalricLing Athens VG
21/23
Robust Querying of Text Memories
Purpose
robust processing of queries allowing for error-tolerantmatching of query terms and index terms
retrieval of appropriate document portions
Resources
list of functional wordsfor each language (optional)
Tools
textual database indexing tool
fuzzy matching tool
-
8/2/2019 BalricLing Athens VG
22/23
Multilingual Information Processing
parallel text processing
automatically induce translation equivalencies at different levels
Methodological principles
knowledge acquisition through hybrid (probabilistic & linguistic) corpus-
based methods
make effort to devise methods and tools that are as languageindependent as possible
To this end:
as little linguistic processing as possible
exploit the power of statistical tools
-
8/2/2019 BalricLing Athens VG
23/23
Parallel Text Processing
text alignment three levels of alignment
sentence compact translation ambiguities within sentence
use in translation aid tools for translation ortranslation example retrieval
word/term create candidate bilingual equivalencies
use in translation and cross-lingual IR
clause create candidate bilingual equivalent chunks
use in draft translation production