Supporting Annotation Layers for Natural Language Processing
description
Transcript of Supporting Annotation Layers for Natural Language Processing
Supporting Annotation Layers for Natural Language Processing
Archana Ganapathi, Preslav Nakov, Ariel Schwartz, and Marti Hearst
Computer Science Division and SIMSUniversity of California, Berkeley
Motivation
Most natural language processing (NLP) algorithms make use of the results of previous processing steps, e.g.: Tokenizer Part-of-speech tagger Phrase boundary recognizer Syntactic parser Semantic tagger
No standard way to represent, store and retrieve text annotations efficiently.
MEDLINE has close to 13 million abstracts. Full text starts to become available as well.
Text Annotation Framework
Annotations are stored independently of text in an RDBMS
Declarative query language for annotation retrieval
Indexing structure designed for efficient query processing
Object Oriented API for annotations: insertion, deletion and modification
Key Contributions
Support for hierarchical and overlapping layers of annotation
Querying multiple levels of annotations simultaneously First to evaluate different physical database designs Focused on scaling annotation-based queries to very
large corpora with many layers of annotations We propose a query language and demonstrate its
power and the efficiency of the indexing architecture on a wide variety of query types that have been published in the NLP literature.
Outline
Related Work Layered Query Language Database Design API Evaluation Conclusions
Related Work Annotation graphs (AG): directed
acyclic graph; nodes can have time stamps or are constrained via paths to labeled parents and children. (Bird and Liberman, 2001)
Emu system: sequential levels of annotations. Hierarchical relations may exist between different levels, but must be explicitly defined for each pair.(Cassidy&Harrington,2001)
The Q4M query language for MATE: directed graph; constraints and ordering of the annotated components. Stored in XML (McKelvie&al., 2001)
TIQL: queries consist of manipulating intervals of text, indicated by XML tags; supports set operations. (Nenadic et al., 2002)
SELECT IWHERE X.[id:I].Y <- db/wrd X.[:hv].[]*.Y <- db/phn;
Annotation Graphs
Find arcs labeled as words, whose phonetic transcription starts with a “hv“:
[[Phonetic=A -> Phonetic=p] ^ Syllable=S]
Emu
Find sentences of phonetic “A” followed by “p“ both dominated by an “S” syllable:
($a word) ($b word); ($a pos ~ "NN") && ($a <> $b) && ($b # ~ "lesser")
Q4M (MATE system)
Find nouns followed by the word “lesser”:
TIQL (TIMS system)
Find sentences containing the noun phrase “COUP-TF II” and the verb “inhibit”:
(<SENTENCE> <TERM nf=‘COUP TF II’>) <V lemma=‘inhibit’>
Outline
Related Work Layered Query Language Database Design API Evaluation Conclusions
Layers of Annotations
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
PO
SS
ha
llow
pa
rse
On
tolo
gy
Ge
ne
/pro
tein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Wo
rd
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
PO
SS
ha
llow
pa
rse
On
tolo
gy
Ge
ne
/pro
tein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Wo
rd
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
Layers of Annotations
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
PO
SS
ha
llow
pa
rse
On
tolo
gy
Ge
ne
/pro
tein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Wo
rd
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
PO
SS
ha
llow
pa
rse
On
tolo
gy
Ge
ne
/pro
tein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Wo
rd
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
Layers of Annotations
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
PO
SS
ha
llow
pa
rse
On
tolo
gy
Ge
ne
/pro
tein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Wo
rd
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
PO
SS
ha
llow
pa
rse
On
tolo
gy
Ge
ne
/pro
tein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Wo
rd
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
Layers of Annotations
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
PO
SS
ha
llow
pa
rse
On
tolo
gy
Ge
ne
/pro
tein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Wo
rd
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
NN IN NN VBZ IN JJ JJ NN NN NN CC NN IN NN
NP PP NP VP PP NP NP PP NP
D019254 D044465 D001769 D002477 D003643
D001773
D016923
D007962
24224596 28102012043
PO
SS
ha
llow
pa
rse
On
tolo
gy
Ge
ne
/pro
tein
185 8 51112 23017 7 5874 2791 8952 1263 5632 17 8252 8 12523Wo
rd
Ontology
Gene/protein
Word
Part of Speech
Shallow Parse
Overexpression of Bcl-2 results in insufficient white blood cell death and activation of p53.
D016158
39727642722
Full parse, sentence and section layers are not shown.
Layers of Annotation (cont.)
Each annotation represents an interval spanning a sequence of characters absolute start and end positions
Each layer corresponds to a conceptually different kind of annotation i.e., word, gene/protein, shallow parse can have several layers with the same semantics
Layers can be sequential overlapping
e.g., two multiple-word concepts sharing a word hierarchical
spanning, when the intervals are nested as in a parse tree, or
ontologically, when the token itself is derived from a hierarchical ontology
Layer Type Properties
One-to-one correspondence between the Word and the Part-of-speech (POS) layers. The Word, POS and Shallow parse layers are sequential
The Full parse layer is spanning hierarchical The Gene/protein layer assigns IDs from the LocusLink
database of gene names many-to-one in the case of multiple species
The Ontology layer assigns terms from the hierarchical medical ontology MeSH (Medical Subject Headings) Overlapping (share the word cell) and hierarchical:
both spanning, since blood cell (with MeSH ID D001773) spans cell (which is also in MeSH), and
ontologically, since blood cell is a kind of cell and cell death (D016923) is a type of Biological Phenomena.
Layered Query Language
Requirements for the query language on layers of annotations: Intuitive Compact Declarative Expressive power for real world queries Support for hierarchical and overlapping annotations Compatible with SQL
LQL (Layered Query Language) XML-like Can be translated to SQL to run against an RDBMS Tested on real world bioscience NLP applications
LQL by Example
<shallow_parse [tag_type=NP]>{print}<pos [tag_type=\(]><shallow_parse [tag_type=NP]>{print}<pos [tag_type=\)]>
(Pustejovsky et al., 2001)
(d) Acronym-Meaning Extraction
<shallow_parse [tag_type=NP] <pos [tag_type=noun] ^<mesh [label=G07.553*]>{print}$ > <pos [tag_type=noun] ^<mesh [label=D*]>{print}$ >$ >
(c) Descent of Hierarchy:
D24.185.119.490G07.553.481
D12.776.811.300G07.553.481
D12.776.124.125.500G07.553.481
D12.776.124.050.250G07.553.481
MeSH
label
MeSH
label
(Rosario et al., 2002)
<document <sentence <gene_protein> {print tag_type} ... <pos [tag_type=verb] <word [lex=results]> > ... <gene_protein> {print tag_type} > {print}> {print document.id}
(Blaschke et al., 1999)
(a) Protein-Protein Interaction
<sentence <shallow_parse [tag_type=NP]>{$np1=lex} <pos [tag_type=verb] <word [lex=binds]>> <pos [tag_type=prep] <word [lex=to]>> <shallow_parse [tag_type=NP]>{$np2=lex}> {print $np1, $np2, sentence}
(Thomas et al., 2000)
(b) Protein-Protein Interaction
A01 A07
limb:vein
shoulder: artery
LQL Syntax “< >” Defines an arbitrary range over text. A range is typically restricted to a specific layer type using <layer_name>. All layers have a lex (the text spanned by the range) and a tag_type
attribute. Predicates on attribute values are enclosed in square brackets,
i.e. “<layer_name [attribute_name{ = | ! = | > | > = | < | < = } value]>”. The language supports the boolean operators
conjunction (&&), disjunction (||), and negation (!). By default tokens must follow each other immediately. The ellipses (...) indicate that tokens may intervene in between the
specified ranges. A range is optionally followed by an action statement, enclosed in curly
braces, which binds variables or specifies what should be printed, e.g. “<gene_protein> {print tag_type}”.
With no arguments, print outputs the value of the lex attribute. The two special characters ‘ˆ’ and ‘$’ are used to match the range’s
beginning and end positions, respectively. ‘*’ is used as a wildcard; can be used to descend an ontological hierarchy.
Additional LQL Features For spanning hierarchical layers we can have hierarchical queries with
several nested references to the same layer. The following query finds a PP of the form preposition+NP and prints that NP:
<full_parse [tag_type=PP] ˆ<pos [tag_type=prep]> <full_parse [tag_type=NP]>{print} $>
The keyword noorder allows an arbitrary order for the tokens within a range, e.g.:
<sentence [noorder] <gene_protein>
<pos [tag_type=verb]>> {print sentence}
The language allows for a combination of ordered and unordered constraints. For example,
<sentence [noorder] <gene_protein> ( <pos [tag_type=verb] <word [lex=binds]>> <pos [tag_type=prep] <word [lex=to]>> )> {print sentence}
LQL currently does not support a range overlap operator.
LQL and SQL
LQL can be automatically translated into SQL (although this is not yet implemented), as: user-defined function, or a macro
The result of an LQL query is a relation Thus, allowing the use of standard SQL syntax such as GROUP
BY, COUNT, DISTINCT, ORDER BY, UNION etc. An added advantage of LQL over SQL is that the LQL queries
do not need to be modified, if the underlying logical design is changed.
LQL is still a work in progress; We plan to assess it via usability studies with computational
linguistics researchers, modifying it as necessary. However, we feel it is more intuitive and easier to use for
text processing than the existing languages.
LQL Versus SQL
SELECT nn1.pmid, mt1.tree_number, mt2.tree_numberFROM biotext_annotation_1 npJOIN biotext_annotation_1 nn1 ON np.pmid = nn1.pmid AND np.section = nn1.section AND np.start_char_pos <= nn1.start_char_pos AND np.end_char_pos >= nn1.end_char_posJOIN biotext_annotation_1 nn2 ON np.pmid = nn2.pmid AND np.section = nn2.section AND nn2.start_char_pos > nn1.end_char_pos AND np.start_char_pos <= nn2.start_char_pos AND np.end_char_pos = nn2.end_char_posJOIN biotext_annotation_1 mesh1 ON np.pmid = mesh1.pmid AND np.section = mesh1.section AND nn1.start_char_pos = mesh1.start_char_pos AND nn1.end_char_pos = mesh1.end_char_posJOIN biotext_annotation_1 mesh2 ON np.pmid = mesh2.pmid AND np.section = mesh2.section AND nn2.start_char_pos = mesh2.start_char_pos AND nn2.end_char_pos = mesh2.end_char_pos
JOIN biotext_annotation_mesh_tree mt1 ON mt1.descriptor_ui = mesh1.tag_typeJOIN biotext_annotation_mesh_tree mt2 ON mt2.descriptor_ui = mesh2.tag_typeWHERE nn1.layer_id = 1 AND nn2.layer_id = 1 AND np.layer_id = 3 AND nn1.tag_type in (27,30) AND nn2.tag_type in (27,30) AND np.tag_type = 31 AND mesh1.layer_id = 6 AND mesh2.layer_id = 6 AND NOT EXISTS ( SELECT word.pmid FROM biotext_annotation_1 word WHERE word.layer_id = 0 AND word.pmid = nn1.pmid AND word.section = nn1.section AND word.start_char_pos < nn2.start_char_pos AND word.end_char_pos > nn1.end_char_pos ) AND mt1.tree_number like 'C21%' AND mt2.tree_number like 'G10%'
SQL:
<document <shallow_parse [tag_type=NP] <pos [tag_type=noun] ^<mesh [label=C21*]>{print label}$ > <pos [tag_type=noun] ^<mesh [label=G10*]>{print label}$ >$ >>{print document.id}
LQL:
Outline
Related Work Layered Query Language Database Design API Evaluation Conclusions
Database Design
We evaluated 5 different logical and physical database designs. The basic model is similar to the one of TIPSTER (Grishman,
1996). Each annotation is stored as a record in a relation. Architecture 1 contains the following columns:
1. docid: document ID;
2. section: title, abstract or body text;
3. layer_id: a unique identifier of the annotation layer;
4. start_char_pos: starting character position, relative to particular section and docid;
5. end_char_pos: end character position, relative to particular section and docid;
6. tag_type: a layer-specific token unique identifier. There is a separate table mapping token IDs to entities (the string
in case of a word, the MeSH label(s) in case of a MeSH term etc.)
Database Design (cont.)
Architecture 2 introduces one additional column, sequence_pos, thus defining an ordering for each layer. Simplifies some SQL queries as there is no need for
“NOT EXISTS” self joins, which are required under Architecture 1 in cases where tokens from the same layer must follow each other immediately.
Architecture 3 adds sentence_id, which is the number of the current sentence and redefines sequence_pos as relative to both layer_id and sentence_id. Simplifies most queries since they are often limited to
the same sentence.
Database Design (cont.)
Architecture 4 merges the word and POS layers, and adds word_id assuming a one-to-one correspondence between them. Reduces the number of stored annotations and the
number of joins in queries with both word and POS constraints.
Architecture 5 replaces sequence_pos with first_word_pos and last_word_pos, which correspond to the sequence_pos of the first/last word covered by the annotation. Requires all annotation boundaries to coincide with
word boundaries. Copes naturally with adjacency constraints between
different layers. Allows for a simpler indexing structure.
An Example RelationExample: “Kinase inhibits RAG-1.”
231(NP)39343(s.parse)b3345
259(VP)48413b3345
23154503b3345
21665454506b3345
21077039346(mesh)b3345
23954505b3345
239(prt)39345 (gene)b3345
8998522754501b3345
55608253 (VB)48411 b3345
59571227 (NN)39341 (POS)b3345
8998528998554500b3345
5560825560848410b3345
595712595713934b (body)3345
WORDID
SENTENCE
SEQUENCEPOS
TAGTYPE
ENDCHARPOS
STARTCHARPOS
LAYERID
SECTIONPMID
131(NP)39343(s.parse)b3345
259(VP)48413b3345
33154503b3345
21665454506b3345
11077039346(mesh)b3345
23954505b3345
139(prt)39345 (gene)b3345
8998532754501b3345
55608253 (VB)48411 b3345
59571127 (NN)39341 (POS)b3345
8998538998554500b3345
5560825560848410b3345
5957115957139340 (word)b (body)3345
WORDID
SENTENCE
SEQUENCEPOS
TAGTYPE
ENDCHARPOS
STARTCHARPOS
LAYERID
SECTIONPMID
Basic architecture Added, architecture 3
Added, architecture 2 Added, architecture 4
3
2
1
3
2
1
FIRSTWORDPOS
1
2
3
1
3
1
3
3
2
1
3
2
1
LASTWORDPOS
1
2
3
1
3
1
3
Added, architecture 5
Indexing Structure
Two types of composite indexes: forward and inverted. An index lookup can be performed on any column
combination that corresponds to an index prefix. The forward indexes support lookup based on position
in a given document. The inverted indexes support lookup based on
annotation values (i.e., tag type and word id). Most query plans involve both forward and inverted
indexes Joins statistics would have been useful
Detailed statistics are essential. Standard statistics in DB2 are insufficient.
Records are clustered on their primary key
Indexing Structure (cont.)Architecture Type Columns
Arch 1-4 F *DOCID +SECTION +LAYER_ID +START_CHAR_POS +END_CHAR_POS +TAG_TYPE
Arch 1-4 I LAYER_ID +TAG_TYPE +DOCID +SECTION +START_CHAR_POS +END_CHAR_POS
Arch 2 F DOCID +SECTION +LAYER_ID +SEQUENCE POS +TAG_TYPE +START_CHAR_POS +END_CHAR_POS
Arch 2 I LAYER_ID +TAG_TYPE +DOCID +SECTION +SEQUENCE POS +START_CHAR_POS +END_CHAR_POS
Arch 3-4 F DOCID +SECTION +LAYER_ID +SENTENCE +SEQUENCE POS +TAG_TYPE +START_CHAR_POS +END_CHAR_POS
Arch 3-4 I LAYER_ID +TAG_TYPE +DOCID +SECTION +SENTENCE +SEQUENCE POS +START_CHAR_POS +END_CHAR_POS
Arch 4 I WORD ID +LAYER_ID +TAG_TYPE +DOCID +SECTION +START_CHAR_POS +END_CHAR_POS +SENTENCE +SEQUENCE POS
Arch 5 F *DOCID +SECTION +LAYER_ID +SENTENCE +FIRST_WORD_POS +LAST_WORD_POS +TAG_TYPE
Arch 5 I LAYER_ID +TAG_TYPE +DOCID +SECTION +SENTENCE +FIRST_WORD_POS +LAST_WORD_POS
Arch 5 I WORD ID +LAYER_ID +TAG_TYPE +DOCID +SECTION +SENTENCE +FIRST_WORD_POS
Outline
Related Work Layered Query Language Database Design API Evaluation Conclusions
API
Java based API allows for simple insertion, deletion and modification of annotations. Need to specify document ID, section, layer
ID, and positional information. Supports editing a collection of annotations
and storing them back to the database. We plan to develop a user interface for
viewing, editing and querying annotations. Not a trivial task, since there are many HCI
issues on how to display annotations effectively.
Outline
Related Work Layered Query Language Database Design API Evaluation Conclusions
Experimental Setup
Annotated 13,504 MEDLINE abstracts Stanford Lexicalized Parser (Klein and Manning, 2003) for
sentence splitting, word tokenization, POS tagging and parsing.
We wrote a shallow parser and tools for gene and MeSH term recognition.
This resulted in 10,910,243 records stored in an IBM DB2 Universal Database Server.
Defined 4 workloads based on variants of queries (a-d).
Workload (a) (b) (c) (d)
#Queries 54 11 50 1
#Results/query 303.4 77.5 1.6 16,701
LQL lines 8 6 5 4
Results
Architecture
Space (MB) 1 2 3 4 5
Data Storage 168.5 168.5 168.5 132.5 136.5
Index Storage 617.0 1,397.0 1,441.0 1,182.0 673.5
Total Storage 785.5 1,565.5 1,609.5 1,314.5 810.0
Workload (a) (b)
Architecture 1 2 3 4 5 1 2 3 4 5
SQL lines 37 37 34 29 29 91 77 75 65 50
# Joins 6 6 6 5 5 12 11 11 9 7
Time (sec) 3.98 4.35 3.59 1.69 1.94 3.88 5.68 5.41 3.85 3.55
Workload (c) (d)
Architecture 1 2 3 4 5 1 2 3 4 5
SQL lines 45 38 38 39 41 59 50 53 53 35
# Joins 7 6 6 6 6 7 7 7 7 4
Time (sec) 17.9 23.42 21.49 30.07 4.06 1,879 1,700 2,182 1,682 1,582
Results (cont.)
Different architectures are optimized for different types of queries.
Architecture 5 performs well (if not best) on all query types, while the other architectures perform poorly on at least one query type.
Storage requirement of Architecture 5 is comparable to that of Architecture 1
Architecture 5 results in much simpler queries
We recommend Architecture 5 in most cases, or Architecture 1, if atomic annotation layer cannot be defined.
Scalability Analysis Combined workload of 3 query types Varying buffer pool sizes
Buffer Pool Size (MB) Elapsed Time (ms) Buffer Read Time (ms)
1000 2300 1050
100 2900 1670
10 4600 3340
1 8300 6250 Suggests that the query execution time grows as a sub-
linear function of memory size. We believe a similar ratio will be observed when increasing
the database size and keeping the memory size fixed Parallel query execution can be enabled after partitioning
the annotation on document_id
Conclusions
Provided a mechanism to effectively store and query layers of textual annotations.
Evaluated various structures for data storage and have arrived at an efficient and simple one.
Used variations of queries drawn from published research, to ensure the real-world applicability.
Presented a concise language (LQL) to express queries that span multiple levels of the annotation structure, which captures the user’s intent better as the syntax is more intuitive and closely resembles the annotation structure.
Future Work
Conduct a usability study to assess the query language.
Automate the LQL to SQL translation process.
Test the scalability of this approach on larger document collections.
References Steven Bird and Mark Liberman. 2001. A formal framework for linguistic
annotation. Speech Communication, 33(1–2):23–60. Steve Cassidy and Jonathan Harrington. 2001. Speech annotation and
corpus tools. Speech Communication, 33(1–2):61–77. David McKelvie, Amy Isard, Andreas Mengel, Morten B. Moller, Michael
Grosse and Marion Klein. 2001. Speech annotation and corpus tools. Speech Communication, 33(1–2):97–112.
Goran Nenadic, Hideki Mima, Irena Spasic, Sophia Ananiadou and Jun-ichi Tsujii. 2002. Terminology-Driven Literature Mining and Knowledge Acquisition in Biomedicine. International Journal of Medical Informatics, 67:33–48.
Ralph Grishman. 1996. Building an Architecture: a CAWG Saga. Advances in Text Processing: Tipster Program Phase II, Morgan Kaufmann, 1996.
Steve Cassidy. 1999. Compiling Multi-tiered Speech Databases into the Relational Model: Experiments with the Emu System. 6th European Conference on Speech Communication and Technology Eurospeech 99, 2127–2130, Budapest, Hungary.
Xiaoyi Ma, Haejoong Lee, Steven Bird and Kazuaki Maeda. 2002. Models and Tools for Collaborative Annotation. Third International Conference on Language Resources and Evaluation, 2066–2073.
Thank You
Questions and
constructive comments
are welcomed
http://biotext.berkeley.edu