Ulrich schäfer - dfki language technology lab delph-in summit fefor 06/2006 Heart of Gold Tutorial...
-
date post
20-Dec-2015 -
Category
Documents
-
view
215 -
download
2
Transcript of Ulrich schäfer - dfki language technology lab delph-in summit fefor 06/2006 Heart of Gold Tutorial...
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Heart of GoldTutorial
An XML-based middlewarefor the integration of deep and shallow
natural language processing components
Ulrich SchäferDFKI language technology lab
Mus
eu d
os
coch
es,
Lisb
on
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Talk outline
history middleware application clients modules pet input chart transformation service practical tour, configuration SDL cascades visualization gadgets web page
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Heart of Gold History
roots in Whiteboard (2000-2003) WHAM shallow XML standoff annotation, XSLT ("WHAT"),
PET extensions, pipeline integration API-based, focus on German
yy extensions to PET (~2001) DeepThought: Heart of Gold (2002-2004)
multilinguality, RMRS output flexible configuration, networking fallback to shallow if deep fails
extensions in QUETAL (2003-2005) SDL (sub-architectures with loops, parallelism) automatic stylesheet generation (NER, RMRS) new modules (Sleepy, Treetagger), ontology interf.
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Heart of Gold Application
NLP components
Res
ults
Que
ries
Deep parser, tagger, named entity recognizer, ...
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Heart of Gold Application
NLP components
Res
ults
Que
ries
MIDDLEWARE
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Heart of Gold
Computed annotations XML,RMRS
Application
Module Communication Manager
Res
ults
Que
ries
External, persistent annotation database
Modules
NLP components
MIDDLEWARE
XML-RPC
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Heart of Gold
Computed annotations XML,RMRS
Application
Module Communication Manager
Res
ults
Que
ries
External, persistent annotation database Modules
External NLP components
TransformationService
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Application Clients
open a session with a configuration of active modules
each query ("analyze") has parameters session ID input text depth of deepest analysis requested (e.g. 10 for
tokens, 40 for NER, 100 for PET) language code
client gets result of deepest analysis as answer, other analyses on request
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Processing Strategy
Shallowest component first (e.g. tokenizer). Then other components with increasing depth, up
to requested depth. Fallback to result of previous component if no
result from component with requested depth. Each component gets the output of previous
component as input plus the output from other components if configured.
The result of the query is the result of the deepest component in the sequence.
Analyses results from other components are returned on request.
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Annotation Storage
Session Annotation collection (1 per input text)
Standoff annotations (analyses computed by components)
XML standoff annotation and/or RMRS in Main Memory, XML:DB, File System
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Modules
modules are adapters to external NLP components (PET, tagger, NER, ...)
connection direct (e.g. process streams) or via XML-RPC
depth, language, name are mandatory configuration properties
input is output from previous module, alternative and additional input configurable
XML output mandatory (RMRS generation optional, e.g. via XSLT stylesheet)
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Sample Module Configuration: PETno
# configuration file for PET module#module.name=PETmodule.depth=100module.language=no## root element name for XML outputmodule.rootelement=pet## common modules settings end here -----## path to cheap binarypet.binary=components/pet/bin/cheap## additional library search path for cheappet.libs=components/pet/lib## working directory (where the grammar is)pet.grammardir=components/pet/norwegian## prefix for grammar filepet.grammarprefix=norsourcepet## command line options for cheappet.options=-mrs=xml -limit=30000 -
nsolutions=1## character set encoding for PET inputpet.inputencoding=ISO-8859-1#
# character set encoding for PET outputpet.outputencoding=ISO-8859-1## input annotation(s), comma-separated# (for use in conjunction with PIC mode)# use "rawtext" for raw input text.# omitting/empty value means take input from # previous component (XML)pet.inputannotation=rawtext
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Name Purpose depth Lang. resources Implemented in
JTok tokenizer, SBR 10 en, de, it Java
TnT stat. PoS tagger 20 en, de C
ChaSen tagger, segmentation 20 ja C
TreeTagger tagger 20 de, en, fr, it, es... C
Chunkie stat. chunker 30 en, de C
ChunkieRmrs RMRS of chunks 35 en, de XSLT, XTDL, SDL
SProUT morph, IE/NER 40 en,de,el,fr,es,ja,... Java
LingPipe NER, coreference resolver 40 en, es, ... Java
Corcy coreference resolver 45 en Python
RASP stat. parser 50 en Lisp
Sleepy stat. parser 50 de OCaml
PET deep parser 100 en,de,el,ja,[it,no] C, C++
SDL sub-architectures n - Java
RMRSmerge merge RMRSes 110 - XSLT, SDL
Integrated Components
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Simple Integration of new Components
1. Subclass Module
2. Implement init(), process() and shutdown()
3. Use e.g. XSL transformation to generate
RMRS output (cf. TnT, SProUT integration)
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Annotation Metadata <metadata acid="collection0002" component="PET" created="Do, 4 Dez 2003 18:25:16 +0100" processingtime="00:08,140" sessionid="session0001"> diagnosis="OK"> <conf name="pet.cfg"> <entry name="module.rootelement" value="pet"/> <entry name="module.language" value="en"/> <entry name="module.depth" value="100"/> <entry name="pet.grammarprefix" value="english"/> <entry name="pet.options" value="-mrs=xml"/> <entry name="pet.inputencoding" value="ISO-8859-1"/> <entry name="pet.outputencoding" value="ISO-8859-1"/> <entry name="pet.inputannotation" value="rawtext"/> </conf></metadata>
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
PET XML Input Chart ('PIC', 'PiXML')
generalisation and extension of yy input mode (cf. example; DTD in HoG doc)
TnT-, ChaSen-, SproutModule adapted to generate PiXML as additional annotation
XML-wise 'concatenation' of n input charts via XSLT stylesheet
PicModule for text input without PoS tagger
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
PET XML Input Chart
conf/en/pet.cfg:
# input annotations, comma-separatedpet.inputannotation=TnTpiXML,SProUTpiXML
# stylesheet for XML chart combinationpet.combinestylesheet=xsl/pic/combinepixml.xsl## stylesheet for preprocessing the PET input chart (opt.)pet.preprocstylesheet=xsl/pic/remove-subspan-items.xsl
TnTpiXML
SProUTpiXML
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Heart of Gold
Computed annotations XML,RMRS
Application
Module Communication Manager
Res
ults
Que
ries
External, persistent annotation database Modules
External NLP components
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
with TransformationService
Computed annotations XML,RMRS
Application
Module Communication Manager
Res
ults
Que
ries
External, persistent annotation database Modules
External NLP components
TransformationService
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
TransformationService
Central XSLT class with access to computed Heart of Gold annotations via special URI:
URI syntax (in XPath): document(hog://sid/acid/aid)/PATH/TO/ELEMENT
where sid = session ID, acid = annotation collection ID, aid = annotation ID
Session Annotation collection (1 per input text)
Standoff annotations (analyses computed by components)
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
XSLT for Component Integration
post-processing of SProUTput:
1. PET input chart generation with mapping to generic HPSG NE types
2. RMRS generation both stylesheets
generated automatically at compile time from TDL type hierarchies of SProUT named entity grammars (5500 more lines...):
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
XSLT for Component Integration
IE-like structured RMRS output for application:
only NE span and type information for PET:
SProUTput:
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Sample Module Configuration: PETen
# configuration file for PET module#module.name=PETmodule.depth=100module.language=en## root element name for XML outputmodule.rootelement=pet## common modules settings end here -----## path to cheap binarypet.binary=components/pet/bin/cheap## additional library search path for cheappet.libs=components/pet/lib## working directory (where the grammar is)pet.grammardir=components/pet/erg## prefix for grammar filepet.grammarprefix=english## command line options for cheappet.options=-xml_counts -mrs=xml -default-
les -limit=30000 -nsolutions=2## character set encoding for PET inputpet.inputencoding=UTF-8#
# character set encoding for PET outputpet.outputencoding=UTF-8## input annotation(s), comma-separated# (for use in conjunction with yy mode)# use "rawtext" in conjunction with non-yy mode# omitting/empty value means take input from # previous component (XML)pet.inputannotation=TnTpiXML,SProUTpiXML#pet.inputannotation=rawtext## stylesheet for XML input chart combinationpet.combinestylesheet=xsl/pic/combinepixml.xsl## stylesheet for preprocessing the input chart# no transformation if unset#pet.preprocstylesheet=xsl/pic/remove-subspan-
items.xsl## stylesheet for postprocessing fragments# return only the n longest fragments# unset=return all (=no stylesheet application)pet.postprocstylesheet=xsl/rmrs/extract-
longest-fragment.xsl## stylesheet parameter: number of fragments to
return# unset=return all (=no stylesheet application)pet.postprocfragments=5
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Heart of Gold tour
download/installation prerequisites: Python, Java, Mozilla/Firefox directory structure below hog/ ISO 639 language codes Logging: log4j configuration Heart of Gold configuration files in conf/
XML-RPC server and ant configuration: conf/mocoman.cfg
-> logging configuration in conf/logging/ session configuration in conf/en/
-> module configurations in conf/en/ -> component configurations ion components/XXXX/YYY
Starting and stopping server, using clients
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
SDLModule
Generic module that plugs SDL (Krieger '03) sub-architectures into the Heart of Gold
Generic SProUT and XSLT SDL modules implemented (SProUT grammars and XSLT stylesheets via configuration)
Access to other (computed) Heart of Gold annotations via TransformationService
Application: RMRS construction from chunks Can also serve as 'standalone' SProUT
wrapper for shallow cascades
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Heart of Gold Schema with SDLModule
Computed annotations XML,RMRS
Application
Module Communication Manager Re
sults
Queries
External, persistent annotation database Modules
External NLP components
SDLModule
Compiled SDLsub-architecture(s)
TransformationService
SDL XsltModules
SDL SproutModules
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
ChunkieRMRS cascade within HoG
ChunkieRMRS (SDL-defined module)
Constraint-Based RMRS Construction from Shallow Grammars (Frank et al. 2004)
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
SDL definition of SproUT-XSLT cascade for ChunkieRMRS
de.dfki.lt.quetal.sdlgen.chunkiermrs_de = ( sprout_rmrs_pos + xslt_morph_filter + sprout_rmrs_lex + xslt_nodeid_cat + sprout_rmrs_comp + sprout_rmrs_final + xslt_fsxml2rmrsxml + xslt_reorder )
sprout_rmrs_pos = de.dfki.lt.sdl.sprout.SproutModulesTextXmlEncapsulated ("components/sdl/chunkiermrs/rmrs-pos/rmrs/rmrs-pos.cfg", "SDLs-RMRS-pos")
xslt_morph_filter = de.dfki.lt.sdl.xslt.XsltModulesStringStringEncapsulated ("components/sdl/chunkiermrs/morphposfilter.xsl", "SDLx-Morph-filter", "aid", "Chunkie")
sprout_rmrs_lex = de.dfki.lt.sdl.sprout.SproutModulesXmlXmlEncapsulated ("components/sdl/chunkiermrs/rmrs-lex/rmrs/rmrs-lex.cfg", "SDLs-RMRS-lex")
xslt_nodeid_cat = de.dfki.lt.sdl.xslt.XsltModulesStringStringEncapsulated ("components/sdl/chunkiermrs/nodeinfo.xsl", "SDLx-Node-info", "aid", "Chunkie")
sprout_rmrs_comp = de.dfki.lt.sdl.sprout.SproutModulesXmlXmlEncapsulated ("components/sdl/chunkiermrs/rmrs-cascade/rmrs/rmrs-cascade.cfg", "SDLs-RMRS-casc")
sprout_rmrs_final = de.dfki.lt.sdl.sprout.SproutModulesXmlXmlEncapsulated ("components/sdl/chunkiermrs/rmrs-final/rmrs/rmrs-final.cfg", "SDLs-RMRS-final")
xslt_fsxml2rmrsxml = de.dfki.lt.sdl.xslt.XsltModulesStringStringEncapsulated ("components/sdl/chunkiermrs/rmrsfs2rmrsxml.xsl", "SDLx-RMRS-2dtd")
xslt_reorder = de.dfki.lt.sdl.xslt.XsltModulesStringStringEncapsulated ("components/sdl/chunkiermrs/reorderrmrsdtrs.xsl", "SDLx-RMRS-reorder")
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Configuration of ChunkieRMRS SdlModule
# configuration file for Chunkie RMRS module (SDL)#module.name=ChunkieRmrsmodule.depth=35module.language=en# root element name for XML outputmodule.rootelement=chunkiermrs# ----- common modules settings end here -----# name of input annotation (raw text for first cascade/SProUT)sdl.inputannotation=rawtext# class name of compiled SDL definition# (same as class name at beginning of .sdl file)# can be compiled using 'ant chunkiermrs'sdl.classname=de.dfki.lt.hog.sdlgen.chunkiermrs_en
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
SDL definition of RmrsMerge cascade
XSLT Stylesheets developed by Anette Frank
de.dfki.lt.hog.sdlgen.rmrsmerge = ( rmrs_ep_rargs2rels + adjust_nespans + merge_ne_to_petrasp + rmrs_rels2ep_rargs + reorder_rmrs_dtrs )
rmrs_ep_rargs2rels = de.dfki.lt.sdl.xslt.XsltModulesStringDomEncapsulated ("xsl/sdl/rmrsmerge/rmrs_ep_rargs2rels.xsl", "SDLx_rargs2rels")
adjust_nespans = de.dfki.lt.sdl.xslt.XsltModulesDomDomEncapsulated ("xsl/sdl/rmrsmerge/adjust_nespans.xsl", "SDLx_adjustnespans", "aid", "Sprout")
merge_ne_to_petrasp = de.dfki.lt.sdl.xslt.XsltModulesDomDomEncapsulated ("xsl/sdl/rmrsmerge/merge-ne-to-rasp.xsl", "SDLx_netorasp", "aid", "Sprout")
rmrs_rels2ep_rargs = de.dfki.lt.sdl.xslt.XsltModulesDomDomEncapsulated ("xsl/sdl/rmrsmerge/rmrs_rels2ep_rargs.xsl", "SDLx_rels2rargs")
reorder_rmrs_dtrs = de.dfki.lt.sdl.xslt.XsltModulesDomStringEncapsulated ("xsl/sdl/rmrsmerge/reorderrmrsdtrs.xsl", "SDLx_reorderdtrs", "aid", "xmltext")
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Configuration of RmrsMerge module
# configuration file for RmrsMerge module (SDL)#module.name=RmrsMergemodule.depth=110module.language=en# root element name for XML outputmodule.rootelement=merged-rmrs# ----- common modules settings end here -----# name of input annotation (PET or RASP)sdl.inputannotation=PET# class name of compiled SDL definition# (same as class name at beginning of .sdl file)# can be compiled using 'ant rmrsmerge'sdl.classname=de.dfki.lt.hog.sdlgen.rmrsmerge
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Visualization Gadgets
HTML (generic XML, RMRS) xsl/html/xml2html.xsl, rmrs2html.xsl
AVM (generic XML, SProUTput): applet part of SProUT runtime
LaTeX (FS-XML, SProUTput, RMRS) fs2latex tool, xsl/latex/rmrs2latex.xsl
Complete PHP-based Webdemo portal is part of Heart of Gold CVS
ulric
h sc
häfe
r -
dfki
lang
uage
tech
nolo
gy la
b
d
elph
-in s
umm
it fe
for
06/2
006
Documentation, Papers, Downloads
core middleware is LGPL
different licences for (externally developed) components
http://heartofgold.dfki.de
http://lists.delph-in.net