GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

52
GATE, SWAN and Semantic TV http :// gate.ac.uk/ Hamish Cunningham Department of Computer Science, University of Sheffield

description

GATE, SWAN and Semantic TV http://gate.ac.uk/ Hamish Cunningham Department of Computer Science, University of Sheffield. Language Technology and the Knowledge Economy Information Extraction and Ontology-Based IE GATE: infrastructure and IE in practice - PowerPoint PPT Presentation

Transcript of GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

Page 2: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

2(52)

Contents1. Language Technology and the

Knowledge Economy2. Information Extraction and Ontology-

Based IE3. GATE: infrastructure and IE in practice4. Three examples: parallel data mining,

digital libraries; video indexing5. SWAN: OBIE meets the Web6. Semantic TV

Page 3: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

3(52)

The Knowledge Economy and Human Language

Gartner, December 2002: • taxonomic and hierachical knowledge mapping and indexing

will be prevalent in almost all information-rich applications • through 2012 more than 95% of human-to-computer

information input will involve textual language A contradiction: • to deal with the information deluge we need formal knowledge

in semantics-based systems • our communication culture is in informal and ambiguous natural

language The challenge: to reconcile these two phenomena

Page 4: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

4(52)

HumanLanguage

Formal Knowledge(ontologies andinstance bases)

(A)IE

CLIE

(M)NLG

ControlledLanguage

OIE

SemanticWeb; Semantic Grid;Semantic Web Services

KEYMNLG: Multilingual Natural Language GenerationOIE: Ontology-aware Information ExtractionAIE: Adaptive IECLIE: Controlled Language IE

HLT: Closing the Loop

Page 5: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

5(52)

Contents1. Language Technology and the

Knowledge Economy2. Information Extraction and Ontology-

Based IE3. GATE: infrastructure and IE in practice4. Three examples: parallel data mining,

digital libraries; video indexing5. SWAN: OBIE meets the Web6. Semantic TV

Page 6: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

6(52)

Information Extraction• Information Extraction (IE) pulls facts and

structured information from the content of large text collections.

• Contrast IE and Information Retrieval• NLP history: from NLU to IE • Progress driven by quantitative measures• MUC: Message Understanding Conferences • ACE: Advanced Content Extraction• CoNLL: Conference on Nat. Lang. Learning• Pascal (2005): ontology-based IE

Page 7: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

7(52)

“The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc.”

Conventional IE Example

• ST: rocket launch event with various participants

• NE: "rocket", "Tuesday", "Dr. Head“, "We Build Rockets"

• CO:"it" = rocket; "Dr. Head" = "Dr. Big Head"• TE: the rocket is "shiny red" and Head's

"brainchild". • TR: Dr. Head works for We Build Rockets Inc.

Page 8: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

8(52)

Ontology-based IEXYZ was established on 03 November 1978 in London. It opened a plant in Bulgaria in …

Ontology & KB

Company

type

HQ

establOn

City Country

Location

partOf

type

type type

“03/11/1978”

XYZLondon

UK Bulgaria

HQpartOf

Page 9: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

9(52)

Ontology-Based IE (OBIE)• Conventional IE tags selected segments of

text whenever that text represents the name of an entity

• OBIE: view enitites as mentions of the underlying instances from the ontology

• Identify which mentions in the text refer to which instances in the ontology

• Add new instances if needed• Identify instances of attributes and relations

– take into account what are allowed given the ontology, using domain&range as constraints

Page 10: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

10(52)

EntityPerson

Job-title

president

chancellorminister

G.Brown

“Gordon Brown met George Bush during his two day visit.

Classes, instances & metadata

Classes+instances before

Bush

<metadata> <DOC-ID>http://… 1.html</DOC-ID> <Annotation> <s_offset> 0 </s_offset> <e_offset> 12 </e_offset> <string>Gordon Brown</string>

<class>…#Person</class> <inst>…#Person12345</inst>

</Annotation> <Annotation> <s_offset> 18 </s_offset> <e_offset> 32 </e_offset> <string>George Bush</string>

<class>…#Person</class> <inst>…#Person67890</inst>

</Annotation></metadata>

Classes+instances after

Page 11: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

11(52)

EntityPerson

Job-title

president

chancellorminister

T. Blair

“Gordon Brown met Tony Blair to discuss the university tuition fees.

Classes, instances & metadata (2)

Classes+instances before

G. Brown

<metadata> <DOC-ID>http://… 2.html</DOC-ID> <Annotation> <s_offset> 0 </s_offset> <e_offset> 12 </e_offset> <string>Gordon Brown</string>

<class>…#Person</class> <inst>…#Person12345</inst>

</Annotation> <Annotation> <s_offset> 18 </s_offset> <e_offset> 30 </e_offset> <string>Tony Blair</string>

<class>…#Person</class> <inst>…#Person26389</inst>

</Annotation></metadata>

Classes+instances after

G. Bush

Page 12: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

12(52)

Challenges for IE for SemWeb

• Portability – different and changing ontologies• Different text types – structured, free, etc.• Utilise ontology information where available• Train from small amount of annotated text• Output results wrt the given ontology

– bridge the gap demonstrated in S-CREAM• Learn/Model at the right level

– ontologies are hierarchical and data will get sparser the lower we go

Page 13: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

13(52)

Deploying IE

complexity

spec

ifici

ty acceptableaccuracy

domainspecific

bag-of-words events

general

simple complex

relationsentities

Per

form

ance

Lev

el

100%90%80%

30%

Domain specificity vs. task complexity: a necessary trade-off

Page 14: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

14(52)

Contents1. Language Technology and the

Knowledge Economy2. Information Extraction and Ontology-

Based IE3. GATE: infrastructure and IE in practice4. Three examples: parallel data mining,

digital libraries; video indexing5. SWAN: OBIE meets the Web6. Semantic TV

Page 15: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

15(52)

Software lifecycle in collaborative research1. Project Proposal: We love each other. We can work so well together. We can hold workshops on Santorini together. We will solve all the problems of AI that our predecessors were too stupid to.

2. Analysis and Design: Stop work entirely, for a period of reflection and recuperation following the stress of attending the kick-off meeting in Luxembourg.

3. Implementation: Each developer partner tries to convince the others that program X that they just happen to have lying around on a dusty disk-drive meets the project objectives exactly and should form the centrepiece of the demonstrator.

4. Integration and Testing: The lead partner gets desperate and decides to hard-code the results for a small set of examples into the demonstrator, and have a fail-safe crash facility for unknown input ("well, you know, it's still a prototype...").

5. Evaluation: Everyone says how nice it is, how it solves all sorts of terribly hard problems, and how if we had another grant we could go on to transform information processing the World over (or at least the European business travel industry).

Page 16: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

16(52)

Infrastructure and Science•Physicists have supercolliders; medics have

MRI scanners; HLT researchers have.... Perl?•Other relevant trends:

– EU funds multi-site collaborative projects– Realisation of role of engineering in

scalablility, reusablility, and portablility– Support for large data, in multiple media,

languages, formats, and locations– Promotion of quantitative evaluation metrics

•Hence GATE, a General Architecture for Text Engineering (est. 1995)

Page 17: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

17(52)

GATE, a General Architecture for Text Engineering is...

• An architecture A macro-level organisational picture for LE software systems.

• A framework For programmers, GATE is an object-oriented class library that implements the architecture.

• A development environment For language engineers, a graphical development environment.

GATE comes with...• Free components, and wrappers for other people's• Tools for: evaluation; visualise/edit; persistence; IR; IE;

dialogue; ontologies; etc.• Free software (LGPL) at http://gate.ac.uk/download/• Used by thousands of people at hundreds of sites

Page 18: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

18(52)

A bit of a nuisance (GATE users)GATE team projects. Past:• Conceptual indexing: MUMIS: automatic

semantic indices for sports video• MUSE, cross-genre entitiy finder• HSL, Health-and-safety IE• Old Bailey: collaboration with HRI on

17th century court reports• Multiflora: plant taxonomy text analysis

for biodiversity research e-science • ACE / TIDES: Arabic, Chinese NE• JHU summer w/s on semtagging• EMILLE: S. Asian languages corpus• hTechSight: chemical eng. K. portalPresent:• Advanced Knowledge Technologies:

€12m UK five site collaborative project• SEKT Semantic Knowledge Technology• PrestoSpace MM Preservation/Access• KnowledgeWeb Semantic Web• ETCSL Sumerian Digital Library• ENIRAF, MMKM networksFuture:• New eContent project LIRICS

Thousands of users at hundreds of sites. A representative sample: • the American National Corpus project • the Perseus Digital Library project, Tufts

University, US• Longman Pearson publishing, UK• Merck KgAa, Germany• Canon Europe, UK• Knight Ridder, US• BBN (leading HLT research lab), US• SMEs inc. Sirma AI Ltd., Bulgaria• DERI, Stanford, Imperial College, London,

the University of Manchester, UMIST, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities

• UK and EU projects inc. MyGrid, CLEF, dotkom, AMITIES, Cub Reporter, EMILLE, Poesia...

Page 19: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

19(52)

GATE – components and services

• HLT systems composed of components• GATE versions:

– v1: dynamic loading of shared object libraries with Tcl wrappers

– v2, v3: Java beans with URL loading, XML metadata, produce web services externally

– v4: core web services (both produce and consume), new LIRICS project out of ISO TC37/SC4(link up with SWS in SDK?)

Page 20: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

20(52)

GATE – infrastructure for semantic metadata extraction

• Combines learning and rule-based methods (new work on mixed-initiative learning)

• Allows combination of IE and IR • Enables use of large-scale linguistic resources

for IE, such as WordNet• Supports ontologies as part of IE applications -

Ontology-Based IE• Supports languages from Hindi to Chinese,

Italian to German• Used in OntoText KIM, SDK, Text2Onto, ...

Page 21: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

21(52)

Contents1. Language Technology and the

Knowledge Economy2. Information Extraction and Ontology-

Based IE3. GATE: infrastructure and IE in practice4. Three examples: parallel data mining,

digital libraries; video indexing5. SWAN: OBIE meets the Web6. Semantic TV

Page 22: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

22(52)

Example 1: Massively Parallel Clustering and Classification

• D2K (Data 2 Knowledge): data mining / machine learning with visual programming development tool

• T2K: library of text processing modules built on D2K• Integrates data mining methods for prediction,

discovery, and deviation detection, with information visualization tools

• Offers a visual programming environment.• Distributed computing / parallel processing facilities.• From NCSA: http://alg.ncsa.uiuc.edu/do/tools/t2k

Page 23: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

23(52)

T2K demos; EmailClassification_GATE

Page 24: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

24(52)

Email classification results

Page 25: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

25(52)

GATE_IE_VIZ demo with component information

Page 26: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

26(52)

Entities extracted from news corpus

Page 27: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

27(52)

Document clustering using GATE feature extraction

Page 28: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

28(52)

Dendogram of the clustering results

Page 29: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

29(52)

Example 2: Digital Libraries• Greenstone:

– Digital Library with automated ingestion, structuring and indexing

– Full text and fielded search (Dublin Core)– GATE-based entity tagging– From Maori to Arabic, Russian to Chinese– UNESCO’s Information for All Programme

• Perseus:– One of the oldest and biggest humanities DLs– Provides rich interlinking of related resources– Models time and space via materials dates and locations– GATE-based automated hyperlinking etc.

Page 30: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

30(52)

Greenstone

Page 31: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

31(52)

Perseushttp://www.perseus.tufts.edu/

Time-line and geographic visualisation

Page 32: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

32(52)

Example 3: the MUMIS project• Multimedia Indexing and Searching Environment • Composite index of a multimedia programme from

multiple sources in different languages• ASR, video processing, Information Extraction (Dutch,

English, German), merging, user interface• University of Twente/CTIT, University of Sheffield,

University of Nijmegen, DFKI, MPI, ESTEAM AB, VDA• An important experimental result: multiple sources for

same events can improve extraction quality– PrestoSpace applications in news and sports archiving

Page 33: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

33(52)

Semantic Query

Not “goal Beckham”(includes e.g. missed goals, or “this was not a goal”)

Instead: “goal events with scorer David Beckham”

Page 34: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

34(52)

The results: England win!

Page 35: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

35(52)

Contents1. Language Technology and the

Knowledge Economy2. Information Extraction and Ontology-

Based IE3. GATE: infrastructure and IE in practice4. Three examples: parallel data mining,

digital libraries; video indexing5. SWAN: OBIE meets the Web6. Semantic TV

Page 36: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

36(52)

SWAN: a Semantic Web Annotator• Collaboration between DERI/NUIG, OntoText

and USFD, hosted at DERI• Large heap of IBM hardware in your server room• Objective: make the cooling fans run flat-out

• Conceptual indexing of news or other web fractions

• Quantitative media reporting• Annotated web workbench service• Custom knowledge services

• Demo and poster at ESWS

Page 37: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

37(52)

SWAN Scenarios (1)Financial Analysts• Indications of how a company is viewed: • How many instances predicting strong performance for a

particular company are out there? Over the past year how has the profile of predictions for this company changed? How many positive/negative sentiments were expressed for the company?

Marketing Strategists• Support campaign tuning today based on yesterday's results: • In this morning's IT press 7% of articles discussed your

company. The average proportion of the article directly relating to your company was 33%. The figures for the other key players in your sector are summarised in the following table.... Extent of media coverage relative to spend events:

• Company Y exhibited at Comdex. In the week following the exhibition 20% of the press that covered Comdex mentionned Y.

Page 38: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

38(52)

SWAN Scenarios (2)PR Workers• Identify negative reporting events (to issue denials, obfuscations, bribes etc.):

•The table below summarises 12 negative reporting events concerning your company in the last 24 hours of IT news....

Media Analysts• A range of media metrics, e.g. the "media distance" between concepts and products/companies:

•The media distance between your company and the subject of XML is 0.09; for IBM the value is 0.2.

Page 39: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

39(52)

SWAN Scenarios (3)Sales• Generate "black books" - lists of contacts in the organisations for sales staff.

• Business structures are continually changing and reported in the news.

• Track works-for and joining and leaving reporting events

Public Interest Services• In order to generate interest and to prototype the system we may wish to provide a free public service, for example about sport, or theatre and cinema alerts.

Page 40: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

40(52)

KIM

Popov et al. KIM. ISWC’03

• Ontology (KIMO) + 200K instances KB (5m stmts)• Lookup phase marks mentions from the ontology• Combined with rule-based IE system to recognise

new instances of concepts and relations• High ambiguity of instances with the same label –

uses disambiguation step• Special KB enrichment stage where some of these

new instances are added to the KB• Disambiguation uses an Entity Ranking algorithm,

i.e., priority ordering of entities with the same label based on corpus statistics (e.g., Paris)

Page 41: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

41(52)

OBIE in KIM

Popov et al. KIM. ISWC’03

Page 42: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

42(52)

SWAN Logical ArchitectureWeb

ServiceUsers

UIUsers

FocussedcrawlingFocussed

crawlingFocussedcrawlingFocussed

crawlingFocussedcrawling

FocussedcrawlingFocussed

crawlingFocussedcrawlingFocussed

crawlingIE(32 bit)

IE(64 bit)

Web UI,Web services

Knowledgebase

(Sesame)

Annotation(Oracle)

Page 43: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

43(52)

Cluster Controller

Page 44: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

44(52)

SWAN: Status, FutureNow• Hardware working, crawling and annotating news sites• IE tuning and evaluation in progress

Next steps• Public demonstration service• More news, sports domain•More languages (parallel corpus, align, project markup, learn recogniser for new language)

•Negative reporting events

Page 45: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

45(52)

Contents1. Language Technology and the

Knowledge Economy2. Information Extraction and Ontology-

Based IE3. GATE: infrastructure and IE in practice4. Three examples: parallel data mining,

digital libraries; video indexing5. SWAN: OBIE meets the Web6. Semantic TV

Page 46: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

46(52)

Trend 1: DRM: end of civilisation as we know it

•Digital Rights Management (DRM) civilisation as we know it controls how you consume media you buy

•Has the potential to be linked with censorship and with invasive behaviour logging)

•You can't make digital objects behave like physical objects - unless you totally control the hardware and the operating system

• If someone does gain control, then we may end up finding that someone has given the contract for news and culture to Haliburton, for example

Page 47: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

47(52)

Seconds out, round 5: file sharing is about to go social

•Round 1: Napster's explosion •Round 2: Napster's demise •Round 3: P2P, Kazaa, BitTorrent •Round 4: RIAA sues the punters •Round 5: OSN + P2P, trust as referal

Page 48: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

48(52)

Trend 2: the Biggest Innovation in Conversation Since the Table

•Social software hits the mainstream: •Friendster, LinkedIn, Orkut (On-line Social Networking, OSN)

•Bloggs, Wikis, chat/IM, RSS/ATOM •How to run a better teleconference: add Wiki and IM

Page 49: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

49(52)

Trend 3: Wintel vs. Consumer Electronics in the Home

•The TV, cable and satelite, DVD, Hifi, radio and Tivo of several years' time will probably run from a single PC (which will also do web, email, ...)

•There will be a battle between Wintel, offering high-quality gaming and full-blown Windows, and more conventional consumer electronics approaches based on Linux and cheap hardware

•The latter can probably capture some significant market share, having advantages such as: no viruses; better stability; cheap hardware; multi-user functions; fast boot; quiet running...

Page 50: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

50(52)

What if...?•What if these three trends combine? What if we get widespread open platform consumer electronics + OSN + P2P file sharing?

•Ubiquitous on-line communities centred on shared content, with a working model of trust as referral

•What if semantic technology provides the means of organising and interlinking the cross-over between TV and the web?

• Killer application for OSN • Bandwidth sales for cable companies • Antidote to DRM

Page 51: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

51(52)

Semantic TV: Features• NGTivo will not record what you ask, but will record a week of 60 channels and allow

you to browse! Then we're starting to have DL type problems. • DIY SInS:

• structured information spaces • technology trickle-down effect: everyone's a systems analyst • semantic wiki: the simplest shared semantics that could possibly work

• Coordinated web/desktop search; temporal search • KIM/SWAN (pre-announced shows; text metrics / shows watched; semantic search) • SWAN re-defined as web mining for TV-related facts (OBIE for scraping)

• Celebrity-based indexing • How do you say "record all the programmes with actor X after 2001 not set in Europe"? • SemTV answer: "record all the programmes with actor X after 2001 not set in Europe" • (This is done by CLIE and example-based authoring, not dialogue processing, QA or

NLDB.) • Social networking based on TV preferences • Recommender • Video OCR • Retro and multi-player/low graphics games

Page 52: GATE, SWAN and Semantic TV gate.ac.uk/ Hamish Cunningham Department of Computer Science,

• GATE: – defacto standard for HLT R&D– emerging substrate for HLT in knowledge

technologies– world-class IE

• SWAN: scaling up OBIE• Semantic TV: coming soon, act now!

More information: http://gate.ac.uk/

Summary