Calais @ the Palo Alto Semantic Web Meetup
-
Upload
krista-thomas -
Category
Technology
-
view
6.674 -
download
3
description
Transcript of Calais @ the Palo Alto Semantic Web Meetup
CalaisPAWS
Sep 4, 2008
Calais?
ClearForest• Founded in 1998 by text analytics
pioneers
• A software organization that enables Intelligent Information
• Enterprise and government customers
• Led the market in the establishment of unstructured text as a key corporate asset
• Acquired by Reuters June 2007
• Offices: Boston, Israel
The Text Problem
• People consume text
• Most of it isn’t semantically enabled
• Most of it won’t be semantically enabled
• Why: Latency, cost and short shelf-life
Calais’ Piece of the Puzzle
• A semantic metadata
generation service that extracts
entities, facts and events from
unstructured text
• Two new capabilities: topics &
relevance
• Available for commercial or
non-commercial use up to
40,000 times per day
Calais
Named Entities
Facts Events
People,
Companies,
Geographies,
Albums,
Authors, etc.
Position,
Alliance,
Education,
Political
Affiliation, etc.
Management
Change, IPO,
Labor Action,
Sporting,
Entertainment
etc.
Unstructured Documents
(Text / HTML / XML)
<Topic>M&A</Topic>
<Acquisition offset="494" length="130"> <Company_Acquirer>Reuters</Company_Acquirer> <Company_Acquired>ClearForest Ltd.</Company_Acquired> <Status>Planned</Status> </Acquisition>
<Company>Reuters</Company>
<Company>ClearForest Ltd.</Company>
<Product>Text Analytic Solution </Product>
<Company>ClearForest Ltd.</Company>
<Company>Reuters</Company>
<Country>United States</Country>
<Country>Israel</Country>
<Company>Reuters</Company>
<Person>Gerry Campbell</Person>
<ManagementChange offset="2789" length="92"> <Person>Gerry Campbell</Person> <Company>Reuters</Company> <Action>Enters</Position> </ManagementChange>
Reuters Announced the Acquisition of ClearForest
New York - April 30, 2007
Reuters, the global information company, has entered into an agreement to acquire all of the outstanding shares of ClearForest Ltd., a privately held provider of Text Analytics solutions, whose tagging platform and analytical products allow clients to derive precise business information from huge amounts of textual content.
ClearForest has received sufficient shareholder approval to complete the transaction, which is expected to close in approximately 30 days, subject to customary closing conditions. The financial terms were not disclosed. Reuters plans to retain and continue to work with the existing management team and their highly skilled workforces in the US and Israel. It also plans to continue to support existing products and customers.
Reuters believes that search will be a pivotal element to the future of how financial information is sourced and consumed. As part of its drive into this space, Reuters has created a new strategic group and appointed Gerry Campbell, who will oversee the integration of ClearForest and drive this innovation.
What’s Behind and Event … An Example
Digital Marketing Services,Inc. (DMS), the leading provider of online marketing research and a division of America Online Inc. (AOL), today announced an alliance with Netcentives Inc. (Nasdaq: NCNT)
Extracted instances:
Company = Digital Marketing Services, Inc.
Company = Netcentives Inc.
Status = announced
DateString = today
Date = 2000-01-31
Live Example
Viewer Demo
Gnosis Demo
Extending Calais’ Reach
More than just a web service – a growing collection of tools
and applications to make it valuable in the real world
Calais
BrowserExtensions
Gnosis
Content Management Tools
WordPress
Drupal
UIMA
Development Tools & Libraries
PHP
Ruby
JAVA
.NET
Applications
And more…
TopBraid
RSS Tagger
Powerhouse
LinkedFacts
Wirecatch
FeedShaver
How Calais is Being Used Today• GistGist Automatically aggregates multiple news sources and automatically slots them
into topic, etc.
The Stack
ClearForest Tags Platform
File BasedConnector
Programmatic API(SOAP web Service)
RDBMS Connector
Web Crawlers(Agents)
Con
sole
RichXML
LiveFeed
Tooling
Modeler
Developer
Cat Manager
A
F
External Content/live feed/Enterprise Content
ClearForest Extraction Modules
B
ClearForest Categorizer C
Detailed Stack
RichXML
RichXML
ClearForest Tags Platform
Files
Document Conversion and Normalization
Control
DB
Tags API
ControlAPI
File BasedAPI
Programmatic API(SOAP web Service)
WebAgents
RDBMSbased API
Enterprise System
Categorizer
Semantic Tagging
Language ID
Headline Generation
Classifier
Extraction Modules
Language Classifier
Templates
Categorization Manager
ClearForest Dvlpr/Modeler
Languages Configuration
Key ConceptsConfiguration
ClearForest Studio
RichXML
External Feed
Configuration & Monitoring
Console
FarmManager
Platform Highlights
• Single run-time platform for all technologies
• Modular architecture
• Additional functional plug-in can be added anywhere
• Web services interfaces
• SOA ready
• Java based
• Programmatic API to all components
• Farming support for scalability
• Best practices/standards (XML, Unicode, Architectural Patterns, Design patterns …)
FileAPI
Programmatic API(SOAP web Service)
RDBMS based APIWeb
Custom
Document Tagging (Doc Runner)
Categorization
Information extractionControl
Con
sole
ControlAPI
Tags Pipeline
KB Writer
DB Writer
XML Writer
IO Bound
RichXML
ANSCollection
DB
Other (Headline Generation)
Document Conversion
Conversion & Normalization
PDF Conv.
XML Conv.
Doc Conv.
File/Web/DB based API (Document Provider)
ProfileProfileListener
Listener
Listener
Language identification
Queues:
CPU Bound
Web
Document Injector
(flight plan)
Technology
The NLP StackEvents & FactsEvents & Facts
EntitiesCandidates, Resolution, Normalization
EntitiesCandidates, Resolution, Normalization
Basic NLPNoun Groups, Verb Groups, Numbers Phrases, Abbreviations
Basic NLPNoun Groups, Verb Groups, Numbers Phrases, Abbreviations
Metadata AnalysisTitle, Date, Body, Paragraph
Metadata AnalysisTitle, Date, Body, Paragraph
Sentence MarkingSentence Marking
Morphological AnalyzerPOS Tagging (per word)
Stem, Tense, Aspect, Singular/PluralGender, Prefix/Suffix Separation
Morphological AnalyzerPOS Tagging (per word)
Stem, Tense, Aspect, Singular/PluralGender, Prefix/Suffix Separation
TokenizationTokenization
Calais, Semantics and the Semantic Web
• Issues, Opportunities
– Ontologies• How do we make this a community effort?
– Dereferenceable URI’s & Endpoints• Engineering
• Population– Basic data– Links– Proprietary data sources– Functions? Code?
What’s in the Pipeline?
• 2008– The basics of de-referenceable URI’s
– Disambiguation – company & geography
– Hooks
• 2009 (this is a fuzzy list)– Person disambiguation (social networks?)
– Other disambiguation
– Continued population of endpoints
– Calais as hub
– Exposure of the IDE
– User managed lexicons
– Lots and lots of hooks
• www.opencalais.com
– Gallery – code and applications examples
– Forums
– Documentation