Text Mining & Visualization - Patent Information Users Group · Introduction –Text Mining...
Transcript of Text Mining & Visualization - Patent Information Users Group · Introduction –Text Mining...
Text Mining & Visualization
Impressions of emerging capabilities
Cynthia Barcelon-Yang (speaker)
Yun Yun Yang (speaker)
Lucy Akers
Bristol-Myers Squibb
2007 PIUG Northeast Conference
New Brunswick, New Jersey
�Introduction – Text Mining &Visualization
�Overview of Text Mining Tools
■Capabilities
■Data Sources
■Results
■Strengths
� Summary
Why do we need a tool to do text mining?
Welcome to the age of too much information...
Typical questions asked of IP Operations
�How many patents do we have concerning technology ‘x’?
�How does our portfolio compare with company ‘ABC’ ?
�Who is citing our portfolio?
�Which patents do business unit ‘xyz’ own?
�Which patents should we divest as a result of selling division
XYZ?
�How do our invention disclosures compare with current granted
patents?
�How do we improve our patent operations?
Often, the IP Operations group within an organization provides centralized support
to a wide range of business units, and is responsible for answering the following:
What is text mining?
(according to Marti Hearst of UC Berkeley School of Information)
■ The discovery of new, previously unknown information, by automatically extracting information from different written resources.
■ A variation on a field called data mining, that tries to find interesting patterns from large databases.
■ Many researchers think it will require a full simulation of how the mind works before we can write programs that read the way people do.
■ computational linguistics (also known as natural language processing)
■ Hearst distinguishes between "real" text mining, that discovers new pieces of knowledge, and approaches that find overall trends in textual data.
Text Mining Process
Courtesy of: Invention Machine Corp.
Common Tasks�List generation (can be displayed as histograms)
�List cleanup and grouping of concepts
�Co-occurrence matrices and other graphing
�Clustering, categorization, grouping and extraction of text
�Mapping document clusters or concepts
�Adding temporal components to maps
�Citation analysis
�Subject/Action/Object (SAO) functions (a.k.a. NLP)
�Federated searching e.g. on Internet or Intranets
Project Planning
■ Phase I
►Literature searches, key references, brainstorming of
text/data mining & visualization
►Identify potential tools to evaluate
►Vendor onsite demonstrations
► Summary of initial tool evaluations
■ Phase II
►Pilot selected tools
►Identify potential clients groups and interview
representative clients
Investigation & Process Approach
■ Scout the literature/internet sources & brainstorm
■ Benchmark
■ “Patinformatics – Tools and Tasks” by Tony Trippe,
World Patent Information 25 (2003) 211–221
■ “Data Visualization Tools - A Perspective from
the Pharmaceutical Industry” by Jeannette Eldridge, World Patent Information 28 (2006) 43–49
■ Vendor demos
Tools Initially Identified
AnaVist Matheo Patent
Anacubis OmniViz
Aureka PatAnalyst
Bioalma Quosa
BizInt Technology Watch
ClearForest Temis
Delphion VantagePoint
Entrieva (Semio) Vivisimo
GoldFire Wisdomain
Inxight Wistract
M-CAM
Vendor Tool Demonstrations
1.Quosa
2.Inxight
3.PatAnalyst
4.OmniViz
5.Temis
6.Aureka
7.Wisdomain
8.GoldFire
9.VantagePoint
10.ClearForest
11.m-CAM
12.RefViz
* Overview of Vendor Tools
�Type of Tool
�Capabilities
�Data Sources
�Results
�Strengths
�Summary
* Text mining tool slides are provided courtesy of the vendors.
Text Mining Capabilities
�Keyword Analysis■ Extracting nouns or noun phrases in text without understanding their meaning or relationships or counting the number of times the nouns appear
�Statistical Analysis ■ Frequency-based analysis – counting the number of times a word appears in the text
� Linguistic Analysis■ Natural language processing (NLP) – “Trained Agent”
■ Semantic analysis
Text Mining Data Sources
■Unstructured text
►full text document, emails
■Structured text
►database records, such as records from STN,
pubmed
■Hybrid content
►Patents, front page is structured, text is not
Data Sources
I. General Data Sources (Unstructured):ClearForest
GoldFire Innovator
Inxight
OmniViz
Temis
II. Bibliographic Data Sources (Structured):Quosa
RefViz
VantagePoint
III. Patent-Focused (Hybrid): Aureka
M-CAM
PatAnalyst
Wisdomain
Evaluation Template� Type of Tool
■ Text mining software tool
■ Database content provider
■ Both
� Capabilities■ Keyword analysis
■ Statistical analysis
■ Linguistic analysis
� Data Sources■ Structured bibliographic data sources
■ Unstructured sources – full-text web, email, corporate repositories, etc.
■ Hybrid sources – patents, combination of structured/unstructured
� Results■ Lists of documents
■ Tables
■ Charts/Graphs
■ Maps
� Strengths – Disclaimer: Our Impressions only!
� Summary
GoldFire Innovator
� Type of tool – text mining tool
GoldFire Innovator� Technology – Semantic Analysis
GoldFire Innovator
GoldFire Innovator
� Data Sources■ Unstructured information from personal data, corporate data, deep web, content, patents, internet
►15 MM worldwide patents
►Database of over 8000 scientific effects
►3000 cross-disciplinary scientific deep web websites
� Results■ Static categorization of key concepts
■ Accurate answers to questions
■ Dynamic document summarization
GoldFire Innovator - Strengths
�Precision retrieval of targeted R&D content►Retrieves information from context – semantic
indexing
►automated summaries and categorization
►Relevant filtering and ranking
�Using natural language query to search►Ask the right questions - How to dry paper? How to balance diets?
�Innovation Trend Analysis► Competitive analysis
► Technology analysis
► Patent relationship analysis – citation analysis
Inxight
� Type of Tool■ Text mining software tool.
� Capability■ Natural Language Processing
■ Contextual extractions (leaning towards semantic analysis)
� Data Source■ Unstructured text from websites, internal repositories, full-
text documents
■ Documents have to be pre-processed to extract meta-data and identify entity types
� Results■ Hierarchical categorization
Inxight - Strengths
� Federated Search capability
� Claim to have more accuracy than a
human reader
� Software can work in 32 languages
and can understand 27 entity types
� Can process 1.2Gigabytes per hour
� Claim to have the most powerful
linguistic algorithms in the field
Temis
�Type of tool ■ Text Mining Solutions - software
�Capability ■ Natural Language processing
►Insight DiscovererTM Extractor – info extraction sever poweredby Xe-LDA and used with specialized Skill Cartridges
►Insight DiscovererTM Categorizer – doc categorization sever
►Insight DiscovererTM Clusterer – automated classification sever
►XeLDA - Multilingual linguistic engine – natural language processing
►Skill Cartridge – A set of customizable knowledge components
that define the information to be extracted. The two major knowledge
components are multi-lingual dictionaries and multi-lingual
extraction rules (establish relationships between defined concepts
Skill Cartridge Overview
� Open architecture
■ Plug & Play annotation components
■ Each defines areas of interests & extraction rules
■ Extraction rules describe the sentence structure that characterizes a concept
XeLDA™
Text
(any kind, any format)
Words
(any concept)
Merger & Acquisition
Positive & Negative Sentiment Analysis
Meaning = Acquisition
• Target & buyer• Amount & date
...
Meaning = Satisfaction
• People, companies, Products
• Satisfaction• Support
...
Plug & Play
Skill Cartridges™…
InsightDiscoverer™Extractor
Temis
�Data Sources
■ Any kind, any format, Internal & external data,
documents, literature, patents, clinical trials,
chemistry and biology, bioinformatics, internet,
email, etc
�Results
■ Clusters, Rankings, Lists to discover information
trends and relationships
Temis - Strengths
�Searching by concepts►Selecting concepts from concept tree
�Specialized Skill cartridges►Life science Skill Cartridges
– Analytics
– Text Mining 360°
– Competitive Intelligence
– Human Resources Management
►General Skill Cartridges
– Biological Entity Relationships – best selling
– Medical Entity Relationships
– Chemical Entity Relationships
– Competitive Intelligence Life Sciences Edition
Temis - Strengths
�Strong extraction, categorization, and
clustering capabilities
�Robust XeLDA linguistic engine
�Quick trend analysis
�Chemical Document Browser – specialized
extraction module for chemical substance
nomenclature translation to chemical
structures.
OmniViz
� Type of tool■ visual based data/text mining software
� Capability■ algorithm based statistical analysis, not semantics
� Data source/type■ numeric, text, categorical, chem. structures, sequence,
structured/unstructured text
� Results
■ interactive visualizations maps such as CoMet,
Correlation, Galaxy Proximity, etc.
OmniViz
OmniViz- Strengths
■ Interactive visualizations
■ Supports analysis of large amounts of data (millions of documents) - numeric, categorical and full-text analysis, including patents.
■ Broad applications including gene expression, sequence & pathway analysis, chemical structures, cheminformatics, clinical trial, patent analysis, diagnosis and treatment, legal, marketing data, regulatory compliance, intelligence analysis, etc.
■ Flexible data import and merge capabilities
ClearForest � Type of Tool
■ Text mining tool (text analytics solution)
� Capability■ Semantic analysis/NLP
� Data Sources■ Unstructured text – websites
■ Patents
■ Internal documents
■ Meta-data
� Results■ Structured data entities
■ List of potential solutions for identified issues
■ Visualization tools – trend graphs, category maps
►Color and font are used to show intensity of relationships
ClearForest
Text Analytics: How it Works
Unified Analysis
Output
TaggingPlatform
UnstructuredText
Problem Condition
Fuel Pump Fails corroded
Pump Relay Shorts Cold
weather
Headlight Fails Running hot
Engine Stalls At low
speeds
Part
DB
Database
DatabaseText Fields
DB
XML
Extraction
Across RecordsIncluding domain specific
entities & relationships
Role-Based Interfaces
<PartProblemCondition>
<Part> Fuel Pump </Part>
<Problem> Fails </Problem>
<Condition> Corroded </Condition>
</PartProblemCondition>
DocumentsText, Word, Excel,
Email, WWW, PDF
Clear Forest
Packaged Extraction ModulesInputs
Outputs
Patents
Structured Data Entities� Agent� Application Number� Assignee� Assignee Address� Examiner� Filing Date� Inventor� Inventor Address� IPC� Issue Date� Number Of Claims� Patent Citations� Patent Number� US Class
Entities • Claim Element• Claim Invention• Extracted Terms• Invention Terms• Measurement Terms• Number of Claims• Patent Section• Problem Solved Terms• Problems Solved• Process Technology Terms• Technology Terms
U.S. PatentSearch
MicroPatentSearch
DatabaseFields
Text, Word,Excel, etc
ClearForest - Strengths
�Can be applied to a wide range of applications as evidenced by wide variety of available extraction modules■ Security/intelligence gathering
■ Product/customer information
■ Corporate/People profiles
■ Patents
■ Biomedical entities
�Analytics tool can discover unexpected relationships between entities that would not have been otherwise uncovered by standard, manual methods.
VantagePoint
� Type of the tool■ Text mining software mainly used for technology
assessment and company profiling
� Capability■ Uses pattern matching, rule-based, and natural language
processing techniques
� Data Sources■ Works best with structured data - text data from
bibliographic databases
� Results
■ summaries, charts, matrices, maps, and graphs
VantagePoint - Key Features
� Rapid navigation in large abstract collections
� Helps find relationships within your data
� Visually displays relationships
� Buckets documents to help in categorization
� Utilities for cleaning data
� User created thesauri for reducing data
� Scripting capabilities to automate knowledge-gathering
� Easily exports output to other applications
� Can be configured to text mine most forms of structured bibliographic data
VantagePoint - Strengths
� List Creation and Cleanup■ patent assignee, author, inventor
■ pre-built IPC, User created thesauri
� Analytical tool box■ rapid navigation in large abstract collections to answer who, where, what, when but not how and why
■ visually displays relationships
� Scripting capabilities to automate knowledge-gathering■ configure to extract from structured databases
RefViz
� Type of tool
■ Text Analysis and Data Visualization software
� Capability
■ Statistical and Linguistic analysis
►“mathematical signature” – relationship of words
►Uses a thesaurus tool
� Data Sources
■ Only structured data from title, abstracts/notes fields, or ISI Web of Science, PubMed, OCLC, Output
� Results
■ “Galaxy” & matrix visualization
RefViz - Strengths
■ Reference Retriever™ can search multiple online
sources simultaneously
■ can be used together with EndNote, ProCite, and
Reference Manager to provide an additional level
of analysis to existing reference collections
■ analyzes large numbers of references by thematic
content
■ interactive, visual landscape
Reveal trends and associations in references
The Galaxy view organizes references according to how they are related conceptually.
References on farming and herbs, either their
cultivation or use as herbicides, are found in
the upper left region of the Galaxy.
Groups in the lower right focus on herbs in
medicine.
The region in between farming and medicine contains a mix of
references about herbage diets in farm animals, herbal extracts
from plants, and research on health effects of herbicide exposure.
Quosa
� Type of tool■ Text mining tool based on concept extraction/clustering
� Capability■ Statistical analysis (term extraction, frequency ranking,
concept extraction using dynamic extraction algorithm from MIT/Harvard)
� Data sources■ unstructured text - PubMed, Ovid, Google Scholar
■ Patents
■ Internal documents
� Results■ Highly organized collection of documents (folders on
shared server or local machine)
■ Team sharing and annotating
Quosa - Strengths
Full-text retrieval and management of
scientific documents
■Get full-article from a journal or patent
gateway
► PubMed, Ovid, USPTO website
■Document Summary from My Article
Organizer
■Download to EndNote
M-CAM DoorsTM
� Type of tool■ Patent database provider, with text analysis and risk management
solution
� Capability■ Linguistic & semantic-based analysis, multi lingual
� Data Sources■ Patents from over 88 patenting authorities, 50 million patent doc.
■ journal articles (by the end of the summer 2006)
� Results■ “Compass” citation view
■ “Magellan” telescope & hourglass – patent life timeline
■ Patent uniqueness and enforceability analysis
■ Competitive intelligence analysis - financial risk analysis for merger/acquisition and stock trading
M-CAM DoorsTM
Hourglass view – shows behavior and intent
Red bar – cited patents
Blue bar – citing patents
Green bar – concurrent art – share pendency
Purple bar – volume of uncited patents
Orange bar – volume of patents that did not cite subject patent
M-CAM DoorsTM - Strengths
�Powerful visual interface for citation analysis with related family & legal status views
�Can rate each patent for its uniqueness, reliance on related patents, and enforcement potential – based on Hourglass view
�Can rank patent clusters by relevance to business objectives
�Competitive Intelligence/Investment Research ■ New Patent Thursday™ , Patent Portfolio Confidence Rating™ , Custom PPCR™
PatAnalyst
� Type of tool■ Patent database provider – integrated source (UNIPAT) of patent
databases from US, PCT, EPO, PAJ, Germany, UK, France and Switzerland
■ Patent search & examination service
� Capability■ No text mining algorithm
� Data Sources■ 51.5 MM patent documents – bibliographic data from 70 countries
from EPO
■ 15MM full-text documents – 8 countries/patenting authorities
� Results■ Viewer – analyze and orgnize the patent documents/families.
■ easy to use analytical colored text-highlighting of keywords
■ Organized folders of documents
PatAnalyst - Strengths
�Powerful user-interface with enhanced
display features
■ Highlight keywords are in different colors
■ Side-by-side views of full-text and standard
bibliographic data
■ Integrated IPC category trees
■ “Live” legal status & patent family tree view from
EPO Viewer (EPOQUE)
■ Combined search of full-text & bibliographic data
Aureka
� Type of tool
■ content and software tool specializing in visualization and
citation analysis
� Capability
■ Keyword and Statistical Analysis
� Data Sources
■ patent databases listed in MicroPatent’s FullText collection
� Results
■ ThemeScape maps, hyperbolic citations trees, text clusters
Aureka Themescape Map of
Stem Cell TechnologyA Themescape map of
a large set of
documents provides an
initial view of the
content. Additional
probing and analysis of
the map will help to
reveal more insight.
Citation Tree of Patent EP0778277
A cited patent provides insight into a corporation’s strategic intent with a patent;
build a picket fence, non-core patent, or lack of R&D interest.
Aureka – Strengths
� Strong citation analysis tool►Interactive citation tree – intelligence analysis
and strategic planning
� Annotation capabilities
� Strong visualization analysis►Patent mapping with ThemeScape
►Clustering by Vivisimo
Wisdomain
� Type of tool■ Content and software tool. Web-based searching and
citation tool. Analysis module is local
� Capability■ Keyword analysis, citation map visualized searching
� Data Sources■ Patents, specialized in US, EP, PCT, PAJ, INPADOC legal and family status, China abs, Korea abs
� Results■ Genealogy tree, Tables, charts
Wisdomain - Strengths
�Strong citation analysis capability►backward and forward citations, more than one nesting
►collateral citation analysis
►citation alerts
�Genealogy Tree►good in competitive analysis and licensing
strategy planning
� Graphic view of the search results
ISSUED
1993APPLIED
1990
PENDING PERIOD
SUBJECT PATENT
PATENT
PATENT
PATENT
PATENT
PATENT
PATENT
Collateral CitationIdentifying similar patents sharing the same pending period with the subject patent
PATENT
PATENT
PATENT
PATENT
PATENT
PATENT
PATENT
Key Collateral patentKey Collateral patentKey Collateral patentKey Collateral patent
7 collateral patents are identified based on indirect citation r7 collateral patents are identified based on indirect citation r7 collateral patents are identified based on indirect citation r7 collateral patents are identified based on indirect citation relations.elations.elations.elations.
Summary
R&D scientists,
Information Professionals
Strong collateral citation analysis Wisdomain
Information Professionals,
R&D scientists
Powerful full-text user interface
with display featuresPatAnalyst
Business Intelligence, Legal/Patent
Dept., Information Professionals
Patent uniqueness & enforcement
analysisM-CAM
Legal/Patent Dept., R&D scientists,
Information Professionals,
Strategic Planning, Business
Intelligence
Patent mapping, clustering &
citation analysisAureka
Information Professionals,
Business Intelligence
Analytical tool box for technology
or company assessmentVantagePoint
R&D scientists,
Information Professionals
Bibliographic data post-
processingRefViz
R&D scientistsFull-text retrieval & mgmtQuosa
R&D scientists,
Business Intelligence
Extraction using Specialized Skill
Cartridges Temis
R&D scientistsInteractive visualizationOmniViz
R&D InformaticsExtraction & Federated Search Inxight
R&D scientistsSophisticated semantic analysis
toolGoldFire
Business IntelligenceExtraction modulesClearForest
Potential User GroupsStrengthVendor Name
Path Forward
■Phase II
►Pilot selected tools
►Identify potential clients groups and interview
representative clients
Closing Remarks
Acknowledgements
Peter Mattei Aureka
Thomas Klose ClearForest
Shelley Pavlek GoldFire/Invention Machine
Joanne Freeman Inxight
Marlene Khouri M-CAM
Heahyun Yoo OmniViz
Tony Medina PatAnalyst
Michael Rogers Quosa
Karen Stesis RefViz
Tisha Zawisky Temis
Lou Ann DiNallo VantagePoint
Mary Talmadge-Grebenar Wisdomain
Joseph Bezek
Claudia Powers
Ramesh Durvasula (Informatics)
Ronald Stoner (Mead Johnson)
Questions