Introduction to Text Mining and Semantics
-
Upload
seth-grimes -
Category
Business
-
view
111 -
download
0
description
Transcript of Introduction to Text Mining and Semantics
![Page 1: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/1.jpg)
Introduction to Text Miningand Semantics
Seth Grimes--
President, Alta Plana
![Page 2: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/2.jpg)
New York Times
October 9, 1958
![Page 3: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/3.jpg)
From text to information
“Text expresses a vast, rich range of information, but encodes this information in a form that is difficult to decipher automatically.”
-- Marti A. Hearst,“Untangling Text Data Mining,” 1999
![Page 4: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/4.jpg)
Document input and processing
Information extraction
Hans Peter Luhn, “A Business Intelligence System,” IBM Journal, October 1958
Knowledge management
![Page 5: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/5.jpg)
Statistical analysis of content
“Statistical information derived from word frequency and distribution is used by the machine to compute a relative measure of significance.”
Hans Peter Luhn, “The Automatic Creation of Literature Abstracts,” IBM Journal, April 1958
![Page 6: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/6.jpg)
Statistical analysis limitations
“This rather unsophisticated argument on ‘significance’ avoids such linguistic implications as grammar and syntax... No attention is paid to the logical and semantic relationships the author has established.”
-- Hans Peter Luhn, 1958
![Page 7: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/7.jpg)
Semantic links
New York Times,September 8, 1957
Anaphora / coreference: “They”
![Page 8: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/8.jpg)
“The Diverse and Exploding Digital Universe,”(IDC, 2008)Approximately 70% of the
digital universe is created by individuals.
“The broadcast, media, and entertainment industries garner about 4% of the world’s revenues but already generate, manage, or otherwise oversee 50% of the digital universe.”
Digital content universe
![Page 9: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/9.jpg)
The digital universe:• Web sites, news & journal articles, images, video.
• Blogs, forum postings, and social media.
• E-mail, Contact-center notes and transcripts; recorded conversation.
• Surveys, feedback forms, warranty & insurance claims.
• Office documents, regulatory filings, reports, scientific papers.
• And every other sort of document imaginable.
Is Search up to the job?
The “unstructured data” challenge
![Page 10: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/10.jpg)
How are the quality, value & authority of search results?
Hotel’s opinion
Guest’s opinion - about Priceline
Who profits from search?
Search results
![Page 11: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/11.jpg)
From Web 1.0 to Web 2.0
How can we do better?
“We have many of the tools in place -- from Web 2.0 technologies…”
“The Diverse and Exploding Digital Universe,”(IDC, 2008)
![Page 12: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/12.jpg)
Web 2.0
“Web 2.0 is the business revolution in the computer industry caused by the move to the Internet as a platform.” -- Tim O’Reilly, 2004
“[A] move from personal websites to blogs and blog site aggregation, from publishing to participation,… an ongoing and interactive process... to links based on tagging.”
-- Terry Flew, “New Media: An Introduction,” 2008
Web 2.0 is dynamic, personalized, interactive, collaborative.
![Page 13: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/13.jpg)
“We have many of the tools in place -- from Web 2.0 technologies… to unstructured data search software and the Semantic Web -- to tame the digital universe. Done right, we can turn information growth into economic growth.”
-- “The Diverse and Exploding Digital Universe,” (IDC, 2008)
From information growth to economic growth
![Page 14: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/14.jpg)
Text mining: from information to intelligence
Text mining enables smarter search that better responds to user goals, e.g., answers –
![Page 15: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/15.jpg)
From Web 2.0 to Web 3.0
For even better findability:“The Semantic Web is a web of data, in some ways like a
global database.”-- Tim Berners-Lee, 1998
Web 3.0 is Web 2.0 + the Semantic Web + semantic tools.
Recurring themes:•Semantically enriched content & search.•Linked Data.•Context sensitive.•Location aware.
![Page 16: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/16.jpg)
The Semantic Web vision
"
Linked Data: “exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web.”
An open-standards architure, coordinated by the W3C (World Wide Web Consortium)
![Page 17: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/17.jpg)
Steps in the right direction…
![Page 18: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/18.jpg)
Unfiltered duplicates
External reference
“Kind” = type, variety, not a sentiment.
… and missteps
Complete misclassification
![Page 19: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/19.jpg)
Getting to Web 3.0
Text mining / analytics enables Web 3.0 and the Semantic Web.• Automated content categorization and classification.• Text augmentation: metadata generation, content
tagging.• Information extraction to databases.• Exploratory analysis and visualization.
Technical concepts:• Linked Data• RDF, SPARQL, OWL• RDFa, Microformats, eRDF
![Page 20: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/20.jpg)
I recently published a study report, “Text Analytics 2009: User Perspectives on Solutions and Providers.”
I estimated a $350 million global market in 2008, up 40% from 2007.
I relayed findings from a survey that asked…
Text mining: users’ perspective
![Page 21: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/21.jpg)
Primary applications
Law enforcement
Other
E-discovery
Insurance, risk management, or fraud
Content management or publishing
Research (not listed)
Competitive intelligence
0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
7%
8%
13%
14%
15%
15%
17%
18%
19%
22%
33%
33%
37%
40%
What are your primary applications where text comes into play?
![Page 22: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/22.jpg)
Analyzed textual information
blogs and other social media (twitter, social-network sites, etc.)
62%
news articles 55%
on-line forums 41%
e-mail and correspondence 38%
customer/market surveys 35%
What textual information are you analyzing or do you plan to analyze?
Current users responded:
![Page 23: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/23.jpg)
Extracted information
Named entities – people, companies, geographic locations, brands, ticker symbols, etc.
Topics and themes
Sentiment, opinions, attitudes, emotions
Concepts, that is, abstract groups of entities
Events, relationships, and/or facts
Metadata such as document author, publication date, title, headers, etc.
Other entities – phone numbers, e-mail & street addresses
Other
0% 10% 20% 30% 40% 50% 60% 70% 80%
71%
65%
60%
58%
55%
53%
40%
15%
Do you need (or expect to need) to extract or analyze:
![Page 24: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/24.jpg)
Overall satisfaction
Please rate your overall experience – your satisfaction – with text mining.
![Page 25: Introduction to Text Mining and Semantics](https://reader035.fdocuments.in/reader035/viewer/2022062617/54c65b8c4a795971538b45a6/html5/thumbnails/25.jpg)
Moving ahead
Apply text mining to discover value in content.
Develop / improve metadata and taxonomies.
Adopt semantic technologies -- for content publishing and for user interactions -- to boost flexibility, findability, and profitability.
And understand your audience:“By focusing on the fundamental aspects of the
consumers’ online behavior -- not just current best practices -- companies will be better prepared when Web 2.0+ morphs into Web 3.0 and beyond.”
-- Donna L. Hoffman, UC Riverside, in the McKinsey Quarterly