Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases
-
Upload
mariagrineva -
Category
Technology
-
view
490 -
download
0
description
Transcript of Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases
![Page 1: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases](https://reader033.fdocuments.in/reader033/viewer/2022051817/54931ce3ac7959342e8b47a7/html5/thumbnails/1.jpg)
Semantic Data Search and Analysis Using Web-based User-Generated
Knowledge Bases
Dr. Maria GrinevaSystems Group @ ETH Zurich
Sunday, April 7, 13
![Page 2: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases](https://reader033.fdocuments.in/reader033/viewer/2022051817/54931ce3ac7959342e8b47a7/html5/thumbnails/2.jpg)
Today’s Search is Based On Links
• Full-text search is the main way to access information on the Web
• The goal of Web search engines: find out the most relevant pages for the user’s query
• Google employs the Web’s hyperlinks to compute relevance of a Web page (PageRank)
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
![Page 3: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases](https://reader033.fdocuments.in/reader033/viewer/2022051817/54931ce3ac7959342e8b47a7/html5/thumbnails/3.jpg)
Domains Without Links
• PageRank does not work when documents are are not interlinked
• Breaking news and Blog posts - must be available in real-time, when no links have been created yet
• Enterprise databases - documents are not well interconnected because of organizational silos and limited number of people who create and use them
Sunday, April 7, 13
![Page 4: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases](https://reader033.fdocuments.in/reader033/viewer/2022051817/54931ce3ac7959342e8b47a7/html5/thumbnails/4.jpg)
Web-based User-Generated Knowledge Bases
• To rank and organize documents that are not interlinked well, we need additional knowledge bases:
• Wikipedia - Online encyclopedia
• Twitter - real-time microblogging service
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
![Page 5: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases](https://reader033.fdocuments.in/reader033/viewer/2022051817/54931ce3ac7959342e8b47a7/html5/thumbnails/5.jpg)
The Goal of This ProjectDevelop a technology which automatically extracts semantic information:
• from Wikipedia - term meanings, relationships, ontologies ...
• from Twitter - real-time information about breaking news, trends, people opinions ...
and applies this information to organize:
• news and blogs on the Web
• documents in enterprise databases
We will release our technology as an open source software framework
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
![Page 6: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases](https://reader033.fdocuments.in/reader033/viewer/2022051817/54931ce3ac7959342e8b47a7/html5/thumbnails/6.jpg)
Semantic Text Analysis Using Wikipedia
• Leveraging Wikipedia to improve text analysis methods:
• Comprehensive coverage (6M terms vs. 65K in Britannica)
• Continuously brought up-to-date
• Rich structure (cross-references between articles, categories, redirect pages, disambiguation pages, info-boxes)
• New algorithms:
• Advanced NLP: Word Sense Disambiguation, Keyword Extraction, Topic Inference
• Automatic Ontology Management: Organizing Concept into Thematically Grouped Tag Clouds
• Semantic Search: Concept-based Similarity Search, Smart Faceted Navigation
• Zero-cost deployment and customization: No need to train methods, no human labor, no “cold start” problem
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
![Page 7: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases](https://reader033.fdocuments.in/reader033/viewer/2022051817/54931ce3ac7959342e8b47a7/html5/thumbnails/7.jpg)
Basic Technique:Semantic Relatedness of Terms
• We analyze Wikipedia Links Structure to compute Semantic Relatedness of Wikipedia terms
• We use Dice-measure with weighted hyperlinks (bi-directional links, direct links, “see also” links, etc)
Dmitry Lizorkin, Pavel Velikhov, Maria Grineva, Maxim GrinevAccuracy Estimate and Optimization Techniques for SimRank ComputationVLDB 2008Sunday, April 7, 13
![Page 8: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases](https://reader033.fdocuments.in/reader033/viewer/2022051817/54931ce3ac7959342e8b47a7/html5/thumbnails/8.jpg)
Word Sense Disambiguation • Exmple: IBM may stand for International Business
Machines Corp. or International Brotherhood of Magicians
• We use Wikipedia redirection (synonyms) and disambiguation pages (homonyms) to detect and disambiguate terms in a text
• Example: Platform is mentioned in the context of implementation, open-source, web-server, HTTP
Sunday, April 7, 13
![Page 9: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases](https://reader033.fdocuments.in/reader033/viewer/2022051817/54931ce3ac7959342e8b47a7/html5/thumbnails/9.jpg)
Prototype of a Semantic Search Engine for the Blogosphere
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
![Page 10: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases](https://reader033.fdocuments.in/reader033/viewer/2022051817/54931ce3ac7959342e8b47a7/html5/thumbnails/10.jpg)
Twitter - A Real-Time News Medium
• ~200M users all over the world posting short messages (tweets) via mobile devices and web browser
• ~140M tweets per day
• Twitter - is an open social network where everyone can follow everyone
• Retweets - a mechanism for fast news spreading
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
![Page 11: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases](https://reader033.fdocuments.in/reader033/viewer/2022051817/54931ce3ac7959342e8b47a7/html5/thumbnails/11.jpg)
Following + Retweets:Twitter is the Fastest News Medium
• Twitter reacts faster than mainstream media: Haiti Earthquake, Hudson river plane crash
• Everyone can be a reporter: real-time updates on the revolutions in Tunisia, Egypt, Libya, Iran ...
Sunday, April 7, 13
![Page 12: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases](https://reader033.fdocuments.in/reader033/viewer/2022051817/54931ce3ac7959342e8b47a7/html5/thumbnails/12.jpg)
Extracting Useful Information From Twitter
• Popularity of a URL
• Sentiments, opinions about a news story (tweets containing the news URL)
• Trending topics: what is being actively discussed right now
• Personalization of news based on user’s friends connections: The Tweeted Times http://tweetedtimes.com
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
![Page 13: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases](https://reader033.fdocuments.in/reader033/viewer/2022051817/54931ce3ac7959342e8b47a7/html5/thumbnails/13.jpg)
The Tweeted Times: personalized newspaper generated from user’s Twitter account
Sunday, April 7, 13
![Page 14: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases](https://reader033.fdocuments.in/reader033/viewer/2022051817/54931ce3ac7959342e8b47a7/html5/thumbnails/14.jpg)
At the Systems Layer
• Scalable distributed architecture is required:
• Hadoop (MapReduce software framework) for batch processing of Wikipedia snapshots
• Real-time analytics based on distributed key-value store for online Twitter stream processing
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
![Page 15: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases](https://reader033.fdocuments.in/reader033/viewer/2022051817/54931ce3ac7959342e8b47a7/html5/thumbnails/15.jpg)
Scalable Real-Time Analytics Based On Distributed Key-Value Store
• At Systems Group, we are working on a system for real-time analytics based on Cassandra:
• We extend Cassandra with:
• push-style procedure for real-time analytics
• incremental computations (alternative to batch-processing) - processing data as it arrives from the stream
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
![Page 16: Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases](https://reader033.fdocuments.in/reader033/viewer/2022051817/54931ce3ac7959342e8b47a7/html5/thumbnails/16.jpg)
References
• Prototype of the semantic search engine Blognoon: http://blognoon.com
• The Tweeted Times - personalized newspaper based on user’s Twitter account:http://tweetedtimes.com
• Triggy: a system for real-time analytics:http://www.systems.ethz.ch/research/projects
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13