10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data...
-
Upload
barbara-barber -
Category
Documents
-
view
218 -
download
3
Transcript of 10/14/2015 1 Data Mining: Concepts and Techniques — Chapter 10 — 10.3.1 Mining Text and Web Data...
04/19/23 1
Data Mining: Concepts and Techniques
— Chapter 10 —10.3.1 Mining Text and Web Data (I)
Jiawei Han and Micheline Kamber
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
Acknowledgements: Based on the slides by students at CS512 (Spring 2009)
Outline
Introduction to Information Retrieval (Rui Li)
Text categorization (Parikshit Sondhi)
Web link analysis (Kavita Ganesan)
Mining and Searching Structured Data on
the Web (Bo Zhao)
Information Retrieval
What’s the Information Retrieval ? Information Retrieval
There exists a collection of text documents User gives a query to express the information need A retrieval system returns relevant documents to users
Typical IR systems Online library catalogs Online document management systems Web Search Engine (Google)
Information Retrieval vs. Database System Unstructured/free text vs. structured data Ambiguous vs. well-defined semantics Incomplete vs. complete specification Relevant documents vs. matched records No transaction VS transaction management,
Typical IR System Architecture
5
User
querydocs
results
Query RepDoc Rep (Index)
ScorerIndexer
Tokenizer
Index
Document Representation
A document can be described by a set of representative keywords called index terms.
Different index terms have varying relevance when used to describe document contents.
Steps: Tokenize the document into the words Remove stop words from stop word list E.g., “is”
“a” “or” Words stemmer: Several words are small syntactic
variants of each other since they share a common word stem E.g., drug, drugs, drugged
Calculate the term weight based on the word frequency
Query Representation is a similar process
Indexing Inverted index
Maintains two hash- or B+-tree indexed tables: document_table: A set of document records <doc_id,
postings_list> term_table: A set of term records, <term, postings_list>
Answer query: Find all docs associated with one or a set of terms
+ easy to implement + effective to fetch documents with specific term – do not handle well synonymy and polysemy, and posting
lists could be too long (storage could be very large) Other index techniques: e.g., signature file
Ranking Model The basic question: Given a query, how do we know if document A
is more relevant than B? Relevance = Similarity
Query and document are represented similarly A query can be regarded as a “document” Relevance(d, q) similarity(d, q)
Key issues How to represent query/document? How to define the similarity measure ?
Typical Models Boolean Model Vector Space Model Language Model
9
The Notion of RelevanceRelevance
(Rep(q), Rep(d)) Similarity
P(r=1|q,d) r {0,1} Probability of relevance
P(d q) or P(q d) Probabilistic inference
Different rep & similarity
Vector spacemodel(Salton et al., 75)
Prob. distr.model(Wong & Yao, 89)
…
GenerativeModel
RegressionModel(Fox 83)
Classicalprob. model(Robertson & Sparck Jones, 76)
Docgeneration
Querygeneration
LMapproach(Ponte & Croft, 98)(Lafferty & Zhai, 01a)
Prob. conceptspace model(Wong & Yao, 95)
Differentinference system
Inference network model(Turtle & Croft, 91)
10
Vector Space Model Represent a doc/query by a term vector
Term: basic concept, e.g., word or phrase Each term defines one dimension and N terms
define a high-dimensional space Element of vector corresponds to term weight, i.e.,
the “importance” of the term
Java
Microsoft
Starbucks
D6
D10
D9
D4
D7D8
D5
D11
D2 ? ?
D1
? ?
D3? ?Query
04/19/23Data Mining: Principles and Algorithms
11
How to Assign Weights
Two-fold heuristics based on frequency TF (Term frequency)
More frequent within a document more relevant to semantics
e.g., “query” vs. “commercial”
IDF (Inverse document frequency) Less frequent among documents more discriminative e.g. “algebra” vs. “science”
TF-IDF weighting: weight(t, d) = TF(t, d) * IDF(t) Frequent within doc high tf high weight Selective among docs high idf high weight
04/19/23Data Mining: Principles and Algorithms
12
How to Measure Similarity? Given two document
Similarity definition dot product
normalized dot product (or cosine)
13
Advantages and Disadvantages of VS Model Advantages:
Empirically effective! (Top TREC performance) Intuitive Easy to implement Well-studied/most evaluated
Disadvantages: Assume term independence Assume query and document be the same Lack of “predictive adequacy” Arbitrary term weighting Arbitrary similarity measure
14
Language Models for Retrieval(Ponte & Croft 98)
Document
Text miningpaper
Food nutritionpaper
Language Model
…text ?mining ?assocation ?clustering ?…food ?…
…food ?nutrition ?healthy ?diet ?…
Query = “data mining algorithms”
?Which model would most likely have generated this query?
15
Text Generation with Unigram LM
(Unigram) Language Model p(w| )
…text 0.2mining 0.1assocation 0.01clustering 0.02…food 0.00001…
Topic 1:Text mining
…food 0.25nutrition 0.1healthy 0.05diet 0.02…
Topic 2:Health
Document
Text miningpaper
Food nutritionpaper
Sampling
16
Estimation of Unigram LM
(Unigram) Language Model p(w|) = ?
Document
text 10mining 5association 3database 3algorithm 2…query 1efficient 1
…text ?mining ?association ?database ?…query ?…
Estimation
A “text mining paper”(total #words=100)
10/1005/1003/1003/100
1/100
17
Ranking Docs by Query Likelihood
d1
d2
dN
qd1
d2
dN
Doc LM
p(q| d1)
p(q| d2)
p(q| dN)
Query likelihood
18
Retrieval as Language Model Estimation
Document ranking based on query likelihood
n
ii
wwwqwhere
dwpdqp
...,
)|(log)|(log
21
• Retrieval problem Estimation of p(wi|d)
• Smoothing is an important issue, and distinguishes different approaches
Document language model
Basic Measures for Text Retrieval
04/19/23Data Mining: Principles and Algorithms
19
Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)
Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved
|}{|
|}{}{|
Relevant
RetrievedRelevantrecall
|}{||}{}{|
RetrievedRetrievedRelevant
precision
Relevant Relevant & Retrieved Retrieved
All Documents
Acknowledge Some slides are coming from Professor Jiawei
Han’ s CS512 course slides and from Professor Chengxiang Zhai’s CS410 course slides (Language Model Part)
By:Parikshit Sondhi
Computer ScienceUniversity of Illinois at Urbana Champaign
Some slides have been adapted from Prof. Han's presentation
Text Categorization
Document Classification: Motivation
News article classification Automatic email filtering Webpage classification Word sense disambiguation … …
04/19/23Data Mining: Principles and Algorithms23
Text Categorization Pre-given categories and labeled document
examples (Categories may form hierarchy) Classify new documents A standard classification (supervised learning )
problem
CategorizationSystem
…
Sports
Business
Education
Science…
SportsBusiness
Education
Document Classification: Problem Definition
Need to assign a boolean value {0,1} to each entry of the decision matrix
C = {c1,....., cm} is a set of pre-defined categories D = {d1,..... dn} is a set of documents to be
categorized 1 for aij : dj belongs to ci 0 for aij : dj does not belong to ci
A Tutorial on Automated Text Categorisation, Fabrizio Sebastiani, Pisa (Italy)
Flavors of Classification Single Label
For a given di at most one (di, ci) is true Train a system which takes a di and C as input and outputs
a ci
Multi-label For a given di zero, one or more (di, ci) can be true Train a system which takes a di and C as input and outputs
C’, a subset of C
Binary Build a separate system for each ci, such that it takes in as
input a di and outputs a boolean value for (di, ci) The most general approach Based on assumption that decision on (di, ci) is independent
of (di, cj)
Classification Methods
04/19/23Data Mining: Principles and Algorithms26
Manual: Typically rule-based (KE Approach) Does not scale up (labor-intensive, rule inconsistency) May be appropriate for special data on a particular
domain Automatic: Typically exploiting machine learning
techniques Vector space model based
Prototype-based (Rocchio) K-nearest neighbor (KNN) Decision-tree (learn rules) Neural Networks (learn non-linear classifier) Support Vector Machines (SVM)
Probabilistic or generative model based Naïve Bayes classifier
Steps in Document Classification Classification Process
Data preprocessingE.g., Term Extraction, Dimensionality
Reduction, Feature Selection, etc.Definition of training set and test setsCreation of the classification model using
the selected classification algorithmClassification model validationClassification of new/unknown text
documents
Taking an Example : TFIDF Classifiers
Vector Space Model
04/19/23Data Mining: Principles and Algorithms29
Represent a doc by a term vector Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define an N-dimensional space Element of vector corresponds to term weight
E.g., d = (x1,…,xN), xi is “importance” of term i
New document is assigned to the most likely category based on vector similarity (e.g., based on cosine formula).
VS Model: Illustration
04/19/23Data Mining: Principles and Algorithms30
Java
Microsoft
StarbucksC2 Category 2
C1 Category 1
C3
Category 3
new doc
TFIDF Classifier The basic idea of the algorithm is to represent
each document d as a vector d = (d(1),...., d(|F|)) in a vector space so that documents with similar content have similar vectors.
Each dimension of the vector space represents a word selected by the feature selection process.
d(i) for a document d is calculated as a combination of the statistics TF(w, d) and DF(w).d(i) is called weight of word wi in document d.
A probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Thorsten Joachims, Carnegie Mellon University, Pittsburgh, PA
Representation
Each distinct word is a feature with the number of times the word occurs in the document as its value. This value is usually a function of TF(w,d) and IDF(w,d).
To avoid unnecessarily large feature vectors words are considered as features only if they occur in the training data at least m times (e.g., m = 3).
A probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Thorsten Joachims, Carnegie Mellon University, Pittsburgh, PA
Preprocessing: Feature Selection All available features vs. "good" subset The problem of finding a "good" subset of
features is called feature selection Feature selection methods;
1- pruning of infrequent words Words are only considered as features, if they occur at
least a few times in the training data. 2- Pruning of high frequency words
This technique is supposed to eliminate non content words like "the", "and", "for".
A probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Thorsten Joachims, Carnegie Mellon University, Pittsburgh, PA
Classification: TFIDF Classifier
A probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, Thorsten Joachims, Carnegie Mellon University, Pittsburgh, PA
Evaluations
04/19/23Data Mining: Principles and Algorithms35
Effectiveness measure Classic: Precision & Recall
Precision
Recall
Evaluation (con’t)
04/19/23Data Mining: Principles and Algorithms36
Benchmarks Classic: Reuters collection
A set of newswire stories classified under categories related to economics
Effectiveness Difficulties of strict comparison
different parameter setting different “split” (or selection) between training and testing various optimizations … …
However, widely recognizable Best: Boosting-based committee classifier & SVM Worst: Naïve Bayes classifier
Need to consider other factors, especially efficiency
Document Classification: Approach Comparisons
Document Clustering
04/19/23Data Mining: Principles and Algorithms38
Motivation Automatically group related documents based on
their contents No predetermined training sets or taxonomies Generate a taxonomy at runtime
Most popular clustering methods are: K-Means clustering Agglomerative hierarchical clustering EM (Gaussian Mixture) …
The Steps and Algorithms Clustering Process
Data preprocessing: remove stop words, stem, feature extraction, lexical analysis, etc.
Hierarchical clustering: compute similarities by applying clustering algorithms
Model-Based clustering (Neural Network Approach): clusters are represented by “exemplars” (e.g., SOM)
K-Means clustering Given:
set of documents (e.g., TFIDF vectors), distance measure (e.g., cosine) K (number of groups)
For each of K groups, initialize its centroid with a random document
While not converging Each document is assigned to the nearest group
(represented by its centroid) For each group, calculate new centroid (group
mass point, average document in the group)
Slide adapted from Dr. Andrew Moore’s Presentation
Summary: Text Categorization
04/19/23Data Mining: Principles and Algorithms42
Wide application domain
Comparable effectiveness to professionals
Manual TC is not 100% and unlikely to improve
substantially
A.T.C. is growing at a steady pace
Prospects and extensions
Very noisy text, such as text from O.C.R.
Speech transcripts
References
04/19/23Data Mining: Principles and Algorithms43
Fabrizio Sebastiani, “Machine Learning in Automated Text
Categorization”, ACM Computing Surveys, Vol. 34, No.1,
March 2002
Yiming Yang, “An evaluation of statistical approaches to text
categorization”, Journal of Information Retrieval, 1:67-88,
1999.
Yiming Yang and Xin Liu, “A re-examination of text
categorization methods”, Proceedings of ACM SIGIR
Conference on Research and Development in Information
Retrieval (SIGIR'99, pp 42--49), 1999.
Thank You
April 19, 2023Data Mining: Concepts and
Techniques 45
Web Link Analysis
By Kavita Ganesan
April 19, 2023Data Mining: Concepts and
Techniques 46
RECAP
What is ranking in information retrieval?
Doc 1Doc 1
Doc 2Doc 2
Doc 3Doc 3
Doc 4Doc 4 perform searchon google
April 19, 2023Data Mining: Concepts and
Techniques 47
RECAP
What is ranking in information retrieval?
Doc 1Doc 1
Doc 2Doc 2
Doc 3Doc 3
Doc 4Doc 4
Ranked 1st
Ranked 2nd
Ranked 3rd
Ranked 4th perform searchon google
April 19, 2023Data Mining: Concepts and
Techniques 48
Why is ranking important?
Users tend to look at top few results make sure that good matches are at the very
top
Fast access to information! savvy users want results immediately
What happens if pages are poorly ranked?
important matches missed
poor user retention
April 19, 2023Data Mining: Concepts and
Techniques 49
Ranking in Text Information Retrieval
Before web existed Each document treated as a bag of words Minimal structure Ranking heuristics
Solely based on words in the documents E.g., term frequency, inverse document
frequency
After the web was born Documents
have structure contain hyperlinks contain components like title, author,
abstract, sections, references
Question is: Can we leverage this information to improve ranking?Question is: Can we leverage this information to improve ranking?
April 19, 2023Data Mining: Concepts and
Techniques 50
Exploiting inter-document links
Description(“anchor text”)
Hub Authority
Links indicate the utility of a doc
What does a link tell us?
show
April 19, 2023Data Mining: Concepts and
Techniques 51
Links Analysis Algorithms
PageRankPageRank HITSHITS
Hyperlink analysis to rank documents
April 19, 2023Data Mining: Concepts and
Techniques 52
PageRank
Based on the idea of a ‘random surfer’ the likelihood that a person randomly clicking on
links will arrive at any particular page
Pages represented as Markov Chain states
Probability of moving from one page to another is modelled as a state transition probability
April 19, 2023Data Mining: Concepts and
Techniques 53
PageRank
Ex:
BBAA
CC DD
0 1/2 1/2 0
1/2 0 0 1/2
1 0 0 01/2 0 1/2 0
State transition matrix
ABCD
A B C D
PR(A) =½*PR(B) + 1*PR(C)+ ½ PR(D)
1/2
1/21
April 19, 2023Data Mining: Concepts and
Techniques 54
PageRank PageRank value for any page u can be expressed
as
PR(u) VEBu
PR(v) L(v)
L(v) = number of outbound links of page vPR(v) = PageRank of page vBu = set of pages linking to page u
April 19, 2023Data Mining: Concepts and
Techniques 55
HITS
HITS = Hyperlink-Induced Topic Search
Developed by Jon Michael Kleinberg from Cornell
The algorithm produces two types of pages: Authority: pages that provide an important, trustworthy
information on a given topic Hub: pages that contain links to authorities
Authorities and hubs exhibit a mutually reinforcing relationship:
a better hub points to many good authorities, and a better authority is pointed to by many good hubs
April 19, 2023Data Mining: Concepts and
Techniques 56
HITS algorithm
Start with each node(page) having a hub score and authority score of 1
Run the Authority Update Rule
Run the Hub Update Rule
Normalize the values: divide each Hub score by the sum of all Hub scores divide each Authority score by the sum of all Authority
scores
Repeat from the second step as necessary
April 19, 2023Data Mining: Concepts and
Techniques 57
HITS algorithm—Authority Update
Node's Authority score = the sum of the Hub Score's of each node that points to it.
A page has high authority if it is linked to by pages that are recognized as Hubs for information.
1 A
B
C
D
authority(A) = h(B) + h(C) + h(D)
April 19, 2023Data Mining: Concepts and
Techniques 58
HITS algorithm—Hub Update
Node’s Hub Score = the sum of the Authority Score's of each node that it points to. A page is a good hub if it links to pages that
have high authority
A
5
6
7
E
G
F
hub(A) = a(E) + a(F) + a(G)
April 19, 2023Data Mining: Concepts and
Techniques 59
PageRank vs HITS
HITS PageRankiterative algorithm based on linkage of documents on the web
iterative algorithm based on linkage of documents on the web
HITS is executed at query time (authority and hub scores are query specific) takes a perfomance hit
PageRank is pre-computed
Computes two scores per document, hub and authority
Computes a single score
April 19, 2023Data Mining: Concepts and
Techniques 60
End
April 19, 2023Data Mining: Concepts and
Techniques 61
AUTHORITY PAGE
Kevin Chang
Cheng Zhai
Marianne Winslet
ibm.com
berkeley.edu
Stanford.edu
If a page is popular, then it must be an important page [back]
If a page is popular, then it must be an important page [back]
Mining and Searching Structured Data on the Web
Bo Zhao ([email protected])
Department of Computer Science
University of Illinois at Urbana-Champaign
Structured Data are EVERYWHERE!
Deep Web: databases behind websites (aa.com) Web 2.0 Contents: Flickr, Del.icio.us tags Google Base: structured data portals Surface Web: emails, org, country…
Solutions
Deep Web Data Integration Vertical Search Engines On-the-fly Meta-querying Systems Pay-As-You-Go Integration
Deep Web Surfacing Entity Search on the Surface Web
Vertical Search Engines—”Warehousing” approach
Academic Search Libra@MSRA DBLife@WISC ArnetMiner@Tsinghua
Many other domains Shopping Events Apartments …
66
Integrating information from multiple types of sources Ranking papers, conferences, and authors for a given query Handling structured queries
WebDatabase
WebDatabase
WebDatabase
WebDatabase
WebDatabase
…
PS DOC
JournalHomepage
AuhtorHomepage
Conf.Homepage
Vertical Search Engines—”Warehousing” approach e.g., Libra Academic Search [NieZW+05] (courtesy MSRA)
On-the-fly Meta-querying Systems
MetaQuerier@UIUC WISE-Integrator
http://www.data.binghamton.edu:8080/wise-integrator/ Commercial Systems
http://www.cheaptickets.com http://pipl.com …
68
On-the-fly Meta-querying Systems—e.g., WISE [HeMYW03], MetaQuerier [ChangHZ05]
FIND sources
QUERY sources
db of dbs
unified query interface
Amazon.comCars.com
411localte.com
Apartments.com
MetaQuerier@UIUC :
69
Technical Challenges.
Source Modeling & Selection How to describe a source and find right sources for query answering?
Schema Matching How to match the schematic structures between sources?
Source Querying, Crawling, and Object Ranking How to query a source? How to crawl all objects and to search them?
Data Extraction How to extract result pages into relations?
70
Source Modeling & Selection: How to describe a source and find right sources for query answering
Focus: Discovery of sources. Focused crawling to collect query interfaces [BarbosaF05, ChangHZ05].
Focus: Extraction of source models. Hidden grammar-based parsing [ZhangHC04]. Proximity-based extraction [HeMY+04]. Classification to align with given taxonomy [HessK03, Kushmerick03].
Focus: Organization of sources and query routing Offline clustering [HeTC04, PengMH+04]. Online search for query routing [KabraLC05].
71
Form Extraction: the Problem
Output all the conditions, for each: Grouping elements (into query conditions) Tagging elements with their “semantic roles”
attribute operator value
72
Schema Matching: How to match the schematic structures between sources
Focus: Matching large number of interface schemas, often in a holistic way. Statistical model discovery [HeC03]; correlation mining [HeCH04, HeC05]. Query probing [WangWL+04]. Clustering [HeMY+03, WuYD+04]. Corpus-assisted [MadhavanBD+05]; Web-assisted [WuDY06].
Focus: Constructing unified interfaces. As a global generative model [HeC03]. Cluster-merge-select [HeMY+03].
73
WISE-Integrator: Cluster-Merge-Represent [HeMY+03]
74
Source Querying: How to query a source? How to crawl all objects and to search them?
1. Metaquerying model: Focus: On-the-fly Querying.
MetaQuerier Query Assistant [ZhangHC05].
2. Vertical-search-engine model: Focus: Source crawling to collect objects.
Form submission by query generation/selection e.g., [RaghavanG01, WuWLM06].
Focus: Object search and ranking [NieZW+05]
75
On-the-fly Querying: [ZhangHC05] Type-locality based Predicate Translation
Target template P
Target Predicate t*
Type Recognizer
Domain Specific Handler
Text Handler
Numeric Handler
Datetime Handler
Predicate Mapper
Source predicate s
Correspondences occur within localities
Translation by type-handler
76
Source Crawling by Query Selection [WuWL+06]
Author Title Category
Ullman Complier System
Ullman Data Mining Application
Ullman Automata Theory
Han Data Mining ApplicationUllman
Han
Compiler
Automata
Data Mining
Application
TheorySystem
Conceptually, the DB as a graph: Node: Attributes Edge: Occurrence relationship
Crawling is transformed into graph traversal problem:Find a set of nodes N in the graph G such that for every node i in G, there exists a node j in N, j->i. And the summation of the cost of nodes in N should be minimum.
77
Object Ranking - Object Relationship Graph [NieZW+05]
Popularity Propagation Factor for each type of relationship link
Popularity of an object is also affected by the popularity of the Web pages containing the object
78
Data Extraction: How to extract result pages into relations
Mediator
Wrapper Wrapper Wrapper
Focus: Semi-automatic wrapper construction
Techniques: Wrapper-mediator architecture [Wiederhold92] . Manual construction: Semi-automatic: Learning-based
HLRT [KushmerickWD97], Stalker [MusleaMK99], Softmealy [HsuD98];
79
Data Extraction: How to extract result pages into relations
Mediator
Wrapper Wrapper Wrapper
Focus: Even more automatic approaches.
Techniques: Semi-automatic: Learning-based
[ZhaoMWRY05], [IRMKS06]. Automatic: Syntax-based
RoadRunner [MeccaCM01], ExAlg [ArasuG03],DEPTA [LiuGZ03, ZhaiL05].
You can only afford to Pay As You Go
Data Integration Solution Build data integration systems with deep web sources Reformulate user queries at search-time Build data integration for every domain of interest
Impractical for web search! Cannot query sources too often Precise content description required Too many domains of interest? Mediated schema design is infeasible!
Web Search Queries and Users
Web Queries are typically keyword queries
Data integration solutions assume structured queries
Web users do not typically care if results are structured or unstructured
User attention restricted to small number of portals (~1)
PAYGO Architecture
There can be many, potentially ill-defined, domainsMediated Schema Schema Clusters
Precise mappings cannot be created to all data sourcesExact Mappings Approximate Mappings
Users prefer keyword queries to structured queriesQuery Reformulation Query Routing
Data sources are diverse and mappings approximateExact Answers Heterogeneous Result Ranking
Uncertainty everywhere !
Pay As You Go in PAYGO
Integration is a continuous process Apriori integration impossible Understanding of mappings/sources/ranking/etc. evolves over time
Mechanisms to facilitate evolution over time Automatic schema clustering and matching Implicit use of user feedback, e.g., from result clicks Result variations to elicit disambiguating user feedback
Queries always answered with best effort “Pay” more by correcting/creating semantic mappings
Query Routing Example
Keyword Analysis
Domain Selection
Query Construction
Source Selection
Result Ranking
make model year attribute
vehicle
vehicle (mk:honda, md:civic, yr:2007, review:?)
car-reviews-by-year.com > car-reviews.com > car-prices.com
“honda civic 2007 review”
Surfacing the Deep Web – A More Practical Solution?
Pre-compute all interesting form submissions each HTML form
Each form submission corresponds to a distinct URL Add URLs for each form submission into search engine index
Enables the reuse of existing search engine infrastructure Deep-web URLs are like any other URL (GET method)
Reduced load on deep-web sites Only in response to user clicks on a search results Search engine performance not dependent on deep-web source
Surfacing Challenges
1. Predicting the appropriate values for text inputs Valid input values are required for retrieving data Ingredients in recipes.com and zipcodes in borderstores.com
2. Predicting the correct input combinations Generating all possible URLs is wasteful + unnecessary Cars.com has ~500K listings, but 250M possible queries
Google’s Deep-Web crawling system
Affects more than 1000 queries per second Enables access to more than a million Deep-Web sites Spans 50+ languages and 100+ domains Results served from 400K+ distinct forms per day Results validate the utility of Deep-Web content
Other systems: http://www.deeppeep.org/ …
Searching for Structured Data on the Surface Web: EntitySearch@UIUC
Entity Extraction and Indexing Ranking Entities Directly:
Contextual - Utilize Entities’ Surrounding Context Uncertain - Extractions are non-”prefect” Holistic - Many evidences from multiple sources Discriminative - Web Pages are of Varying Quality Associative - Tell True Associations from Accidental
Other systems: NAGA (http://www.mpi-inf.mpg.de/~kasneci/naga/) Correlator (http://correlator.sandbox.yahoo.net/)
References
Large-Scale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web. K. C.-C. Chang, tutorial in SIGMOD 2006
EntityRank: Searching Entities Directly and Holistically. T. Cheng, X. Yan, and K. C.-C. Chang. VLDB 2007.
Web-scale Data Integration: You can only afford to Pay As You Go. Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy. CIDR, 2007.
Google's Deep-Web Crawl. Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, Alon Halevy. VLDB, 2008.
Thank you!