Probe, Count, and Classify: Categorizing Hidden Web Databases Panagiotis G. Ipeirotis Luis Gravano...
-
Upload
isaiah-mcdowell -
Category
Documents
-
view
221 -
download
2
Transcript of Probe, Count, and Classify: Categorizing Hidden Web Databases Panagiotis G. Ipeirotis Luis Gravano...
Probe, Count, and Classify:Categorizing Hidden Web
Databases
Panagiotis G. Ipeirotis Luis Gravano
Columbia University
Mehran SahamiE.piphany Inc.
Surface Web vs. Hidden Web
Surface WebLink structureCrawlable
Hidden WebNo link structureDocuments “hidden” behind search forms
SUBMIT
Keywords
CLEAR
Do We Need the Hidden Web?
Example: PubMed/MEDLINE
PubMed: (www.ncbi.nlm.nih.gov/PubMed) search: “cancer” 1,341,586 matchesAltaVista: “cancer site:www.ncbi.nlm.nih.gov” 21,830 matches
Surface Web Hidden Web2 billion pages 500 billion pages
(?)
Interacting With Searchable Text Databases
Searching: Metasearchers
Browsing: Yahoo!-like web directories:InvisibleWeb.comSearchEngineGuide.com
Example from InvisibleWeb.com
Health > Publications > PubMED
Created Manually!
Classifying Text Databases Automatically:
Outline
Definition of classification
Classification through query probing
Experiments
Database Classification: Two Definitions
Coverage-based classification:Database contains many documents about a category Coverage: #docs about this category
Specificity-based classification:Database contains mainly documents about a categorySpecificity: #docs/|DB|
Database Classification: An Example
Category: Basketball
Coverage-based classificationESPN.com, NBA.com, not KnicksTerritory.com
Specificity-based classificationNBA.com, KnicksTerritory.com, not ESPN.com
Database Classification: More Details
Thresholds for coverage and specificityTc: coverage threshold (e.g., 100)Ts: specificity threshold (e.g., 0.5)
Tc, Ts “editorial”
choices
Root
SPORTS
C=800 S=0.8
BASEBALLS=0.5
BASKETBALL
S=0.5
HEALTHC=200S=0.2
Ideal(D)Ideal(D): set of classes for database DClass C is in Ideal(D) if:
D has “enough” coverage and specificity (Tc, Ts) for C and all of C’s ancestorsandD fails to have both “enough”
coverage and specificity for each child of C
From Document to Database Classification
If we know the categories of all documents inside the database, we are done!
We do not have direct access to the documents.Databases do not export such data!
How can we extract this information?
Our Approach: Query Probing
1. Train a rule-based document classifier.
2. Transform classifier rules into queries.
3. Adaptively send queries to databases.
4. Categorize the databases based on adjusted number of query matches.
Training a Rule-based Document Classifier
Feature Selection: Zipf’s law pruning, followed by information-theoretic feature selection [Koller & Sahami’96]
Classifier Learning: AT&T’s RIPPER [Cohen 1995]Input: A set of pre-classified, labeled documentsOutput: A set of classification rules
IF linux THEN ComputersIF jordan AND bulls THEN SportsIF lung AND cancer THEN Health
Transform each rule into a queryIF lung AND cancer THEN health +lung +cancerIF linux THEN computers +linux
Send the queries to the database
Get number of matches for each query, NOT the documents (i.e., number of documents that match each rule)
These documents would have been classified by the rule under its associated category!
Constructing Query Probes
Adjusting Query Results
Classifiers are not perfect!Queries do not “retrieve” all the documents in a categoryQueries for one category “match” documents not in this category
From the classifier’s training phase we know its “confusion matrix”
Confusion Matrix
comp sports
health
comp 0.70 0.10 0.00
sports
0.18 0.65 0.04
health
0.02 0.05 0.86
DB-real
1000
5000
50
Probing results700+500+0 = 1200
180+3250+2
= 3432
20+250+43 = 313
XX ==
Classified Classified intointo
Correct Correct classclass
M . Coverage(D) ~ ECoverage(D)
10% of “Sports” classified as “Computers”
10% of the 5000 “Sports” docs to “Computers”
Confusion Matrix Adjustment:Compensating for Classifier’s
Errors
comp
sports
health
comp 0.70 0.10 0.00
sports
0.18 0.65 0.04
health
0.02 0.05 0.86
DB-real
1000
5000
50
Probing results
1200
3432
313
XX==
Coverage(D) ~ M-1 . ECoverage(D)
-1-1
M is diagonally dominant, hence invertible
Multiplication better approximates the correct result
Classifying a Database
1. Send the query probes for the top-level categories2. Get the number of matches for each probe3. Calculate Specificity and Coverage for each category4. “Push” the database to the qualifying categories
(with Specificity>Ts and Coverage>Tc)
5. Repeat for each of the qualifying categories6. Return the classes that satisfy the
coverage/specificity conditions
The result is the Approximation of the Ideal classification
Real Example: ACM Digital Library
(Tc=100, Ts=0.5)
C/C++ Java Visual BasicPerl
Arts(0,0)
Sports(22, 0.008)
Science(430, 0.042)
Health(0,0)
Programming(1042, 0.18)
Hardware(2709, 0.465)
Software(2060, 0.355)
Computers(9919, 0.95)
Root
Experiments: Data
72-node 4-level topic hierarchy from InvisibleWeb/Yahoo! (54 leaf nodes)
500,000 Usenet articles (April-May 2000):Newsgroups assigned by hand to hierarchy nodesRIPPER trained with 54,000 articles (1,000 articles per leaf)27,000 articles used to construct estimations of the confusion matricesRemaining 419,000 articles used to build 500 Controlled Databases of varying category mixes, size
Comparison With Alternatives
DS: Random sampling of documents via query probes
Callan et al., SIGMOD’99Different task: Gather vocabulary statisticsWe adapted it for database classification
TQ: Title-based ProbingYu et al., WISE 2000Query probes are simply the category names
Accuracy of classification results:Expanded(N) = N and all descendantsCorrect = Expanded(Ideal(D))Classified = Expanded(Approximate(D))
Precision = |Correct /\ Classified|/|Classified|Recall = |Correct /\ Classified|/|Correct|
F-measure = 2.Precision.Recall/(Precision + Recall)
Cost of classification: Number of queries to database
Experiments: Metrics
N
Expanded(N)
Average F-measure, Controlled Databases
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Ts, (Tc=8)
F-m
eas
ure
DS [Callan et al.] PnC TQ [Yu et al.]
PnC =Probe & Count, DS=Document Sampling, TQ=Title-based probing
Experimental Results: Controlled Databases
Feature selection helps.Confusion-matrix adjustment helps.F-measure above 0.8 for most <Tc, Ts> combinations.Results degrade gracefully with hierarchy depth.Relatively small number of probes needed for most <Tc, Ts> combinations tried. Also, probes are short: 1.5 words on average; 4 words maximum.
Both better performance and lower cost than DS [Callan et al. adaptation] and TQ [Yu et al.]
Web Databases
130 real databases classified from InvisibleWeb™.Used InvisibleWeb’s categorization as correct.Simple “wrappers” for querying (only # of matches needed).
The Ts, Tc thresholds are not known (unlike with the Controlled databases) but implicit in the IWeb categorization.
We can learn/validate the thresholds (tricky but easy!).More details in the paper!
Web Databases: Learning Thresholds
1
4
16
64
25
6
10
24
40
96
16
384
65
536
26
214
40
0.3
0.6
0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8F
-me
as
ure
TecTes
Experimental Results:Web Databases
130 Real Web Databases.
F-measure above 0.7 for best <Tc, Ts> combination learned.
185 query probes per database on average needed for classification.
Also, probes are short: 1.5 words on average; 4 words maximum.
Conclusions
Accurate classification using only a small number of short queries
No need for document retrievalOnly need a result like: “X matches found”
No need for any cooperation or special metadata from databases
• Build “wrappers” automatically
• Extend to non-topical categories
• Evaluate impact of varying search interfaces (e.g., Boolean vs. ranked)
• Extend to other classifiers (e.g., SVMs or Bayesian models)
• Integrate with searching (connection with database selection?)
Current and Future Work
Easy, inexpensive method for database classificationUses results from document classification“Indirect” classification of the documents in a database Does not inspect documents, only number of matchesAdjustment of results according to classifier’s performanceEasy wrapper constructionNo need for any metadata from the database
Contributions
Related Work
Callan et al., SIGMOD 1999Gauch et al., Profusion Dolin et al., Pharos Yu et al., WISE 2000 Raghavan and Garcia Molina, VLDB 2001
Controlled Databases
500 databases built using 419,000 newsgroup articles
One label per document350 databases with single (not necessarily leaf) category 150 databases with varying category mixesDatabase size ranges from 25 to 25,000 articlesIndexed and queries using SMART
F-measure for Different Hierarchy Depths
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
0.95
1.00
0 1 2 3Level
F-m
easu
re
DS [Callan et al.] PnC TQ [Yu et al.]
PnC =Probe & Count, DS=Document Sampling, TQ=Title-based probing
Tc=8, Ts=0.3
Query Probes Per Controlled Database
500
750
1000
1250
1500
1750
2000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Tes
Inte
rac
tio
ns
wit
h D
ata
bas
e
PnC DS TQ
(a)
250
500
750
1000
1250
1500
1750
2000
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192Tec
Inte
rac
tio
ns
wit
h D
atab
ase
PnC DS TQ
(b)
Web Databases: Number of Query Probes
14
1664
256
1024
4096
1638
4
6553
6
2621
44
0
0.3
0.6
0.9
0
100
200
300
400
500
600
Qu
ery
Pro
bes
Tec
Tes
3-fold Cross-validation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
F-1
F-2
F-3
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9