Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web:...
-
Upload
sara-marshall -
Category
Documents
-
view
216 -
download
0
Transcript of Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web:...
![Page 1: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/1.jpg)
Livnat SharabaniLivnat Sharabani
SDBI 2006SDBI 2006
The HiddenThe Hidden WebWeb
![Page 2: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/2.jpg)
22
Based on:Based on:
““Distributed search over the hidden web: Distributed search over the hidden web: Hierarchical database sampling and Hierarchical database sampling and selection”selection”(Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, VLDB (Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, VLDB 2002)2002)
““When one sample is not enough: Improving When one sample is not enough: Improving text database selection using shrinkage”text database selection using shrinkage”(Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, SIGMOD (Luis Gravano, Panagiotis G. Ipeirotis, Columbia University, SIGMOD 2004)2004)
![Page 3: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/3.jpg)
33
ContentContent
What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.
![Page 4: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/4.jpg)
44
What is the hidden web?What is the hidden web?
The The “hidden- web”“hidden- web” / / “invisible-web”“invisible-web” is what you cannot retrieve ("see") in is what you cannot retrieve ("see") in the search results the search results
The The “surface-web”“surface-web” / / “visible-web”“visible-web” is is what you see in the results pages what you see in the results pages from general web search engines.from general web search engines.
![Page 5: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/5.jpg)
55
““Surface” web vs. “Hidden” webSurface” web vs. “Hidden” web
![Page 6: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/6.jpg)
66
Why Are Some Pages Why Are Some Pages Invisible?Invisible?
Technical barrier:Technical barrier: When typing or judgment are required.When typing or judgment are required. Dynamically generated pages.Dynamically generated pages.
Pages search engines choose to exclude:Pages search engines choose to exclude: Links containing ‘?’ (can be a spiders trap)Links containing ‘?’ (can be a spiders trap) Flash, shockwave (spiders are html Flash, shockwave (spiders are html
optimized)optimized)
![Page 7: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/7.jpg)
77
The hidden web - majorityThe hidden web - majority Text databases on the web which are Text databases on the web which are
“hidden” behind search interfaces.“hidden” behind search interfaces.
![Page 8: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/8.jpg)
88
““Surface” web vs. “Hidden” Surface” web vs. “Hidden” webweb
Surface web:Surface web: Link structure.Link structure. The content is The content is
crawlable.crawlable. The content is indexed The content is indexed
by search engines like by search engines like Google.Google.
Hidden web:Hidden web: Documents “hidden” Documents “hidden”
in databases.in databases. The content is not The content is not
crawlable.crawlable. Need to query each Need to query each
collection individually.collection individually.
Keywords:
![Page 9: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/9.jpg)
99
ContentContent
What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.
![Page 10: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/10.jpg)
1010
MetasearchersMetasearchers Metsearcher is a tool for searching over multiple Metsearcher is a tool for searching over multiple
hidden databases simultaneously through a query hidden databases simultaneously through a query interface.interface.
A metasearcher performs three main tasks:A metasearcher performs three main tasks: Database selection.Database selection. Query translation.Query translation. Result merging. Result merging.
DB1DB2
DB3
Metasearcher Query
resultsWEB
![Page 11: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/11.jpg)
1111
DB Content SummaryDB Content Summary
CNN.fnCNN.fn
Num Docs:44,730Num Docs:44,730
WordWord dfdf
BreastBreast
CancerCancer
……
124124
4444
……
Statistics that characterize the database Statistics that characterize the database content: content: Document frequencies of the words appear in the Document frequencies of the words appear in the
databasedatabase Number of documents stored in the database.Number of documents stored in the database.
Examples:Examples:
CANCERLITCANCERLIT
Num Docs: 148,944Num Docs: 148,944
WordWord dfdf
BreastBreast
CancerCancer
……
121,134121,134
91,68891,688
……
![Page 12: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/12.jpg)
1212
Typical DB Selection Typical DB Selection AlgorithmAlgorithm
Typical database selection algorithm Typical database selection algorithm depends on the database content depends on the database content summary to make decision.summary to make decision.
Given a content summary the Given a content summary the algorithm estimates how relevant the algorithm estimates how relevant the database is for a given query.database is for a given query.
![Page 13: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/13.jpg)
1313
bGIOSSbGIOSS The algorithm: calculate the number of documents The algorithm: calculate the number of documents
which expected to have the words in the query.which expected to have the words in the query. Example: for query “breast cancer” bGIOSS will calculate:Example: for query “breast cancer” bGIOSS will calculate:
CANCERLIT: |c|=148,944 df(breast)=121,134 CANCERLIT: |c|=148,944 df(breast)=121,134 df(cancer)=91,688df(cancer)=91,688148,944*(121,134/148,944)*(91,688/148,944)=~74,569148,944*(121,134/148,944)*(91,688/148,944)=~74,569
CNN.fn: |C|=44,730, df(breast)=124, df(cancer)=44CNN.fn: |C|=44,730, df(breast)=124, df(cancer)=44 44,730 *(124/ 44,730)*(44/ 44,730)=~044,730 *(124/ 44,730)*(44/ 44,730)=~0
CNN.fnCNN.fn
Num Docs:44,730Num Docs:44,730
WordWord dfdf
BreastBreast
CancerCancer124124
4444
CANCERLITCANCERLIT
Num Docs: 148,944Num Docs: 148,944
WordWord dfdf
BreastBreast
CancerCancer121,134121,134
91,68891,688
![Page 14: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/14.jpg)
1414
Database SelectionDatabase Selection
The data base selection is based on the The data base selection is based on the contents summary.contents summary.
How do the metasearcher obtain the DB How do the metasearcher obtain the DB content summary?content summary? Exported by the DB itself.Exported by the DB itself. Manually generated description.Manually generated description. Use a technique to automate the extraction Use a technique to automate the extraction
of content summaries from searchable text of content summaries from searchable text DBs.DBs.
![Page 15: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/15.jpg)
1515
Content Summary Content Summary construction construction
A pioneer work done by J. Callan and A pioneer work done by J. Callan and M. Connell was presented at SIGMOD M. Connell was presented at SIGMOD ’99.’99.
Their algorithm extracts a document Their algorithm extracts a document sample from a given database D and sample from a given database D and computes the frequency of each computes the frequency of each observed word observed word ww in the sample. in the sample.
![Page 16: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/16.jpg)
1616
Content Summary Content Summary constructionconstruction
The algorithm:The algorithm:1.1. Start with a comprehensive word Start with a comprehensive word
dictionary.dictionary.
2.2. Pick a word and send it as a query to Pick a word and send it as a query to database D.database D.
3.3. Retrieve the top k documents returned.Retrieve the top k documents returned.
4.4. If the number of retrieved documents If the number of retrieved documents exceeds a pre-specified threshold stop exceeds a pre-specified threshold stop sampling. Otherwise return to step 2.sampling. Otherwise return to step 2.
5.5. For each word w in the retrieved For each word w in the retrieved documents calculate SampleDF(documents calculate SampleDF(ww).).
![Page 17: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/17.jpg)
1717
Content Summary Content Summary constructionconstruction
There are two main versions of this There are two main versions of this algorithm that differ in how they pick words algorithm that differ in how they pick words from the dictionary:from the dictionary: RS-Ord (Random Sampling Other Resource) – RS-Ord (Random Sampling Other Resource) –
picks a random word from the dictionary.picks a random word from the dictionary. RS-Lrd (Random Sampling Learned Resource)- RS-Lrd (Random Sampling Learned Resource)-
pick a word from a previously retrieved pick a word from a previously retrieved documents.documents.
Both versions do not retrieve the actual Both versions do not retrieve the actual document frequency for each word document frequency for each word ww, Hence , Hence 2 DBs, differing significantly in size, might be 2 DBs, differing significantly in size, might be assigned similar content summaries.assigned similar content summaries.
![Page 18: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/18.jpg)
1818
ContentContent
What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.
![Page 19: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/19.jpg)
1919
Database ClassificationDatabase Classification
Classifying a database to hierarchy Classifying a database to hierarchy of topics is another way to of topics is another way to characterize the content of a characterize the content of a database. database.
Example: “CANCERLIT” can be Example: “CANCERLIT” can be classified under the category classified under the category “health”.“health”.
![Page 20: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/20.jpg)
2020
Topics hierarchyTopics hierarchy
Topics Topics hierarchhierarchy:y:
![Page 21: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/21.jpg)
2121
Automatic Document Automatic Document Classifier Classifier II
Queries closely associated with topical Queries closely associated with topical categories retrieve mainly documents about categories retrieve mainly documents about that category.that category.example: “breast” and “cancer” is likely to example: “breast” and “cancer” is likely to retrieve documents related to health.retrieve documents related to health.
By observing the number of matches By observing the number of matches generated for each query at a database we generated for each query at a database we can classify the database.can classify the database.example: if a database generates a large example: if a database generates a large number of matches to queries associated number of matches to queries associated with health and few matches for other with health and few matches for other categories we can classify the database categories we can classify the database under category health.under category health.
![Page 22: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/22.jpg)
2222
Automatic Document Classifier Automatic Document Classifier IIII
A rule based document classifier A rule based document classifier uses a set of rules defining a uses a set of rules defining a classification decisions.classification decisions. Examples: Examples:
““Jordan” AND “basketball” Jordan” AND “basketball” sportssports““hepatitis”hepatitis” health health
A database can be classified to more A database can be classified to more than one category.than one category.
![Page 23: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/23.jpg)
2323
Automatic Document Classifier Automatic Document Classifier IIIIII
The algorithm defines for each The algorithm defines for each subcategory csubcategory cii : : Coverage(cCoverage(cii) – the number of documents ) – the number of documents
estimated to belong to cestimated to belong to cii.. Specificity(cSpecificity(cii) – the fraction of documents ) – the fraction of documents
estimated to belong to cestimated to belong to cii.. The algorithm classify a database into The algorithm classify a database into
a category ca category cii if the values of if the values of Coverage(cCoverage(cii) and specificity(c) and specificity(cii) exceed ) exceed two pre-specify thresholds.two pre-specify thresholds.
![Page 24: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/24.jpg)
2424
ExampleExample Rules:Rules:
““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
Coverage(sport)=Coverage(sport)=
300300
Documents frequencyDocuments frequency
soccersoccer 300300
basketballbasketball 200200
dietdiet 140140
diabetesdiabetes 1212
CancerCancer 250250
![Page 25: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/25.jpg)
2525
ExampleExample Rules:Rules:
““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
Coverage(sport)=Coverage(sport)=
300300 + 200 + 200 = 500= 500
Documents frequencyDocuments frequency
soccersoccer 300300
basketballbasketball 200200
dietdiet 140140
diabetesdiabetes 1212
CancerCancer 250250
![Page 26: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/26.jpg)
2626
ExampleExample Rules:Rules:
““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
Coverage(sport)=Coverage(sport)=300300 + 200+ 200 = 500= 500
Coverage(health)=Coverage(health)=140140
Documents frequencyDocuments frequency
soccersoccer 300300
basketballbasketball 200200
dietdiet 140140
diabetesdiabetes 1212
CancerCancer 250250
![Page 27: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/27.jpg)
2727
ExampleExample Rules:Rules:
““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
Coverage(sport)=Coverage(sport)=300300 + 200+ 200 = 500= 500
Coverage(health)=Coverage(health)=140140+12 +12 = 162= 162
Documents frequencyDocuments frequency
soccersoccer 300300
basketballbasketball 200200
dietdiet 140140
diabetesdiabetes 1212
CancerCancer 250250
![Page 28: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/28.jpg)
2828
ExampleExample Rules:Rules:
““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
Specificity(sport) =Specificity(sport) =
500500//(500+162)=0.76(500+162)=0.76
Documents frequencyDocuments frequency
soccersoccer 300300
basketballbasketball 200200
dietdiet 140140
diabetesdiabetes 1212
CancerCancer 250250
![Page 29: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/29.jpg)
2929
ExampleExample Rules:Rules:
““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
Specificity(sport) =Specificity(sport) =
500/500/((500+162500+162)=0.76)=0.76
Documents frequencyDocuments frequency
soccersoccer 300300
basketballbasketball 200200
dietdiet 140140
diabetesdiabetes 1212
CancerCancer 250250
![Page 30: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/30.jpg)
3030
ExampleExample Rules:Rules:
““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
Specificity(sport) =Specificity(sport) = 500/(500+162)=0.76500/(500+162)=0.76
Specificity(health) = Specificity(health) = 162162/(500+162) = 0.24/(500+162) = 0.24
Documents frequencyDocuments frequency
soccersoccer 300300
basketballbasketball 200200
dietdiet 140140
diabetesdiabetes 1212
CancerCancer 250250
![Page 31: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/31.jpg)
3131
ExampleExample Rules:Rules:
““soccer” => sportsoccer” => sport ““basketball” => sportbasketball” => sport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
Specificity(sport) =Specificity(sport) = 500/(500+162)=0.76500/(500+162)=0.76
Specificity(health) = Specificity(health) = 162/(162/(500+162500+162) = 0.24) = 0.24
Documents frequencyDocuments frequency
soccersoccer 300300
basketballbasketball 200200
dietdiet 140140
diabetesdiabetes 1212
CancerCancer 250250
![Page 32: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/32.jpg)
3232
ExampleExample
Rules:Rules: ““soccer” => sportsoccer” => sport ““basketball” => basketball” =>
sportsport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
sporsportt
healthealthh
coveragcoveragee
500500 162162
SpecificitSpecificityy
0.760.76 0.240.24
![Page 33: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/33.jpg)
3333
ExampleExample
Rules:Rules: ““soccer” => sportsoccer” => sport ““basketball” => basketball” =>
sportsport ““diet” => healthdiet” => health ““diabetes” => health diabetes” => health
Pre-define thresholds:Pre-define thresholds: Coverage(cCoverage(cii)=100)=100 Specificity(cSpecificity(cii)=0.5)=0.5
sporsportt
healthhealth
coveragecoverage 500500 162162
SpecificitSpecificityy
0.760.76 0.240.24
The word “cancer” did not
appear in the rules thus did not affect coverage nor specificity.
![Page 34: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/34.jpg)
3434
QProberQProber
View DemoView Demo
![Page 35: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/35.jpg)
3535
ContentContent
What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.
![Page 36: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/36.jpg)
3636
Construct Content SummaryConstruct Content Summary
Algorithm outline:Algorithm outline:
1.1. Retrieve a document sample.Retrieve a document sample.
2.2. Generate a preliminary content Generate a preliminary content summary.summary.
3.3. Categorize the database.Categorize the database.
4.4. Estimate the absolute frequencies Estimate the absolute frequencies of the words retrieved from the of the words retrieved from the database.database.
![Page 37: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/37.jpg)
3737
Construct Content SummaryConstruct Content Summary
Algorithm outline:Algorithm outline:
1.1. Retrieve a document sample.Retrieve a document sample.
2.2. Generate a preliminary content Generate a preliminary content summary.summary.
3.3. Categorize the database.Categorize the database.
4.4. Estimate the absolute frequencies Estimate the absolute frequencies of the words retrieved from the of the words retrieved from the database.database.
![Page 38: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/38.jpg)
3838
Document SampleDocument Sample
Document sample for category c:Document sample for category c: newdocsnewdocs = = ØØ For each subcategory For each subcategory ccii of c: of c:
For each query q relevant for For each query q relevant for ccii:: newdocsnewdocs = = newdocsnewdocs U {top k documents U {top k documents
return for q}return for q} If q consist a single word If q consist a single word ww
then ActualDF(then ActualDF(ww)= #matches returned for )= #matches returned for q.q.
![Page 39: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/39.jpg)
3939
Document Sample – Document Sample – Example Example II
START
Sport Arts Science
Basketball soccer
Health
RulesRules
SportSport ““Jordan” and “bulls” , “Romario” and “soccer”,Jordan” and “bulls” , “Romario” and “soccer”,
““Maradona”, “swimming” , etc.Maradona”, “swimming” , etc.
HealtHealthh
““diabetes”, “diet” and “fat”, “stomach”, diabetes”, “diet” and “fat”, “stomach”, etc.etc.
……
We know ActualDF
(.)
![Page 40: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/40.jpg)
4040
Construct Content SummaryConstruct Content Summary
Algorithm outline:Algorithm outline:
1.1. Retrieve a document sample.Retrieve a document sample.
2.2. Generate a preliminary content Generate a preliminary content summary.summary.
3.3. Categorize the database.Categorize the database.
4.4. Estimate the absolute frequencies Estimate the absolute frequencies of the words retrieved from the of the words retrieved from the database.database.
![Page 41: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/41.jpg)
4141
Content SummaryContent Summary
Build content summary for category c:Build content summary for category c: For each word w in For each word w in newdocsnewdocs : :
SampleDF(SampleDF(ww)= #documents in )= #documents in newdocsnewdocs that contain that contain ww..
![Page 42: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/42.jpg)
4242
Construct Content SummaryConstruct Content Summary
Algorithm outline:Algorithm outline:
1.1. Retrieve a document sample.Retrieve a document sample.
2.2. Generate a preliminary content Generate a preliminary content summary.summary.
3.3. Categorize the database.Categorize the database.
4.4. Estimate the absolute frequencies Estimate the absolute frequencies of the words retrieved from the of the words retrieved from the database.database.
![Page 43: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/43.jpg)
4343
Categorizing the DatabaseCategorizing the Database
The algorithm is recursive.The algorithm is recursive. We go down the topics hierarchy We go down the topics hierarchy
according to the “Coverage” and the according to the “Coverage” and the “specificity” .“specificity” .
Categorization:Categorization: If Coverage(If Coverage(ccii)>treshold1 and )>treshold1 and
specificity( specificity(ccii)>threshold2)>threshold2
Then getContentSummary(Then getContentSummary(ccii))
![Page 44: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/44.jpg)
4444
Document Sample – Document Sample – Example Example IIII
START
Sport Arts Science
Basketball soccer
Health
Requirements:Requirements: Coverage(cCoverage(cii) > x1) > x1 Specificity(cSpecificity(cii) > x2) > x2
NBAstatisticsNBA
statistics
![Page 45: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/45.jpg)
4545
Construct Content SummaryConstruct Content Summary
Algorithm outline:Algorithm outline:
1.1. Retrieve a document sample.Retrieve a document sample.
2.2. Generate a preliminary content Generate a preliminary content summary.summary.
3.3. Categorize the database.Categorize the database.
4.4. Estimate the absolute frequencies Estimate the absolute frequencies of the words retrieved from the of the words retrieved from the database.database.
![Page 46: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/46.jpg)
4646
Estimating absolute document Estimating absolute document FrequenciesFrequencies
To evaluate the absolute document To evaluate the absolute document frequencies the paper uses Zipf’s frequencies the paper uses Zipf’s observation that was refined later by observation that was refined later by Mendelbort:Mendelbort:
f=P(r+p)f=P(r+p)-B-B
![Page 47: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/47.jpg)
4747
Estimating absolute document Estimating absolute document FrequenciesFrequencies
ff=P(r+p)=P(r+p)-B-B
f => the frequency of the word.f => the frequency of the word. r => The rank of the word (by it’s r => The rank of the word (by it’s
frequency).frequency). P, p, B => parameters of the specific P, p, B => parameters of the specific
document collection.document collection.
![Page 48: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/48.jpg)
4848
Estimating absolute document Estimating absolute document FrequenciesFrequencies
f=P(f=P(rr+p+p))-B-B
f => the frequency of the word.f => the frequency of the word. r => The rank of the word (by it’s r => The rank of the word (by it’s
frequency).frequency). P, p, B => parameters of the specific P, p, B => parameters of the specific
document collection.document collection.
![Page 49: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/49.jpg)
4949
Estimating absolute document Estimating absolute document FrequenciesFrequencies
f=f=PP(r+(r+pp))--BB
f => the frequency of the word.f => the frequency of the word. r => The rank of the word (by it’s r => The rank of the word (by it’s
frequency).frequency). P, p, B => parameters of the specific P, p, B => parameters of the specific
document collection.document collection.
![Page 50: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/50.jpg)
5050
Estimating absolute document Estimating absolute document Frequencies - ExampleFrequencies - Example
Rank:Rank: r(“Bulls”)=1r(“Bulls”)=1 r(“Jordan”)=2r(“Jordan”)=2 r(“Maradona”)=3r(“Maradona”)=3 r(“Romario”)=4r(“Romario”)=4
RulesRules
SporSportt
““Jordan” and “Bulls” , Jordan” and “Bulls” , “Romario” and “soccer”, “Romario” and “soccer”, “Maradona”, “swimming” , “Maradona”, “swimming” , etc.etc.
SampleDSampleDFF
ActualDActualDFF
JordanJordan 4545 ------BullsBulls 8080 ------MaradonMaradonaa
4040 68006800
RomarioRomario 3232 ------
……
![Page 51: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/51.jpg)
5151
Estimating absolute document Estimating absolute document FrequenciesFrequencies
Estimate actual word frequencies:Estimate actual word frequencies:1.1. Sort words in their descending order of Sort words in their descending order of
their SampleDF(.). Determine the rank their SampleDF(.). Determine the rank rrii of each word w of each word wii..
2.2. Estimate P, p, B by the ActualDF(.) you Estimate P, p, B by the ActualDF(.) you have.have.
3.3. Estimate absolute document frequency Estimate absolute document frequency for all words in the sample.for all words in the sample.
![Page 52: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/52.jpg)
5252
Estimating absolute document Estimating absolute document Frequencies - ExampleFrequencies - Example
Rank:Rank: r(“Bulls”)=1r(“Bulls”)=1 r(“Jordan”)=2r(“Jordan”)=2 r(“Maradona”)=3r(“Maradona”)=3 r(“Romario”)=4r(“Romario”)=4
According to According to Maradona (and Maradona (and more actualDF) more actualDF) estimate P, p and Bestimate P, p and B
Estimate ActualDF Estimate ActualDF of “Jordan”, “Bulls” of “Jordan”, “Bulls” etc.etc.
RulesRules
SporSportt
““Jordan” and “Bulls” , Jordan” and “Bulls” , “Romario” and “soccer”, “Romario” and “soccer”, “Maradona”, “swimming” , “Maradona”, “swimming” , etc.etc.
SampleDSampleDFF
ActualDActualDFF
JordanJordan 4545 ------BullsBulls 8080 ------MaradonMaradonaa
4040 68006800
RomarioRomario 3232 ------
……
![Page 53: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/53.jpg)
5353
Content Summary ProblemsContent Summary Problems
The sparse data problem:The sparse data problem: The content summary tends to include the The content summary tends to include the
most frequent words but generally miss most frequent words but generally miss many other words that appear only in few many other words that appear only in few documents.documents. Example: The word “hemophilia” appears in Example: The word “hemophilia” appears in
0.1% of the PubMed documents.0.1% of the PubMed documents.A typical content summary for PubMed will not A typical content summary for PubMed will not include “hemophilia” in it, thus causing the include “hemophilia” in it, thus causing the metasearcher to find PubMed as a non relevant metasearcher to find PubMed as a non relevant database to query containing “hemophilia”.database to query containing “hemophilia”.
![Page 54: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/54.jpg)
5454
Content Summary ProblemsContent Summary Problems
Disproportion:Disproportion:
Some word might be disproportionately Some word might be disproportionately represented in the document summary.represented in the document summary.
Challenge:Challenge: Improving the quality of the content Improving the quality of the content
summary without necessarily increasing summary without necessarily increasing the document sample size.the document sample size.
![Page 55: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/55.jpg)
5555
ContentContent
What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.
![Page 56: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/56.jpg)
5656
ShrinkageShrinkage
When multiple databases correspond When multiple databases correspond to similar topic categories they tend to similar topic categories they tend to have similar content summaries.to have similar content summaries.
The content summaries of databases The content summaries of databases under similar topics can mutually under similar topics can mutually complement each other.complement each other.
![Page 57: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/57.jpg)
5757
Category Content SummaryCategory Content SummaryRoot
Sport Health
HeartD3
D1 D2
^DB = 1000Df(“hypertension”)=480P(“hypertension”)=0.48
^DB = 2000Df(“hypertension”)=0P(“hypertension”)=0
P(“hypertension”) = 0.16((2000*0)+(1000*0.48))/3000
![Page 58: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/58.jpg)
5858
Shrunk content Summary Shrunk content Summary II
To create a shrunk content summary we must To create a shrunk content summary we must first create the categories content summary for first create the categories content summary for all the categories in the hierarchy.all the categories in the hierarchy.
Consider a path in the topic hierarchy CConsider a path in the topic hierarchy C11,….,C,….,Cm m
were cwere cii=parent(c=parent(ci+1i+1))
Root
c1
c2
c3
D
![Page 59: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/59.jpg)
5959
Shrunk content Summary Shrunk content Summary IIII
A shrunk content summary for database D A shrunk content summary for database D classified under categories cclassified under categories c11…c…cmm is: is:
Where:Where:
![Page 60: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/60.jpg)
6060
Shrunk content Summary Shrunk content Summary IIIIII
Root
… … C1
C2
… C3
…
D
P(w|D)=0.6
P(w|C3)=0.4
P(w|C2)=0.78
P(w|C1)=0.3
P(w|Root)=0.01
Shrunk content Summary:
0.01*λ0+0.3*λ1+0.78*λ2+0.4*λ3+0.6*λ4
![Page 61: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/61.jpg)
6161
Shrunk content Summary Shrunk content Summary IVIV
The category weights:The category weights:λλm+1m+1 is the highest among the is the highest among the λλii’s, which ’s, which means the highest weight is given to the means the highest weight is given to the original content summary. original content summary.
The shrunk content summary The shrunk content summary incorporates information from incorporates information from multiple content summary and thus multiple content summary and thus it can be closer to the complete (and it can be closer to the complete (and unknown) content summary.unknown) content summary.
![Page 62: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/62.jpg)
6262
Shrunk content summary – Shrunk content summary – is it always good?is it always good?
Not always, if the “uncertainty” associated Not always, if the “uncertainty” associated with the score is low don’t use shrinkage:with the score is low don’t use shrinkage: The sample size - If the database sample includes The sample size - If the database sample includes
most of the documents from the DB (a small DB) most of the documents from the DB (a small DB) then this sample is sufficiently complete. In this then this sample is sufficiently complete. In this case shrinkage is not needed and might be case shrinkage is not needed and might be undesirable.undesirable.
The frequency of the query words – if all the query The frequency of the query words – if all the query words appear in almost all of the sample words appear in almost all of the sample documents then the distribution of the words over documents then the distribution of the words over the DB is “certain”. Same goes if every query the DB is “certain”. Same goes if every query word appears in close to no sample document.word appears in close to no sample document.
![Page 63: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/63.jpg)
6363
ContentContent
What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.
![Page 64: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/64.jpg)
6464
Experiments ResultExperiments Result
The papers refer to 2 aspects:The papers refer to 2 aspects: Content summary quality.Content summary quality. Database selection accuracy.Database selection accuracy.
The papers show that the idea of The papers show that the idea of exploiting content summaries of exploiting content summaries of similarly classified databases similarly classified databases increases the content summary increases the content summary quality and improves the database quality and improves the database selection for a given query.selection for a given query.
![Page 65: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/65.jpg)
6565
Content summary quality Content summary quality IIComparing coverage of the retrieve Comparing coverage of the retrieve vocabulary. RS-ORD and RS-LRD vs. vocabulary. RS-ORD and RS-LRD vs. different Rulers.different Rulers.
Specificity
% r
etr
ieved
word
s
![Page 66: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/66.jpg)
6666
Content summary quality Content summary quality IIIIComparing rank of words.Comparing rank of words.
RS-ORD and RS-LRD vs. different Rulers.RS-ORD and RS-LRD vs. different Rulers.
![Page 67: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/67.jpg)
6767
Content summary quality Content summary quality IIIIII Comparing the number of queries done to the Comparing the number of queries done to the
database. RS-ORD and RS-LRD vs. different database. RS-ORD and RS-LRD vs. different Rulers.Rulers.
![Page 68: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/68.jpg)
6868
Data base selection using Data base selection using shrinkage shrinkage
The shrinkage improves selecting The shrinkage improves selecting relevant data bases.relevant data bases.
![Page 69: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/69.jpg)
6969
ContentContent
What is the hidden web?What is the hidden web? Content Summary.Content Summary. Database classification.Database classification. Combined Algorithm.Combined Algorithm. Shrinkage.Shrinkage. Experiments Result.Experiments Result. Summary.Summary.
![Page 70: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/70.jpg)
7070
Summary Summary II
Database selection is critical to Database selection is critical to building efficient metasearchers that building efficient metasearchers that interact with potentially large interact with potentially large number of databases.number of databases.
The metasearchers uses the The metasearchers uses the database content summary to select database content summary to select the most relevant databases for a the most relevant databases for a given query.given query.
![Page 71: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/71.jpg)
7171
Summary Summary IIII
The papers present methods to improve The papers present methods to improve the database content summary:the database content summary: Creating Content summary with estimation Creating Content summary with estimation
of actual document frequency.of actual document frequency. Categorizing databases in a classification Categorizing databases in a classification
scheme.scheme. A method to exploits content summaries of A method to exploits content summaries of
similarly classified databases and combines similarly classified databases and combines them using shrinkage.them using shrinkage.
![Page 72: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/72.jpg)
7272
The EndThe End
"The invisible portion of the Web will continue to grow
exponentially before the tools to uncover the hidden Web are
ready for general use" (http://brightplanet.com/technol
ogy/deepweb.asp)
QUESTIONS?
![Page 73: Livnat Sharabani SDBI 2006 The Hidden Web. 2 Based on: “Distributed search over the hidden web: Hierarchical database sampling and selection” “Distributed.](https://reader035.fdocuments.in/reader035/viewer/2022062409/56649ec75503460f94bd404d/html5/thumbnails/73.jpg)
7373
Appendix Appendix
The metasearcher Turbo10 - The metasearcher Turbo10 - http://turbo10.com/index.htmlhttp://turbo10.com/index.html