Daniele Alfarone ○ Erasmus student ○ Milan (Italy) Deep-Web Crawling “Enlightening the dark...

Click here to load reader

download Daniele Alfarone ○ Erasmus student ○ Milan (Italy) Deep-Web Crawling “Enlightening the dark side of the web”

of 35

Transcript of Daniele Alfarone ○ Erasmus student ○ Milan (Italy) Deep-Web Crawling “Enlightening the dark...

  • Slide 1
  • Daniele Alfarone Erasmus student Milan (Italy) Deep-Web Crawling Enlightening the dark side of the web
  • Slide 2
  • Structure 1. Introduction What is the Deep-Web How to crawl it 2. Googles Approach Problem statement Main algorithms Performance evaluation 3. Improvements Main limitations Some ideas to improve 4. Conclusions 2
  • Slide 3
  • What is the Deep-Web? Deep-Web is the content hidden behind HTML forms 3 IntroductionGoogles approachImprovements
  • Slide 4
  • Hidden content This content cannot be reached by traditional crawlers Deep-Web has 10 times more data than the currently searchable content 4 IntroductionGoogles approachImprovements
  • Slide 5
  • How do webmasters deal with it? Not only the search engines are interested: the websites want to be more accessible to the crawlers The websites publish pages with long lists of static links to let traditional crawlers index them 5 IntroductionGoogles approachImprovements
  • Slide 6
  • How can search engines crawl the Deep-Web? Developing vertical search engines focused on a specific topic flights jobs But Limited to the number of topics for which a vertical search engine has been built Difficult to keep semantic maps between individual data sources and a common DB Boundaries between different domains are fuzzy 6 But search engines cannot pretend that every website does the same IntroductionGoogles approachImprovements
  • Slide 7
  • Are there smarter approaches? 7 Currently the Web contains more than 10 millions high-quality HTML forms and it is still growing exponentially Any approach which involves human effort can't scale: we need a fully-automatic approach without site-specific coding Solution: the surfacing approach 1. Choose a set of queries to submit to the web form 2. Store the URL of the page obtained 3. Pass all the URLs to the crawler Number of websites since 1990 (7% has an high-quality form) IntroductionGoogles approachImprovements
  • Slide 8
  • Part 2 Googles approach Problem statement Main algorithms Performance evaluation
  • Slide 9
  • Solving the surfacing problem: Googles approach The problem is divided in two sub-problems 9 Decide which form inputs to fill Find appropriate values to fill-in these inputs 1 2 IntroductionGoogles approachImprovements
  • Slide 10
  • HTML form example 10 Free-text inputs Choice inputs IntroductionGoogles approachImprovements
  • Slide 11
  • IntroductionGoogles approachImprovements HTML form example 11 Selection inputs Presentation inputs Selection inputs
  • Slide 12
  • Which form inputs to fill: Query templates Defined by Google as: the list of input types to be filled to create a set of queries 12 Query Template #1 IntroductionGoogles approachImprovements
  • Slide 13
  • Which form inputs to fill: Query templates Defined by Google as: the list of input types to be filled to create a set of queries 13 Query Template #2 IntroductionGoogles approachImprovements
  • Slide 14
  • How to create informative query templates 1. discard presentation inputs currently a big challenge 2. choose the optimal dimension for the template too big: increase crawling traffic and produce pages without results too small: every submission will get a large numbers of results and the website site may: limit the number of results allow to browse results through pagination (which is not always easy to follow) 14 IntroductionGoogles approachImprovements
  • Slide 15
  • Informativeness tester How Google evaluates if a template is informative? Query templates are evaluated upon the distinctness of the web pages resulting from the form submissions generated To estimate the number of distinct web pages, the results are clustered based on the similarity of their content 15 # distinct pages # pages > 25% A template is informative if IntroductionGoogles approachImprovements
  • Slide 16
  • How to scale to big web forms? Given a form with N inputs, the possible templates are 2 N 1 To avoid running the informativeness tester on all possible templates, Google developed an algorithm called Incremental Search for Informative Query Templates I.S.I.T. 16 IntroductionGoogles approachImprovements
  • Slide 17
  • ISIT example 17 IntroductionGoogles approachImprovements X X
  • Slide 18
  • Generating input values To assign values to a select menu is as easy as select all the possible values To generate meaningful values for text boxes is a big challenge Text boxes are used in different ways in web forms: Generic text boxes: to retrieve all documents in a database that match the words typed (e.g. title or author of a book) Typed text boxes: as a selection predicate on a specific attribute in the where clause of a SQL query (e.g. zip codes, US states, prices) 18 IntroductionGoogles approachImprovements
  • Slide 19
  • Values for generic text boxes 19 Initial seed keywords are extracted from the form page A query template with only the generic text box is submitted Additional keywords are extracted from the resulting page Discard keywords not representative for the page (TF-IDF rank) Runs until a sufficient number of keywords has been extracted 1 3 4 2 IntroductionGoogles approachImprovements
  • Slide 20
  • Values for typed text boxes 20 The number of types which can appear in HTML forms of different domains are limited (e.g.: city, date, price, zip) Forms with typed text boxes will produce reasonable result pages only with type-appropriate values To recognize the correct type, the form is submitted with known values of different types and the one with highest distinctness fraction is considered to be the correct type IntroductionGoogles approachImprovements
  • Slide 21
  • Performance evaluation query templates with only select menus 21 As the number of inputs increase, the number of possible templates increases exponentially, but the number tested only increases linearly, as does the number found to be informative IntroductionGoogles approachImprovements
  • Slide 22
  • Testing on 1 million HTML forms, the URLs were generated using a template which had: only one text box (57%) one or more select menus (37%) one text box and one or more select menus (6%) 22 Today on Google.com one query out of 10 contains "surfaced" results Performance evaluation mixed query templates IntroductionGoogles approachImprovements
  • Slide 23
  • Part 3 Improvements Main limitations Some ideas to improve
  • Slide 24
  • 1. POST forms are discarded The output of the whole Deep-Web crawling by Google is a list of URLs for each form considered. The result pages from a form submitted with method=POST dont have a unique URL Google bypasses these forms relying on the fact the RFC specifications recommend POST forms only for operations that write on the website database (e.g.: comments in a forum, sign-up to a website) But In reality websites make massive use of POST forms, for: URL Shortening Maintaining the state of a form after its submission 24 IntroductionGoogles approachImprovements
  • Slide 25
  • How can we crawl POST forms? Two approaches can drop the limitation put by Google: 1. POST forms can be crawled sending to the server a complete HTTP request, rather than just an URL. The problem becomes how to link (in the SERP) the page obtained submitting the POST form. 2. An approach which would solve all the problems stated is to simply convert the POST form to its GET equivalent. An analysis is required to assess which percentage of websites accept also GET parameters for POST forms. 25 IntroductionGoogles approachImprovements
  • Slide 26
  • 2. Select menus with bad default values When instantiating a query template, for select menus not included in the template, the default value of the menu is assigned, making the assumption that it's a wild card value like "Any" or All. This assumption is probably too strong: in several select menus the default option is simply the first one of the list. 26 e.g. for a select menu of U.S. cities we would expect All, but we can find Alabama. If a bad option like Alabama is selected, a high percentage of the database will remain undiscovered. IntroductionGoogles approachImprovements
  • Slide 27
  • How can we recognize a bad default value? Idea: to submit the form with all possible values and count the results if the number of results with the (potentially) default value is close to the sum of all the other results, probably it is a real default value. Once we recognize a bad default value, we force the inclusion of the select menu in every template for the given form. 27 IntroductionGoogles approachImprovements
  • Slide 28
  • 3. Managing mandatory inputs Often the HTML forms indicate to the user which inputs are mandatory (e.g.: with asterisks or red borders). To recognize the mandatory inputs can offer some benefits: Reduce the number of URLs generated by ISIT only the templates which contain all the mandatory fields will be passed to the informativeness tester Avoid to instantiate the default value (not always correct) to inputs that can just be discarded because they are not mandatory 28 IntroductionGoogles approachImprovements
  • Slide 29
  • 4. Filling text boxes exploiting Javascript suggestions An alternative approach for filling text boxes can be to exploit whenever a website uses suggestions proposed via Javascript. 29 IntroductionGoogles approachImprovements
  • Slide 30
  • Algorithm to extract the suggestions 1. Type in the text box all the possible first 3 letters (with the English alphabet: 26 3 = 17.576 submissions) 2. For each combination of 3 letters, retrieve all the auto-completion suggestions using a Javascript simulator 3. All suggestions can be assumed as valid inputs, we dont need to filter according to relevance 4. The relevance filter will be applied only if the website is not particularly interesting 30 IntroductionGoogles approachImprovements
  • Slide 31