Search engines

download Search engines

of 54

  • date post

  • Category


  • view

  • download


Embed Size (px)



Transcript of Search engines


2. Even a blind squirrel finds a nut ,occasionally. But few of us are determinedenough to search through millions, orbillions, of pages of information to find ournut. So, to reduce the problem to a, moreor less, manageable solution, web searchengines were introduced a few years ago. 3. Finding key informationfrom gigantic World WideWeb is similar to find aneedle lost in haystack. Forthis purpose we would use aspecial magnet that wouldautomatically, quickly andeffortlessly attract thatneedle for us. In this scenario magnet isSearch Engine 4. SearchCOMPUTING to examine a computer file, disk,database, or network for particular information.EngineSomething that supplies the driving force or energyto a movement, system, or trend.Search EngineA computer program that searches for particularkeywords and returns a list of documents in whichthey were found, especially a commercial servicethat scans documents on the Internet. 5. Search is a Wicked Problem No definitive formulation. Considerable uncertainty. Complex interdependencies. Incomplete, contradictory, and changing requirements. Stakeholders have radically different world views anddifferent frames for understanding the project or process. The problem is never solved. Roles Language Input IndexMetadataDesign Goals Vocabulary Interaction Algorithms Controlled Vocabulary Interaction Tasks Syntax FeedbackLinguisticsKnowledge ManagementBehaviorUser ? QuerySearch InterfaceSearchEngineAsk, Browse, or Search Again ContentResults 6 6. InteractionInformationDiscoveryDesign ArchitectureSearchFuturesKnowledge Patterns WayfindingStudies Management 7. 1st Generation (ca 1994): AltaVista, Excite, Infoseek Ranking based on Content: Pure Information Retrieval 2nd Generation (ca 1996): Lycos Ranking based on Content + Structure Site Popularity 3rd Generation (ca 1998): Google, Teoma, Yahoo Ranking based on Content + Structure + Value Page Reputation In the Works Ranking based on the need behind the query 8. Content Similarity Ranking:The more rare words two documents share,the more similar they are Documents are treated as bags of words(no effort to understand the contents) Similarity is measured by vector angles t3 Query Results are rankeddby sorting the angles 2between query and documents d1 t1t2 9. A hyperlinkfrom a page in site Awww.aa.comto some page in site B 1is considered a popularity vote site A to site B 2 Rank similar to popularity 2www.zz.com0 10. The reputation PageRank of a page Pi =the sum of a fraction of the reputations of allpages Pj that point to Pi Idea similar to academic co-citations Beautiful Math behind it PR = principal eigenvector of the webs link matrix PR equivalent to the chance of randomly surfing to the page HITS algorithm tries to recognizeauthorities and hubs 11. Check for duplicates,crawl thestore the webdocumentsDocIds usercreate aninvertedquery indexSearch Show results Invertedengine To userindexservers 12. Crawling Follow links to find information IndexingRecord what words appear whereRankingWhat information is a good match to a user query? What information is inherently good?DisplayingFind a good format for the information 13. 50% of emails received are spam! 14. But Google is usually so good in finding infoWhy does it do that? 15. I try another search engine. I try different keywords but if I still cant findan answer, I just think real hard for ananswer. I focus on the encyclopedia. 16. I punch thescreen. Just kidding, LOL. 17. dont know how to form a sound searchquery;dont have a strategy for dealing with poorresults;cant articulate how they know content iscredible;dont check the author or date of an article. 18. Step1 define the data you want Step 2 figure out where its likely to befound Step 3 select the search tool most likelyto provide it Step 4 learn how to interpret your results 19. The most commonly used search tools are Search Engines Subject Directories Other search tools include Targeted directories Focused Crawlers Portals Vortals Meta-tools Value-added search services 20. Searchengines are the preferred tool when you: Are looking for something very specific Need to pin down a quick fact or two Need to know if any information exists at all on asubject Want mass quantities of links, but are notconcerned about quality control. 21. A subject directory is a database of titles,citations, and websites organized bycategory. Advantage Most directories are edited,maintained and created by people. Usually they are carefully evaluated and annotated for this reason. Disadvantage Typically include a smallernumber of sites than a search engine dueto the great amount of human effortinvolved. 22. Open Directory Project - The largest, mostcomprehensive human-edited directory of theWeb. It is constructed and maintained by avast, global community of volunteer editors. Closed model directories such as Yahoo! AndLookSmart are pulled together by professionaleditors who select the links and set up thecategories. The user generally gets highquality results 23. Subject directories are organized andselective. They are useful when you want to knowmore about broad-based subjects, such as General topics Popular topics Targeted directories Current events Product information 24. Many search engines are now hybrids-search tools that have an engine as wellas a directory. Sometimes targeted directories arematched with focused crawlers to producea very powerful hybrid search tool. (e.g. 25. Metasearches use multiple engines to look foryour keywords. Advantage You have many search engines alllooking for what you need. Great when you arelooking for something that is hard to find. Disadvantage Its hard to fine tune your searchand narrow things down. Also, Metasearchescan sometimes give you more information thanwhat you need. 26. Beaucoup! Clusty Mamma, the mother of all searchengines- Ixquick 27. Yahooligans Made for ages 7-12, pages arehand picked to be appropriate for children. Notonly will the content on these pages bemonitored, but so are the ads that are displayed. Froogle Made for the frugal shopper, thisoffshoot of Google has engines that catalogproducts and finds you the cheapest price for agiven item on the internet. Its in its betaversion so they are still working out some kinks. 28. Boolean Operators (AND, OR, andNOT) AND: Limits the number of hits (results) you receive In many search sites, this is implied (if you typetwo or more words, it assumes you want x AND yAND z, etc.) OR: Increases the number of hits you receive Synonyms for words can be used NOT: Limits the number of hits you receive Useful for getting rid of words that have more thanone meaning Ex: Sun NOT Microsystems Sometimes a (-) sign (like for Google) 29. Phrase Search Usually quotation marks are used: Useful for a specific search (song lyrics, part of a poem, etc.) Ex: fly me to the moon Truncation and Wildcards Used as placeholders for additional characters - usually (*) Truncation = finds any characters that come after the placeholder Ex: Red* --> red, reds, redwood, redding, etc. Wildcards = finds different characters within a word Ex: Wom*n --> woman, women Stop Words Small words that are used often Some stop words include: and, the, a, not, to, be, etc. Ex: Give me a cookie and Give me cookie would yield similar results Most search engines and databases ingore these 30. Limiters Most search engines and databases provide other ways to narrow your search Often found under Advanced Search Varies greatly! Search limiters Keyword (usually default) Title Author Subject Multiple search boxes Other limiters Date Language Type ( book, dvd, magazine, etc.) OR (web: .gov, .edu, .org) Google Advanced Search Wilson Select Plus 31. Powersearching also uses math, theuniversal language. Uses symbols of + and and . Example: Clinton Lewinsky on Yahoo! 32. Usethese commands in the search window. intitle: Find sites with one search term in the title. allintitle: Find sites with all search terms in the title. inurl: Find sites with one search term in the URL. allinurl: Find sites with all search terms in the URL. site: Limit your search to a specific web site. filetype: Specify a type of document to search. 8/2/2007 33. Findpages containing the term in the title:intitle:[search term] Find pages with terms in the text:allintext:[search terms] Find similar pages to a certain website:related:[insert URL] Find pages with the term in the URL:inurl:[insert search term]Try it out! 34. Find pages containing the term in the title:title:[search term] Find pages with the term in the URL:url.all:[search term] 35. Also called deep web consists ofmaterials search engines will not or cannotindex. Usually consists of web-based databasesor pdf files. Example: American Memory Project:Jackie Robinson. 36. Google The only traditional searchengine that can recognize .pdf and .docfiles. Profusion a Metasearch tool that lets yousearch .pdf files. 37. Google By far the most used search site (76% of searches on the Internet are done using Google). Simple one line search box Phrase completion function Did you mean function Im Feeling Lucky! Other search options Images, Videos, Maps, News, Shopping (limiters) Search strategiesTYPE INCLUDED? HOWBoolean operatorsYes AND = [default] OR = OR(capitalized) NOT = [-](AND, OR, NOT)Phrase SearchYes Quotation marks [ ]Wildcards / Truncation SomeNo truncation (Google automatically searches other endings) Wildcards = [*]Advanced searchYes Limit by Language, File type, Domain, etc. 38. Bing 39. Bing (Microsofts latest search engine) Starts out with a simple one bo