LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special...

20
LIS618 lecture 10 Thomas Krichel 2003-04-23

Transcript of LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special...

Page 1: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

LIS618 lecture 10

Thomas Krichel

2003-04-23

Page 2: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

Structure

• some repeats from last week

• other special syntaxes

• usenet news in google

• open directory project in google.

Page 3: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

query language II

• * is a wildcard for any word

• +stopword requires the presences of a stop word stopword. But the list of stop words has not been published.

• In fact it depends from query to query

• There is a limit of 10 words, but a * does not count towards the limit

Page 4: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

special syntax I

• intitle: find in title only, "intitle: google"

• intext: find in text only. This will exclude occurrences of the search term in anchor or title data. "intext: html"

• inanchor: This option requests pages, for which there is another page that links to them with the anchor text in the query. example: inanchor:"a list of my courses" finds my courses page because it has a link with that text

Page 5: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

special syntax

• cache: pages that are in the google cache,

useful if query result has nothing to do with the query terms cache:openlib.org/home/krichel will show the cached version of the page.

• If you add further terms, they will be highlighted.

Page 6: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

daterange: special syntax

• limits the search to pages indexed between a range of dates. Changed pages are reindexed, unchanged pages are not reindexed when the crawler visits a page.

• dates are expressed in the Julian period, i.e. number of days after -4713-01-01 0:00 UTC of the Julian calendar. Today is 2452739

• example: daterange: 2452640-2452739

Page 7: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

mixing special syntax expressions

• The link: syntax does not mix with others.

• Other bad ideas:– "site:openlib.org –inurl:openlib"– "site:edu site:com"

• Things that work well– intitle:search – Intitle:biology inurl:help

Page 8: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

Examples

• George Bush site:nytimes.com

• "Copyright * The New York Times" "George Bush"

• Intitle:"directory * * trees"

• Botany intitle:"directory of" site:edu

• "powered by blogger" or site:blogspot.com

• "classical music" (inurl:mailman | inurl:listserv)

Page 9: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

phonebook: special syntax

• also rphonebook for residential and bphonebook for businesses

• A location seems to be required, i.e. phone: long island university phone: long island university ny• no

– wildcards– exclusions– or

Page 10: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

stocks on google

• stocks: ticker will look up a ticker symbol ticker at http://finance.yahoo.com

• you can find ticker symbols there

• ticker symbols are useful to find financial information about publicly traded companies.

Page 11: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

google images

• it has the following special syntaxes– intitle searches for images on a page with a

given title, "intitle: long island university"– Inurl: searches for images in pages that have

a certain url, inurl:liu.edu– site: restricts the search to a certain site,

should be combined with a search term like "site:liu.edu koenig"

Page 12: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

Google interfaces to 3rd party data

• Google groups are an interface to usenet news

• Google directory is an interface to the Open Directory Project.

• In both cases Google is dependent on the quality of these underlying data source.

Page 13: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

usenet news

• Usenet is a collection of user-submitted notes on various subjects that are posted to servers on a worldwide network. Each subject collection of posted notes is known as a newsgroup.

• A newsgroup is a discussion about a particular subject consisting of notes written to a networked site and distributed through Usenet.

• Newsgroups are hierarchical. Hierarchical levels are separated by dots example: comp.text.tex

• alt stands for anarchists, lunatics and terrorists.

Page 14: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

usenet history

• The idea of network news was born in 1979 when two graduate students, Tom Truscott and Jim Ellis, thought of using UUCP to connect machines for the purpose of information exchange among users. They set up a small network of three machines in North Carolina.

• UUCP is ``UNIX to UNIX copy'' a protocol that is used to copy files between machines running some flavor of UNIX, without the need for IP protocol. Usenet is older than the Internet

Page 15: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

decline of usenet

• essentially open to all (peer-to-peer system)

• used by spammers for – posting – gathering addresses

• steady decline of quality of contribution

• steady decline of quantity of contributions

Page 16: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

usenet worth checking out

• independent reviews of products, often written by experts.

• Example: interpretation of beethoven sonatas by Wilhelm Kempff.

• Sorting by date reveals that the newsgroup rec.music.classical.recordings is still active. On a good day, you will find no finer guide to records.

Page 17: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

special syntax for usenet

• group: limits posting to a certain group

• title: limits to titles of postings

• author: searches for author name or email address

• Mixing syntaxes works well

Page 18: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

the open directory project

• "The Open Directory Project is the largest, most comprehensive human-edited directory of the Web. It is constructed and maintained by a vast, global community of volunteer editors.

• Claim that there is a historic precedence in the Oxford English Dictionary.

• Formerly known as ``GnuHoo'', then ``NewHoo'', then acquired by NetScape, and called ``dmoz''.

Page 19: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

dmoz.org

• dmoz is maintained by volunteers ``net-citizen''. No special qualifications required, but claimed to be experts.

• There are about 30,000 volunteers (they claim).• Powers the core directory services for the

Web's largest and most popular search engines and portals– Netscape Search AOL Search– Google Lycos– HotBot DirectHit

• Headquarters run by Netscape

Page 20: LIS618 lecture 10 Thomas Krichel 2003-04-23. Structure some repeats from last week other special syntaxes usenet news in google open directory project.

http://openlib.org/home/krichel

Thank you for your attention!