Web Intelligence By Otto Borchert April 28, 2003.

Post on 30-Dec-2015

213 views 1 download

Transcript of Web Intelligence By Otto Borchert April 28, 2003.

Web Intelligence

By Otto Borchert

April 28, 2003

Background

• Application Layer / HTTP

• Agents

• Present - Google / Page Rank

• Future - Semantic Web / OWL

Hypertext Transfer Protocol (HTTP)

• Application level protocol (World Wide Web)• Runs over TCP, normally port 80• Information retrieved using a URL (Uniform

Resource Locator) protocol://host:port• Typical HTTP packet format

– START_LINE<CRLF>– MESSAGE_HEADER<CRLF>– <CRLF>– MESSAGE_BODY<CRLF>

Request Messages

• Given by client on START_LINE• Includes:

– OPTIONS: request information about available options– GET: (one of 2 most commonly used) retrieve document

identified in URL– HEAD (other most common used) retrieve metainformation

about document identified in URL (find out how old a page is)– POST: give information to server– PUT: store document under specified URL– DELETE: delete specified URL– TRACE: loopback request message– CONNECT: for use by proxies

Example request

• GET http://www.cs.ndsu.nodak.edu/index.html HTTP/1.1– Give entire descriptor in START_LINE

• GET index.html HTTP/1.1Host: www.cs.ndsu.nodak.edu– Precise page given in START_LINE, host in

MESSAGE_HEADER

Server reply

• Server replies with a Response Message

• Contains version of HTTP being used, 3 digit code indicating whether or not the request was successful and the reason for giving that code

Codes

• 1xx – Informational (Request received, continuing process)

• 2xx – Success (Action successfully received, understood, and accepted)

• 3xx – Redirection (further action must be taken to complete the request)

• 4xx – Client Error (request contains bad syntax or cannot be fufilled)

• 5xx – Server Error (server failed to fulfill an apparently valid request)

Example Replies

• HTTP/1.1 202 Accepted– Web page request accepted, displays page

• HTTP/1.1 404 Not Found– The usual not found error

• HTTP/1.1 301 Moved Permanently– The page has moved, includes a

MESSAGE_HEADER like in request to tell where the page has been moved to

HTTP extras

• In version 1.0 one TCP connection for each request. 1.1 allowed for persistent connections

• HTTP was set up with web caching in mind. One can check the date a page was last updated and store the newest versions of frequently accessed pages on a local machine

Is the web intelligent?

• Intelligence is a poorly defined word anyway. For example, would you consider these intelligent?– Document analysis systems for cataloging and summarizing Web

pages– Profiling systems for placing selective Web advertising– Data mining and analysis– Tools for searching databases supported by Web browsers– Translation tools that convert to and from human languages– Statistical software for network caching, routing, and tracking– Knowledge-based systems for automated e-mail reading– Smart agents for Internet-based product and service marketing– Video object recognition and searching

Is the web intelligent? (2)

• One of the most important advances in making the web intelligent is through the use of agents.

• These agents take many forms including many listed on the previous slide

What is an agent?

• No standard definition• Can be:

– Web Crawler– Travel Agent– Secretary– Hard to distinguish between agent and program. Agent

normally performs actions based on data it finds, without much human intervention

• Agents can be defined as intelligent as well• Act as the glue for many of the following ideas

The Present of Web Intelligence - Google

• Presently the most used search engine the Internet has to offer.

• Provides a unique blend of computer hardware and software to complete millions of user searches each day

• Based on a system called Page Rank

PageRank

• Developed by Larry Page and Sergey Brin at Stanford University (Google’s founders)

• Uses a system of link ranking– If there is a link from page A to page B, page B

is correlated to page A– If page A is a strong page to begin with, page B

becomes stronger as well

Word Association

• On top of PageRank, there is also a system of word matching. – Word counts (Do the words exist on the page?)– Proximity checks (Are the words close

together?)

Can’t you cheat PageRank?

• People try everyday! • Higher search ranking == More exposure• Link Farms

– Places where people merely have millions of links to a web page in hopes the target will move higher on the list.

– Google’s answer: Page importance. Once link farms are discovered, they are given a negative rank, so if you have a page on a link farm, its rank will go down as well

Another way to cheat

• Put lots of words related to your page in your page (even if they are not visible)

• Google’s answer: PageRank is primary, cheaters are given lower priority

Moral Decisions

• Wired article– Computer screen shows location, query pairs

for random searches on Google’s engines.– One search during the late hours on the West

Coast was “How to stop a friend from committing suicide”

– Can’t do much about it but make sure they get the right information the next time

The Future of Web Intelligence

• The Semantic Web

What is the Semantic Web?

• As the web presently stands, it is complete nonsense to most software applications. – Two completely different statements

• The ball is round

• The round ball

• The semantic web is a series of protocols meant to enrich the current web with meaning

Series of Protocols

• RDF – Resource Description Framework

• OWL – Web Ontology Language (extension of RDF)

Resource Description Framework

• From World Wide Web Consortium webpage• RDF “defines a mechanism for describing

resources that makes no assumptions about a particular application domain, nor defines (a priori) the semantics of any application domain. The definition of the mechanism should be domain neutral, yet the mechanism should be suitable for describing information about any domain“

RDF – Some examples

• Ora Lassila is the creator of the resource http://www.w3.org/Home/Lassila. – Abstract, conceptual Framework

– Concrete syntax using XML

Abstract example

• Subject (Resource) – http://www.w3.org/Home/Lassila   

• Predicate (Property)   – Creator  

• Object (literal)   – "Ora Lassila“

• Graphic

Concrete syntax

• Ora Lassila is the creator of the resource http://www.w3.org/Home/Lassila.

<rdf:RDF>

<rdf:Description about="http://www.w3.org/Home/Lassila">

<s:Creator>Ora Lassila</s:Creator>

</rdf:Description>

</rdf:RDF>

Web Ontology Language

• What is an ontology?– “defines the terms used to describe and

represent an area of knowledge”

• OWL defines ontologies for use on the web

• Actually an extension of RDF

Ontologies

• Date and Time

• Countries of the World

• Wines

• Space Shuttle Information

Some example OWL statements

<owl:Class rdf:ID="WineGrape"> <rdfs:subClassOf rdf:resource="&food;Grape" /></owl:Class>

<owl:Class rdf:ID="WhiteWine"> <owl:intersectionOf rdf:parseType="Collection"> <owl:Class rdf:about="#Wine" /> <owl:Restriction> <owl:onProperty rdf:resource="#hasColor" /> <owl:hasValue rdf:resource="#White" /> </owl:Restriction> </owl:intersectionOf> </owl:Class>

Conclusion

• Web intelligence is a broad new field for exploration

• Present efforts like Google can be improved upon with more semantic information

• Any questions?