Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron...

90
Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen XML, RDF and Advanced Search (Semantic Web)

Transcript of Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron...

Page 1: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou,

Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen

XML, RDF and

Advanced Search

(Semantic Web)

Page 2: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

What we have covered• What is IR• Evaluation• Tokenization and properties of text • Web crawling• Query models• Vector methods• Measures of similarity• Indexing• Inverted files• Basics of internet and web• Spam and SEO• Search engine design• Google and Link Analysis• Social network analysis

– This lecture: metadata, XML, RDF; issues in advanced search and the Semantic Web

Page 3: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

The importance of data and their rules

• Tim Berners-Lee– inventor of the world wide web– Founder of the W3C

• Presentation at Ted

Page 4: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

“Metadata is data about data”

Metadata and Markup languages

Metadata often is written in XML

Page 5: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Metadata is semi-structured data conforming to commonlyagreed upon models, providing operational interoperability

in a heterogeneous environment

Page 6: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

What is metadata?Some simple definitions

• ‘Structured data about data’.• Dublin Core Metadata Initiative FAQ, 2005

– http://dublincore.org/resources/faq/

• Machine-understandable information about Web resources or other things.

• Tim Berners-Lee, W3C, 1997– http://www.w3.org/DesignIssues/Metadata

Page 7: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

"Web resources or other things"

– HTML documents– digital images– databases– books– museum objects– archival records– metadata records

– Web sites– collections– services– physical places– people– organizations– “works”– formats– concepts– events

• Metadata might be "about"… anything!

Page 8: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

What is metadata?Towards a "functional" view

• Data associated with objects which relieves their potential users of having to have full advance knowledge of their existence or characteristics.

• Lorcan Dempsey & Rachel Heery, "Metadata: a current view of practice and issues", 1998

– http://www.ukoln.ac.uk/metadata/publications/jdmetadata/

• Structured data about resources that can be used to help support a wide range of operations.

• Michael Day, "Metadata in a Nutshell", 2001– http://www.ukoln.ac.uk/metadata/publications/nutshell/

Page 9: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

What might metadata "say"?

What is this called?

What is this about?

Who made this?

When was this made?

Where do I get (a copy of) this?

When does this expire?

What format does this use?

Who is this intended for?

What does this cost?

Can I copy this? Can I modify this?

What are the component parts of this?

What else refers to this?

What did "users" think of this?

(etc!)

Page 10: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

What operations/functions?

• resource disclosure & discovery• resource retrieval, use• resource management, including preservation• verification of authenticity• intellectual property rights management• commerce• content-rating• authentication and authorization• personalization and localization of services• (etc!)

Page 11: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

What operations/functions?

• Different functions : different metadata• Metadata (and metadata standards) sometimes

classified according to function– Descriptive: primarily for discovery, retrieval– Administrative: primarily for management– Structural: relationships between component parts of

resources – Contextual: relationships between resources

• No “one size fits all solution”!

Page 12: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Metadata importance

• “data about data” is about as good as the definition gets...

• As a data resource grows, metadata becomes more important

• Lack of metadata has different consequences– documentation: metadata can be regenerated automatically,

or by hand– datasets, pictures: once lost, can be impossible to

regenerate

Page 13: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Types of Metadata

• Descriptive– Discovery / description of objects

• Title, author, abstract, etc.

• Structural– Storage & presentation of objects

• 1 pdf file, 1 ppt file, 1 LaTeX file, etc.

• Administrative– Managing and preservation of objects

• Access control lists, terms and conditions, format descriptions, “meta-metadata”

See http://www.loc.gov/standards/metadata.html

Page 14: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Which View is Correct?

figure 1 from: http://www.dlib.org/dlib/january01/lagoze/01lagoze.html

Page 15: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Approaches to Metadata

• from Ng, Park and Burnett, 1997 (also JASIS, 50(13))

http://www.scils.rutgers.edu/~sypark/asis.html

– library science: bibliographic control• “organizing the physical containers of information, by means

of bibliographical description, subject analysis, and classification notation construction, so that the container can be efficiently described, identified, located and retrieved”

– computer and information science: data management• “not only to store, access and utilize data effectively, but also

to provide data security, data sharing, and data integrity”

Page 16: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Metadata Formats and Implementation

• Use markup languages– Interoperable– Extensible– Robust

• Permits advance search features

When online, the beginning of a semantic web!

Page 17: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

What is a markup language?

• Textual (i.e. person readable) language where significant elements are indicated by markers– <TITLE>XML</TITLE>

• Examples are RTF, HTML, XML, TEX etc.• Easy to process and can be manipulated by

a variety of application programs

Page 18: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Standard Generalized Markup Language (SGML)

• Based on GML (generalized markup language), developed by IBM in the 1960s

• An international standard (ISO 8879:1986) defines how descriptive markup should be embedded in a document

• Can define any document format of any complexity

• Enables, extensibility, structure and validation

• Too many optional features for the Web

• Gave birth to the extensible markup language (XML), W3C recommendation in 1998

Page 19: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

The Purpose of SGML

•SGML is designed to make your information last longer than the systems that created it. Such longevity also implies immunity to short-term changes -- such as a change from one application program to another -- so SGML is also inherently designed for re-purposing and portability.

Page 20: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

What is SGML?

• SGML (and it's derivatives, HTML and XML) are ASCII character based representations of electronic data

• Remember, it's all bits--meaning is derived from how they are organized…

• Think of SGML docs as strings that must be parsed--A web browser parses an HTML doc and uses the markup codes to display the data contained

• Since it's all ASCII, these docs can also be handled by non parsing tools (such as vi, emacs, perl, etc.)

Page 21: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

SGMLXMLHTML

• SGML is the “mother tongue” – but is overkill for most common desktop applications.

• XML is an abbreviated version of SGML• easier to define own document types• easier for programmers to write programs to handle

documents (and data)• omits all the options (and most of more complex and

less-used parts) of SGML)• HTML is just one of many SGML or XML

“applications” – most frequently used on the Web

Page 22: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

SGML Components

• SGML documents have three parts:• Declaration: specifies which characters and delimiters may

appear in the application

• DTD (document type definition) / style sheet: defines the syntax of markup constructs

• Document instance: actual text (with the tag) of the documents

• More info could be found: http://www.W3.Org/markup/SGML

Page 23: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

World Wide Web (W3C) Consortium

Page 24: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

What is XML?

• XML – eXtensible Markup Language• designed to improve the functionality of the Web

by providing more flexible and adaptable information and identification

• “extensible” because not a fixed format like HTML

• a language for describing other languages (a meta-language)

• design your own customised markup language

Page 25: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

The HTML World

<body> <h1> XML and Information Retrieval: A SIGIR 2000 Workshop </h1> <p> The workshop was held on 28 July 2000. The editors of the workshop were David Carmel, Yoelle Maarek, and Aya Soffer </p> <h2> XQL and Proximal Nodes </h2>

<p> The paper was authored by Ricardo Baeza-Yates and Gonzalo Navarro. The abstract of this paper is given below. </p>

<p> We consider the recently proposed language … </p>

<p> The paper references the following papers: <a href=“http://www.acm.org/www8/paper/xmlql”> … </a> … </p> …

Page 26: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

The XML World

<workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> …

Page 27: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Why use XML?

• XML is written in SGML – the Standardized General Markup Language, an international standard (ISO 8879)

• XML = very simple dialect of SGML• goal = enable generic SGML to be served,

received and processed on the Web in ways not possible with HTML

Page 28: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Why use XML?

• XML is not just for Web pages

• use to store any kind of structured document

• to enclose/encapsulate information in order to pass it between different computing systems that are otherwise unable to communicate

Page 29: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Key feature of XML

• An application is free to use XML tagged data in many different ways, e.g.

• produce an image• generate a formatted text listing• display the XML document’s markup in pretty

colors• restructure the data into a format for storing in a

database, transmission over a network, input to another program.

Page 30: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

XML is important because...

• Removes 2 constraints that held back Web development:

• dependence on a single, inflexible document type (HTML) [much abused]

• reduced the complexity of full SGML [many options but hard to program]

Page 31: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

• XML… allows the flexible development of user-defined document types.

• provides a robust, non-proprietary, persistent, and verifiable file format for the storage and transmission of text and data both on and off the Web

Page 32: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

XML Software?

• many programs are “XML ready” already today.

• xml.coverpages.org covers news of new additions to XML

Page 33: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Is XML a Computer Language?

• XML is not C or C++ or like any other programming language

• By itself, it cannot specify calculations, actions, decisions to be carried out in any order

• XML is a markup specification language

Page 34: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

XML - a Markup Language

• with XML, you can design ways of describing information (text or data), usually for storage, transmission or processing by a program

• XML conveys no information about what should be done with the data or text – it merely describes it.

• By itself, XML does anything – it is a data description format

Page 35: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

How do I run or execute an XML file?

• You can’t and you don’t !• XML is not a programming language• XML is a markup specification language• XML files are just data (waiting for a

program to do something with them)• XML files can be viewed with an XML

editor or XML-compatible browser

Page 36: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Things to Remember

• XML does not replace HTML – it provides an alternative which allows you to define your own set of markup elements to a published standard:– <?xml version="1.0" standalone="yes"?>– <conversation>– <greeting>Hello, world!</greeting>– <response>Stop the planet, I want to get

off!</response>– </conversation>

Page 37: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Things to Remember

• All parts of an XML document are case sEnSiTiVe

• Element type names are case sensitive, so <BODY> …</b ody> is out.

• Attribute names are case sensitive …• <PIC width=“7cm”/> and• <PIC WIDTH=“6cm”/>• describe different attributes, not just

different values for the attribute “PIC width”.

Page 38: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

What is XQuery?

– XQuery is the language for querying XML data • The best way to explain XQuery is to say that XQuery is to XML

what SQL is to database tables.

– XQuery uses XPath expressions to extract XML data.• XPath is a language for finding information in an XML document.

• XPath is used to navigate through elements and attributes in an XML document.

– XQuery is defined by the W3C.– XQuery is supported by all the major database engines (IBM,

Oracle, Microsoft, etc.)• XQuery 1.0 W3C Recommendation

Page 39: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Motivation for XML Search

• It is becoming increasingly popular to publish data on the Web in the form of XML documents.

• Current search engines, which are an indispensable tool for finding HTML documents, have two main drawbacks when it comes to searching for XML documents. – It is not possible to pose queries that explicitly refer to XML tags. – Search engines return references (i.e. links) to documents and not

specific fragments thereof. This is problematic, since large XML documents may contain thousands of elements storing many pieces of information that are not necessarily related to each other.

Page 40: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

The HTML World

<body> <h1> XML and Information Retrieval: A SIGIR 2000 Workshop </h1> <p> The workshop was held on 28 July 2000. The editors of the workshop were David Carmel, Yoelle Maarek, and Aya Soffer </p> <h2> XQL and Proximal Nodes </h2>

<p> The paper was authored by Ricardo Baeza-Yates and Gonzalo Navarro. The abstract of this paper is given below. </p>

<p> We consider the recently proposed language … </p>

<p> The paper references the following papers: <a href=“http://www.acm.org/www8/paper/xmlql”> … </a> … </p> …

Page 41: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

The XML World

<workshop date=”28 July 2000”> <title> XML and Information Retrieval: A SIGIR 2000 Workshop </title> <editors> David Carmel, Yoelle Maarek, Aya Soffer </editors> <proceedings> <paper id=”1”> <title> XQL and Proximal Nodes </title> <author> Ricardo Baeza-Yates </author> <author> Gonzalo Navarro </author> <abstract> We consider the recently proposed language … </abstract> <section name=”Introduction”> Searching on structured text is becoming more important with XML … <subsection name=“Related Work”> The XQL language … </subsection> </section> … <cite xmlns:xlink=”http://www.acm.org/www8/paper/xmlql> … </cite> </paper> …

Page 42: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

• A query language for XML, such as XQuery, can be used to

extract data from XML documents.

• However, such a query language is not an alternative to an XML search engine for several reasons. – The syntax of XQuery is more complicated than the syntax of a

standart search query. Hence, it is not appropriate for a naive user. – Extensive knowledge of the document structure is required in order

to correctly formulate a query. Thus, queries must be formulated on a per document basis.

– XQuery lacks any mechanism for ranking answers.

• Solution - XML Search engine

Problems with XQuery

Page 43: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

XML Search Tool Design Features?

• A simple syntax that can be used by naive users• Search results should include XML fragments and not

necessarily full documents• The XML fragments in an answer, should be semantically

related– For example, a paper and an author should be in an answer only

if the paper was written by this author

• Search results should be ranked• Search results should be returned in “reasonable” time

Page 44: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

XML Search Engines

• Summary of XML engines– Open source ones starting to emerge

• Or just use web search engine with filetype:xml– Usually doesn’t work!

• Many for commercial use and some in design– Active research area

• Web XML is a step in the direction of the semantic web!

Page 45: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

What is Web 2.0 ?

• Term coined by Tim O’Reilly and Media Live International as part of brainstorming session about the future of the web in 2005

• Also may be called the Live Web or Living Web• Refers to more interactive technologies that engage, facilitate

and empower users• Companies utilizing interactive technologies are the hot

investments• Companies are just starting to embrace these technologies for

business value• Tim’s Def (Video); Schmidt’s (Video)• The Machine (Video)

Page 46: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Web 1.0 vs 2.0 (Some Examples)

Web 1.0   Web 2.0DoubleClick --> Google AdSense

Ofoto --> FlickrAkamai --> BitTorrent

mp3.com --> NapsterBritannica Online --> Wikipediapersonal websites --> blogging

domain name speculation --> search engine optimizationpage views --> cost per click

screen scraping --> web servicespublishing --> participation

content management systems --> wikisdirectories (taxonomy) --> tagging ("folksonomy")

stickiness --> syndication

Source: www.oreilly.com, “What is web 2.0: Design Patterns and Business Models for the next Generation of Software”, 9/30/2005

Page 47: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Web 3.0Web 3.0This will be the This will be the

INTELLIGENT Web!INTELLIGENT Web!

The Semantic Web!The Semantic Web!

Page 48: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

How will we get the semantic web?

Now... that should clear up a few things around here

Page 49: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

• The Web and Web 2.0 were designed with humans in mind.(Human Understanding)

• The Web 3.0 will anticipate our needs! Whether it is State Department information when traveling, foreign embassy contacts, airline schedules, hotel reservations, area taxis, or famous restaurants: the information. The new Web will be designed for computers.(Machine Understanding)

• The Web 3.0 will be designed to anticipate the meaning of the search.

Page 50: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Web 2.0 vs Web 3.0Web 2.0 vs Web 3.0

Web 2.0 : On the Web, you can see your e-mails, Web 2.0 : On the Web, you can see your e-mails, photographs, and restaurant appointments. photographs, and restaurant appointments.

Web 3.0: On the Web...Web 3.0: On the Web...

...you can see your photographs ...you can see your photographs arranged so that you know what arranged so that you know what restaurants you visited on a particular restaurants you visited on a particular date, and based on related emails sent date, and based on related emails sent that day.that day.

Page 51: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

• The next stage for the Web will be making data accessible to artificial intelligence agents.

• The Web 3.0 will need new languages beyond HTML or XML. That is the case of RDF or Resource Description Framework.

• The Web 3.0 will need data delivered in computer-readable form (RDF).

Page 52: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

General idea of Semantic Web

Make current web more machine accessible and intelligent!(currently all the intelligence is in the user)

Motivating use-cases

• Search engines• concepts, not keywords• semantic narrowing/widening of queries

• Shopbots• semantic interchange, not screenscraping

• E-commerce– Negotiation, catalogue mapping, personalisation

• Web Services– Need semantic characterisations to find them

• Navigation• by semantic proximity, not hardwired links

• .....

Page 53: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Example

• Try these queries with Google:– Distance between Paris and Madrid Google returns:

www.freedom-tour.com/mall/kmeurope.htm (giving you distances in miles and kilometers)

– (The) Largest city of France Google returns: France – Largest City: Paris

– (The) Largest city of Spain Google returns: Spain – Largest City: Madrid

• Now, try these with Google:– Distance between largest city of France and largest city of

Spain– Distance between“largest city of France”and “largest city of

Spain”– And worst, Distance between“the largest city of France” and

“the largest city of Spain” – No result returned by Google!

Page 54: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Example

• So, what’s wrong with Google?– Nothing. The problem is with the World Wide Web:

• The Web contains unstructured information

– and Google is a keyword- and phrase-based search engine

• Initiative to make the contents on the Web structured information/represented knowledge – the Semantic Web

Page 55: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

General idea of Semantic Web(2)

Do this by:• Somehow making data and metadata available

on the Web in machine-understandable form (formalized)

• Structure the data and meta-data in ontologies

These are non-trivial design decisions.Alternative would be:

Page 56: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Expressed using the W3C stack

Page 57: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

What it’s like to be a machine on the Web

Page 58: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Required are:

• Explicit meta-data

• Shared domain descriptions

Machine-processable contentMachine-support for interoperability

Page 59: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

machine accessible meaning (What it’s like to be a machine)

CV

name

education

work

private

Page 60: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

XML machine accessible meaning

CV

name

education

work

private

< >

< >

< >

< >

< >

< >

< >

<>

<>

<>

Page 61: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

So why not just use XML?• No agreement on:

– structure• is country a:

– object?– class?– attribute?– relation? – something else?

• what does nesting mean?– vocabulary

• is country the same as nation?

<country name=”Netherlands”> <capital name=”Amsterdam”> <areacode>020</areacode> </capital></country>

<country name=”Netherlands”> <capital name=”Amsterdam”> <areacode>020</areacode> </capital></country>

<nation> <name>Netherlands</name> <capital>Amsterdam</capital> <capital_areacode> 020 </capital_areacode></nation>

<nation> <name>Netherlands</name> <capital>Amsterdam</capital> <capital_areacode> 020 </capital_areacode></nation>

● Are the above XML documents the same?● Do they convey the same information?● Is that information machine-accessible?

Page 62: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

“2nd aim of Semantic Web”: Data integration

– Unstructured and sensors, programs, services semi-structured sources (document collections, message traffic, web pages, ...)

– Structured data without an explicit data schema (non-local databases, data tables, charts and reports, ...)

– Non-Text collections (image, video, sound, ...) – Streams of data from

Must specify the structure of data resources..

Page 63: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

2nd aim of Semantic Web: Data integration

... so a processor can tell how the "attributes" and "values" are related

– What is required vs. optional? – How many values for a particular attribute? – What attributes are keys for other attributes? – Which attributes are necessarily related to other

attributes and in what way?? – How do the attributes (and values) in one data

source map to attributes and values describing another source?

Page 64: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Stack of languages

• XML:– Surface syntax, no semantics

• XML Schema:– Describes structure of XML documents

• RDF:– Datamodel for “relations” between “things”

• RDF Schema (RDFS):– RDF Vocabulary Definition Language

• OWL:– A more expressive

Vocabulary Definition Language

Page 65: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Semantic web languages today

• Today there are three semantic web languages– RDF – Resource Description Framework

http://www.w3.org/RDF/

– DAML+OIL – Darpa Agent Markup Language http://www.daml.org/ (deprecated)

– OWL – Ontology Web Languagehttp://www.w3.org/2001/sw/

• OWL lit

• OWL DL

• OWL Full

Page 66: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

RDF is the first Semantic Web language

<rdf:RDF ……..> <….> <….></rdf:RDF>

XML EncodingGraph

stmt(docInst, rdf_type, Document)stmt(personInst, rdf_type, Person)stmt(inroomInst, rdf_type, InRoom)stmt(personInst, holding, docInst)stmt(inroomInst, person, personInst)

Triples

RDFData Model

Good for Machine

Processing

Good For HumanViewing

Good For Reasoning

RDF is a simple language for building graph based representations

Page 67: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

The RDF Data Model• An RDF document is an unordered collection of statements, each with a

subject, predicate and object (aka triples)

• A triple can be thought of as a labelled arc in a graph

• Statements describe properties of web resources

• A resource is any object that can be pointed to by a URI:– a document, a picture, a paragraph on the Web, …

– E.g., http://umbc.edu/~ypeng/F07671.html

– a book in the library, a real person (?)

– isbn://5031-4444-3333

– …

• Properties themselves are also resources (URIs)

Page 68: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.
Page 69: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

RDF without a Schema• Object ->Attribute-> Value triples

• objects are web-resources• Value is again an Object:

• triples can be linked• data-model = graph

pers05 ISBN...Author-of

pers05 ISBN...Author-of

MIT

ISBN...

Publ-by

Author-of Publ-

by

Page 70: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

What does RDF Schema add?

• Defines vocabulary for RDF• Organizes this vocabulary in a

typed hierarchy• Class, subClassOf, type• Property, subPropertyOf• domain, range

Person

Author Reader

subClassOfsubClassOf

Lynda

type

communicatesTodomain range

Frank

type

communicatesTo

Page 71: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Which Semantic Web?

• Version 1:"Semantic Web as Web of Data" (TBL)

• recipe:expose databases on the web, use XML, RDF, integrate

• metadata from:– expressing DB schema semantics

in machine interpretable ways

• enable integration and unexpected re-use

Page 72: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Which Semantic Web?

• Version 2:“Enrichment of the current Web”

• recipe:Annotate, classify, index

• metadata from:– automatically producing markup:

named-entity recognition, concept extraction, tagging, etc.

• enable personalization, search, browse,..

Page 73: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Which Semantic Web?

• Version 1:“Semantic Web as Web of Data”

• Version 2:“Enrichment of the current Web”

Different use-cases Different techniques Different users

Page 74: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Four popular fallacies about the Semantic Web

Semantic Web research

Page 75: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

First: clear up some popular misunderstandings

False statement No :

“Semantic Web people try to enforce meaning from the top”

They only “enforce” a language.They don’t enforce what is said in that language

Compare: HTML “enforced” from the top,But content is entirely free.

Page 76: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

First: clear up some popular misunderstandings

False statement No :“The Semantic Web people will require everybody to subscribe to a

single predefined "meaning" for the terms we use.”

Of course, meaning is fluid, contextual, etc.

Lot’s of work on (semi)-automatically bridging between different vocabularies.

Page 77: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

First: clear up some popular misunderstandings

False statement No :

“The Semantic Web will require users to understand the complicated details of formalised knowledge representation.”

All of this is “under the hood”.

Page 78: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

First: clear up some popular misunderstandings

False statement No :“The Semantic Web people will require us to manually markup all the

existing web-pages.”

Lots of work on automatically producing semantic markup:

named-entity recognition, concept extraction, etc.

Page 79: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

The current state of Semantic Web

Semantic Web research

Page 80: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

4 hard questions about the Semantic Web:

Q1: "where does the meta-data come from?” NL technology is delivering on concept-extraction Socially emerging (learning from tagging).Q2: “where do the meta-data-schema

come from?” many handcrafted schema hierarchy learning remains hard relation extraction remains hard.Q3: “what to do with many meta-data schema?” ontology mapping/aligning remains VERY hard.Q4: “where’s the ‘Web’ in the Semantic Web?” more attention to social aspects (P2P, FOAF) non-textual media remains hard deal with typical Web requirements.

Page 81: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Advanced Search

Metadata and semantic web will make advanced search much easier

Growth of web metadata.Folksonomies!Tools that automatically generate metadata

TREC 2008

Page 82: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Search for Web 3.0

• Natural language queries

• Search agent (avatar) understands and anticipates your needs

• Personal life search with avatar

Page 83: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

The Evolving Web

Web ofKnowledge

HyperText Markup LanguageHyperText Transfer Protocol

Resource Description FrameworkeXtensible Markup Language Self-Describing Documents

Foundation of the Current Web

Proof, Logic andOntology Languages Shared terms/terminology

Machine-Machine communication

1990

2000

2010

Berners-Lee, Hendler; Nature, 2001

DOCUMENTS

DATA/PROGRAMS

Page 84: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Web Semantics

Semantic Web LayerCake (Berners-Lee, 99;Swartz-Hendler, 2001)

Page 85: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Semantic Web ca. 2008Semantic Web companies starting & growing

Siderean , SandPiper , SiberLogic , Ontology Works, Intellidimension , Intellisophic ,

TopQuadrant , Data Grid, Mondeca , ontoPrise …

Web 3.0 new buzzword: Garlik , Metaweb , RadarNetworks , Joost , Talis , …

Semantic Search: Powerset , CK Lingo, Curbside MD, ZoomInfo , …

Bigger players buying in

Adobe, Cisco, Dow Jones, HP, IBM, Eli Lilly, Microsoft ™ , Nokia, Oracle, Pfizer, Sun,

Vodaphone , Yahoo!, Reuters, …

Gartner identifies Corporate Semantic Web as one of three "High impact" Web

technologies

Tool market forming: AllegroGraph , Altova , TopBraid , …

Government projects in and across agencies

US, UK, EU, Japan, Korea, China, India …

Several "verticals" heavily using Semantic Web technologies

Health Care and Life Sciences

Interest Group at W3C

Financial services

Human Resources

Sciences other than Life Science

Virtual observatory, Geo ontology, …

Many open source tools available

Kowari, RDFLib , Jena, Sesame, Protégé, SWOOP, Pellet, Onto(xxx), Wilbur, …

(Jim Hendler - internal talk, Microsoft Labs, July 2008)

Semantic Web 2008 - ?

Page 86: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Web 4.0 :-?)

Page 87: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

The next 5000 days of the web

• Kevin Kelly– Founder of WIRED magazine

– Video

Page 88: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Web 4.0• Machines talk back!

Page 89: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

Search for Web 4.0

• We get real help when we search!

Terminator: the Sarah Connor Chronicles

Cameron’s on our side!

Page 90: Thanks to Jim Hendler, Carl Lagoze, Jayavel Shanmugasundaram, Sara Cohen, Jonathan Mamou, Yaron Kanza, Mark Sapossnek, Yehoshua Sagiv, Frank van Harmelen.

What we covered

• The web of data– xml, rdf, others

• Web 2.0– The social web

• Web 3.0– The semantic web

• Future of the web