Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB,...

74
Linked and graph data introduction & lecture 1: what’s the big deal about big data? George Fletcher Eindhoven University of Technology The Netherlands Short course in the International Doctoral School in ICT University of Modena and Reggio Emilia, Italy 22 October 2013

Transcript of Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB,...

Page 1: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Linked and graph dataintroduction & lecture 1: what’s the big deal about big data?

George Fletcher

Eindhoven University of TechnologyThe Netherlands

Short course in the International Doctoral School in ICTUniversity of Modena and Reggio Emilia, Italy

22 October 2013

Page 2: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

introduction

Page 3: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Goals of this course

Participants will

1. acquire an overview of recent research developments in themanagement of “big” data on the web (i.e., linked (open) andgraph data);

2. gain insight into challenges posed for querying in this contextand current solutions in indexing and query processing forovercoming these challenges; and,

3. develop a perspective on interesting research directions in webdata management.

Page 4: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Syllabus of the course: Tuesday, 22 October

Introduction and overview of linked and graph data, and discussionof related research in the big data field.

lecture 1. what’s the big deal about big data? (10:00-12:00)

I big data ∪ web data

I big data ∩ web data

I our focus: modeling and querying linked and graph data

lecture 2. models for web data (14:00-16:00)

I RDF and RDFS data model

I Updates on RDFS graphs

Page 5: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Syllabus of the course: Wednesday, 23 October

Foundations of query processing over graph and linked data.

lecture 3. querying graph data (11:00-13:00)

I graphs and the macro-structure of the web

I graph database models

I query languages for graphs

lecture 4. querying web data (14:30-16:30)

I SPARQL 1.1

I RDF & SPARQL exercises

Page 6: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Syllabus of the course: Thursday, 24 October

Engineering solutions for efficient query processing over graph andlinked data.

lecture 5. querying web data, cont. (11:00-13:00)

I models for linked data

I querying linked data

lecture 6. indexing web data and query processing (14:30-16:30)

I value-based indexing of web data

I structural indexing of web data

Page 7: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Syllabus of the course: Tuesday, 29 October

Advanced topics and concluding discussion; final exam.

lecture 7. uncertain graphs & wrap up (11:00-13:00)

I uncertain graph data & query model

I annotated graphs

I recap of lectures & open discussion

final session. written exam (14:30-16:30)

Page 8: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Syllabus of the course

course contact

I email: g.h.l.“my family name”@tue.nl

I homepage: google me (george fletcher eindhoven)

I course homepage: found on my “teaching” pagehttp://wwwis.win.tue.nl/~gfletcher/modena/

course materials

I Online links to all background and lecture material will beprovided as the course progresses.

I There are no specific prerequisites for the course participants,other than a general Computer Science background. All othernecessary details and concepts will be introduced as needed ineach lecture.

Page 9: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Who are we?

memy background

I Assistant professor in the Web Engineering group at TUEindhoven

I PhD in Computer Science from Indiana University,Bloomington, USA

I Bachelor’s degrees in Mathematics and Cognitive Sciencefrom the Univ. of North Florida, USA

my research

I in general: data management systems

I PhD thesis in query learning for data integration

I current research focus on the theory, engineering, andapplications of query languages for graph and web data

and you?

Page 10: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Who are we?

memy background

I Assistant professor in the Web Engineering group at TUEindhoven

I PhD in Computer Science from Indiana University,Bloomington, USA

I Bachelor’s degrees in Mathematics and Cognitive Sciencefrom the Univ. of North Florida, USA

my research

I in general: data management systems

I PhD thesis in query learning for data integration

I current research focus on the theory, engineering, andapplications of query languages for graph and web data

and you?

Page 11: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Thank you

My warmest thanks to:

I dr. Federica Mandreoli,

I The International Doctoral School in ICT,

I The University of Modena and Reggio Emilia, and

I all of the participants in this course!

Page 12: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Five minute paper

What are the top two or three things you would like to take awayfrom this course?

Page 13: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

lecture 1

what’s the big deal about big data?

Page 14: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Overview of this lecture

I What does “big” data mean?

I What does “web” data mean?

I Refresher on first order logic

Page 15: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Big Data

Our ability to generate and capture data only continues to explode

Indeed, we live in a world that is

I increasingly driven by this generation and consumption ofmassive ever-growing data sets (leading to an emerging“second economy”),

I which in turn is driven by both technology and a rapidlyincreasing “datafication” of all aspects of our public andprivate lives, and well, of most everything.

Managing this Big Data is now central to commerce,entertainment, government, education, ...

Page 16: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Big Data: the hype

Funny Dilbert comic strip removed. Seehttp://dilbert.com/strips/comic/2012-07-29/

Page 17: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Big Data: the hype

Of course, the data management community has been intensivelystudying data for decades

I the prestigious Very Large Data Bases (VLDB) conferencehad its first edition in 1975

I “big data” < “very large data”?

(But of course, the symbiosis of technological trends anddatafication does indeed raise many novel and deep challenges)

Page 18: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Big Data: the hype

Of course, the data management community has been intensivelystudying data for decades

I the prestigious Very Large Data Bases (VLDB) conferencehad its first edition in 1975

I “big data” < “very large data”?

(But of course, the symbiosis of technological trends anddatafication does indeed raise many novel and deep challenges)

Page 19: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Big Data: the hype

Recently, a critical mass of awareness of the ubiquity and theenormous potential value of data has been reached, as articulatedin two landmark essays:

I The end of theory (2008)

I The unreasonable effectiveness of data (2009)

This led to an explosion of interest and energy cutting across thescientific and business worlds, e.g.,

I journal of Science, special issue on “Dealing with Data”(2011)

I Harvard Business Review, special issue on “Spotlight on BigData” (2012)

I Big Data: The Management RevolutionI Data Scientist: The Sexiest Job of the 21st Century

I National Research Council (USA), special report on “Frontiersin massive data analysis” (2013)

Page 20: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Big Data: the hype

Recently, a critical mass of awareness of the ubiquity and theenormous potential value of data has been reached, as articulatedin two landmark essays:

I The end of theory (2008)

I The unreasonable effectiveness of data (2009)

This led to an explosion of interest and energy cutting across thescientific and business worlds, e.g.,

I journal of Science, special issue on “Dealing with Data”(2011)

I Harvard Business Review, special issue on “Spotlight on BigData” (2012)

I Big Data: The Management RevolutionI Data Scientist: The Sexiest Job of the 21st Century

I National Research Council (USA), special report on “Frontiersin massive data analysis” (2013)

Page 21: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Big Data: the reality

Big Data is now experiencing some backlash, tempering andnuancing this enthusiasm

I e.g., On being a data skeptic (2013)I “Observation always involves theory.” – Edwin Hubble

There is, of course, still much truth in all of the enthusiasm aboutBig Data

Due to datafication and technology trends, applications now mustface data scale, heterogeneity, and distribution as the norm, ratherthan as exceptions, as in the past

I often captured as the four V’s of “volume”, “variety”,“veracity”, and “velocity”

Furthermore, there is indeed increased potential for extractingsignificant untapped value from data

I e.g., the successes of Google and Facebook are essentiallybuilt on extracting value from data

Page 22: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Big Data: the reality

Big Data is now experiencing some backlash, tempering andnuancing this enthusiasm

I e.g., On being a data skeptic (2013)I “Observation always involves theory.” – Edwin Hubble

There is, of course, still much truth in all of the enthusiasm aboutBig Data

Due to datafication and technology trends, applications now mustface data scale, heterogeneity, and distribution as the norm, ratherthan as exceptions, as in the past

I often captured as the four V’s of “volume”, “variety”,“veracity”, and “velocity”

Furthermore, there is indeed increased potential for extractingsignificant untapped value from data

I e.g., the successes of Google and Facebook are essentiallybuilt on extracting value from data

Page 23: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Big Data: the reality

Big Data is now experiencing some backlash, tempering andnuancing this enthusiasm

I e.g., On being a data skeptic (2013)I “Observation always involves theory.” – Edwin Hubble

There is, of course, still much truth in all of the enthusiasm aboutBig Data

Due to datafication and technology trends, applications now mustface data scale, heterogeneity, and distribution as the norm, ratherthan as exceptions, as in the past

I often captured as the four V’s of “volume”, “variety”,“veracity”, and “velocity”

Furthermore, there is indeed increased potential for extractingsignificant untapped value from data

I e.g., the successes of Google and Facebook are essentiallybuilt on extracting value from data

Page 24: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Big Data: the reality

However, there is increased difficulty in extracting “value” (a fifthV), due to the other four V’s

In data management, embracing and confronting these challengeshas led to an explosion of NoSQL systems, as alternatives to thedominant paradigm of structured data (i.e., “SQL” stores)

Page 25: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Big Data: NoSQL systems

Serious relaxations in “traditional” data management assumptionsinclude

I uncertain dataI structural heterogeneity

I semi-structured, denormalized dataI query paradigms for such loosely structured data

I looser data consistency

I semantic heterogeneity

NoSQL systems typically embrace one or more of these

I Note, though, that many of these relaxations predate NoSQLand even “SQL” systems (e.g., CODASYL and the networkdata model) ...

Page 26: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Big Data: NoSQL systems

Serious relaxations in “traditional” data management assumptionsinclude

I uncertain dataI structural heterogeneity

I semi-structured, denormalized dataI query paradigms for such loosely structured data

I looser data consistency

I semantic heterogeneity

NoSQL systems typically embrace one or more of these

I Note, though, that many of these relaxations predate NoSQLand even “SQL” systems (e.g., CODASYL and the networkdata model) ...

Page 27: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Big Data: NoSQL systems

Let’s consider data structural heterogeneity, driven by the need forso-called “polyglot persistence”

Much modern data doesn’t cleanly map to a tight relationalstructure

I missing or multi-valued attributesI e.g., not all site visitors have exactly one phone number

I nested/hierarchical structureI ontologies, folksonomiesI JSON, XML

I loose structureI social networksI chem-/bio- networks

Page 28: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Big Data: NoSQL systems

Four broad classes of approaches taken by “NoSQL” systems:

1. key-value databases (essentially scalable hash tables)I example systems: BerkelyDB, Tokyo Cabinet

2. document databases (generalization of key-value model toinclude nested structure, e.g., JSON and XML)

I example systems: MongoDB, CouchDB, Couchbase

3. column-family stores (hybrid of key-value and relationalmodel, where rows can have different schemas)

I example systems: Cassandra, Amazon SimpleDB

4. graph databasesI example systems: Neo4j, IBM DB2 RDF GraphStore

Document and graph stores are firmly rooted in applications arisingin web data management ...

Page 29: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Big Data: NoSQL systems

Four broad classes of approaches taken by “NoSQL” systems:

1. key-value databases (essentially scalable hash tables)I example systems: BerkelyDB, Tokyo Cabinet

2. document databases (generalization of key-value model toinclude nested structure, e.g., JSON and XML)

I example systems: MongoDB, CouchDB, Couchbase

3. column-family stores (hybrid of key-value and relationalmodel, where rows can have different schemas)

I example systems: Cassandra, Amazon SimpleDB

4. graph databasesI example systems: Neo4j, IBM DB2 RDF GraphStore

Document and graph stores are firmly rooted in applications arisingin web data management ...

Page 30: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Web Data

The explosion of the Web was both a precursor to and accelerantfor the rise of Big Data.

Historically, web data has been modeled as

I hypertext and SGML

I text and HTML

I XML and JSON

I ...

Primarily focusing on the metaphor of the “document” ...

But early “web” scientists had even broader visions for humanknowledge ...

Page 31: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Web Data

The explosion of the Web was both a precursor to and accelerantfor the rise of Big Data.

Historically, web data has been modeled as

I hypertext and SGML

I text and HTML

I XML and JSON

I ...

Primarily focusing on the metaphor of the “document” ...

But early “web” scientists had even broader visions for humanknowledge ...

Page 32: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Web DataLeJivre etla representation, dumonde

Zintelhqenr.il a-ecitlivpc pouple inindu exterifiur

Le lirlrae (Ju Jrvrn fait nai'treJans iLiuti'f.n inttiUigences la

dummde etlea creatjofis

Moyens divers de communicaUanavee le mon.de

dans la nature(Objels reols)

I) La pTiotanra.ji'ncu b t n '

I'igure 2

J," u n i v p . r s . T i n t e l l i g e n c e . l a s c i e n c e , l e l i v r e

Les choaes

s .laEealitc.le Cosmos

Les .intelligences

i ponsentlcv choses Fraom

La scienceRemnt et coordonne en ses cadres lespinse.tnde iniUas lesmt,elliaences parttculieres

Les LivresTranscnvent et photographisnt l&science,selon lordre divigd des connais&aiwes\-» r.ollnction dc li»ret pDrdCnL \a Bibl iotheqn'

La Bihlio.qraphieJnventone et catalooue lesla reunion de notices Bibl'ogpapliiquGS •o'""'-.

I t rrpertoire B'blpoq^aphique univcrsel

•L'Encyclopedie. ( T c x t e e t l m s g e )Douier Attas MicfoFilm

Concentre, C/SE.TE et noordnnne lezonlenu des hvrr,$

o

La Classification.Conforms alordre cfuel'intelliaefice,

deccuvre dsnslee clwses. sect a lSiois a.Jordqnnance ifc Ja science rfwJjvre* de

p?ife ri de 1'Encydopsdie

0

i t

r

—*—T—7—hH

—d!-—«

7

n

3

L'Umvers, 1'Intelligcnce, \s Science, le iivre

The Mundaneumas conceived by the Belgian author and peace activist Paul Otlet (1868-1944)

Page 33: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Web Data: linked data

This vision has resurfaced recently in the Linked Data initiative,which is essentially a vision for modeling and sharing graph data onthe web, using web standards

Linked data principles:

1. Use URIs as names for things

2. Use HTTP URIs so that people can look up those names

3. When someone looks up a URI, provide useful information,using the standards (RDF, SPARQL)

4. Include links to other URIs so that they can discover morethings

e.g., BBC, NYTimes, Wikipedia, Uniprot, PubMed, DBLP, ...

Linked open data: public sector data as linked data

I e.g, dati.gov.it, data.gov.uk, data.gouv.fr, data.overheid.nl,data.gov, data.eu, datameti.go.jp, ...

Page 34: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Web Data: linked data

This vision has resurfaced recently in the Linked Data initiative,which is essentially a vision for modeling and sharing graph data onthe web, using web standards

Linked data principles:

1. Use URIs as names for things

2. Use HTTP URIs so that people can look up those names

3. When someone looks up a URI, provide useful information,using the standards (RDF, SPARQL)

4. Include links to other URIs so that they can discover morethings

e.g., BBC, NYTimes, Wikipedia, Uniprot, PubMed, DBLP, ...

Linked open data: public sector data as linked data

I e.g, dati.gov.it, data.gov.uk, data.gouv.fr, data.overheid.nl,data.gov, data.eu, datameti.go.jp, ...

Page 35: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Web Data: linked data

This vision has resurfaced recently in the Linked Data initiative,which is essentially a vision for modeling and sharing graph data onthe web, using web standards

Linked data principles:

1. Use URIs as names for things

2. Use HTTP URIs so that people can look up those names

3. When someone looks up a URI, provide useful information,using the standards (RDF, SPARQL)

4. Include links to other URIs so that they can discover morethings

e.g., BBC, NYTimes, Wikipedia, Uniprot, PubMed, DBLP, ...

Linked open data: public sector data as linked data

I e.g, dati.gov.it, data.gov.uk, data.gouv.fr, data.overheid.nl,data.gov, data.eu, datameti.go.jp, ...

Page 36: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Our focus

In this course we study the intersection of big data and web data,namely linked data and graph data

Our focus will be on linked and graph data management (and notinformation retrieval, although the distinction is increasingly fuzzy)

I i.e., scaleably implementing variations of first order logic andtheir application

Specifically, we will investigate the challenges of modeling andquerying linked and graph data

(Equally) important issues not touched upon in this course include:data integration, security, privacy, user interfaces, data integrityand consistency, ...

Page 37: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

Our focus

In this course we study the intersection of big data and web data,namely linked data and graph data

Our focus will be on linked and graph data management (and notinformation retrieval, although the distinction is increasingly fuzzy)

I i.e., scaleably implementing variations of first order logic andtheir application

Specifically, we will investigate the challenges of modeling andquerying linked and graph data

(Equally) important issues not touched upon in this course include:data integration, security, privacy, user interfaces, data integrityand consistency, ...

Page 38: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

a quick refresher on first order logic

Page 39: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

First order logic: ∃,∀,¬,∧,∨,R(x), x = y , x = a

Most of the query languages we will consider are fragments orextensions of first order logic (FO)

that is, the “queries” (i.e., computable mappings from databaseinstances to database instances) expressible in these languages areequivalently expressible as formulas in FO

we next review some basics of FO

Page 40: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: data

Fix some universe U of atomic values.

I A relation schema consists of a name R and finite set ofattribute names attributes(R) = {A1, . . . ,Ak}. The arity of Ris arity(R) = k.

I A fact over relation schema R of arity k is a term of the formR(a1, . . . , ak), where a1, . . . , ak ∈ U.

I alternatively, a tuple over R is a function from attributes(R) toU.

I An instance of relation schema R is a finite set of facts/tuplesover R.

I A database schema (aka “signature”) is a finite set D ofrelation schemas.

I An instance of database schema D is a set of relationinstances, one for each R ∈ D.

Page 41: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: data

Fix some universe U of atomic values.

I A relation schema consists of a name R and finite set ofattribute names attributes(R) = {A1, . . . ,Ak}. The arity of Ris arity(R) = k.

I A fact over relation schema R of arity k is a term of the formR(a1, . . . , ak), where a1, . . . , ak ∈ U.

I alternatively, a tuple over R is a function from attributes(R) toU.

I An instance of relation schema R is a finite set of facts/tuplesover R.

I A database schema (aka “signature”) is a finite set D ofrelation schemas.

I An instance of database schema D is a set of relationinstances, one for each R ∈ D.

Page 42: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

Fix some database schema D.

SQL

SELECT A1,...,Ak

FROM S1,...,Sm

WHERE Cond

where S1, . . . ,Sm ∈ D, Cond is a well-formed selection conditionover A, and A1, . . . ,Ak ∈ A, for A =

⋃S∈{S1,...,Sm} attributes(S).

Relational Algebra (RA)

πA1,...,Ak(σCond(S1 × · · · × Sm))

where S1, . . . ,Sm ∈ D, Cond is a well-formed selection conditionover A, and A1, . . . ,Ak ∈ A, for A =

⋃S∈{S1,...,Sm} attributes(S).

Page 43: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

Fix some database schema D.

SQL

SELECT A1,...,Ak

FROM S1,...,Sm

WHERE Cond

where S1, . . . ,Sm ∈ D, Cond is a well-formed selection conditionover A, and A1, . . . ,Ak ∈ A, for A =

⋃S∈{S1,...,Sm} attributes(S).

Relational Algebra (RA)

πA1,...,Ak(σCond(S1 × · · · × Sm))

where S1, . . . ,Sm ∈ D, Cond is a well-formed selection conditionover A, and A1, . . . ,Ak ∈ A, for A =

⋃S∈{S1,...,Sm} attributes(S).

Page 44: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

Datalog

result(A1, . . . ,Ak)← S1(A1), . . . ,Sm(Am),C1, . . . ,Cj

where S1, . . . ,Sm ∈ D, Ai is a list of variables of length arity(Si )(for 1 ≤ i ≤ m), Ci is a well-formed selection condition over A (for1 ≤ i ≤ j), and A1, . . . ,Ak ∈ A, for the set A of all variablesappearing in A1, . . . ,Am.

Page 45: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

List the titles of all books written by Italian speakers.

Page 46: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

List the titles of all books written by Italian speakers.

Page 47: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

In SQL

SELECT book.title

FROM author, book

WHERE book.authorID = author.authorID

AND

author.language = ‘Italian’

Page 48: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

In RAπbook.title(σC (author × book))

where C is

book.authorID = author .authorID ∧ author .language = ‘Italian’

Page 49: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

In Datalog

result(T )← author(A,N,D, LA), book(B,T ,A,P, LB,Y ),

LA = ‘Italian’

Page 50: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

List the names of all authors writing in Italian or born in1985.

Page 51: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

SELECT A.Name

FROM Author A

WHERE A.Language = ‘Italian’

UNION

SELECT A.Name

FROM Author A

WHERE A.Birthdate = 1985

πname(σlanguage=‘I...’(author)) ∪ πname(σbirthdate=1985(author))

result(N) ← author(A,N,B, L), L = ‘Italian’

result(N) ← author(A,N,B, L),B = 1985

Page 52: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

SELECT A.Name

FROM Author A

WHERE A.Language = ‘Italian’

UNION

SELECT A.Name

FROM Author A

WHERE A.Birthdate = 1985

πname(σlanguage=‘I...’(author)) ∪ πname(σbirthdate=1985(author))

result(N) ← author(A,N,B, L), L = ‘Italian’

result(N) ← author(A,N,B, L),B = 1985

Page 53: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

SELECT A.Name

FROM Author A

WHERE A.Language = ‘Italian’

UNION

SELECT A.Name

FROM Author A

WHERE A.Birthdate = 1985

πname(σlanguage=‘I...’(author)) ∪ πname(σbirthdate=1985(author))

result(N) ← author(A,N,B, L), L = ‘Italian’

result(N) ← author(A,N,B, L),B = 1985

Page 54: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

List the names of all authors writing in Italian and bornin 1985.

Page 55: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

SELECT A.Name

FROM Author A

WHERE A.Language = ‘Italian’

INTERSECT

SELECT A.Name

FROM Author A

WHERE A.Birthdate = 1985

πname(σlanguage=‘I...’(author)) ∩ πname(σbirthdate=1985(author))

I (N) ← author(A,N,B, L), L = ‘Italian’

B(N) ← author(A,N,B, L),B = 1985

result(N) ← I (N),B(N)

Page 56: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

SELECT A.Name

FROM Author A

WHERE A.Language = ‘Italian’

INTERSECT

SELECT A.Name

FROM Author A

WHERE A.Birthdate = 1985

πname(σlanguage=‘I...’(author)) ∩ πname(σbirthdate=1985(author))

I (N) ← author(A,N,B, L), L = ‘Italian’

B(N) ← author(A,N,B, L),B = 1985

result(N) ← I (N),B(N)

Page 57: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

SELECT A.Name

FROM Author A

WHERE A.Language = ‘Italian’

INTERSECT

SELECT A.Name

FROM Author A

WHERE A.Birthdate = 1985

πname(σlanguage=‘I...’(author)) ∩ πname(σbirthdate=1985(author))

I (N) ← author(A,N,B, L), L = ‘Italian’

B(N) ← author(A,N,B, L),B = 1985

result(N) ← I (N),B(N)

Page 58: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

List the names of all authors writing in Italian not born in1985.

Page 59: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

SELECT A.Name

FROM Author A

WHERE A.Language = ‘Italian’

EXCEPT

SELECT A.Name

FROM Author A

WHERE A.Birthdate = 1985

πname(σlanguage=‘I...’(author))− πname(σbirthdate=1985(author))

I (N) ← author(A,N,B, L), L = ‘Italian’

B(N) ← author(A,N,B, L),B = 1985

result(N) ← I (N), not B(N)

Page 60: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

SELECT A.Name

FROM Author A

WHERE A.Language = ‘Italian’

EXCEPT

SELECT A.Name

FROM Author A

WHERE A.Birthdate = 1985

πname(σlanguage=‘I...’(author))− πname(σbirthdate=1985(author))

I (N) ← author(A,N,B, L), L = ‘Italian’

B(N) ← author(A,N,B, L),B = 1985

result(N) ← I (N), not B(N)

Page 61: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

SELECT A.Name

FROM Author A

WHERE A.Language = ‘Italian’

EXCEPT

SELECT A.Name

FROM Author A

WHERE A.Birthdate = 1985

πname(σlanguage=‘I...’(author))− πname(σbirthdate=1985(author))

I (N) ← author(A,N,B, L), L = ‘Italian’

B(N) ← author(A,N,B, L),B = 1985

result(N) ← I (N), not B(N)

Page 62: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

List the IDs of all books which are sold in every store.

Page 63: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

In RA

πbookID(book)− πbookID ((πstoreID(store)× πbookID(book))− sells)

Page 64: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

In SQL

SELECT B.bookID

FROM book B

WHERE NOT EXISTS

(SELECT bookID, storeID

FROM book, store

WHERE bookID=B.bookID

EXCEPT

sells)

Page 65: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

In Datalog

all(S ,B) ← book(B,T ,A,P, L,Y ), store(S ,Ad ,Ph)

missing(B) ← all(S ,B), not sells(S ,B)

result(B) ← book(B,T ,A,P, L,Y ), not missing(B)

Page 66: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: syntax

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

in TRC, the Tuple Relational Calculus (i.e., essentially straight FO)

{t | ∃b ∈ book(t.bookID = b.bookID∧∀s ∈ store∃` ∈ sell(`.storeID = s.storeID

∧ `.bookID = b.bookID))}

Page 67: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: complexity of query evaluation

Fact. SQL (without aggregation and other bells-and-whistles), RA,TRC, and (safe non-recursive) Datalog are all equivalent inexpressive power.

LetI FO denote the full TRC

I i.e., any of the above languages

I Conj denote the TRC using only ∃ and ∧I i.e., the “conjunctive” queriesI corresponds in SQL to basic SELECT-FROM-WHERE blocksI corresponds to the {σ, π,×} fragment of RAI corresponds to single positive datalog rules

I AConj denote the “acylic” conjunctive queriesI queries with join treesI full def in a later lecture

Page 68: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: complexity of query evaluation

Fact. SQL (without aggregation and other bells-and-whistles), RA,TRC, and (safe non-recursive) Datalog are all equivalent inexpressive power.

LetI FO denote the full TRC

I i.e., any of the above languages

I Conj denote the TRC using only ∃ and ∧I i.e., the “conjunctive” queriesI corresponds in SQL to basic SELECT-FROM-WHERE blocksI corresponds to the {σ, π,×} fragment of RAI corresponds to single positive datalog rules

I AConj denote the “acylic” conjunctive queriesI queries with join treesI full def in a later lecture

Page 69: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: complexity of query evaluation

Fact. SQL (without aggregation and other bells-and-whistles), RA,TRC, and (safe non-recursive) Datalog are all equivalent inexpressive power.

LetI FO denote the full TRC

I i.e., any of the above languages

I Conj denote the TRC using only ∃ and ∧I i.e., the “conjunctive” queriesI corresponds in SQL to basic SELECT-FROM-WHERE blocksI corresponds to the {σ, π,×} fragment of RAI corresponds to single positive datalog rules

I AConj denote the “acylic” conjunctive queriesI queries with join treesI full def in a later lecture

Page 70: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: complexity of query evaluation

Fact. SQL (without aggregation and other bells-and-whistles), RA,TRC, and (safe non-recursive) Datalog are all equivalent inexpressive power.

LetI FO denote the full TRC

I i.e., any of the above languages

I Conj denote the TRC using only ∃ and ∧I i.e., the “conjunctive” queriesI corresponds in SQL to basic SELECT-FROM-WHERE blocksI corresponds to the {σ, π,×} fragment of RAI corresponds to single positive datalog rules

I AConj denote the “acylic” conjunctive queriesI queries with join treesI full def in a later lecture

Page 71: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: complexity of query evaluation

Fact. The complexity of query evaluation is as follows

FO Conj AConj

combined PSPACE-complete NP-complete LOGCFL-completedata Logspace Logspace Linear time

where combined means in the size of the query and the databaseand data means in the size of the database (i.e., for some fixedquery).

Page 72: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

FO review: exercise

author(authorID, name, birthdate, language)book(bookID, title, authorID, publisher, language, year)store(storeID, address, phone)sells(storeID, bookID)

In Datalog, express the following.

1. Retrieve the IDs of all books sold by stores in Modena.

2. Retrieve the IDs of all books written by an Italian speaker.

3. Retrieve the IDs of all books published in a language differentfrom that of the book’s author.

4. Retrieve the IDs of all authors who have books published inevery language (known in the database).

Page 73: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

References & Credits, 1/2

There is no “required” reading for this lecture. The following arejust recommended background reading and credits.

I The end of theory: the data deluge makes the scientific method obsolete.Chris Anderson. Wired Magazine 16(7), 2008.http://www.wired.com/science/discoveries/magazine/16-07/pb_theory

I The second economy. W. Brian Arthur. McKinsey Quarterly, 2011.http://www.mckinsey.com/insights/strategy/the_second_economy

I The rise of big data. Kenneth Cukier and Viktor Mayer-Schoenberger.Foreign Affairs, May/June 2013.http://www.foreignaffairs.com/articles/139104/kenneth-neil-cukier-and-viktor-mayer-schoenberger/the-rise-of-big-data

I The unreasonable effectiveness of data. Alon Halevy, Peter Norvig, andFernando Pereira. IEEE Intelligent Systems, March/April 2009.http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//pubs/archive/35179.pdf

Page 74: Linked and graph datagfletcher/modena/fletcher-modena1.pdf · I example systems: MongoDB, CouchDB, Couchbase 3. column-family stores (hybrid of key-value and relational model, where

References & Credits, 2/2

I Towards a Big Data reference architecture. Markus Maier. MSc thesis,Eindhoven University of Technology, 2013.http://www.win.tue.nl/~gfletche/Maier_MSc_thesis.pdf

I Frontiers in massive data analysis. National Research Council of theNational Academies. Washington, D.C., 2013.http://www.nap.edu/catalog.php?record_id=18374 (register for free PDF)

I On being a data skeptic. Cathy O’Neil. O’Reilly Media, 2013.

http://cdn.oreillystatic.com/oreilly/radarreport/0636920032328/On_Being_a_Data_Skeptic.pdf

I NoSQL distilled: a brief guide to the emerging world of polyglotpersistence. Pramod J. Sadalage and Martin Fowler. Addison-Wesley,2013. http://martinfowler.com/nosql.html