EDBT 2015: Summer School Overview

36
EDBT Summer School 2015 Badenes, Carlos Garijo, Daniel Priyatna, Freddy Palamos, Spain 31/8 - 4/9 2015

Transcript of EDBT 2015: Summer School Overview

Page 1: EDBT 2015: Summer School Overview

EDBTSummer

School 2015

Badenes, CarlosGarijo, Daniel

Priyatna, Freddy

Palamos, Spain31/8 - 4/9 2015

Page 2: EDBT 2015: Summer School Overview

The Venue

2

(where we thought we would be) (where we actually were)

Page 3: EDBT 2015: Summer School Overview

Overview

3

Graph Data Management Part I: Theoretical

- Notes about lectures Part II: Practical- Sparksee Technology - Challenges

Page 4: EDBT 2015: Summer School Overview

Part I: Theoretical

4

๏ Large Scale Graph Processing System - (C. Badenes)Sherif Sakr - National ICT Australia

๏ Graph Visualization - (C. Badenes)Peter Eades - University of Sydney

๏ Graph Data Management - (F. Priyatna)Claudio Gutierrez - Universidad de Chile

๏ Applications of Flexible Querying to Graphs - (F. Priyatna)Alexandra Poulovassilis - Birkbeck, University of London

๏ Graph Management Benchmarking - (F. Priyatna)Peter Boncz - CWI and Vrije Universiteit Amsterdam

๏ Graph Algorithms - (D. Garijo)Dennis Shasha - New York University

๏ Parallel Processing - (D. Garijo)Bin Shao - Microsoft Research, Beijing

Page 5: EDBT 2015: Summer School Overview

Graph Data Management

5

Dr. Claudio GutierrezComputer Science DepartmentUniversidad de Chile

http://richard.cyganiak.de/blog/2006/06/perez-et-al-semantics-and-complexity-of-sparql/

(2-2, 1-1)

Page 6: EDBT 2015: Summer School Overview

a general view of the main features of current graph databases

Graph Data Management

6

A hypernode is a directed graph whose nodes can themselves be graphs (or hypernodes), allowing nesting of

graphs.

A property graph is a directed, labelled, attributed multigraph. That is, a graph where the edges are directed, both nodes and edges are labeled and can have

any number of properties (or attributes), and there can be multiple edges between any two vertices.

Page 7: EDBT 2015: Summer School Overview

Applications of Flexible Querying to Graphs

7

Dr. Alexandra PoulovassilisDepartment of Computer Science and Information Systems, Birkbeck, University of London

Reasoning in Event-Based Distributed Systems

Authors: Helmer, Sven, Poulovassilis, Alexandra, Xhafa,

Fatos

Adapting to Change in Content, Size, Topology

and Use

Editors: Levene, Mark, Poulovassilis, Alexandra (Eds.)

The Functional Approach to Data Management

Modeling, Analyzing and Integrating Heterogeneous

Data

Editors: Gray, P.M.D., Kerschberg, L., King, P.J.H., Poulovassilis, A.

(Eds.)

Page 8: EDBT 2015: Summer School Overview

Applications of Flexible Querying to Graphs

8

Query relaxation, which generally returns additional answers compared to the exact form

of the database query.

Query approximation, which returns potentially different answers compared to the exact form of

the query.

Q2 = SELECT * WHERE {

?x :actedIn :Tea_with_Mussolini .

RELAX ( ?x :hasFamilyName ?z ) }

Q3 = SELECT * WHERE {

?x :actedIn :Tea_with_Mussolini .

?x :label ?z . }

Q3.1 = SELECT * WHERE {

?x :actedIn :Tea_with_Mussolini .

?x :hasGivenName ?z ) }

Q3.2 = SELECT * WHERE {

?x :actedIn :Tea_with_Mussolini .

?x :hasFamilyName ?z . }

Q1= SELECT * WHERE {

APPROX ( :Battle_of_Waterloo :happenedIn/(:hasLongitude|:hasLatitude) ?x ) }

Q1.1=SELECT * WHERE {

:Battle_of_Waterloo :hasLongitude ?x }

Q1.2=SELECT * WHERE {

:Battle_of_Waterloo :hasLatitude ?x }

hasFamilyNamehasGivenName

labelsubPropertyOf subPropertyOf

SparqlAR

Page 9: EDBT 2015: Summer School Overview

Graph Management Benchmarking

9

Dr. Peter BonczCentrum Wiskunde & Informatica (CWI)

Page 10: EDBT 2015: Summer School Overview

Graph Management Benchmarking

10

Description: Given a start Person, find the Forums which that Person’s friends and friends of friends (excluding start Person) became Members of after a given date. Return top 20 Forums, and the number of Posts in each Forum that was Created by any of these Persons. For each Forum consider only those Persons which joined that particular Forum after the given date. Sort results descending by the count of Posts, and then ascending by Forum identifier.

Page 11: EDBT 2015: Summer School Overview

Graph Management Benchmarking

11

Description: Given a start Person, find the Forums which that Person’s friends and friends of friends (excluding start Person) became Members of after a given date. Return top 20 Forums, and the number of Posts in each Forum that was Created by any of these Persons. For each Forum consider only those Persons which joined that particular Forum after the given date. Sort results descending by the count of Posts, and then ascending by Forum identifier.

Page 12: EDBT 2015: Summer School Overview

Graph Motifs

12

Page 13: EDBT 2015: Summer School Overview

Graph Motifs

13

Page 14: EDBT 2015: Summer School Overview

Parallel Processing

14

Page 15: EDBT 2015: Summer School Overview

Parallel Processing

15

Page 16: EDBT 2015: Summer School Overview

Large Scale Graph Processing System

16

Dr. Sherif Sakr Associate Professor at

College of Public Health and Health Informatics at King Saud bin Abdul-Aziz University

“Big Data (Graph) Processing Systems: State-of-the-art and open challenges”

Page 17: EDBT 2015: Summer School Overview

Large Scale Graph Processing System

17

Pregel FamilyBulk Synchronous Parallel (BSP) model

L. G. Valiant. A Bridging Model for Parallel Computation. Commun. ACM, 1990

GraphLab FamilyGather, Apply, Scatter (GAS) model

Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning in the Cloud. PVLDB, 2012

Page 18: EDBT 2015: Summer School Overview

Graph Visualization

18

Dr. Peter Eades Research Professor at

School of Information Technologies at The University of Sidney

Data Drawing Human

VisualizationFunction

PerceptionFunction

faithful + readable

Page 19: EDBT 2015: Summer School Overview

Graph Visualization

19

Topology-Shape-Metric approach

Energy-based approach

Clustered Planarity:

Multilevel methods:

Fast Approximations:

scaling to large graphs

scaling to large graphs

Page 20: EDBT 2015: Summer School Overview

Part II: Practical

๏ Sparksee - Sparsity Technologies (C.Badenes)Universitat Politécnica de Catalunya

๏ Challenges :: OEG-Team - (D.Garijo)Similarities between Wikipedia Articles

20

Page 21: EDBT 2015: Summer School Overview

Sparksee

21

Page 22: EDBT 2015: Summer School Overview

Sparksee

22

schema

query: Get common Messages for the given Hashtags

// User Nodeint nodeUser = graph.newNodeType("User");int userNickName = graph.newAttribute(nodeUser, "nickname", DataType.String, AttributeKind.Unique);

// knows edgeint edgeKnows = graph.newEdgeType("knows", true, true);

// User1long user1 = graph.newNode(nodeUser);graph.setAttribute(user1,userNickName,new Value().setString(“User1"));

// edge 'knows'long knows1 = graph.newEdge(edgeKnows, user1, user2);

// Find out the OID of the Hashtags with the given hastag's texts.int tag = g.findType("Tag");int tagName = g.findAttribute(tag, "name");long tag1 = g.findObject(tagName, new Value().setString(ht1));long tag2 = g.findObject(tagName, new Value().setString(ht2));

// Retrieve Messages with both hashtags and intersect the retrieved collection of Messages.int tags = g.findType("tags");Objects msgs1 = g.neighbors(tag1, tags, EdgesDirection.Ingoing);Objects msgs2 = g.neighbors(tag2, tags, EdgesDirection.Ingoing);long nums = msgs1.intersection(msgs2);

Page 23: EDBT 2015: Summer School Overview

Challenge

23

Similarities in Wikipedia- Description

- To Evaluate- The design- A good proof of functionality- The efficiency, in terms of computation time- The originality of the proposed method

- Technical prerequisites of participants- Basic programming skills- To be familiar with some graph library

- Technical support provided to participants- English Wikipedia data (dump):

- articles_ids.csv- articles_links.csv- articles_body.csv- articles_redirect.csv- categories_ids.csv- articles_category.csv- categories_relations.csv

Page 24: EDBT 2015: Summer School Overview

Problem

24

Similarity between Wikipedia Articles

Wikipedia Article: text

links

categories

Page 25: EDBT 2015: Summer School Overview

Hypothesis

25

Wikipedia Article: text

links

categories

simLinks

simCtg

simTextα·

β·

ɣ·

+

+

simWA(R1,R2) = α·simTxt(R1,R2) + β·simLinks(R1,R2) + ɣ·simCtg(R1,R2)

where α+β+ɣ=1

Page 26: EDBT 2015: Summer School Overview

Similarity based on Text

26

TOPIC_1

p = [0.5, 0.3,.., 0.7]q = [0.2, 0.4,.., 0.9]Ri R

j

TOPIC_2 TOPIC_n

LatentDirichletAllocation

Page 27: EDBT 2015: Summer School Overview

Similarity based on Categories

27

Articles with multiple common categories are likely to be similar

Noise filtering is necessary (e.g., “All articles lacking in-text citations”).See https://github.com/cbadenes/siminwikart-challenge4/blob/master/category/wikipedia_bad_categories.txt

Page 28: EDBT 2015: Summer School Overview

Similarity based on Links

28

Sim(A,B) = links(A) ∩ links(B) / ( (links(A) U links(B) ) / 2)

Articles with multiple common linksare likely to be similar

Page 29: EDBT 2015: Summer School Overview

Proof of Concept

29

Fernando Alonso

Lionel Messi

Iker CasillasPrincess Akiko

(simLinks) α = 0.2(simCtg) β = 0.2(simTxt) ɣ = 0.6

[1]0.062[3]0.075

[1]0.666[3]0.683

[1]0.058[3]0.069

[1]0.043[3]0.072

[1]0.019[3]0.023

[1]0.068[3]0.069

simTxt = 0.059simLinks = 0.019simCtg=[1]0.117

[3]0.181

simTxt = 0.065simLinks = 0.0simCtg=[1]0.095

[3]0.161

simTxt = 0.052simLinks = 0.019simCtg=[1]0.166

[3]0.172

simTxt = 0.980simLinks = 0.175simCtg=[1]0.217

[3]0.302

simTxt = 0.060simLinks = 0.008simCtg=[1]0.030

[3]0.172

simTxt = 0.069simLinks = 0.004simCtg=[1]0.080

[3]0.134

Page 30: EDBT 2015: Summer School Overview

Comparison

30

Lionel Messi

Princess Akiko

simTxt = 0.060 -> <common words>simLinks = 0.008 -> (England,Buenos_Aires,Chile,Madrid,Argentina)simCtg=[1]0.030 -> living_person

Page 31: EDBT 2015: Summer School Overview

Proposal

31

Graph based on Links Graph based on Similarities

Page 32: EDBT 2015: Summer School Overview

Problem

32

Wikipedia links reliability(missing links)

Wikipedia Article: text

links

categories

Page 33: EDBT 2015: Summer School Overview

Further Refinement

33

Similarities between categories (as topics) can define relations between articles

Graph based on Links Graph based on Similarities

Subgraph Pattern Matching

+Topic Model

+

Page 34: EDBT 2015: Summer School Overview

Code

34

https://github.com/cbadenes/siminwikart-challenge4

Page 35: EDBT 2015: Summer School Overview

Happy Ending

35

Page 36: EDBT 2015: Summer School Overview

Kitkat Time

• Suggestions?

• Name for the system?

• Contributors?

36