Personal Information Management Systems - EDBT/ICDT'15 Tutorial
EDBT 2015: Summer School Overview
Transcript of EDBT 2015: Summer School Overview
EDBTSummer
School 2015
Badenes, CarlosGarijo, Daniel
Priyatna, Freddy
Palamos, Spain31/8 - 4/9 2015
The Venue
2
(where we thought we would be) (where we actually were)
Overview
3
Graph Data Management Part I: Theoretical
- Notes about lectures Part II: Practical- Sparksee Technology - Challenges
Part I: Theoretical
4
๏ Large Scale Graph Processing System - (C. Badenes)Sherif Sakr - National ICT Australia
๏ Graph Visualization - (C. Badenes)Peter Eades - University of Sydney
๏ Graph Data Management - (F. Priyatna)Claudio Gutierrez - Universidad de Chile
๏ Applications of Flexible Querying to Graphs - (F. Priyatna)Alexandra Poulovassilis - Birkbeck, University of London
๏ Graph Management Benchmarking - (F. Priyatna)Peter Boncz - CWI and Vrije Universiteit Amsterdam
๏ Graph Algorithms - (D. Garijo)Dennis Shasha - New York University
๏ Parallel Processing - (D. Garijo)Bin Shao - Microsoft Research, Beijing
Graph Data Management
5
Dr. Claudio GutierrezComputer Science DepartmentUniversidad de Chile
http://richard.cyganiak.de/blog/2006/06/perez-et-al-semantics-and-complexity-of-sparql/
(2-2, 1-1)
a general view of the main features of current graph databases
Graph Data Management
6
A hypernode is a directed graph whose nodes can themselves be graphs (or hypernodes), allowing nesting of
graphs.
A property graph is a directed, labelled, attributed multigraph. That is, a graph where the edges are directed, both nodes and edges are labeled and can have
any number of properties (or attributes), and there can be multiple edges between any two vertices.
Applications of Flexible Querying to Graphs
7
Dr. Alexandra PoulovassilisDepartment of Computer Science and Information Systems, Birkbeck, University of London
Reasoning in Event-Based Distributed Systems
Authors: Helmer, Sven, Poulovassilis, Alexandra, Xhafa,
Fatos
Adapting to Change in Content, Size, Topology
and Use
Editors: Levene, Mark, Poulovassilis, Alexandra (Eds.)
The Functional Approach to Data Management
Modeling, Analyzing and Integrating Heterogeneous
Data
Editors: Gray, P.M.D., Kerschberg, L., King, P.J.H., Poulovassilis, A.
(Eds.)
Applications of Flexible Querying to Graphs
8
Query relaxation, which generally returns additional answers compared to the exact form
of the database query.
Query approximation, which returns potentially different answers compared to the exact form of
the query.
Q2 = SELECT * WHERE {
?x :actedIn :Tea_with_Mussolini .
RELAX ( ?x :hasFamilyName ?z ) }
Q3 = SELECT * WHERE {
?x :actedIn :Tea_with_Mussolini .
?x :label ?z . }
Q3.1 = SELECT * WHERE {
?x :actedIn :Tea_with_Mussolini .
?x :hasGivenName ?z ) }
Q3.2 = SELECT * WHERE {
?x :actedIn :Tea_with_Mussolini .
?x :hasFamilyName ?z . }
Q1= SELECT * WHERE {
APPROX ( :Battle_of_Waterloo :happenedIn/(:hasLongitude|:hasLatitude) ?x ) }
Q1.1=SELECT * WHERE {
:Battle_of_Waterloo :hasLongitude ?x }
Q1.2=SELECT * WHERE {
:Battle_of_Waterloo :hasLatitude ?x }
hasFamilyNamehasGivenName
labelsubPropertyOf subPropertyOf
SparqlAR
Graph Management Benchmarking
9
Dr. Peter BonczCentrum Wiskunde & Informatica (CWI)
Graph Management Benchmarking
10
Description: Given a start Person, find the Forums which that Person’s friends and friends of friends (excluding start Person) became Members of after a given date. Return top 20 Forums, and the number of Posts in each Forum that was Created by any of these Persons. For each Forum consider only those Persons which joined that particular Forum after the given date. Sort results descending by the count of Posts, and then ascending by Forum identifier.
Graph Management Benchmarking
11
Description: Given a start Person, find the Forums which that Person’s friends and friends of friends (excluding start Person) became Members of after a given date. Return top 20 Forums, and the number of Posts in each Forum that was Created by any of these Persons. For each Forum consider only those Persons which joined that particular Forum after the given date. Sort results descending by the count of Posts, and then ascending by Forum identifier.
Graph Motifs
12
Graph Motifs
13
Parallel Processing
14
Parallel Processing
15
Large Scale Graph Processing System
16
Dr. Sherif Sakr Associate Professor at
College of Public Health and Health Informatics at King Saud bin Abdul-Aziz University
“Big Data (Graph) Processing Systems: State-of-the-art and open challenges”
Large Scale Graph Processing System
17
Pregel FamilyBulk Synchronous Parallel (BSP) model
L. G. Valiant. A Bridging Model for Parallel Computation. Commun. ACM, 1990
GraphLab FamilyGather, Apply, Scatter (GAS) model
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed GraphLab: A Framework for Machine Learning in the Cloud. PVLDB, 2012
Graph Visualization
18
Dr. Peter Eades Research Professor at
School of Information Technologies at The University of Sidney
Data Drawing Human
VisualizationFunction
PerceptionFunction
faithful + readable
Graph Visualization
19
Topology-Shape-Metric approach
Energy-based approach
Clustered Planarity:
Multilevel methods:
Fast Approximations:
scaling to large graphs
scaling to large graphs
Part II: Practical
๏ Sparksee - Sparsity Technologies (C.Badenes)Universitat Politécnica de Catalunya
๏ Challenges :: OEG-Team - (D.Garijo)Similarities between Wikipedia Articles
20
Sparksee
21
Sparksee
22
schema
query: Get common Messages for the given Hashtags
// User Nodeint nodeUser = graph.newNodeType("User");int userNickName = graph.newAttribute(nodeUser, "nickname", DataType.String, AttributeKind.Unique);
// knows edgeint edgeKnows = graph.newEdgeType("knows", true, true);
// User1long user1 = graph.newNode(nodeUser);graph.setAttribute(user1,userNickName,new Value().setString(“User1"));
// edge 'knows'long knows1 = graph.newEdge(edgeKnows, user1, user2);
// Find out the OID of the Hashtags with the given hastag's texts.int tag = g.findType("Tag");int tagName = g.findAttribute(tag, "name");long tag1 = g.findObject(tagName, new Value().setString(ht1));long tag2 = g.findObject(tagName, new Value().setString(ht2));
// Retrieve Messages with both hashtags and intersect the retrieved collection of Messages.int tags = g.findType("tags");Objects msgs1 = g.neighbors(tag1, tags, EdgesDirection.Ingoing);Objects msgs2 = g.neighbors(tag2, tags, EdgesDirection.Ingoing);long nums = msgs1.intersection(msgs2);
Challenge
23
Similarities in Wikipedia- Description
- To Evaluate- The design- A good proof of functionality- The efficiency, in terms of computation time- The originality of the proposed method
- Technical prerequisites of participants- Basic programming skills- To be familiar with some graph library
- Technical support provided to participants- English Wikipedia data (dump):
- articles_ids.csv- articles_links.csv- articles_body.csv- articles_redirect.csv- categories_ids.csv- articles_category.csv- categories_relations.csv
Problem
24
Similarity between Wikipedia Articles
Wikipedia Article: text
links
categories
Hypothesis
25
Wikipedia Article: text
links
categories
simLinks
simCtg
simTextα·
β·
ɣ·
+
+
simWA(R1,R2) = α·simTxt(R1,R2) + β·simLinks(R1,R2) + ɣ·simCtg(R1,R2)
where α+β+ɣ=1
Similarity based on Text
26
…
TOPIC_1
p = [0.5, 0.3,.., 0.7]q = [0.2, 0.4,.., 0.9]Ri R
j
TOPIC_2 TOPIC_n
LatentDirichletAllocation
Similarity based on Categories
27
Articles with multiple common categories are likely to be similar
Noise filtering is necessary (e.g., “All articles lacking in-text citations”).See https://github.com/cbadenes/siminwikart-challenge4/blob/master/category/wikipedia_bad_categories.txt
Similarity based on Links
28
Sim(A,B) = links(A) ∩ links(B) / ( (links(A) U links(B) ) / 2)
Articles with multiple common linksare likely to be similar
Proof of Concept
29
Fernando Alonso
Lionel Messi
Iker CasillasPrincess Akiko
(simLinks) α = 0.2(simCtg) β = 0.2(simTxt) ɣ = 0.6
[1]0.062[3]0.075
[1]0.666[3]0.683
[1]0.058[3]0.069
[1]0.043[3]0.072
[1]0.019[3]0.023
[1]0.068[3]0.069
simTxt = 0.059simLinks = 0.019simCtg=[1]0.117
[3]0.181
simTxt = 0.065simLinks = 0.0simCtg=[1]0.095
[3]0.161
simTxt = 0.052simLinks = 0.019simCtg=[1]0.166
[3]0.172
simTxt = 0.980simLinks = 0.175simCtg=[1]0.217
[3]0.302
simTxt = 0.060simLinks = 0.008simCtg=[1]0.030
[3]0.172
simTxt = 0.069simLinks = 0.004simCtg=[1]0.080
[3]0.134
Comparison
30
Lionel Messi
Princess Akiko
simTxt = 0.060 -> <common words>simLinks = 0.008 -> (England,Buenos_Aires,Chile,Madrid,Argentina)simCtg=[1]0.030 -> living_person
Proposal
31
Graph based on Links Graph based on Similarities
Problem
32
Wikipedia links reliability(missing links)
Wikipedia Article: text
links
categories
Further Refinement
33
Similarities between categories (as topics) can define relations between articles
Graph based on Links Graph based on Similarities
Subgraph Pattern Matching
+Topic Model
+
Code
34
https://github.com/cbadenes/siminwikart-challenge4
Happy Ending
35
Kitkat Time
• Suggestions?
• Name for the system?
• Contributors?
36