1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi...
-
Upload
elmer-miller -
Category
Documents
-
view
220 -
download
0
Transcript of 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi...
![Page 1: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/1.jpg)
1
Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing
![Page 2: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/2.jpg)
2
Outline
Introduction Related work SPARQL Query Processing in MapReduce Experiments Conclusion
![Page 3: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/3.jpg)
3
Outline
Introduction Related work SPARQL Query Processing in MapReduce Experiments Conclusion
![Page 4: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/4.jpg)
4
RDF
Resource Description Framework subject-predicate-object expressions (S-P-O)
Nobel Prize in Physics
阿尔伯特•爱因斯坦
Albert EinsteinisCalled
hasWonPrize
wasBornIn
Albert EinsteinisCalled
Ulm
http://www.mpii.de/yago/resource/
Nobel Prize in Physics
Albert EinsteinisCalled
hasWonPrize
wasBornIn
isCalled
S
OP
![Page 5: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/5.jpg)
5
SPARQL Query Language for RDF
PREFIX source:<http://www.mpii.de/yago/resource/>SELECT ?name ?whereWHERE {?who source:hasWonPrize Nobel Prize in Physics.?who source:isCalled ?name.?who source:wasBornIn ?where}
Query:
阿尔伯特•爱因斯坦
Albert EinsteinisCalled
hasWonPrize
wasBornIn
Albert EinsteinisCalled
Ulm
http://www.mpii.de/yago/resource/
Nobel Prize in Physics
isCalled
hasWonPrize
wasBornIn
isCalled
name where
Albert Einstein Ulm
阿尔伯特•爱因斯坦
Ulm
![Page 6: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/6.jpg)
6
RDF knowledge base…
Semantic web , Web2.0Extract Knowledge from the Web
– YAGO– DBpedia– Freebase– Billion Triple Challenge…
![Page 7: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/7.jpg)
7
RDF knowledge base
295 data sets31 billion RDF triples504 million RDF links
(September 2011)
![Page 8: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/8.jpg)
8
Challenge and Opportunity
Challenge– The RDF data is growing rapidly. Researchers are working with billi
ons of triples.– Relational database has limited ability on scalability.
Opportunity– Google GFS, MapReduce, BigTable– Hadoop: implementation of the MapReduce framework and HDFS– Achievements:Yahoo!, Amazon,腾讯,百度,淘宝 ......
We need to consider the recent achievements for handling massive scale Web data on clusters
![Page 9: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/9.jpg)
9
MapReduce: word count file1: the weather is good file2: today is good flie3: good weather is good.
Worker 1:
(the 1), (weather 1),
(is 1), (good 1). Worker 2:
(today 1), (is 1), (good 1). Worker 3:
(good 1), (weather 1),
(is 1), (good 1).
Worker 1:
(the 1) Worker 2:
(is 1), (is 1), (is 1) Worker 3:
(weather 1), (weather 1) Worker 4:
(today 1) Worker 5:
(good 1), (good 1),
(good 1), (good 1)
Worker 1:
(the 1) Worker 2:
(is 3) Worker 3:
(weather 2) Worker 4:
(today 1) Worker 5:
(good 4)
Map output Reduce Input Reduce Output
Map(k1,v1) → list(k2,v2) Reduce(k2, list (v2)) → list(k3,v3)
![Page 10: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/10.jpg)
10
Outline
Introduction Related work SPARQL Query Processing in MapReduce Experiments Conclusion
![Page 11: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/11.jpg)
11
Solution 1
Directly map the SPARQL into a sequence of MapReduce Jobs
Pro.– scalable
Con.– a burden on the user in terms of usage and maintenanc
e– Not support complex query– No index– Not consider the RDF data characteristics
![Page 12: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/12.jpg)
12
Solution 2
Map the SPARQL to Pig -> MapReduce Jobs
Pro.– Scalable– Support complex query
Con.– No index– Not consider the RDF data characteristics
![Page 13: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/13.jpg)
13
Outline
Introduction Related work SPARQL Query Processing in MapReduce Experiments Conclusion
![Page 14: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/14.jpg)
14
Architecture overview
Map-Reduce Runtime
HDFS
JSON Data Model
Cluster Deployment and Management
JAQL Query Language
SPARQL Translator
Transform Filter Join Sort Group Built-in Functions
BGP Union Filter Optional RDF 2 JSON
Loader
Optimizer
![Page 15: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/15.jpg)
15
JSON
JSON (JavaScript Object Notation) is a lightweight data-interchange format
It is based on a subset of the JavaScript Programming Language
JSON is built on two structures:– A collection of name/value (Key/value) pairs– An ordered list of values (array)
![Page 16: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/16.jpg)
16
RDF to JSONRDF triple JSON format
Albert Einstein isCalled Albert EinsteinAlbert Einstein isCalled 阿尔伯特•爱因斯坦Albert Einstein wasBornIn UlmAlbert Einstein wasBornOnDate 1879-03-14Albert Einstein hasWonPrize Nobel Prize in PhysicsAlbert Einstein diedOnDate 1955-04-18
[{s:Albert Einstein, p:isCalled, o:Albert Einstein },{s:Albert Einstein, p:isCalled, o: 阿尔伯特•爱因斯坦 },{s:Albert Einstein, p:wasBornIn, o:Ulm },{s:Albert Einstein, p:wasBornOnDate, o:1879-03-14 },{s:Albert Einstein, p:hasWonPrize, o:Nobel Prize in Physics },{s:Albert Einstein, p:diedOnDate, o:1955-04-18 }]
JSON is built on two structures:– name/value (Key/value) pairs {s:Albert Einstein}
– list of values(array) [{s:Albert Einstein},{}…]
![Page 17: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/17.jpg)
17
JAQL
JAQL is an open-source language for querying JSON (JavaScript Object Notation) data.
It provides a general parallel data processing platform on Hadoop
Developed by IBM
![Page 18: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/18.jpg)
18
Basic Idea
SPARQL can be supported on Hadoop by translating queries into JAQL operators
Filter
Transform
Join
Group
Sort
Built-in Function merge (d1, d2), regex(), etc
![Page 19: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/19.jpg)
19
SPARQL to JAQLTransformation
SPARQL Query
PREFIX source:<http://www.mpii.de/yago/resource/>SELECT ?name ?whereWHERE {?who source:hasWonPrize Nobel Prize in Physics.?who source:isCalled ?name.?who source:wasBornIn ?where.}
JAQL Query
//read files from hdfs by predicate name $1 = read(hdfs('source:hasWonPrize')) -> filter $.o == “Nobel Prize in Physics” //select -> transform {$.s}; //project$2 = read(hdfs('source:isCalled')) -> transform {$.s,$.o};$3 = read(hdfs('source:wasBornIn')) -> transform {$.s,$.o};//mult-joinjoin $1, $2, $3 where $1.s == $2.sand $2.s == $3.s into { name:$2.o, where:$3.o }; //project to ?name ?where
{s:Albert Einstein, p:isCalled, o:Albert Einstein }
1
2
3
1
2
3
4
Mapreduce job1
Mapreduce job2
Mapreduce job3
Mapreduce job4
![Page 20: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/20.jpg)
20
Data storage
In Hadoop framework, – a file is the smallest unit of input to a MapReduc
e job and read from the disk.
One straightforward partitioning strategy is to store all the data in one file– Must scan the entire data in the read operation
Data Partitioning Strategy
![Page 21: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/21.jpg)
21
Data Partitioning Strategy
Horizontal partitioningVertical partitioningClustered property partitioning
![Page 22: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/22.jpg)
22
Horizontal partitioning with JSON
For example
Store in HDFS
Albert Einstein isCalled Albert EinsteinAlbert Einstein isCalled 阿尔伯特•爱因斯坦Albert Einstein wasBornIn UlmAlbert Einstein wasBornOnDate 1879-03-14Albert Einstein hasWonPrize Nobel Prize in PhysicsAlbert Einstein diedOnDate 1955-04-18Charles K. Kao hasWonPrize Nobel Prize in PhysicsCharles K. Kao wasBornIn ShanghaiFaye Wong hasWonPrizeMTV Video Music AwardsFaye Wong wasBornIn Beijing
File 1 File name: Hash(Subject1)
[{s:Albert Einstein, p:isCalled, o:Albert Einstein },{s:Albert Einstein, p:isCalled, o:阿尔伯特•爱因斯坦 },{s:Albert Einstein, p:wasBornIn, o:Ulm },{s:Albert Einstein, p:wasBornOnDate, o:1879-03-14 },{s:Albert Einstein, p:hasWonPrize, o:Nobel Prize in Physics },{s:Albert Einstein, p:diedOnDate, o:1955-04-18 }]
File 2 File name: Hash(Subject2)
[{s:Charles K. Kao , p:hasWonPrize, o:Nobel Prize in Physics },{s:Charles K. Kao , p:wasBornIn, o:Shanghai }]
File 3 File name: Hash(Subject3)
[{s:Faye Wong, p:hasWonPrize, o:MTV Video Music Awards },{s:Faye Wong, p:wasBornIn, o:Beijing}]
![Page 23: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/23.jpg)
23
Vertical Partitioning with JSON
For example
Store in HDFSFile 1 File name: isCalled
[{s:Albert Einstein, o:Albert Einstein },{s:Albert Einstein, o:阿尔伯特•爱因斯坦 }]
File 2 File name: wasBornIn
[{s:Albert Einstein, o:Ulm },{s:Charles K. Kao , o:Shanghai},{s:Faye Wong, o:Beijing}]]
File 5 File name: diedOnDate
[{s:Albert Einstein, o:1955-04-18 }]
File 3 File name: wasBornOnDate
[{s:Albert Einstein, o:1879-03-14 }]
File 4 File name: hasWonPrize
[{s:Albert Einstein, o:Nobel Prize in Physics },{s:Charles K. Kao , o:Nobel Prize in Physics },{s:Faye Wong, o:MTV Video Music Awards }]
Albert Einstein isCalled Albert EinsteinAlbert Einstein isCalled 阿尔伯特•爱因斯坦Albert Einstein wasBornIn UlmAlbert Einstein wasBornOnDate 1879-03-14Albert Einstein hasWonPrize Nobel Prize in PhysicsAlbert Einstein diedOnDate 1955-04-18Charles K. Kao hasWonPrize Nobel Prize in PhysicsCharles K. Kao wasBornIn ShanghaiFaye Wong hasWonPrizeMTV Video Music AwardsFaye Wong wasBornIn Beijing
![Page 24: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/24.jpg)
24
Clustered property partitioning with JSON
For example
Store in HDFS
Albert Einstein isCalled Albert EinsteinAlbert Einstein isCalled 阿尔伯特•爱因斯坦Albert Einstein wasBornIn UlmAlbert Einstein wasBornOnDate 1879-03-14Albert Einstein hasWonPrize Nobel Prize in PhysicsAlbert Einstein diedOnDate 1955-04-18Charles K. Kao hasWonPrize Nobel Prize in PhysicsCharles K. Kao wasBornIn ShanghaiFaye Wong hasWonPrizeMTV Video Music AwardsFaye Wong wasBornIn Beijing
File 1 File name: cluster1
[{s:Albert Einstein, p:isCalled, o:Albert Einstein },{s:Albert Einstein, p:isCalled, o:阿尔伯特•爱因斯坦 },{s:Albert Einstein, p:wasBornIn, o:Ulm },{s:Albert Einstein, p:wasBornOnDate, o:1879-03-14 },{s:Albert Einstein, p:hasWonPrize, o:Nobel Prize in Physics },{s:Albert Einstein, p:diedOnDate, o:1955-04-18 }]
File 2 File name: cluster2
[{s:Charles K. Kao , p:hasWonPrize, o:Nobel Prize in Physics },{s:Charles K. Kao , p:wasBornIn, o:Shanghai },{s:Faye Wong, p:hasWonPrize, o:MTV Video Music Awards },{s:Faye Wong, p:wasBornIn, o:Beijing}]
![Page 25: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/25.jpg)
25
Partition Index: Vertical Partitioning
Inverted Indexs
s File list
Albert Einstein isCalled,wasBornIn,wasBornOnDate, hasWonPrize,diedOnDate
……
Inverted Indexs
o File list
Albert Einstein isCalled,
…….
File 1 File name: isCalled
[{s:Albert Einstein, o:Albert Einstein },{s:Albert Einstein, o:阿尔伯特•爱因斯坦 }]
File 2 File name: wasBornIn
[{s:Albert Einstein, o:Ulm },{s:Charles K. Kao , o:Shanghai},{s:Faye Wong, o:Beijing}]
File 5 File name: diedOnDate
[{s:Albert Einstein, o:1955-04-18 }]
File 3 File name: wasBornOnDate
[{s:Albert Einstein, o:1879-03-14 }]
File 4 File name: hasWonPrize
[{s:Albert Einstein, o:Nobel Prize in Physics },{s:Charles K. Kao , o:Nobel Prize in Physics },{s:Faye Wong, o:MTV Video Music Awards }]
![Page 26: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/26.jpg)
26
Partition Index: Horizontal partitioning
Inverted Indexs
p File list
isCalled Hash(Subject1)
……
Inverted Indexs
o File list
Nobel Prize in Physics Hash(Subject1),Hash(Subject2)
……
File 1 File name: Hash(Subject1)
[{s:Albert Einstein, p:isCalled, o:Albert Einstein },{s:Albert Einstein, p:isCalled, o:阿尔伯特•爱因斯坦 },{s:Albert Einstein, p:wasBornIn, o:Ulm },{s:Albert Einstein, p:wasBornOnDate, o:1879-03-14 },{s:Albert Einstein, p:hasWonPrize, o:Nobel Prize in Physics },{s:Albert Einstein, p:diedOnDate, o:1955-04-18 }]
File 2 File name: Hash(Subject2)
[{s:Charles K. Kao , p:hasWonPrize, o:Nobel Prize in Physics },{s:Charles K. Kao , p:wasBornIn, o:Shanghai }]
File 3 File name: Hash(Subject3)
[{s:Faye Wong, p:hasWonPrize, o:MTV Video Music Awards },{s:Faye Wong, p:wasBornIn, o:Beijing}]
![Page 27: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/27.jpg)
27
Partition Index: Clustered property partitioning
File 1 File name: cluster1
[{s:Albert Einstein, p:isCalled, o:Albert Einstein },{s:Albert Einstein, p:isCalled, o:阿尔伯特•爱因斯坦 },{s:Albert Einstein, p:wasBornIn, o:Ulm },{s:Albert Einstein, p:wasBornOnDate, o:1879-03-14 },{s:Albert Einstein, p:hasWonPrize, o:Nobel Prize in Physics },{s:Albert Einstein, p:diedOnDate, o:1955-04-18 }]
File 2 File name: cluster2
[{s:Charles K. Kao , p:hasWonPrize, o:Nobel Prize in Physics },{s:Charles K. Kao , p:wasBornIn, o:Shanghai },{s:Faye Wong, p:hasWonPrize, o:MTV Video Music Awards },{s:Faye Wong, p:wasBornIn, o:Beijing}]
Inverted Indexs
p File list
isCalled cluster1
……
Inverted Indexs
o File list
Albert Einstein cluster1
……
Inverted Indexs
s File list
Albert Einstein cluster1
Charles K. Kao cluster2
Faye Wong Cluster2
![Page 28: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/28.jpg)
28
Outline
Introduction Related work SPARQL Query Processing in MapReduce Experiments Conclusion
![Page 29: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/29.jpg)
29
Experiments
Dataset:Billion Triples Challenge 2010(BTC10) . 3.2B <s, p, o, q> quads,624 GBs;The resulted of dataset have
1,426,823,976 unique triples;
Hadoop 0.20.2.Ubuntu 10.04.linux 2.6.32-24-server 64bit. 30nodes: One node is a master, and the others are slaves 47G memory, 4.3TB disk space and 24 processor of Intel(R) Xeon(R)
CPU E5645@ 2.40GHz “dfs.replication” is 2
JAQL is 0.5.1 version Java 1.6
![Page 30: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/30.jpg)
30
Experiments
Fig. Distribution of data
![Page 31: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/31.jpg)
31
Experiments
Fig. Cost time of each query
![Page 32: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/32.jpg)
32
Outline
Introduction Related work SPARQL Query Processing in MapReduce Experiments Conclusion
![Page 33: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/33.jpg)
33
Conclusion
Solution for SPARQL queries in MapReduce Transforming the queries to JAQL operators running on Hadoop.
Transformation of SPARQL to JAQL Filter, Transform, Join ……
Data Partitioning Strategy Horizontal partitioning Vertical partitioning Clustered property partitioning
Experiments show the performance Clustered property partitioning has best performance Horizontal partitioning is the worst one
![Page 34: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/34.jpg)
34
Scalability
RDBMS: Waits and deadlocks are increasing nonlinearly with the size
of the transactions and concurrency. Scale-up(Vertical scaling):Commercial RDBMSes are very, ve
ry expensive Schema:Structured data
MapReduce Linear, High throughput Scale-out (horizontal scaling) Schema-free: Unstructured data
![Page 35: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/35.jpg)
35
RDBMS V.S MapReduce
Traditional RDBMS MapReduce
Data size Gigabytes Petabytes
Access Interactive and batch Batch
Updates Read and write many times Write once, read many times
Structure Static schema Dynamic schema
Integrity High Low
Scaling Nonlinear Linear
Table . RDBMS compared to MapReduce
![Page 36: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/36.jpg)
36
Limit of hadoop
The Apache Hadoop MapReduce framework has hit a scalability limit around 4,000 machines
The MapReduce JobTracker needs a drastic overhaul to address several deficiencies in its scalability, memory consumption, threading-model, reliability and performance
![Page 37: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/37.jpg)
The Next Generation of Apache Hadoop MapReduce
Divide the two major functions of the JobTracker, resource management and job scheduling/monitoring, into separate components.
ResourceManager ApplicationMaster
Reliability
Availability
Scalability–beyond 10,000 machines
Backward (and Forward) Compatibility
Evolution –for customers to control upgrades
Predictable Latency
Cluster utilization
![Page 38: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/38.jpg)
38
Conclusion
Hadoop(MapReduce)– Pro.
Scalable High throughput
– Con. Expense of laten
cy No index No more than 40
00 nodes
SPARQL on Cloud– Pro.
Scalable High throughput
– Con. Expense of latency Complex query:JAQL Join operation
![Page 39: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/39.jpg)
39
Thanks!
![Page 40: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/40.jpg)
40
Sparql query
Q1:select?X ?Y where{?X rdfs:label Albert Einstein. ?X smc:page ?Y. ?X rdf:type smc:Subject. }
Q2:select ?x ?y ?z where { dbsc:Ulm rdf:type ?x. ?x rdfs:label ?y. ?x rdfs:comment ?z. }
Q3:select? Who ?Y ?date1 ?Z ?date2 ?prize where{?who source:bornIn ?Y.?who source:bornOnDate?date1.?whosource:diedIn?Z.?whosource:diedOnDate ?date2. ?who source:hasWonPrize ?prize. }
Q4:select ?x ?author ?title where {?x purl:hasAuthor ?author. ?x purl:hasBooktitle ISWC 2009. ?x purl:hasTitle ?title.}
Q5:select distinct ?name ?lat ?long ?pop where {?a property:name ?name.?a property:regoin dbsc: Nord-Pas-de-Calais.a pos:lat ?lat.?a pos:long ?long.?a property:population ?pop. }
![Page 41: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/41.jpg)
41
Sparql query
Q6: select ?bn ?b ?p where{ ?a property:name ?bn. ?a property:dateOfBirth ?b. ?a property:placeOfBirth ?p. }
Q7:select ?Y ?type ?prize where{source:Albert_Einstein source:bornIn ?Y. source:Albert_Einsteinrdf:type?type.source:Albert_Einstein source:hasWonPrize ?prize. }
Q8:select ?a ?type ?pub where{?a rdf:type ?type.?a semweb:publisher ?pub.?a semweb:periodical_title Theory of Computing Systems.}
Q9:select distinct ?a ?lat ?long ?pop where{?a geo:ontology#name Chevilly.?a geo:ontology#inCountry geo:countries#FR.?a pos:lat ?lat.?a pos:long ?long.?a geo:ontology#population ?pop.}
Q10:select distinct ?l ?long ?lat where{?a property:placeOfBirth ?l.?l pos:lat ?lat.?l pos:long ?long.}
![Page 42: 1 Efficient SPARQL Query Processing in MapReduce through Data Partitioning and Indexing Nie Zhi niezhixuesen@163.com.](https://reader034.fdocuments.in/reader034/viewer/2022042822/56649e7d5503460f94b80a17/html5/thumbnails/42.jpg)
42
Q3, Q10 are star join queries with poplar predicates and unspecified object
Q1, Q4, Q5, Q6, Q8, Q9 are also star join but with one or more known object.
Q2 is a chain query The value of subject is literals in Q7
Sparql query