Storing RDF Data in Hadoop And Retrieval Pankil Doshi Asif Mohammed Mohammad Farhan Husain Dr....

Storing RDF Data in Hadoop And Retrieval

Pankil DoshiAsif Mohammed

Mohammad Farhan HusainDr. Latifur Khan

Dr. Bhavani Thuraisingham

Goal

• To build efficient storage using Hadoop for Peta-bytes of data

• To build an efficient query mechanism• Possible outcomes

– Open Source Framework for RDF– Integration with Jena

Possible Approaches

• Store RDF data in HDFS and query through Map-Reduce programming– Our current approach

• Store RDF data in HDFS and process query outside of Hadoop– Done in BIOMANTA [1] project, no details however

• Hbase– Currently being worked on by another team in

Semantic Web lab

Dataset And Queries

• LUBM [2]– Dataset generator– 14 benchmark queries– Generates data of some

imaginary universities– Used for query

execution performance comparison by many researches

Our Clusters

• 4 node cluster in Semantic Web lab• 10 node cluster in SAIAL lab

– 4 GB main memory– Intel Pentium IV 3.0 GHz processor– 640 GB hard drive

• OpenCirrus HP labs test bed– Sponsor: Andy Seaborne, HP Labs

Tasks Completed/In Progress

• Setup Hadoop cluster• Generate, preprocess & insert data• Devise algorithm to produce map-reduce code

for a SPARQL query• Code for 14 queries• Cascading output of one job to another job as

input without using hard disk

Two Storage Approaches

1. Multiple File Approach:• Dumping files as generated by LUBM generator, possibly merging some• Each Line on file Contains Subject, Predicate and Object

2. Predicate Based Approach:• Dividing Files based on Predicate• File name will be “Predicate “ name• Each line then contains only Subject and Object• On-an Average there are about 20 different type of Predicate

Common Preprocessing :-Adding Prefixeshttp://www.University10Department5:.... == U10D5:….

D0U0:Graduate20 ub:type lehigh:GraduateStudentD0U0:Graduate20 ub:memberOf lehigh:University0

D0U0:Graduate20 lehigh:GraduateStudent………

D0U0:Graduate20 lehigh:University0……

Example Of Predicate Based File division:

Filename : type

Filename : memberOf

Filename: type_GraduateStudentD0U0:Graduate20…

Filename: memberOf_UniversityD0U0:Graduate20 lehigh:University0…

Sample Query:-

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#> SELECT ?X WHERE {

?X rdf:type ub:Publication . ?X ub:publicationAuthor D0U0:AssistantProfessor0

}

Map Function :-Look from which file (key) the data (value) is coming and filter it according to conditions. For example:• If data is from file “type_Publication” output the pair• If data is from file “publicationAuthor_*” look for D0U0:AssistantProfessor0 as object

Reduce Function :-Look for all the required values according to condition and output the key as the result

Ex: Filter those results having both ub:Publication & D0U0:AssistantProfessor0

AlgorithmSELECT ?X, ?Y WHERE {1. ?X rdf:type ub:Chair . 2. ?Y rdf:type ub:Department . 3. ?X ub:worksFor ?Y . 4. ?Y ub:subOrganizationOf <http://www.University0.edu>} Y

1X

4Y

3X,Y

2YX

Y

Y

|E| = 4

Job 1 map output keys:1.Y – 2, 3, 4 (3 joins)Job 1 joins: 31 join left, so need more job

Variable Nodes Joins

X 1, 3 1-3

Y 2, 3, 4 2-3, 3-4, 4-2

Algorithm (contd.)

A(2, 3, 4)

X, Y

B(1)X

X

Job 2 map output key:1.X – A, B (1 Join)Job 2 joins: 1No joins left, no more jobs needed

Variable Nodes Joins

X A, B A-B

Some Query Results

Horizontal axis: Number of TriplesVertical axis: Time in milliseconds

Query Preprocessing

• Original query 2:?X rdf:type ub:GraduateStudent . ?Y rdf:type ub:University . ?Z rdf:type ub:Department . ?X ub:memberOf ?Z . ?Z ub:subOrganizationOf ?Y . ?X ub:undergraduateDegreeFrom ?Y

• Rewritten:?X rdf:type ub:GraduateStudent . ?X ub:memberOf_Department ?Z . ?Z ub:subOrganizationOf_University ?Y . ?X ub:undergraduateDegreeFrom_University ?Y

Parallel Experiment with Pig

• Script for query 2:/* Load statements */GS = LOAD ‘type_GraduateStudent‘ AS (gs_subject:chararray);MO = LOAD ‘memberOf_Department‘ AS (mo_subject:chararray, mo_object:chararray);SOF = LOAD ‘subOrganizationOf_University‘ AS (sof_subject:chararray, sof_object:chararray); UDF = LOAD ‘undergraduateDegreeFrom_University‘ AS (udf_subject:chararray, udf_object:chararray); /* Joins */ MO_UDF_GS = JOIN GS BY gs_subject, UDF BY udf_subject, MO BY mo_subject PARALLEL 8; MO_UDF_GS = FOREACH MO_UDF_GS GENERATE mo_subject, udf_object, mo_object;MO_UDF_GS_SOF = JOIN SOF BY (sof_subject, sof_object), MO_UDF_GS BY (mo_object, udf_object);MO_UDF_GS_SOF = FOREACH MO_UDF_GS_SOF GENERATE mo_subject, udf_object, mo_object; /* Store query answer */STORE MO_UDF_GS_SOF INTO ‘Query2' USING PigStorage('\t');

Parallel Experiment with Pig

• 2 jobs created for query 2• For 330 mln triples, answers in 20 mins

– Direct MapReduce approach takes 10 mins

Future Works• Run all 14 queries for 100 mln, 200 mln, … , 1 bln

triples and compare with Jena In-Memory, RDB, SDB, TDB models

• Cascading output of one job to another job as input without using hard disk

• Generic map reduce code• Proof of algorithm• Modification of algorithm for queries with

optional triple patterns• Indexing, summary statistics

References

• [1] BIOMANTA: http://www.biomanta.org/• [2] LUBM:

http://swat.cse.lehigh.edu/projects/lubm/

http://www.biomanta.org/

http://swat.cse.lehigh.edu/projects/lubm/

Storing RDF Data in Hadoop And Retrieval Pankil Doshi Asif Mohammed Mohammad Farhan Husain Dr....

Documents

Transcript of Storing RDF Data in Hadoop And Retrieval Pankil Doshi Asif Mohammed Mohammad Farhan Husain Dr....