Infovore: An Open Source MapReduce Framework For Processing Graph Data

23
Infovore, an Open-Source Map/Reduce Framework For Processing Graph Data Paul Houle Ontology 2

description

This talk describes an Infovore, a tool that uses the Map/Reduce approach to clean up, filter and combine RDF data sets to deliver purpose-built data sets for practical consumers of linked data

Transcript of Infovore: An Open Source MapReduce Framework For Processing Graph Data

Page 1: Infovore: An Open Source MapReduce Framework For Processing Graph Data

Infovore, an Open-Source Map/Reduce Framework For Processing Graph

Data

Paul Houle

Ontology2

Page 2: Infovore: An Open Source MapReduce Framework For Processing Graph Data
Page 3: Infovore: An Open Source MapReduce Framework For Processing Graph Data
Page 4: Infovore: An Open Source MapReduce Framework For Processing Graph Data
Page 5: Infovore: An Open Source MapReduce Framework For Processing Graph Data

2+ billion facts, 20+ gb!

Page 6: Infovore: An Open Source MapReduce Framework For Processing Graph Data

the data your project needs

Page 7: Infovore: An Open Source MapReduce Framework For Processing Graph Data

Why handle complete data sets?

Quality Perimeter

Infovore

Page 8: Infovore: An Open Source MapReduce Framework For Processing Graph Data

RDF Tools vs.Invalid Triples

Image cc-by from arj03

Page 9: Infovore: An Open Source MapReduce Framework For Processing Graph Data

Scaling Limits of Triple Stores

CPU Main Memory

CPU

CPU

CPU

CPU

CPU

Random-access bottleneck

Hard Drive or Flash Storage

Page 10: Infovore: An Open Source MapReduce Framework For Processing Graph Data

Map/Reduce conserves memory!

Image cc-by-sa from Anua22a

Page 11: Infovore: An Open Source MapReduce Framework For Processing Graph Data

Partitioning Data

md5(“http://dbpedia.org/resource/Tree”) =b78f8f508982ceb4e8dd3510fac75f62

331 332330 333 334 335… …

Page 12: Infovore: An Open Source MapReduce Framework For Processing Graph Data

If you really try it…

331 332330

333

334 335… …

Page 13: Infovore: An Open Source MapReduce Framework For Processing Graph Data

Preprocessing Freebase

• Expand prefixes

• Remove

• fbase:type.type.instance

• fbase:type.type.expected_by

• rdfs:type w/ fbase:* subject

• Reverse

• Fbase:type.permission.controls

• Fbase:dataworld_gardening_hint.replaced_by

• Rewrite

• Fbase:type.object.type to rdfs:type

Page 14: Infovore: An Open Source MapReduce Framework For Processing Graph Data

Parallel Super Eyeball

Page 15: Infovore: An Open Source MapReduce Framework For Processing Graph Data

sort | uniq

:Surgeon a :Occupation .:Surgeon rdfs:label “Surgeon” @en.:Surgeon :mustHave :Md.

:Tree a :Plant .:Tree rfs:label “Tree” @en .:Tree :has :Leaves .

:Victory a :AbstractConcept .:Vectory rdfs:label “Victory” .:Victory :emotialTone :Positive .

Page 16: Infovore: An Open Source MapReduce Framework For Processing Graph Data

Huge scalability…

:Tree

:Victory

:Surgeon

Main memory

Page 17: Infovore: An Open Source MapReduce Framework For Processing Graph Data

Pig, Hadoop and All That…

Source: http://www.dbis.informatik.hu-berlin.de/forschung/projekte/query-optimization-in-rdf-databases.html

Page 18: Infovore: An Open Source MapReduce Framework For Processing Graph Data

Monitoring for Quality Control

Operational Statistics(rdf)

Preprocess Partition Clean Sort Classify Filter

Page 19: Infovore: An Open Source MapReduce Framework For Processing Graph Data

:basekb

Page 20: Infovore: An Open Source MapReduce Framework For Processing Graph Data

Parallel Loading into Triple Stores

331 332330 333 334 335… …

Openlink Virtuoso4x Speedup

Page 21: Infovore: An Open Source MapReduce Framework For Processing Graph Data

:basekb lite

:Freebase

:Chosenfacts

:Rulebox

:Chosentopics

Page 22: Infovore: An Open Source MapReduce Framework For Processing Graph Data

rdf diff

Page 23: Infovore: An Open Source MapReduce Framework For Processing Graph Data

See for yourself

https://github.com/paulhoule/infovore/wiki