Infovore: An Open Source MapReduce Framework For Processing Graph Data
-
Upload
paul-houle -
Category
Technology
-
view
2.700 -
download
1
description
Transcript of Infovore: An Open Source MapReduce Framework For Processing Graph Data
Infovore, an Open-Source Map/Reduce Framework For Processing Graph
Data
Paul Houle
Ontology2
2+ billion facts, 20+ gb!
the data your project needs
Why handle complete data sets?
Quality Perimeter
Infovore
RDF Tools vs.Invalid Triples
Image cc-by from arj03
Scaling Limits of Triple Stores
CPU Main Memory
CPU
CPU
CPU
CPU
CPU
Random-access bottleneck
Hard Drive or Flash Storage
Map/Reduce conserves memory!
Image cc-by-sa from Anua22a
Partitioning Data
md5(“http://dbpedia.org/resource/Tree”) =b78f8f508982ceb4e8dd3510fac75f62
331 332330 333 334 335… …
If you really try it…
331 332330
333
334 335… …
Preprocessing Freebase
• Expand prefixes
• Remove
• fbase:type.type.instance
• fbase:type.type.expected_by
• rdfs:type w/ fbase:* subject
• Reverse
• Fbase:type.permission.controls
• Fbase:dataworld_gardening_hint.replaced_by
• Rewrite
• Fbase:type.object.type to rdfs:type
Parallel Super Eyeball
sort | uniq
:Surgeon a :Occupation .:Surgeon rdfs:label “Surgeon” @en.:Surgeon :mustHave :Md.
:Tree a :Plant .:Tree rfs:label “Tree” @en .:Tree :has :Leaves .
:Victory a :AbstractConcept .:Vectory rdfs:label “Victory” .:Victory :emotialTone :Positive .
Huge scalability…
:Tree
:Victory
:Surgeon
Main memory
Pig, Hadoop and All That…
Source: http://www.dbis.informatik.hu-berlin.de/forschung/projekte/query-optimization-in-rdf-databases.html
Monitoring for Quality Control
Operational Statistics(rdf)
Preprocess Partition Clean Sort Classify Filter
:basekb
Parallel Loading into Triple Stores
331 332330 333 334 335… …
Openlink Virtuoso4x Speedup
:basekb lite
:Freebase
:Chosenfacts
:Rulebox
:Chosentopics
rdf diff
See for yourself
https://github.com/paulhoule/infovore/wiki