Data model for analysis of scholarly documents in the MapReduce paradigm

Data model for analysis of scholarly documents in theMapReduce paradigm

Adam Kawa Lukasz Bolikowski Artur Czeczko Piotr Jan DendekDominika Tkaczyk

Centre for Open Science (CeON), ICM UW

Warsaw, July 6, 2012

(CeON ICM UW) Apache Hadoop in CeON ICM UW Warsaw, July 6 2012 1 / 19

Agenda

1 Problem definition

2 Requirements specification

3 Exemplary solutions based on Apache Hadoop Ecosystem tools


The data that we are in possession

Vast collections of scholarly documents to store

10 million of full texts(PDF, plain text)

17 million of document metadata records(described in XML-based BWMeta format)

4TB of data(10TB including data archives)


The tasks that we are doing

Big knowledge to extract and discover

17 million of document metadata records (XML)

contain title, subtitles, abstract, keywords, references, contributors andtheir affiliations, publishing magazine, . . .

input for many state-of-the-art machine learning algorithms

relatively simple ones: searching documents with given title, findingscientific teams, . . .quite complex ones: author name disambiguation, bibliometrics,classification code assignment, . . .


The requirements that we have specified

Multiple demands regarding storage and processing of large amounts of data:

scalability and parallelism — easily handle tens of terabytes of data andparallelize the computation effectively

flexible data model — possibility to add or update data, and enhance itcontent by implicit information discovered by our algorithms

latency requirements — support batch offline processing as well asrandom, realtime read/write requests

availability of many clients — accessible by programmers and researcherswith diverse language preferences and expertise

reliablility and cost-effectiveness — ideally an open-source software whichdoes not require expensive hardware


Document-related data as linked data

Information about document-related resources can be simply described a directedlabeled graph

entities (e.g. documents, contributors, references) are nodes in the graph

relationships between entities are directed labeled edges in the graph


Linked graph as a collection of RDF triples

A directed labeled graph can be simply represented a collection of RDF triples

a triple consists of subject, predicate and object

a triple represents a statement which denotes that a resource (subject) holdsa value (object) for some attribute (predicate) of that resource

a triple can represent any statements about any resource


Hadoop as a solution for scalability/performance issues

Apache Hadoop is most commonly used open-source solution for storing andprocessing big data in reliable, high-performance and cost-effective way.

Scalable storage

Parallel processing

Subprojects and many Hadoop-related projects

HDFS — distributed file system that provides high-throughput accessto large dataMapReduce — framework for distributed processing of large data sets(Java and e.g. JavaScript, Python, Perl, Ruby via Streaming)HBase — scalable, distributed data store with flexible schema, randomread/write access and fast scansPig/Hive — higher-level abstractions on top of MapReduce (simpledata manipulation languages)


Apache Hadoop Ecosystem tools as RDF triple stores

SHARD [3] — a Hadoop backed RDF triple store

stores triples in flat files in HDFSdata cannot be modified randomlyless efficient for queries that requires the inspection of only a smallnumber of triples

PigSPARQL [6] — translates SPARQL queries to Pig Latin programs andruns them on Hadoop cluster

stores RDF triples with the same predicate in separate, flat files inHDFS

H2RDF [5] — a RDF store that combines MapReduce with HBase

stores triples in HBase using three flat-wide tables

Jena-HBase [4] — a HBase backed RDF triple store

provides six different pluggable HBase storage layouts


HBase as storage layer for RDF triples

Storing RDF triples in Apache HBase has several advantages

flexible data model — columns can be dynamically added and removed;multiple versions of data in a particular cell; data serialized to a byte array

random read and write — more suitable for semi-structured RDF datathan HDFS where files cannot be modified randomly and usually whole filemust be read sequentially to find subset of records

availability of many clients

interactive clients — native Java API, REST or Apache Thriftbatch clients — MapReduce (Java), Pig (PigLatin) and Hive(HiveQL)

automatically sorted records — quick lookups and partial scans; joins asfast (linear) merge-joins


Exemplary HBase schema — Flat-wide layout

Advantages

no prior knowledge about data is requiredcolocation of all information about a resource within a single rowsupport of multi-valued propertiessupport of reified statements (statements about statements)

Disadvantages

unlimited number of columnsincrease of storage space


Exemplary HBase schema - Vertically Partitioned layout [1]

Advantages

support of multi-valued propertiessupport of reified statements (statements about statements)storage space savings when compared to the previous layoutfirst-step (predicate-bound) pairwise joins as fast merge-joins

Disadvantages

increased number of joins


Exemplary HBase schema - Hexastore layout [2]

Advantages

support of multi-valued propertiessupport of reified statements (statements about statements)first-step pairwise joins as fast merge-joins

Disadvantages

increased of number of joinsincrease of storage spacecomplication of an update operation


HBase schema - other layout

Some derivative and hybrid layouts exist to combine the advantages of originallayouts

a combination of the vertically partitioned and the hexastore layout [4]

a combination of the flat-wide and the vertically partitioned layouts [4]


Challenges

a large number of join operations

relatively expensiveand practically cannot be avoided (at least for more complex queries)but specialized join techniques can be used e.g. multi join, merge-sortjoin, replicated join, skewed join

lack of a native support for cross-row atomicity (e.g. in the form oftransactions)


Possible performance optimization techniques

property tables — properties often queried together are stored in the samerecord for a quick access [8, 9]

materialized path expressions — precalculation and materialization ofthe most commonly used paths through an RDF graph in advance [1, 2]

graph-oriented partitioning scheme [7]

take advantage of the spatial locality inherent in graph patternmatchinghigher replication of data that is on the border of any particularpartition (however, problematic for a graph that is modified)


The ways of processing data from HBase

Many various tools are integrated with HBase and can read data from and writedata to HBase tables

Java MapReduce

possibility to use our legacy Java code in map and reduce methodsdelivers better perfromance than Apache Pig

Apache Pig

provides common data operations (e.g. filters, unions, joins, ordering)and nested types (e.g. tuples, bags, maps)supports multiple specialized joins implementationpossibility to run MapReduce jobs directly from PigLatin scriptscan be embeded in Python code

Interactive clients (e.g. Java API, REST or Apache Thrift)

interactive access to relatively small subset of our data by sending APIcalls on demand e.g. a web-based client


Case study: author name disambiguation algorithm

The most complex algorithm that we have run over Apache HBase so far isauthor name disambiguation algorithm.


Thanks!

More information about CeON:http://ceon.pl/en/research

c©2012 Adam Kawa. Ten dokument jest dost ↪epny na licencji Creative Commons Uznanie autorstwa 3.0 Polska

Tresc licencji dost ↪epna pod adresem: http://creativecommons.org/licenses/by/3.0/pl/


D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach. ScalableSemantic Web Data Management using vertical partitioning. In VLDB, pages411–422, 2007.

C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple Indexing forSemantic Web Data Management. In VLDB, pages 1008-1019, 2008.

K. Rohloff and R. Schantz. High-performance, massively scalable distributedsystems using the mapreduce software framework: The shard triple-store.International Workshop on Programming Support Innovations for EmergingDistributed Applications, 2010.

V. Khadilkar, M. Kantarcioglu, P. Castagna, and B. Thuraisingham.Jena-HBase: A Distributed, Scalable and Efffient RDF Triple Store. Technicalreport, 2012. http://www.utdallas.edu/ vvk072000/Research/Jena-HBase-Ext/tech-report.pdf

N. Papailiou, I. Konstantinou, D. Tsoumakos and N. Koziris. H2RDF:Adaptive Query Processing on RDF Data in the Cloud. In Proceedings of the21th International Conference on World Wide Web (WWW demo track),Lyon, France, 2012.


A. Schatzle, M. Przyjaciel-Zablocki and G. Lausen: PigSPARQL: MappingSPARQL to Pig Latin. 3th International Workshop on Semantic WebInformation Management (SWIM 2011), in conjunction with the 2011 ACMInternational Conference on Management of Data (SIGMOD 2011). Athens(Greece).

J. Huang, D. Abadi and K. Ren. Scalable SPARQL Querying of Large RDFGraphs. VLDB Endowment, Volume 4 (VLDB 2011).

K. Wilkinson, C. Sayers, H. Kuno, and D. Reynolds. Efficient RDF Storageand Retrieval in Jena2. In SWDB, pages 131–150.

K. Wilkinson. Jena property table implementation. In SSWS, 2006.


Data model for analysis of scholarly documents in the MapReduce paradigm

Documents

Transcript of Data model for analysis of scholarly documents in the MapReduce paradigm