Download - Beyond Hadoop and MapReduce

BEYOND HADOOP AND MAPREDUCE

1

Alexander Alten-Lorenz mapredit.blogspot.com

http://mapredit.blogspot.com

2

HDFS (NN) small files / large namespaces incompatibility

Hadoop as an Platform is not multi-tenant ready in a secure way

Traditional Hadoop importance (M/R) moves behind the world needs

BigData apps rely more and more on predictive as well as reactive platforms

PaaS and IaaS are not compatible with the concept of Hadoop

WHY WE NEED TO LOOK BEYOND

AVAILABLE DISTRIBUTED FILE SYSTEMS

Beyond Hadoop and MapReduce

3

4

In Memory Distributed Shared Filesystem Written in Java In-memory checkpointing and caching Use underlaying FS to provide fault tolerance Pluggable FS layer

Spark compatible

HDFS / MR / Hive compatible

Native support for raw tables Supports multi-columned tables Memory pinning for certain hot tables

Fast facts:

Web: http://tachyon-project.org

Source: https://github.com/amplab/tachyon

http://tachyon-project.org/



https://amplab.cs.berkeley.edu/software/

5

Distributed file system Written in C++ Provides Object, File and Block storage as well as kernel support (experimental) Metadata principe, offers High Availability and Replication Self healing mechanisms, „no downtime“ mechanisms supports B-tree (incl. k-value) model

Hadoop compatible Native Hadoop integration (hadoop-cephfs.jar) RH Gluster compatible (OpenStack) S3 compatible

Fast facts:

Web: http://ceph.com/docs/next/ Source: https://github.com/ceph/ceph




6

Distributed file system scale-out network-attached storage file system.

=> Write once, read everywhere RDMA interconnect as one large parallel network file system client and server component (glusterfsd / glusterfs ) GlusterFS relies on an elastic hashing algorithm, rather than using either a centralised or distributed metadata model

Hadoop compatible Hadoop FileSystem plugin for GlusterFS (glusterfs-hadoop) available

Fast facts:

Web: http://www.gluster.org/ Source: http://download.gluster.org/pub/gluster/glusterfs/3.5/

http://en.wikipedia.org/wiki/Hash_function

https://forge.gluster.org/hadoop/glusterfs-hadoop


http://download.gluster.org/pub/gluster/glusterfs/3.5/

http://download.gluster.org/pub/gluster/glusterfs/3.5/

PAAS / SAAS SOLUTIONS

7

Beyond Hadoop and MapReduce

8

Cluster manager and PaaS layer Provide ressource isolation Provide ressource sharing between applications or frameworks run many different applications on a dynamic shared pool of nodes DCOS (Data Center Operating System)

Support HDFS (DFS Layer) Ceph will be supported soon Tachyon support experimental Runs on local file system too

Spark, Cassandra, Storm, Docker, HDFS, Hadoop support

Fast facts:

Web: http://mesos.apache.org/

DCOS: http://mesosphere.com/

Source: https://github.com/mesos/mesos

Tachyon: https://github.com/mesosphere/tachyon-mesos


http://mesosphere.com/

http://mesosphere.com/

https://github.com/mesos/mesos

https://github.com/mesos/mesos

https://github.com/mesosphere/tachyon-mesos

https://github.com/mesosphere/tachyon-mesos

9

Public / Private cloud and IaaS provider VM based cloud computing Enables horizontal scaling by using unused resources or spinning up new services Wide broaden industry committers Abstract HW layer Supported by RedHat

Fast facts:

Web: http://www.openstack.org/

OpenStack Foundation: http://www.openstack.org/foundation

Source: https://github.com/openstack/openstack


http://www.openstack.org/foundation

http://www.openstack.org/foundation

https://github.com/openstack/openstack

https://github.com/openstack/openstack

HADOOP PAAS

10

-‐ EXPERIMENTAL -‐

11

Mesos

HDFS (Gluster)

Spark NoSQL MR Search IngestETLSQL

Doc

ker

This approach uses Mesos as PaaS layer for the DFS infrstructure. Mesos supports HDFS as well as GlusterFS, HDFS seems to be more robust in that approach. Hadoop and Spark will be maintained by docker containers.

This solution is not so powerful as JBOD hardware in terms of traditional MapReduce, but plays well within an Spark / InMemory environment. Hadoop is used as an transition layer to move more and more applications to Spark.

Datacenter

HADOOP IAAS

12

-‐ EXPERIMENTAL -‐

13

Apache SparkSahara

OpenStack

Spark

SQL

NoSQL

MR

Search

KafkaETLHiveSpark

Stream

Spark

ML

GlusterFS

Apache Hadoop

Apache Flink

Tachyon

Storm

The hardware layer will be completely abstracted by OpenStack. Tachyon closes the gap between IO and Virtualisation (Cloud) and works as an data layer bridge between Hadoop projects, Ingest and Spark. Sahara manages Spark workers as well as Hadoop nodes.

Datacenter

THANK YOU!

14

Alexander Alten-Lorenz mapredit.blogspot.com

@mapredit

http://mapredit.blogspot.com