Beyond Hadoop and MapReduce

14
BEYOND HADOOP AND MAPREDUCE 1 Alexander Alten-Lorenz mapredit.blogspot.com

Transcript of Beyond Hadoop and MapReduce

Page 1: Beyond Hadoop and MapReduce

BEYOND  HADOOP  AND  MAPREDUCE

1

Alexander Alten-Lorenz mapredit.blogspot.com

Page 2: Beyond Hadoop and MapReduce

2

HDFS (NN) small files / large namespaces incompatibility

Hadoop as an Platform is not multi-tenant ready in a secure way

Traditional Hadoop importance (M/R) moves behind the world needs

BigData apps rely more and more on predictive as well as reactive platforms

PaaS and IaaS are not compatible with the concept of Hadoop

WHY  WE  NEED  TO  LOOK  BEYOND

Page 3: Beyond Hadoop and MapReduce

AVAILABLE  DISTRIBUTED  FILE  SYSTEMS

Beyond  Hadoop  and  MapReduce

3

Page 4: Beyond Hadoop and MapReduce

4

In Memory Distributed Shared Filesystem Written in Java In-memory checkpointing and caching Use underlaying FS to provide fault tolerance Pluggable FS layer

Spark compatible

HDFS / MR / Hive compatible

Native support for raw tables Supports multi-columned tables Memory pinning for certain hot tables

Fast facts:

Web: http://tachyon-project.org

Source: https://github.com/amplab/tachyon

Page 5: Beyond Hadoop and MapReduce

5

Distributed file system Written in C++ Provides Object, File and Block storage as well as kernel support (experimental) Metadata principe, offers High Availability and Replication Self healing mechanisms, „no downtime“ mechanisms supports B-tree (incl. k-value) model

Hadoop compatible Native Hadoop integration (hadoop-cephfs.jar) RH Gluster compatible (OpenStack) S3 compatible

Fast facts:

Web: http://ceph.com/docs/next/ Source: https://github.com/ceph/ceph

Page 6: Beyond Hadoop and MapReduce

6

Distributed file system scale-out network-attached storage file system.

=> Write once, read everywhere RDMA interconnect as one large parallel network file system client and server component (glusterfsd / glusterfs ) GlusterFS relies on an elastic hashing algorithm, rather than using either a centralised or distributed metadata model

Hadoop compatible Hadoop FileSystem plugin for GlusterFS (glusterfs-hadoop) available

Fast facts:

Web: http://www.gluster.org/ Source: http://download.gluster.org/pub/gluster/glusterfs/3.5/

Page 7: Beyond Hadoop and MapReduce

PAAS  /  SAAS  SOLUTIONS

7

Beyond  Hadoop  and  MapReduce

Page 8: Beyond Hadoop and MapReduce

8

Cluster manager and PaaS layer Provide ressource isolation Provide ressource sharing between applications or frameworks run many different applications on a dynamic shared pool of nodes DCOS (Data Center Operating System)

Support HDFS (DFS Layer) Ceph will be supported soon Tachyon support experimental Runs on local file system too

Spark, Cassandra, Storm, Docker, HDFS, Hadoop support

Fast facts:

Web: http://mesos.apache.org/

DCOS: http://mesosphere.com/

Source: https://github.com/mesos/mesos

Tachyon: https://github.com/mesosphere/tachyon-mesos

Page 9: Beyond Hadoop and MapReduce

9

Public / Private cloud and IaaS provider VM based cloud computing Enables horizontal scaling by using unused resources or spinning up new services Wide broaden industry committers Abstract HW layer Supported by RedHat

Fast facts:

Web: http://www.openstack.org/

OpenStack Foundation: http://www.openstack.org/foundation

Source: https://github.com/openstack/openstack

Page 10: Beyond Hadoop and MapReduce

HADOOP  PAAS

10

-­‐  EXPERIMENTAL  -­‐

Page 11: Beyond Hadoop and MapReduce

11

Mesos

HDFS (Gluster)

Spark NoSQL MR Search IngestETLSQL

Doc

ker

This approach uses Mesos as PaaS layer for the DFS infrstructure. Mesos supports HDFS as well as GlusterFS, HDFS seems to be more robust in that approach. Hadoop and Spark will be maintained by docker containers.

This solution is not so powerful as JBOD hardware in terms of traditional MapReduce, but plays well within an Spark / InMemory environment. Hadoop is used as an transition layer to move more and more applications to Spark.

Datacenter

Page 12: Beyond Hadoop and MapReduce

HADOOP  IAAS

12

-­‐  EXPERIMENTAL  -­‐

Page 13: Beyond Hadoop and MapReduce

13

Apache SparkSahara

OpenStack

Spark

SQL

NoSQL

MR

Search

KafkaETLHiveSpark

Stream

Spark

ML

GlusterFS

Apache Hadoop

Apache Flink

Tachyon

Storm

The hardware layer will be completely abstracted by OpenStack. Tachyon closes the gap between IO and Virtualisation (Cloud) and works as an data layer bridge between Hadoop projects, Ingest and Spark. Sahara manages Spark workers as well as Hadoop nodes.

Datacenter

Page 14: Beyond Hadoop and MapReduce

THANK  YOU!

14

Alexander Alten-Lorenz mapredit.blogspot.com

@mapredit