BEYOND HADOOP AND MAPREDUCE
1
Alexander Alten-Lorenz mapredit.blogspot.com
2
HDFS (NN) small files / large namespaces incompatibility
Hadoop as an Platform is not multi-tenant ready in a secure way
Traditional Hadoop importance (M/R) moves behind the world needs
BigData apps rely more and more on predictive as well as reactive platforms
PaaS and IaaS are not compatible with the concept of Hadoop
WHY WE NEED TO LOOK BEYOND
AVAILABLE DISTRIBUTED FILE SYSTEMS
Beyond Hadoop and MapReduce
3
4
In Memory Distributed Shared Filesystem Written in Java In-memory checkpointing and caching Use underlaying FS to provide fault tolerance Pluggable FS layer
Spark compatible
HDFS / MR / Hive compatible
Native support for raw tables Supports multi-columned tables Memory pinning for certain hot tables
Fast facts:
Web: http://tachyon-project.org
Source: https://github.com/amplab/tachyon
5
Distributed file system Written in C++ Provides Object, File and Block storage as well as kernel support (experimental) Metadata principe, offers High Availability and Replication Self healing mechanisms, „no downtime“ mechanisms supports B-tree (incl. k-value) model
Hadoop compatible Native Hadoop integration (hadoop-cephfs.jar) RH Gluster compatible (OpenStack) S3 compatible
Fast facts:
Web: http://ceph.com/docs/next/ Source: https://github.com/ceph/ceph
6
Distributed file system scale-out network-attached storage file system.
=> Write once, read everywhere RDMA interconnect as one large parallel network file system client and server component (glusterfsd / glusterfs ) GlusterFS relies on an elastic hashing algorithm, rather than using either a centralised or distributed metadata model
Hadoop compatible Hadoop FileSystem plugin for GlusterFS (glusterfs-hadoop) available
Fast facts:
Web: http://www.gluster.org/ Source: http://download.gluster.org/pub/gluster/glusterfs/3.5/
PAAS / SAAS SOLUTIONS
7
Beyond Hadoop and MapReduce
8
Cluster manager and PaaS layer Provide ressource isolation Provide ressource sharing between applications or frameworks run many different applications on a dynamic shared pool of nodes DCOS (Data Center Operating System)
Support HDFS (DFS Layer) Ceph will be supported soon Tachyon support experimental Runs on local file system too
Spark, Cassandra, Storm, Docker, HDFS, Hadoop support
Fast facts:
Web: http://mesos.apache.org/
DCOS: http://mesosphere.com/
Source: https://github.com/mesos/mesos
Tachyon: https://github.com/mesosphere/tachyon-mesos
9
Public / Private cloud and IaaS provider VM based cloud computing Enables horizontal scaling by using unused resources or spinning up new services Wide broaden industry committers Abstract HW layer Supported by RedHat
Fast facts:
Web: http://www.openstack.org/
OpenStack Foundation: http://www.openstack.org/foundation
Source: https://github.com/openstack/openstack
HADOOP PAAS
10
-‐ EXPERIMENTAL -‐
11
Mesos
HDFS (Gluster)
Spark NoSQL MR Search IngestETLSQL
Doc
ker
This approach uses Mesos as PaaS layer for the DFS infrstructure. Mesos supports HDFS as well as GlusterFS, HDFS seems to be more robust in that approach. Hadoop and Spark will be maintained by docker containers.
This solution is not so powerful as JBOD hardware in terms of traditional MapReduce, but plays well within an Spark / InMemory environment. Hadoop is used as an transition layer to move more and more applications to Spark.
Datacenter
HADOOP IAAS
12
-‐ EXPERIMENTAL -‐
13
Apache SparkSahara
OpenStack
Spark
SQL
NoSQL
MR
Search
KafkaETLHiveSpark
Stream
Spark
ML
GlusterFS
Apache Hadoop
Apache Flink
Tachyon
Storm
The hardware layer will be completely abstracted by OpenStack. Tachyon closes the gap between IO and Virtualisation (Cloud) and works as an data layer bridge between Hadoop projects, Ingest and Spark. Sahara manages Spark workers as well as Hadoop nodes.
Datacenter
Top Related