The hadoop ecosystem table

29
6/16/2017 The Hadoop Ecosystem Table https://hadoopecosystemtable.github.io/ 1/29 Fork Me on GitHub The Hadoop Ecosystem Table This page is a summary to keep the track of Hadoop related projects, focused on FLOSS environment. Distributed Filesystem Apache HDFS The Hadoop Distributed File System (HDFS) offers a way to store large files across multiple machines. Hadoop and HDFS was derived from Google File System (GFS) paper. Prior to Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an HDFS cluster. With Zookeeper the HDFS High Availability feature addresses this problem by providing the option of running two redundant NameNodes in the same cluster in an Active/Passive configuration with a hot standby. 1. hadoop.apache.org 2. Google FileSystem - GFS Paper 3. Cloudera Why HDFS 4. Hortonworks Why HDFS Red Hat GlusterFS GlusterFS is a scale-out network-attached storage file system. GlusterFS was developed originally by Gluster, Inc., then by Red Hat, Inc., after their purchase of Gluster in 2011. In June 2012, Red Hat Storage Server was announced as a commercially-supported integration of GlusterFS with Red Hat Enterprise Linux. Gluster File System, known now as Red Hat Storage Server. 1. www.gluster.org 2. Red Hat Hadoop Plugin Quantcast File System QFS QFS is an open-source distributed file system software package for large-scale MapReduce or other batch- processing workloads. It was designed as an alternative to Apache Hadoop’s HDFS, intended to deliver better performance and cost-efficiency for large-scale processing clusters. It is written in C++ and has fixed- footprint memory management. QFS uses Reed- Solomon error correction as method for assuring reliable access to data. Reed–Solomon coding is very widely used in mass storage systems to correct the burst errors associated with media defects. Rather than storing three full versions of each file like HDFS, resulting in the need for three times more storage, QFS only needs 1.5x the raw capacity because it stripes data across nine different disk drives. 1. QFS site 2. GitHub QFS 3. HADOOP-8885 Ceph Filesystem Ceph is a free software storage platform designed to present object, block, and file storage from a single distributed computer cluster. Ceph's main goals are to be completely distributed without a single point of failure, scalable to the exabyte level, and freely- available. The data is replicated, making it fault tolerant. 1. Ceph Filesystem site 2. Ceph and Hadoop 3. HADOOP-6253 Lustre file system The Lustre filesystem is a high-performance 1. wiki.lustre.org/

Transcript of The hadoop ecosystem table

Page 1: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 1/29

Fork Me on GitHub

The Hadoop Ecosystem TableThis page is a summary to keep the track of Hadoop related projects,focused on FLOSS environment.

Distributed Filesystem

Apache HDFS

The Hadoop Distributed File System (HDFS) offers away to store large files across multiple machines.Hadoop and HDFS was derived from Google FileSystem (GFS) paper. Prior to Hadoop 2.0.0, theNameNode was a single point of failure (SPOF) in anHDFS cluster. With Zookeeper the HDFS HighAvailability feature addresses this problem byproviding the option of running two redundantNameNodes in the same cluster in an Active/Passiveconfiguration with a hot standby.

1. hadoop.apache.org 2. Google FileSystem- GFS Paper 3. Cloudera WhyHDFS 4. Hortonworks WhyHDFS

Red Hat GlusterFS

GlusterFS is a scale-out network-attached storage filesystem. GlusterFS was developed originally byGluster, Inc., then by Red Hat, Inc., after their purchaseof Gluster in 2011. In June 2012, Red Hat StorageServer was announced as a commercially-supportedintegration of GlusterFS with Red Hat EnterpriseLinux. Gluster File System, known now as Red HatStorage Server.

1. www.gluster.org 2. Red Hat HadoopPlugin

Quantcast File System QFS

QFS is an open-source distributed file system softwarepackage for large-scale MapReduce or other batch-processing workloads. It was designed as an alternativeto Apache Hadoop’s HDFS, intended to deliver betterperformance and cost-efficiency for large-scaleprocessing clusters. It is written in C++ and has fixed-footprint memory management. QFS uses Reed-Solomon error correction as method for assuringreliable access to data. Reed–Solomon coding is very widely used in massstorage systems to correct the burst errors associatedwith media defects. Rather than storing three fullversions of each file like HDFS, resulting in the needfor three times more storage, QFS only needs 1.5x theraw capacity because it stripes data across ninedifferent disk drives.

1. QFS site 2. GitHub QFS 3. HADOOP-8885

Ceph Filesystem

Ceph is a free software storage platform designed topresent object, block, and file storage from a singledistributed computer cluster. Ceph's main goals are tobe completely distributed without a single point offailure, scalable to the exabyte level, and freely-available. The data is replicated, making it faulttolerant.

1. Ceph Filesystemsite 2. Ceph and Hadoop 3. HADOOP-6253

Lustre file system The Lustre filesystem is a high-performance 1. wiki.lustre.org/

Page 2: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 2/29

distributed filesystem intended for larger network andhigh-availability environments. Traditionally, Lustre isconfigured to manage remote data storage disk deviceswithin a Storage Area Network (SAN), which is two ormore remotely attached disk devices communicatingvia a Small Computer System Interface (SCSI)protocol. This includes Fibre Channel, Fibre Channelover Ethernet (FCoE), Serial Attached SCSI (SAS) andeven iSCSI. With Hadoop HDFS the software needs a dedicatedcluster of computers on which to run. But folks whorun high performance computing clusters for otherpurposes often don't run HDFS, which leaves themwith a bunch of computing power, tasks that couldalmost certainly benefit from a bit of map reduce andno way to put that power to work running Hadoop.Intel's noticed this and, in version 2.5 of its Hadoopdistribution that it quietly released last week, has addedsupport for Lustre: the Intel® HPC Distribution forApache Hadoop* Software, a new product thatcombines Intel Distribution for Apache Hadoopsoftware with Intel® Enterprise Edition for Lustresoftware. This is the only distribution of ApacheHadoop that is integrated with Lustre, the parallel filesystem used by many of the world's fastestsupercomputers

2. Hadoop withLustre 3. Intel HPC Hadoop

Alluxio Alluxio, the world’s first memory-centric virtualdistributed storage system, unifies data access andbridges computation frameworks and underlyingstorage systems. Applications only need to connectwith Alluxio to access data stored in any underlyingstorage systems. Additionally, Alluxio’s memory-centric architecture enables data access orders ofmagnitude faster than existing solutions. In big data ecosystem, Alluxio lies betweencomputation frameworks or jobs, such as ApacheSpark, Apache MapReduce, or Apache Flink, andvarious kinds of storage systems, such as Amazon S3,OpenStack Swift, GlusterFS, HDFS, Ceph, or OSS.Alluxio brings significant performance improvement tothe stack; for example, Baidu uses Alluxio to improvetheir data analytics performance by 30 times. Beyondperformance, Alluxio bridges new workloads with datastored in traditional storage systems. Users can runAlluxio using its standalone cluster mode, for exampleon Amazon EC2, or launch Alluxio with ApacheMesos or Apache Yarn. Alluxio is Hadoop compatible. This means thatexisting Spark and MapReduce programs can run ontop of Alluxio without any code changes. The projectis open source (Apache License 2.0) and is deployed atmultiple companies. It is one of the fastest growingopen source projects. With less than three years opensource history, Alluxio has attracted more than 160

1. Alluxio site

Page 3: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 3/29

contributors from over 50 institutions, includingAlibaba, Alluxio, Baidu, CMU, IBM, Intel, NJU, RedHat, UC Berkeley, and Yahoo. The project is thestorage layer of the Berkeley Data Analytics Stack(BDAS) and also part of the Fedora distribution.

GridGain

GridGain is open source project licensed under Apache2.0. One of the main pieces of this platform is the In-Memory Apache Hadoop Accelerator which aims toaccelerate HDFS and Map/Reduce by bringing both,data and computations into memory. This work is donewith the GGFS - Hadoop compliant in-memory filesystem. For I/O intensive jobs GridGain GGFS offersperformance close to 100x faster than standard HDFS.Paraphrasing Dmitriy Setrakyan from GridGainSystems talking about GGFS regarding Tachyon:

GGFS allows read-through and write-throughto/from underlying HDFS or any other Hadoopcompliant file system with zero code change.Essentially GGFS entirely removes ETL stepfrom integration.GGFS has ability to pick and choose whatfolders stay in memory, what folders stay ondisc, and what folders get synchronized withunderlying (HD)FS either synchronously orasynchronously.GridGain is working on adding nativeMapReduce component which will providenative complete Hadoop integration withoutchanges in API, like Spark currently forces youto do. Essentially GridGain MR+GGFS willallow to bring Hadoop completely or partially in-memory in Plug-n-Play fashion without any APIchanges.

1. GridGain site

XtreemFS XtreemFS is a general purpose storage system andcovers most storage needs in a single deployment. It isopen-source, requires no special hardware or kernelmodules, and can be mounted on Linux, Windows andOS X. XtreemFS runs distributed and offers resiliencethrough replication. XtreemFS Volumes can beaccessed through a FUSE component, that offersnormal file interaction with POSIX like semantics.Furthermore an implementation of HadoopsFileSystem interface is included which makesXtreemFS available for use with Hadoop, Flink andSpark out of the box. XtreemFS is licensed under theNew BSD license. The XtreemFS project is developedby Zuse Institute Berlin. The development of theproject is funded by the European Commission since2006 under Grant Agreements No. FP6-033576, FP7-ICT-257438, and FP7-318521, as well as the Germanprojects MoSGrid, "First We Take Berlin", FFMK,GeoMultiSens, and BBDC.

1. XtreemFS site 2.Flink on XtreemFS .Spark XtreemFS

Page 4: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 4/29

Distributed Programming

Apache Ignite

Apache Ignite In-Memory Data Fabric is a distributedin-memory platform for computing and transacting onlarge-scale data sets in real-time. It includes adistributed key-value in-memory store, SQLcapabilities, map-reduce and other computations,distributed data structures, continuous queries,messaging and events subsystems, Hadoop and Sparkintegration. Ignite is built in Java and provides .NETand C++ APIs.

1. Apache Ignite 2. Apache Ignitedocumentation

Apache MapReduce

MapReduce is a programming model for processinglarge data sets with a parallel, distributed algorithm ona cluster. Apache MapReduce was derived fromGoogle MapReduce: Simplified Data Processing onLarge Clusters paper. The current Apache MapReduceversion is built over Apache YARN Framework.YARN stands for “Yet-Another-Resource-Negotiator”.It is a new framework that facilitates writing arbitrarydistributed processing frameworks and applications.YARN’s execution model is more generic than theearlier MapReduce implementation. YARN can runapplications that do not follow the MapReduce model,unlike the original Apache Hadoop MapReduce (alsocalled MR1). Hadoop YARN is an attempt to takeApache Hadoop beyond MapReduce for data-processing.

1. ApacheMapReduce 2. GoogleMapReduce paper 3. Writing YARNapplications

Apache Pig

Pig provides an engine for executing data flows inparallel on Hadoop. It includes a language, Pig Latin,for expressing these data flows. Pig Latin includesoperators for many of the traditional data operations(join, sort, filter, etc.), as well as the ability for users todevelop their own functions for reading, processing,and writing data. Pig runs on Hadoop. It makes use ofboth the Hadoop Distributed File System, HDFS, andHadoop’s processing system, MapReduce. Pig uses MapReduce to execute all of its dataprocessing. It compiles the Pig Latin scripts that userswrite into a series of one or more MapReduce jobs thatit then executes. Pig Latin looks different from manyof the programming languages you have seen. Thereare no if statements or for loops in Pig Latin. This isbecause traditional procedural and object-orientedprogramming languages describe control flow, and dataflow is a side effect of the program. Pig Latin insteadfocuses on data flow.

1. pig.apache.org/ 2.Pig examples byAlan Gates

JAQL JAQL is a functional, declarative programminglanguage designed especially for working with largevolumes of structured, semi-structured andunstructured data. As its name implies, a primary useof JAQL is to handle data stored as JSON documents,but JAQL can work on various types of data. Forexample, it can support XML, comma-separated values(CSV) data and flat files. A "SQL within JAQL"

1. JAQL in GoogleCode 2. What is Jaql? byIBM

Page 5: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 5/29

capability lets programmers work with structured SQLdata while employing a JSON data model that's lessrestrictive than its Structured Query Languagecounterparts. Specifically, Jaql allows you to select, join, group, andfilter data that is stored in HDFS, much like a blend ofPig and Hive. Jaql’s query language was inspired bymany programming and query languages, includingLisp, SQL, XQuery, and Pig. JAQL was created by workers at IBM Research Labsin 2008 and released to open source. While it continuesto be hosted as a project on Google Code, where adownloadable version is available under an Apache 2.0license, the major development activity around JAQLhas remained centered at IBM. The company offers thequery language as part of the tools suite associatedwith InfoSphere BigInsights, its Hadoop platform.Working together with a workflow orchestrator, JAQLis used in BigInsights to exchange data betweenstorage, processing and analytics jobs. It also provideslinks to external data and services, including relationaldatabases and machine learning data.

Apache Spark

Data analytics cluster computing framework originallydeveloped in the AMPLab at UC Berkeley. Spark fitsinto the Hadoop open-source community, building ontop of the Hadoop Distributed File System (HDFS).However, Spark provides an easier to use alternative toHadoop MapReduce and offers performance up to 10times faster than previous generation systems likeHadoop MapReduce for certain applications. Spark is a framework for writing fast, distributedprograms. Spark solves similar problems as HadoopMapReduce does but with a fast in-memory approachand a clean functional style API. With its ability tointegrate with Hadoop and inbuilt tools for interactivequery analysis (Shark), large-scale graph processingand analysis (Bagel), and real-time analysis (SparkStreaming), it can be interactively used to quicklyprocess and query big data sets. To make programming faster, Spark provides clean,concise APIs in Scala, Java and Python. You can alsouse Spark interactively from the Scala and Pythonshells to rapidly query big datasets. Spark is also theengine behind Shark, a fully Apache Hive-compatibledata warehousing system that can run 100x faster thanHive.

1. Apache Spark 2. Mirror of Spark onGithub 3. RDDs - Paper 4. Spark: ClusterComputing... - Paper Spark Research

Apache Storm Storm is a complex event processor (CEP) anddistributed computation framework writtenpredominantly in the Clojure programming language.Is a distributed real-time computation system forprocessing fast, large streams of data. Storm is anarchitecture based on master-workers paradigma. So aStorm cluster mainly consists of a master and worker

1. Storm Project/ 2. Storm-on-YARN

Page 6: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 6/29

nodes, with coordination done by Zookeeper. Storm makes use of zeromq (0mq, zeromq), anadvanced, embeddable networking library. It providesa message queue, but unlike message-orientedmiddleware (MOM), a 0MQ system can run without adedicated message broker. The library is designed tohave a familiar socket-style API. Originally created by Nathan Marz and team atBackType, the project was open sourced after beingacquired by Twitter. Storm was initially developed anddeployed at BackType in 2011. After 7 months ofdevelopment BackType was acquired by Twitter inJuly 2011. Storm was open sourced in September 2011.Hortonworks is developing a Storm-on-YARN versionand plans finish the base-level integration in 2013 Q4.This is the plan from Hortonworks.Yahoo/Hortonworks also plans to move Storm-on-YARN code from github.com/yahoo/storm-yarn to be asubproject of Apache Storm project in the near future. Twitter has recently released a Hadoop-Storm Hybridcalled “Summingbird.” Summingbird fuses the twoframeworks into one, allowing for developers to useStorm for short-term processing and Hadoop for deepdata dives,. a system that aims to mitigate the tradeoffsbetween batch processing and stream processing bycombining them into a hybrid system.

Apache Flink

Apache Flink (formerly called Stratosphere) featurespowerful programming abstractions in Java and Scala,a high-performance runtime, and automatic programoptimization. It has native support for iterations,incremental iterations, and programs consisting oflarge DAGs of operations. Flink is a data processing system and an alternative toHadoop's MapReduce component. It comes with itsown runtime, rather than building on top ofMapReduce. As such, it can work completelyindependently of the Hadoop ecosystem. However,Flink can also access Hadoop's distributed file system(HDFS) to read and write data, and Hadoop's next-generation resource manager (YARN) to provisioncluster resources. Since most Flink users are usingHadoop HDFS to store their data, it ships already therequired libraries to access HDFS.

1. Apache Flinkincubator page 2. Stratosphere site

Apache Apex Apache Apex is an enterprise grade Apache YARNbased big data-in-motion platform that unifies streamprocessing as well as batch processing. It processes bigdata in-motion in a highly scalable, highly performant,fault tolerant, stateful, secure, distributed, and an easilyoperable way. It provides a simple API that enablesusers to write or re-use generic Java code, therebylowering the expertise needed to write big dataapplications.

1. Apache Apex fromDataTorrent 2. Apache Apex mainpage 3. Apache ApexProposal

Page 7: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 7/29

The Apache Apex platform is supplemented by ApacheApex-Malhar, which is a library of operators thatimplement common business logic functions needed bycustomers who want to quickly develop applications.These operators provide access to HDFS, S3, NFS,FTP, and other file systems; Kafka, ActiveMQ,RabbitMQ, JMS, and other message systems; MySql,Cassandra, MongoDB, Redis, HBase, CouchDB andother databases along with JDBC connectors. Thelibrary also includes a host of other common businesslogic patterns that help users to significantly reduce thetime it takes to go into production. Ease of integrationwith all other big data technologies is one of theprimary missions of Apache Apex-Malhar.

Apex, available on GitHub, is the core technologyupon which DataTorrent's commercial offering,DataTorrent RTS 3, along with other technology suchas a data ingestion tool called dtIngest, are based.

Netflix PigPen

PigPen is map-reduce for Clojure which compiles toApache Pig. Clojure is dialect of the Lisp programminglanguage created by Rich Hickey, so is a functionalgeneral-purpose language, and runs on the Java VirtualMachine, Common Language Runtime, and JavaScriptengines. In PigPen there are no special user definedfunctions (UDFs). Define Clojure functions,anonymously or named, and use them like you wouldin any Clojure program. This tool is open sourced byNetflix, Inc. the American provider of on-demandInternet streaming media.

1. PigPen on GitHub

AMPLab SIMR

Apache Spark was developed thinking in ApacheYARN. However, up to now, it has been relatively hardto run Apache Spark on Hadoop MapReduce v1clusters, i.e. clusters that do not have YARN installed.Typically, users would have to get permission to installSpark/Scala on some subset of the machines, a processthat could be time consuming. SIMR allows anyonewith access to a Hadoop MapReduce v1 cluster to runSpark out of the box. A user can run Spark directly ontop of Hadoop MapReduce v1 without anyadministrative rights, and without having Spark orScala installed on any of the nodes.

1. SIMR on GitHub

Facebook Corona “The next version of Map-Reduce" from Facebook,based in own fork of Hadoop. The current Hadoopimplementation of the MapReduce technique uses asingle job tracker, which causes scaling issues for verylarge data sets. The Apache Hadoop developers havebeen creating their own next-generation MapReduce,called YARN, which Facebook engineers looked at butdiscounted because of the highly-customised nature ofthe company's deployment of Hadoop and HDFS.

1. Corona on Github

Page 8: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 8/29

Corona, like YARN, spawns multiple job trackers (onefor each job, in Corona's case).

Apache REEF

Apache REEF™ (Retainable Evaluator ExecutionFramework) is a library for developing portableapplications for cluster resource managers such asApache Hadoop™ YARN or Apache Mesos™. ApacheREEF drastically simplifies development of thoseresource managers through the following features:

Centralized Control Flow: Apache REEF turnsthe chaos of a distributed application into eventsin a single machine, the Job Driver. Eventsinclude container allocation, Task launch,completion and failure. For failures, ApacheREEF makes every effort of making the actual`Exception` thrown by the Task available to theDriver.Task runtime: Apache REEF provides a Taskruntime called Evaluator. Evaluators areinstantiated in every container of a REEFapplication. Evaluators can keep data in memoryin between Tasks, which enables efficientpipelines on REEF.Support for multiple resource managers: ApacheREEF applications are portable to any supportedresource manager with minimal effort. Further,new resource managers are easy to support inREEF..NET and Java API: Apache REEF is the onlyAPI to write YARN or Mesos applications in.NET. Further, a single REEF application is freeto mix and match Tasks written for .NET orJava.Plugins: Apache REEF allows for plugins (called"Services") to augment its feature set withoutadding bloat to the core. REEF includes manyServices, such as a name-based communicationsbetween Tasks MPI-inspired groupcommunications (Broadcast, Reduce, Gather, ...)and data ingress.

1. Apache REEFWebsite

Apache Twill Twill is an abstraction over Apache Hadoop® YARNthat reduces the complexity of developing distributedapplications, allowing developers to focus more ontheir business logic. Twill uses a simple thread-basedmodel that Java programmers will find familiar. YARNcan be viewed as a compute fabric of a cluster, whichmeans YARN applications like Twill will run on anyHadoop 2 cluster. YARN is an open source application that allows theHadoop cluster to turn into a collection of virtualmachines. Weave, developed by Continuuity andinitially housed on Github, is a complementary opensource application that uses a programming model

1. Apache TwillIncubator

Page 9: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 9/29

similar to Java threads, making it easy to writedistributed applications. In order to remove a conflictwith a similarly named project on Apache, called"Weaver," Weave's name changed to Twill when itmoved to Apache incubation. Twill functions as a scaled-out proxy. Twill is amiddleware layer in between YARN and anyapplication on YARN. When you develop a Twill app,Twill handles APIs in YARN that resemble a multi-threaded application familiar to Java. It is very easy tobuild multi-processed distributed applications in Twill.

Damballa Parkour

Library for develop MapReduce programs using theLISP like language Clojure. Parkour aims to providedeep Clojure integration for Hadoop. Programs usingParkour are normal Clojure programs, using standardClojure functions instead of new frameworkabstractions. Programs using Parkour are also fullHadoop programs, with complete access to absolutelyeverything possible in raw Java Hadoop MapReduce.

1. Parkour GitHubProject

Apache Hama

Apache Top-Level open source project, allowing youto do advanced analytics beyond MapReduce. Manydata analysis techniques such as machine learning andgraph algorithms require iterative computations, this iswhere Bulk Synchronous Parallel model can be moreeffective than "plain" MapReduce.

1. Hama site

Datasalt Pangool A new MapReduce paradigm. A new API for MR jobs,in higher level than Java.

1.Pangool 2.GitHub Pangool

Apache Tez

Tez is a proposal to develop a generic applicationwhich can be used to process complex data-processingtask DAGs and runs natively on Apache HadoopYARN. Tez generalizes the MapReduce paradigm to amore powerful framework based on expressingcomputations as a dataflow graph. Tez is not meantdirectly for end-users – in fact it enables developers tobuild end-user applications with much betterperformance and flexibility. Hadoop has traditionallybeen a batch-processing platform for large amounts ofdata. However, there are a lot of use cases for near-real-time performance of query processing. There arealso several workloads, such as Machine Learning,which do not fit will into the MapReduce paradigm.Tez helps Hadoop address these use cases. Tezframework constitutes part of Stinger initiative (a lowlatency based SQL type query interface for Hadoopbased on Hive).

1. Apache TezIncubator 2. HortonworksApache Tez page

Apache DataFu DataFu provides a collection of Hadoop MapReducejobs and functions in higher level languages based on itto perform data analysis. It provides functions forcommon statistics tasks (e.g. quantiles, sampling),PageRank, stream sessionization, and set and bagoperations. DataFu also provides Hadoop jobs forincremental data processing in MapReduce. DataFu isa collection of Pig UDFs (including PageRank,

1. DataFu ApacheIncubator

Page 10: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 10/29

sessionization, set operations, sampling, and muchmore) that were originally developed at LinkedIn.

Pydoop

Pydoop is a Python MapReduce and HDFS API forHadoop, built upon the C++ Pipes and the C libhdfsAPIs, that allows to write full-fledged MapReduceapplications with HDFS access. Pydoop has severaladvantages over Hadoop’s built-in solutions for Pythonprogramming, i.e., Hadoop Streaming and Jython:being a CPython package, it allows you to access allstandard library and third party modules, some ofwhich may not be available.

1. SF Pydoop site 2. Pydoop GitHubProject

Kangaroo

Open-source project from Conductor for writingMapReduce jobs consuming data from Kafka. Theintroductory post explains Conductor’s use case—loading data from Kafka to HBase by way of aMapReduce job using the HFileOutputFormat. Unlikeother solutions which are limited to a single InputSplitper Kafka partition, Kangaroo can launch multipleconsumers at different offsets in the stream of a singlepartition for increased throughput and parallelism.

1. KangarooIntroduction 2. Kangaroo GitHubProject

TinkerPop

Graph computing framework written in Java. Providesa core API that graph system vendors can implement.There are various types of graph systems including in-memory graph libraries, OLTP graph databases, andOLAP graph processors. Once the core interfaces areimplemented, the underlying graph system can bequeried using the graph traversal language Gremlin andprocessed with TinkerPop-enabled algorithms. Formany, TinkerPop is seen as the JDBC of the graphcomputing community.

1. Apache TinkerpopProposal 2. TinkerPop site

Pachyderm MapReduce

Pachyderm is a completely new MapReduce enginebuilt on top Docker and CoreOS. In PachydermMapReduce (PMR) a job is an HTTP server inside aDocker container (a microservice). You givePachyderm a Docker image and it will automaticallydistribute it throughout the cluster next to your data.Data is POSTed to the container over HTTP and theresults are stored back in the file system. You canimplement the web server in any language you wantand pull in any library. Pachyderm also creates a DAGfor all the jobs in the system and their dependenciesand it automatically schedules the pipeline such thateach job isn’t run until it’s dependencies havecompleted. Everything in Pachyderm “speaks in diffs”so it knows exactly which data has changed and whichsubsets of the pipeline need to be rerun. CoreOS is anopen source lightweight operating system based onChrome OS, actually CoreOS is a fork of Chrome OS.CoreOS provides only the minimal functionalityrequired for deploying applications inside softwarecontainers, together with built-in mechanisms forservice discovery and configuration sharing

1. Pachyderm site 2. Pachydermintroduction article

Apache Beam Apache Beam is an open source, unified model for 1. Apache Beam

Page 11: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 11/29

defining and executing data-parallel processingpipelines, as well as a set of language-specific SDKsfor constructing pipelines and runtime-specificRunners for executing them.

The model behind Beam evolved from a number ofinternal Google data processing projects, includingMapReduce, FlumeJava, and Millwheel. This modelwas originally known as the “Dataflow Model” andfirst implemented as Google Cloud Dataflow, includinga Java SDK on GitHub for writing pipelines and fullymanaged service for executing them on Google CloudPlatform.

In January 2016, Google and a number of partnerssubmitted the Dataflow Programming Model andSDKs portion as an Apache Incubator Proposal, underthe name Apache Beam (unified Batch + strEAMprocessing).

Proposal 2. DataFlow Beamand SparkComparasion

NoSQL DatabasesColumn Data Model

Apache HBase

Google BigTable Inspired. Non-relational distributeddatabase. Ramdom, real-time r/w operations incolumn-oriented very large tables (BDDB: Big DataData Base). It’s the backing system for MR jobsoutputs. It’s the Hadoop database. It’s for backingHadoop MapReduce jobs with Apache HBase tables

1. Apache HBaseHome 2. Mirror of HBaseon Github

Apache Cassandra

Distributed Non-SQL DBMS, it’s a BDDB. MR canretrieve data from Cassandra. This BDDB can runwithout HDFS, or on-top of HDFS (DataStax fork ofCassandra). HBase and its required supporting systemsare derived from what is known of the original GoogleBigTable and Google File System designs (as knownfrom the Google File System paper Google publishedin 2003, and the BigTable paper published in 2006).Cassandra on the other hand is a recent open sourcefork of a standalone database system initially coded byFacebook, which while implementing the BigTabledata model, uses a system inspired by Amazon’sDynamo for storing data (in fact much of the initialdevelopment work on Cassandra was performed bytwo Dynamo engineers recruited to Facebook fromAmazon).

1. Apache HBaseHome 2. Cassandra onGitHub 3. Training Resources4. Cassandra - Paper

Hypertable

Database system inspired by publications on the designof Google's BigTable. The project is based onexperience of engineers who were solving large-scaledata-intensive tasks for many years. Hypertable runson top of a distributed file system such as the ApacheHadoop DFS, GlusterFS, or the Kosmos File System(KFS). It is written almost entirely in C++. Sposoredby Baidu the Chinese search engine.

TODO

Apache Accumulo Distributed key/value store is a robust, scalable, high 1. Apache Accumulo

Page 12: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 12/29

performance data storage and retrieval system. ApacheAccumulo is based on Google's BigTable design and isbuilt on top of Apache Hadoop, Zookeeper, and Thrift.Accumulo is software created by the NSA withsecurity features.

Home

Apache Kudu

Distributed, columnar, relational data store optimizedfor analytical use cases requiring very fast reads withcompetitive write speeds.

Relational data model (tables) with strongly-typed columns and a fast, online alter tableoperation.Scale-out and sharded with support forpartitioning based on key ranges and/or hashing.Fault-tolerant and consistent due to itsimplementation of Raft consensus.Supported by Apache Impala and Apache Drill,enabling fast SQL reads and writes through thosesystems.Integrates with MapReduce and Spark.Additionally provides "NoSQL" APIs in Java,Python, and C++.

1. Apache KuduHome 2. Kudu on Github3. Kudu technicalwhitepaper (pdf)

Apache Parquet

Columnar storage format available to any project in theHadoop ecosystem, regardless of the choice of dataprocessing framework, data model or programminglanguage.

1. Apache ParquetHome 2. Apache Parquet onGithub

Document Data Model

MongoDB

Document-oriented database system. It is part of theNoSQL family of database systems. Instead of storingdata in tables as is done in a "classical" relationaldatabase, MongoDB stores structured data as JSON-like documents

1. Mongodb site

RethinkDB

RethinkDB is built to store JSON documents, and scaleto multiple machines with very little effort. It has apleasant query language that supports really usefulqueries like table joins and group by, and is easy tosetup and learn.

1. RethinkDB site

ArangoDB

An open-source database with a flexible data model fordocuments, graphs, and key-values. Build highperformance applications using a convenient sql-likequery language or JavaScript extensions.

1. ArangoDB site

Stream Data ModelEventStore An open-source, functional database with support for

Complex Event Processing. It provides a persistenceengine for applications using event-sourcing, or forstoring time-series data. Event Store is written in C#,C++ for the server which runs on Mono or the .NETCLR, on Linux or Windows. Applications using EventStore can be written in JavaScript. Event sourcing (ES)is a way of persisting your application's state by storing

1. EventStore site

Page 13: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 13/29

the history that determines the current state of yourapplication.

Key-value Data Model

Redis DataBase

Redis is an open-source, networked, in-memory, datastructures store with optional durability. It is written inANSI C. In its outer layer, the Redis data model is adictionary which maps keys to values. One of the maindifferences between Redis and other structured storagesystems is that Redis supports not only strings, but alsoabstract data types. Sponsored by Redis Labs. It’s BSDlicensed.

1. Redis site 2. Redis Labs site

Linkedin Voldemort Distributed data store that is designed as a key-valuestore used by LinkedIn for high-scalability storage. 1. Voldemort site

RocksDB

RocksDB is an embeddable persistent key-value storefor fast storage. RocksDB can also be the foundationfor a client-server database but our current focus is onembedded workloads.

1. RocksDB site

OpenTSDB

OpenTSDB is a distributed, scalable Time SeriesDatabase (TSDB) written on top of HBase. OpenTSDBwas written to address a common need: store, indexand serve metrics collected from computer systems(network gear, operating systems, applications) at alarge scale, and make this data easily accessible andgraphable.

1. OpenTSDB site

Graph Data Model

ArangoDB

An open-source database with a flexible data model fordocuments, graphs, and key-values. Build highperformance applications using a convenient sql-likequery language or JavaScript extensions.

1. ArangoDB site

Neo4j

An open-source graph database writting entirely inJava. It is an embedded, disk-based, fully transactionalJava persistence engine that stores data structured ingraphs rather than in tables.

1. Neo4j site

TitanDB

TitanDB is a highly scalable graph database optimizedfor storing and querying large graphs with billions ofvertices and edges distributed across a multi-machinecluster. Titan is a transactional database that cansupport thousands of concurrent users.

1. Titan site

NewSQL Databases

TokuDB

TokuDB is a storage engine for MySQL and MariaDBthat is specifically designed for high performance onwrite-intensive workloads. It achieves this via FractalTree indexing. TokuDB is a scalable, ACID andMVCC compliant storage engine. TokuDB is one ofthe technologies that enable Big Data in MySQL.

1. Percona TokuDBsite

HandlerSocket HandlerSocket is a NoSQL plugin forMySQL/MariaDB (the storage engine of MySQL). Itworks as a daemon inside the mysqld process,accepting TCP connections, and executing requestsfrom clients. HandlerSocket does not support SQLqueries. Instead, it supports simple CRUD operations

TODO

Page 14: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 14/29

on tables. HandlerSocket can be much faster thanmysqld/libmysql in some cases because it has lowerCPU, disk, and network overhead.

Akiban Server

Akiban Server is an open source database that bringsdocument stores and relational databases together.Developers get powerful document access alongsidesurprisingly powerful SQL.

TODO

Drizzle

Drizzle is a re-designed version of the MySQL v6.0codebase and is designed around a central concept ofhaving a microkernel architecture. Features such as thequery cache and authentication system are now pluginsto the database, which follow the general theme of"pluggable storage engines" that were introduced inMySQL 5.1. It supports PAM, LDAP, and HTTPAUTH for authentication via plugins it ships. Via itsplugin system it currently supports logging to files,syslog, and remote services such as RabbitMQ andGearman. Drizzle is an ACID-compliant relationaldatabase that supports transactions via an MVCCdesign

TODO

Haeinsa

Haeinsa is linearly scalable multi-row, multi-tabletransaction library for HBase. Use Haeinsa if you needstrong ACID semantics on your HBase cluster. Is basedon Google Perlocator concept.

1. Haeinsa GitHubsite

SenseiDB

Open-source, distributed, realtime, semi-structureddatabase. Some Features: Full-text search, Fastrealtime updates, Structured and faceted search, BQL:SQL-like query language, Fast key-value lookup, Highperformance under concurrent heavy update and queryvolumes, Hadoop integration

1. SenseiDB site

Sky

Sky is an open source database used for flexible, highperformance analysis of behavioral data. For certainkinds of data such as clickstream data and log data, itcan be several orders of magnitude faster thantraditional approaches such as SQL databases orHadoop.

1. SkyDB site

BayesDB

BayesDB, a Bayesian database table, lets users querythe probable implications of their tabular data as easilyas an SQL database lets them query the data itself.Using the built-in Bayesian Query Language (BQL),users with no statistics training can solve basic datascience problems, such as detecting predictiverelationships between variables, inferring missingvalues, simulating probable observations, andidentifying statistically similar database entries.

1. BayesDB site

InfluxDB InfluxDB is an open source distributed time seriesdatabase with no external dependencies. It's useful forrecording metrics, events, and performing analytics. Ithas a built-in HTTP API so you don't have to write anyserver side code to get up and running. InfluxDB isdesigned to be scalable, simple to install and manage,and fast to get data in and out. It aims to answer

1. InfluxDB site

Page 15: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 15/29

queries in real-time. That means every data point isindexed as it comes in and is immediately available inqueries that should return under 100ms.

SQL-on-Hadoop

Apache HiveData Warehouse infrastructure developed by Facebook.Data summarization, query, and analysis. It’s providesSQL-like language (not SQL92 compliant): HiveQL.

1. Apache HIVE site 2. Apache HIVEGitHub Project

Apache HCatalog

HCatalog’s table abstraction presents users with arelational view of data in the Hadoop Distributed FileSystem (HDFS) and ensures that users need not worryabout where or in what format their data is stored.Right now HCatalog is part of Hive. Only old versionsare separated for download.

TODO

Apache Trafodion

Apache Trafodion is a webscale SQL-on-Hadoopsolution enabling enterprise-class transactional andoperational workloads on HBase. Trafodion is a nativeMPP ANSI SQL database engine that builds on thescalability, elasticity and flexibility of HDFS andHBase, extending these to provide guaranteedtransactional integrity for all workloads includingmulti-column, multi-row, multi-table, and multi-serverupdates.

1. Apache Trafodionwebsite 2. Apache Trafodionwiki 3. Apache TrafodionGitHub Project

Apache HAWQ

Apache HAWQ is a Hadoop native SQL query enginethat combines key technological advantages of MPPdatabase evolved from Greenplum Database, with thescalability and convenience of Hadoop.

1. Apache HAWQsite 2. HAWQ GitHubProject

Apache Drill

Drill is the open source version of Google's Dremelsystem which is available as an infrastructure servicecalled Google BigQuery. In recent years open sourcesystems have emerged to address the need for scalablebatch processing (Apache Hadoop) and streamprocessing (Storm, Apache S4). Apache Hadoop,originally inspired by Google's internal MapReducesystem, is used by thousands of organizationsprocessing large-scale datasets. Apache Hadoop isdesigned to achieve very high throughput, but is notdesigned to achieve the sub-second latency needed forinteractive data analysis and exploration. Drill, inspiredby Google's internal Dremel system, is intended toaddress this need

1. Apache IncubatorDrill

Cloudera Impala

The Apache-licensed Impala project brings scalableparallel database technology to Hadoop, enabling usersto issue low-latency SQL queries to data stored inHDFS and Apache HBase without requiring datamovement or transformation. It's a Google Dremelclone (Big Query google).

1. Cloudera Impalasite 2. Impala GitHubProject

Facebook Presto

Facebook has open sourced Presto, a SQL engine itsays is on average 10 times faster than Hive forrunning queries across large data sets stored in Hadoopand elsewhere.

1. Presto site

Datasalt Splout SQL Splout allows serving an arbitrarily big dataset withhigh QPS rates and at the same time provides full SQL

TODO

Page 16: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 16/29

query syntax.

Apache Tajo

Apache Tajo is a robust big data relational anddistributed data warehouse system for Apache Hadoop.Tajo is designed for low-latency and scalable ad-hocqueries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored onHDFS (Hadoop Distributed File System) and otherdata sources. By supporting SQL standards andleveraging advanced database techniques, Tajo allowsdirect control of distributed execution and data flowacross a variety of query evaluation strategies andoptimization opportunities. For reference, the ApacheSoftware Foundation announced Tajo as a Top-LevelProject in April 2014.

1. Apache Tajo site

Apache Phoenix

Apache Phoenix is a SQL skin over HBase delivered asa client-embedded JDBC driver targeting low latencyqueries over HBase data. Apache Phoenix takes yourSQL query, compiles it into a series of HBase scans,and orchestrates the running of those scans to produceregular JDBC result sets. The table metadata is storedin an HBase table and versioned, such that snapshotqueries over prior versions will automatically use thecorrect schema. Direct use of the HBase API, alongwith coprocessors and custom filters, results inperformance on the order of milliseconds for smallqueries, or seconds for tens of millions of rows.

1. Apache Phoenixsite

Apache MRQL

MRQL is a query processing and optimization systemfor large-scale, distributed data analysis, built on top ofApache Hadoop, Hama, and Spark. MRQL (pronounced miracle) is a query processing andoptimization system for large-scale, distributed dataanalysis. MRQL (the MapReduce Query Language) isan SQL-like query language for large-scale dataanalysis on a cluster of computers. The MRQL queryprocessing system can evaluate MRQL queries in threemodes:

in Map-Reduce mode using Apache Hadoop,in BSP mode (Bulk Synchronous Parallel mode)using Apache Hama, andin Spark mode using Apache Spark.in Flink mode using Apache Flink.

1. Apache IncubatorMRQL site

Kylin

Kylin is an open source Distributed Analytics Enginefrom eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supportingextremely large datasets

1. Kylin project site

Data IngestionApache Flume Flume is a distributed, reliable, and available service

for efficiently collecting, aggregating, and movinglarge amounts of log data. It has a simple and flexiblearchitecture based on streaming data flows. It is robustand fault tolerant with tunable reliability mechanisms

1. Apache Flumeproject site

Page 17: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 17/29

and many failover and recovery mechanisms. It uses asimple extensible data model that allows for onlineanalytic application.

Apache SqoopSystem for bulk data transfer between HDFS andstructured datastores as RDBMS. Like Flume but fromHDFS to RDBMS.

1. Apache Sqoopproject site

Facebook Scribe Log agregator in real-time. It’s a Apache ThriftService.

1. Facebook ScribeGitHub site

Apache Chukwa Large scale log aggregator, and analytics. 1. Apache Chukwasite

Apache Kafka

Distributed publish-subscribe system for processinglarge amounts of streaming data. Kafka is a MessageQueue developed by LinkedIn that persists messages todisk in a very performant manner. Because messagesare persisted, it has the interesting ability for clients torewind a stream and consume the messages again.Another upside of the disk persistence is that bulkimporting the data into HDFS for offline analysis canbe done very quickly and efficiently. Storm, developedby BackType (which was acquired by Twitter a yearago), is more about transforming a stream of messagesinto new streams.

1. Apache Kafka 2. GitHub sourcecode

Netflix SuroSuro has its roots in Apache Chukwa, which wasinitially adopted by Netflix. Is a log agregattor likeStorm, Samza.

TODO

Apache Samza

Apache Samza is a distributed stream processingframework. It uses Apache Kafka for messaging, andApache Hadoop YARN to provide fault tolerance,processor isolation, security, and resourcemanagement. Developed byhttp://www.linkedin.com/in/jaykreps Linkedin.

1. Apache Samza site

Cloudera Morphline

Cloudera Morphlines is a new open source frameworkthat reduces the time and skills necessary to integrate,build, and change Hadoop processing applications thatextract, transform, and load data into Apache Solr,Apache HBase, HDFS, enterprise data warehouses, oranalytic online dashboards.

TODO

HIHO

This project is a framework for connecting disparatedata sources with the Apache Hadoop system, makingthem interoperable. HIHO connects Hadoop withmultiple RDBMS and file systems, so that data can beloaded to Hadoop and unloaded from Hadoop

TODO

Apache NiFi Apache NiFi is a dataflow system that is currentlyunder incubation at the Apache Software Foundation.NiFi is based on the concepts of flow-basedprogramming and is highly configurable. NiFi uses acomponent based extension model to rapidly addcapabilities to complex dataflows. Out of the box NiFihas several extensions for dealing with file-baseddataflows such as FTP, SFTP, and HTTP integration aswell as integration with HDFS. One of NiFi’s unique

1. Apache NiFi

Page 18: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 18/29

features is a rich, web-based interface for designing,controlling, and monitoring a dataflow.

Apache ManifoldCF

Apache ManifoldCF provides a framework forconnecting source content repositories like filesystems, DB, CMIS, SharePoint, FileNet ... to targetrepositories or indexes, such as Apache Solr orElasticSearch. It's a kind of crawler for multi-contentrepositories, supporting a lot of sources and multi-format conversion for indexing by means of ApacheTika Content Extractor transformation filter.

1. ApacheManifoldCF

Service Programming

Apache Thrift

A cross-language RPC framework for servicecreations. It’s the service base for Facebooktechnologies (the original Thrift contributor). Thriftprovides a framework for developing and accessingremote services. It allows developers to create servicesthat can be consumed by any application that is writtenin a language that there are Thrift bindings for. Thriftmanages serialization of data to and from a service, aswell as the protocol that describes a method invocation,response, etc. Instead of writing all the RPC code --you can just get straight to your service logic. Thriftuses TCP and so a given service is bound to aparticular port.

1. Apache Thrift

Apache Zookeeper

It’s a coordination service that gives you the tools youneed to write correct distributed applications.ZooKeeper was developed at Yahoo! Research. SeveralHadoop projects are already using ZooKeeper tocoordinate the cluster and provide highly-availabledistributed services. Perhaps most famous of those areApache HBase, Storm, Kafka. ZooKeeper is anapplication library with two principal implementationsof the APIs—Java and C—and a service componentimplemented in Java that runs on an ensemble ofdedicated servers. Zookeeper is for building distributedsystems, simplifies the development process, making itmore agile and enabling more robust implementations.Back in 2006, Google published a paper on "Chubby",a distributed lock service which gained wide adoptionwithin their data centers. Zookeeper, not surprisingly,is a close clone of Chubby designed to fulfill many ofthe same roles for HDFS and other Hadoopinfrastructure.

1. Apache Zookeeper2. Google Chubbypaper

Apache Avro

Apache Avro is a framework for modeling, serializingand making Remote Procedure Calls (RPC). Avro datais described by a schema, and one interesting feature isthat the schema is stored in the same file as the data itdescribes, so files are self-describing. Avro does notrequire code generation. This framework can competewith other similar tools like: Apache Thrift, GoogleProtocol Buffers, ZeroC ICE, and so on.

1. Apache Avro

Apache Curator Curator is a set of Java libraries that make usingApache ZooKeeper much easier.

TODO

Page 19: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 19/29

Apache ZooKeeper much easier.

Apache karaf

Apache Karaf is an OSGi runtime that runs on top ofany OSGi framework and provides you a set ofservices, a powerful provisioning concept, anextensible shell and more.

TODO

Twitter Elephant Bird

Elephant Bird is a project that provides utilities(libraries) for working with LZOP-compressed data. Italso provides a container format that supports workingwith Protocol Buffers, Thrift in MapReduce, Writables,Pig LoadFuncs, Hive SerDe, HBase miscellanea. Thisopen source library is massively used in Twitter.

1. Elephant BirdGitHub

Linkedin Norbert

Norbert is a library that provides easy clustermanagement and workload distribution. With Norbert,you can quickly distribute a simple client/serverarchitecture to create a highly scalable architecturecapable of handling heavy traffic. Implemented inScala, Norbert wraps ZooKeeper, Netty and usesProtocol Buffers for transport to make it easy to build acluster aware application. A Java API is provided andpluggable load balancing strategies are supported withround robin and consistent hash strategies provided outof the box.

1. Linkedin Project 2. GitHub sourcecode

Scheduling & DR

Apache OozieWorkflow scheduler system for MR jobs using DAGs(Direct Acyclical Graphs). Oozie Coordinator cantrigger jobs by time (frequency) and data availability

1. Apache Oozie 2. GitHub sourcecode

LinkedIn AzkabanHadoop workflow management. A batch job schedulercan be seen as a combination of the cron and makeUnix utilities combined with a friendly UI.

LinkedIn Azkaban

Apache Falcon

Apache Falcon is a data management framework forsimplifying data lifecycle management and processingpipelines on Apache Hadoop. It enables users toconfigure, manage and orchestrate data motion,pipeline processing, disaster recovery, and dataretention workflows. Instead of hard-coding complexdata lifecycle capabilities, Hadoop applications cannow rely on the well-tested Apache Falcon frameworkfor these functions. Falcon’s simplification of datamanagement is quite useful to anyone building apps onHadoop. Data Management on Hadoop encompassesdata motion, process orchestration, lifecyclemanagement, data discovery, etc. among otherconcerns that are beyond ETL. Falcon is a new dataprocessing and management platform for Hadoop thatsolves this problem and creates additionalopportunities by building on existing componentswithin the Hadoop ecosystem (ex. Apache Oozie,Apache Hadoop DistCp etc.) without reinventing thewheel.

Apache Falcon

Schedoscope Schedoscope is a new open-source project providing ascheduling framework for painfree agile development,testing, (re)loading, and monitoring of your datahub,lake, or whatever you choose to call your Hadoop data

GitHub source code

Page 20: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 20/29

warehouse these days. Datasets (includingdependencies) are defined using a scala DSL, whichcan embed MapReduce jobs, Pig scripts, Hive queriesor Oozie workflows to build the dataset. The toolincludes a test framework to verify logic and acommand line utility to load and reload data.

Machine Learning

Apache Mahout Machine learning library and math library, on top ofMapReduce. Apache Mahout

WEKA

Weka (Waikato Environment for Knowledge Analysis)is a popular suite of machine learning software writtenin Java, developed at the University of Waikato, NewZealand. Weka is free software available under theGNU General Public License.

Weka 3

Cloudera Oryx

The Oryx open source project provides simple, real-time large-scale machine learning / predictive analyticsinfrastructure. It implements a few classes of algorithmcommonly used in business applications: collaborativefiltering / recommendation, classification / regression,and clustering.

1. Oryx at GitHub 2. Cloudera forum forMachine Learning

Deeplearning4j

The Deeplearning4j open-source project is the mostwidely used deep-learning framework for the JVM.DL4J includes deep neural nets such as recurrentneural networks, Long Short Term Memory Networks(LSTMs), convolutional neural networks, variousautoencoders and feedforward neural networks such asrestricted Boltzmann machines and deep-beliefnetworks. It also has natural language-processingalgorithms such as word2vec, doc2vec, GloVe and TF-IDF. All Deeplearning4j networks run distributed onmultiple CPUs and GPUs. They work as Hadoop jobs,and integrate with Spark on the slace level for host-thread orchestration. Deeplearning4j's neural networksare applied to use cases such as fraud and anomalydetection, recommender systems, and predictivemaintenance.

1. Deeplearning4jWebsite 2. Gitter Communityfor Deeplearning4j

MADlib

The MADlib project leverages the data-processingcapabilities of an RDBMS to analyze data. The aim ofthis project is the integration of statistical data analysisinto databases. The MADlib project is self-described asthe Big Data Machine Learning in SQL for DataScientists. The MADlib software project began thefollowing year as a collaboration between researchersat UC Berkeley and engineers and data scientists atEMC/Greenplum (now Pivotal)

1. MADlibCommunity

H2OH2O is a statistical, machine learning and mathruntime tool for bigdata analysis. Developed by thepredictive analytics company H2O.ai, H2O hasestablished a leadership in the ML scene together withR and Databricks’ Spark. According to the team, H2Ois the world’s fastest in-memory platform for machine

1. H2O at GitHub 2. H2O Blog

Page 21: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 21/29

learning and predictive analytics on big data. It isdesigned to help users scale machine learning, math,and statistics over large datasets.

In addition to H2O’s point and click Web-UI, its RESTAPI allows easy integration into various clients. Thismeans explorative analysis of data can be done ina typical fashion in R, Python, and Scala;and entire workflows can be written up as automatedscripts.

Sparkling Water

Sparkling Water combines two open sourcetechnologies: Apache Spark and H2O - a machinelearning engine. It makes H2O’s library of AdvancedAlgorithms including Deep Learning, GLM, GBM,KMeans, PCA, and Random Forest accessible fromSpark workflows. Spark users are provided with theoptions to select the best features from either platformsto meet their Machine Learning needs. Users cancombine Sparks’ RDD API and Spark MLLib withH2O’s machine learning algorithms, or use H2Oindependent of Spark in the model building processand post-process the results in Spark.

Sparkling Water provides a transparent integration ofH2O’s framework and data structures into Spark’sRDD-based environment by sharing the sameexecution space as well as providing a RDD-like APIfor H2O data structures.

1. Sparkling Water atGitHub 2. Sparkling WaterExamples

Apache SystemML

Apache SystemML was open sourced by IBM and it'spretty related with Apache Spark. If you thinking inApache Spark as the analytics operating system for anyapplication that taps into huge volumes of streamingdata. MLLib, the machine learning library for Spark,provides developers with a rich set of machine learningalgorithms. And SystemML enables developers totranslate those algorithms so they can easily digestdifferent kinds of data and to run on different kinds ofcomputers.

SystemML allows a developer to write a singlemachine learning algorithm and automatically scale itup using Spark or Hadoop.

SystemML scales for big data analytics with highperformance optimizer technology, and empowersusers to write customized machine learning algorithmsusing simple, domain-specific language (DSL) withoutlearning complicated distributed programming. It is anextensible complement framework of Spark MLlib.

1. Apache SystemML2. Apache Proposal

Page 22: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 22/29

Benchmarking and QA Tools

Apache Hadoop Benchmarking

There are two main JAR files in Apache Hadoop forbenchmarking. This JAR are micro-benchmarks fortesting particular parts of the infrastructure, forinstance TestDFSIO analyzes the disk system, TeraSortevaluates MapReduce tasks, WordCount measurescluster performance, etc. Micro-Benchmarks arepackaged in the tests and exmaples JAR files, and youcan get a list of them, with descriptions, by invokingthe JAR file with no arguments. With regards ApacheHadoop 2.2.0 stable version we have available thefollowing JAR files for test, examples andbenchmarking. The Hadoop micro-benchmarks, arebundled in this JAR files: hadoop-mapreduce-examples-2.2.0.jar, hadoop-mapreduce-client-jobclient-2.2.0-tests.jar.

1. MAPREDUCE-3561 umbrella ticketto track all the issuesrelated toperformance

Yahoo Gridmix3 Hadoop cluster benchmarking from Yahoo engineerteam. TODO

PUMA Benchmarking

Benchmark suite which represents a broad range ofMapReduce applications exhibiting applicationcharacteristics with high/low computation andhigh/low shuffle volumes. There are a total of 13benchmarks, out of which Tera-Sort, Word-Count, andGrep are from Hadoop distribution. The rest of thebenchmarks were developed in-house and are currentlynot part of the Hadoop distribution. The threebenchmarks from Hadoop distribution are also slightlymodified to take number of reduce tasks as input fromthe user and generate final time completion statistics ofjobs.

1. MAPREDUCE-5116 2. Faraz Ahmadresearcher 3. PUMA Docs

Berkeley SWIM Benchmark

The SWIM benchmark (Statistical Workload Injectorfor MapReduce), is a benchmark representing a real-world big data workload developed by University ofCalifornia at Berkley in close cooperation withFacebook. This test provides rigorous measurements ofthe performance of MapReduce systems comprised ofreal industry workloads..

1. GitHub SWIN

Intel HiBench HiBench is a Hadoop benchmark suite. TODOApache Yetus To help maintain consistency over a large and

disconnected set of committers, automated patchtesting was added to Hadoop’s development process.This automated patch testing (now included as part ofApache Yetus) works as follows: when a patch isuploaded to the bug tracking system an automatedprocess downloads the patch, performs some staticanalysis, and runs the unit tests. These results areposted back to the bug tracker and alerts notifyinterested parties about the state of the patch.

However The Apache Yetus project addresses muchmore than the traditional patch testing, it's a betterapproach including a massive rewrite of the patchtesting facility used in Hadoop.

1. Altiscale BlogEntry 2. Apache YetusProposal 3. Apache YetusProject site

Page 23: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 23/29

Security

Apache Sentry

Sentry is the next step in enterprise-grade big datasecurity and delivers fine-grained authorization to datastored in Apache Hadoop. An independent securitymodule that integrates with open source SQL queryengines Apache Hive and Cloudera Impala, Sentrydelivers advanced authorization controls to enablemulti-user applications and cross-functional processesfor enterprise data sets. Sentry was a Clouderadevelopment.

TODO

Apache Knox Gateway

System that provides a single point of secure access forApache Hadoop clusters. The goal is to simplifyHadoop security for both users (i.e. who access thecluster data and execute jobs) and operators (i.e. whocontrol access and manage the cluster). The Gatewayruns as a server (or cluster of servers) that serve one ormore Hadoop clusters.

1. Apache Knox 2. Apache KnoxGatewayHortonworks web

Apache Ranger

Apache Argus Ranger (formerly called Apache Argusor HDP Advanced Security) delivers comprehensiveapproach to central security policy administrationacross the core enterprise security requirements ofauthentication, authorization, accounting and dataprotection. It extends baseline features for coordinatedenforcement across Hadoop workloads from batch,interactive SQL and real–time and leverages theextensible architecture to apply policies consistentlyagainst additional Hadoop ecosystem components(beyond HDFS, Hive, and HBase) including Storm,Solr, Spark, and more.

1. Apache Ranger 2. Apache RangerHortonworks web

Metadata Management

Metascope

Metascope is a metadata management and datadiscovery tool which serves as an add-on toSchedoscope. Metascope is able to collect technical,operational and business metadata from your HadoopDatahub and provides them easy to search and navigatevia a portal.

GitHub source code

System Deployment

Apache Ambari

Intuitive, easy-to-use Hadoop management web UIbacked by its RESTful APIs. Apache Ambari wasdonated by Hortonworks team to the ASF. It's apowerful and nice interface for Hadoop and othertypical applications from the Hadoop ecosystem.Apache Ambari is under a heavy development, and itwill incorporate new features in a near future. Forexample Ambari is able to deploy a complete Hadoopsystem from scratch, however is not possible use thisGUI in a Hadoop system that is already running. Theability to provisioning the operating system could be agood addition, however probably is not in theroadmap..

1. Apache Ambari

Cloudera HUE Web application for interacting with Apache Hadoop. 1. HUE home page

Page 24: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 24/29

It's not a deploment tool, is an open-source Webinterface that supports Apache Hadoop and itsecosystem, licensed under the Apache v2 license. HUEis used for Hadoop and its ecosystem user operations.For example HUE offers editors for Hive, Impala,Oozie, Pig, notebooks for Spark, Solr Searchdashboards, HDFS, YARN, HBase browsers..

Apache Mesos

Mesos is a cluster manager that provides resourcesharing and isolation across cluster applications. LikeHTCondor, SGE or Troque can do it. However Mesosis hadoop centred design

TODO

Myriad

Myriad is a mesos framework designed for scalingYARN clusters on Mesos. Myriad can expand or shrinkone or more YARN clusters in response to events asper configured rules and policies.

1. Myriad Github

Marathon

Marathon is a Mesos framework for long-runningservices. Given that you have Mesos running as thekernel for your datacenter, Marathon is the init orupstart daemon.

TODO

Brooklyn

Brooklyn is a library that simplifies applicationdeployment and management. For deployment, it isdesigned to tie in with other tools, giving single-clickdeploy and adding the concepts of manageable clustersand fabrics: Many common software entities availableout-of-the-box. Integrates with Apache Whirr -- andthereby Chef and Puppet -- to deploy well-knownservices such as Hadoop and elasticsearch (or usePOBS, plain-old-bash-scripts) Use PaaS's such asOpenShift, alongside self-built clusters, for maximumflexibility

TODO

Hortonworks HOYA

HOYA is defined as “running HBase On YARN”. TheHoya tool is a Java tool, and is currently CLI driven. Ittakes in a cluster specification – in terms of the numberof regionservers, the location of HBASE_HOME, theZooKeeper quorum hosts, the configuration that thenew HBase cluster instance should use and so on. So HOYA is for HBase deployment using a tooldeveloped on top of YARN. Once the cluster has beenstarted, the cluster can be made to grow or shrink usingthe Hoya commands. The cluster can also be stoppedand later resumed. Hoya implements the functionalitythrough YARN APIs and HBase’s shell scripts. Thegoal of the prototype was to have minimal codechanges and as of this writing, it has required zero codechanges in HBase.

1. Hortonworks Blog

Apache Helix

Apache Helix is a generic cluster managementframework used for the automatic management ofpartitioned, replicated and distributed resources hostedon a cluster of nodes. Originally developed byLinkedin, now is in an incubator project at Apache.Helix is developed on top of Zookeeper forcoordination tasks.

1. Apache Helix

Page 25: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 25/29

Apache Bigtop Bigtop was originally developed and released as anopen source packaging infrastructure by Cloudera.BigTop is used for some vendors to build their owndistributions based on Apache Hadoop (CDH, PivotalHD, Intel's distribution), however Apache Bigtop doesmany more tasks, like continuous integration testing(with Jenkins, maven, ...) and is useful for packaging(RPM and DEB), deployment with Puppet, and so on.BigTop also features vagrant recipes for spinning up"n-node" hadoop clusters, and the bigpetstore blueprintapplication which demonstrates construction of a fullstack hadoop app with ETL, machine learning, anddataset generation. Apache Bigtop could be consideredas a community effort with a main focus: put all bits ofthe Hadoop ecosystem as a whole, rather thanindividual projects.

1. Apache Bigtop.

Buildoop

Buildoop is an open source project licensed underApache License 2.0, based on Apache BigTop idea.Buildoop is a collaboration project that providestemplates and tools to help you create custom Linux-based systems based on Hadoop ecosystem. Theproject is built from scrach using Groovy language,and is not based on a mixture of tools like BigTop does(Makefile, Gradle, Groovy, Maven), probably is easierto programming than BigTop, and the desing is focusedin the basic ideas behind the buildroot Yocto Project.The project is in early stages of development rightnow.

1. Hadoop EcosystemBuilder.

Deploop

Deploop is a tool for provisioning, managing andmonitoring Apache Hadoop clusters focused in theLambda Architecture. LA is a generic design based onthe concepts of Twitter engineer Nathan Marz. Thisgeneric architecture was designed addressing commonrequirements for big data. The Deploop system is inongoing development, in alpha phases of maturity. Thesystem is setup on top of highly scalable techologieslike Puppet and MCollective.

1. The HadoopDeploy System.

SequenceIQ Cloudbreak

Cloudbreak is an effective way to start and runmultiple instances and versions of Hadoop clusters inthe cloud, Docker containers or bare metal. It is a cloudand infrastructure agnostic and cost effictive HadoopAs-a-Service platform API. Provides automaticscaling, secure multi tenancy and full cloud lifecyclemanagement.

Cloudbreak leverages the cloud infrastructureplatforms to create host instances, uses Dockertechnology to deploy the requisite containers cloud-agnostically, and uses Apache Ambari (via AmbariBlueprints) to install and manage a Hortonworkscluster. This is a tool within the HDP ecosystem.

1. GitHub project. 2. Cloudbreakintroduction. 3. Cloudbreak inHortonworks.

Apache Eagle Apache Eagle is an open source analytics solution for 1. Apache Eagle

Page 26: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 26/29

identifying security and performance issues instantlyon big data platforms, e.g. Hadoop, Spark etc. Itanalyzes data activities, yarn applications, jmx metrics,and daemon logs etc., provides state-of-the-art alertengine to identify security breach, performance issuesand shows insights. Big data platform normallygenerates huge amount of operational logs and metricsin realtime. Apache Eagle is founded to solve hardproblems in securing and tuning performance for bigdata platforms by ensuring metrics, logs alwaysavailable and alerting immediately even under hugetraffic.

Github Project. 2. Apache Eagle WebSite.

Applications

Apache Nutch

Highly extensible and scalable open source webcrawler software project. A search engine based onLucene: A Web crawler is an Internet bot thatsystematically browses the World Wide Web, typicallyfor the purpose of Web indexing. Web crawlers cancopy all the pages they visit for later processing by asearch engine that indexes the downloaded pages sothat users can search them much more quickly.

TODO

Sphinx Search Server

Sphinx lets you either batch index and search datastored in an SQL database, NoSQL storage, or just filesquickly and easily — or index and search data on thefly, working with Sphinx pretty much as with adatabase server.

1. Sphinx search website

Apache OODTOODT was originally developed at NASA JetPropulsion Laboratory to support capturing, processingand sharing of data for NASA's scientific archives

TODO

HIPI LibraryHIPI is a library for Hadoop's MapReduce frameworkthat provides an API for performing image processingtasks in a distributed computing environment.

TODO

PivotalR

PivotalR is a package that enables users of R, the mostpopular open source statistical programming languageand environment to interact with the Pivotal(Greenplum) Database as well as Pivotal HD / HAWQand the open-source database PostgreSQL for Big Dataanalytics. R is a programming language and dataanalysis software: you do data analysis in R by writingscripts and functions in the R programming language.R is a complete, interactive, object-oriented language:designed by statisticians, for statisticians. The languageprovides objects, operators and functions that make theprocess of exploring, modeling, and visualizing data anatural one.

1. PivotalR onGitHub

Development FrameworksJumbune Jumbune is an open source product that sits on top of

any Hadoop distribution and assists in developmentand administration of MapReduce solutions. Theobjective of the product is to assist analytical solutionproviders to port fault free applications on productionHadoop environments.

1. Jumbune 2. Jumbune GitHubProject 3. Jumbune JIRApage

Page 27: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 27/29

Jumbune supports all active major branches of ApacheHadoop namely 1.x, 2.x, 0.23.x and commercialMapR, HDP 2.x and CDH 5.x distributions of Hadoop.It has the ability to work well with both Yarn and non-Yarn versions of Hadoop. It has four major modules MapReduce Debugger,HDFS Data Validator, On-demand cluster monitor andMapReduce job profiler. Jumbune can be deployed onany remote user machine and uses a lightweight agenton the NameNode of the cluster to relay relevantinformation to and fro.

Spring XD

Spring XD (Xtreme Data) is a evolution of Spring Javaapplication development framework to help Big DataApplications by Pivotal. SpringSource was thecompany created by the founders of the SpringFramework. SpringSource was purchased by VMwarewhere it was maintained for some time as a separatedivision within VMware. Later VMware, and its parentcompany EMC Corporation, formally created a jointventure called Pivotal. Spring XD is more thandevelopment framework library, is a distributed, andextensible system for data ingestion, real timeanalytics, batch processing, and data export. It could beconsidered as alternative to ApacheFlume/Sqoop/Oozie in some scenarios. Spring XD ispart of Pivotal Spring for Apache Hadoop (SHDP).SHDP, integrated with Spring, Spring Batch and SpringData are part of the Spring IO Platform as foundationallibraries. Building on top of, and extending thisfoundation, the Spring IO platform provides SpringXD as big data runtime. Spring for Apache Hadoop(SHDP) aims to help simplify the development ofHadoop based applications by providing a consistentconfiguration and API across a wide range of Hadoopecosystem projects such as Pig, Hive, and Cascading inaddition to providing extensions to Spring Batch fororchestrating Hadoop based workflows.

1. Spring XD onGitHub

Cask Data Application Platform

Cask Data Application Platform is an open sourceapplication development platform for the Hadoopecosystem that provides developers with data andapplication virtualization to accelerate applicationdevelopment, address a range of real-time and batchuse cases, and deploy applications into production. Thedeployment is made by Cask Coopr, an open sourcetemplate-based cluster management solution thatprovisions, manages, and scales clusters for multi-tiered application stacks on public and private clouds.Another component is Tigon, a distributed frameworkbuilt on Apache Hadoop and Apache HBase for real-time, high-throughput, low-latency data processing andanalytics applications.

1. Cask Site

Categorize Pending ...Apache Fluo Apache Fluo (incubating) is an open source 1. Apache Fluo Site

Page 28: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 28/29

implementation of Percolator for Apache Accumulo.Fluo makes it possible to incrementally update theresults of a large-scale computation, index, or analyticas new data is discovered. Fluo allows processing newdata with lower latency than Spark or Map Reduce inthe case where all data must be reprocessed when newdata arrives.

2. Percolator Paper

Twitter Summingbird

A system that aims to mitigate the tradeoffs betweenbatch processing and stream processing by combiningthem into a hybrid system. In the case of Twitter,Hadoop handles batch processing, Storm handlesstream processing, and the hybrid system is calledSummingbird.

TODO

Apache Kiji Build Real-time Big Data Applications on ApacheHBase. TODO

S4 Yahoo

S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmersto easily develop applications for processingcontinuous unbounded streams of data.

TODO

Metamarkers Druid Realtime analytical data store. TODO

Concurrent CascadingApplication framework for Java developers to simplydevelop robust Data Analytics and Data Managementapplications on Apache Hadoop.

TODO

Concurrent Lingual

Open source project enabling fast and simple Big Dataapplication development on Apache Hadoop. projectthat delivers ANSI-standard SQL technology to easilybuild new and integrate existing applications ontoHadoop

TODO

Concurrent Pattern Machine Learning for Cascading on Apache Hadoopthrough an API, and standards based PMML TODO

Apache Giraph

Apache Giraph is an iterative graph processing systembuilt for high scalability. For example, it is currentlyused at Facebook to analyze the social graph formed byusers and their connections. Giraph originated as theopen-source counterpart to Pregel, the graphprocessing architecture developed at Google

TODO

Talend

Talend is an open source software vendor that providesdata integration, data management, enterpriseapplication integration and big data software andsolutions.

TODO

Akka ToolkitAkka is an open-source toolkit and runtime simplifyingthe construction of concurrent applications on the Javaplatform.

TODO

Eclipse BIRTBIRT is an open source Eclipse-based reporting systemthat integrates with your Java/Java EE application toproduce compelling reports.

TODO

Spango BI SpagoBI is an Open Source Business Intelligence suite,belonging to the free/open source SpagoWorldinitiative, founded and supported by EngineeringGroup. It offers a large range of analytical functions, ahighly functional semantic layer often absent in other

TODO

Page 29: The hadoop ecosystem table

6/16/2017 The Hadoop Ecosystem Table

https://hadoopecosystemtable.github.io/ 29/29

open source platforms and projects, and a respectableset of advanced data visualization features includinggeospatial analytics

Jedox Palo

Palo Suite combines all core applications — OLAPServer, Palo Web, Palo ETL Server and Palo for Excel— into one comprehensive and customisable BusinessIntelligence platform. The platform is completelybased on Open Source products representing a high-end Business Intelligence solution which is availableentirely free of any license fees.

TODO

Twitter Finagle

Finagle is an asynchronous network stack for the JVMthat you can use to build asynchronous RemoteProcedure Call (RPC) clients and servers in Java,Scala, or any JVM-hosted language.

TODO

Intel GraphBuilder Library which provides tools to construct large-scalegraphs on top of Apache Hadoop TODO

Apache TikaToolkit detects and extracts metadata and structuredtext content from various documents using existingparser libraries.

TODO

Apache Zeppelin

Zeppelin is a modern web-based tool for the datascientists to collaborate over large-scale dataexploration and visualization projects. It is a notebookstyle interpreter that enable collaborative analysissessions sharing between users. Zeppelin isindependent of the execution framework itself. Currentversion runs on top of Apache Spark but it haspluggable interpreter APIs to support other dataprocessing systems. More execution frameworks couldbe added at a later date i.e Apache Flink, Crunch aswell as SQL-like backends such as Hive, Tajo, MRQL.

1. Apache Zeppelinsite

Hydrosphere Mist

Hydrosphere Mist is a service for exposing ApacheSpark analytics jobs and machine learning models asrealtime, batch or reactive web services. It acts as amiddleware between Apache Spark and machinelearning stack and user faced applications.

1. Hydrosphere Mistgithub

Published with GitHub Pages by Javi Roman, and contributors