Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

54
Data Processing in Hadoop Lars George – Partner and Co-Founder @ OpenCore Big Data & Data Science Israel Meetup – 21.03.2017 Analytics and Data Pipelines in Practice

Transcript of Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Page 1: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Data Processing in Hadoop

Lars George – Partner and Co-Founder @ OpenCore

Big Data & Data Science Israel Meetup – 21.03.2017

Analytics and Data Pipelines in Practice

Page 2: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

About Me• Partner & Co-Founder at OpenCore• Before, EMEA Chief Architect at Cloudera• 5+ years

• Hadoop since 2007• Apache Committer• HBase and Whirr

• O’Reilly Author: HBase – The Definitive Guide• Also in Japanese, Korean & Chinese• 2nd edition out soon!

• Contact• [email protected]• @larsgeorge

日本語版も出ました !

Page 3: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Agenda• Hadoop History• Data Pipelines• Hadoop Components• Data Processing• Summary

Page 4: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Hadoop HistoryA walk through time…

Page 5: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Tectonic Shifting: Prevalent Data Inertia

Page 6: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

The Original Inspirations for Hadoop

2003 2004

Page 7: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

A Decade of Hadoop History on One Slide

Ten years ago, “Hadoop” referred to a scalable, fault-tolerant filesystem (HDFS) and programming framework (MapReduce)

for distributed computing.

Today, it refers to both a kernel containing the aforementioned pieces, as well as a constantly evolving ecosystem of 25+ data

stores, execution engines, programming and data access frameworks, and other componentry.

Recognize this guy?

Page 8: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Hadoop’s Original Architecture

MapReduce(Data Processing and Resource Management)

HDFS(Filesystem/Storage)

Page 9: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Hadoop's Architecture Today

MapReduce(Data Processing)

YARN(Resource Management)

HDFS(Storage)

Page 10: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Popular by Demand• More resources are poured into

Hadoop than many other projects• Vibrant community with many

commercial entities backing the development• List on the right lists separate

projects, which are combined in Hadoop distributions• Total would far exceed anything

else

• Literally no alternatives!

Page 11: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Data PipelinesFrom deluge to insight

Page 12: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Data Pipeline Components• Pipelines need data and CPUs• Continuous ingest lands new

data in various ways• Access to data allows for

consumers to build products• All of this needs to be

• Automated & managed• Done in a secure manner

• Finally, pipelines need to be properly onboarded• Discovery is necessary to find

schemas, data sources, etc.

Storage ProcessingIngest

Automation + Data & Resource Management

Authentication, Authorization, Audits

Access

Onboarding & Discovery

Physical Systems

Page 13: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Pipelines Increase Value of Data

Page 14: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Now that we know how data pipelines span many layers in both hardware and software, we can look at what Hadoop has to offer in more detail…

Page 15: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Hadoop ComponentsGrowth and Controversies

Page 16: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Example: Cloudera

Batch, Interactive, and Real-Time.Leading performance and usability in one platform.

• End-to-end analytic workflows

• Access more data• Work with data in new

ways• Enable new users

Security and Administration

ProcessIngest

Sqoop, Flume, NiFi

TransformMapReduce,

Hive, Pig, Spark

DiscoverAnalytic Database

Impala

SearchSolr

ModelMachine Learning

SAS, R, Spark, Mahout

ServeNoSQL Database

HBase

StreamingSpark Streaming

Unlimited Storage HDFS, HBase

YARN, Cloudera Manager,Cloudera Navigator

One Platform, Many Workloads

Page 17: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Hadoop: One Platform• Different to the silo’ed, monolithic databases, Hadoop is a single, shared

platform, with multiple entry points (access engines)• Scale and resilience is inherently built in• There are no silos, everything is just a directory with data inside

But…• How do you know what is where?• Access needs to be tightly controlled, down to the field level!

Page 18: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Analogy: The Universal Flatbed• Hadoop is a powerful engine exposed as a platform to carry loads• Initially the platform is bare and beckons for customization• You can convert the flatbed to what is needed

But…• Once converted, how to switch

between workloads?• How do you share the engine with

different users?

Page 19: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Hadoop Architecture Today• Components are selected to

match customer demands• A platform has many

advantages, including paid QA time• Some newer components

can be added later on• Labs etc.

• Many buzzwords that need to be carefully vetted…

Page 20: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

2006 2008 2009 2010 2011 2012 2013

Core Hadoop (HDFS,

MapReduce)

HBaseZooKeeper

SolrPig

Core Hadoop

HiveMahoutHBase

ZooKeeperSolrPig

Core Hadoop

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

Core Hadoop

FlumeBigtopOozie

HCatalogHue

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

YARNCore Hadoop

SparkTez

ImpalaKafkaDrill

FlumeBigtopOozie

HCatalogHue

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

YARNCore Hadoop

ParquetSentrySpark

TezImpalaKafkaDrill

FlumeBigtopOozie

HCatalogHue

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

YARNCore Hadoop

The stack is continually evolving and growing!

2007

SolrPig

Core Hadoop

KnoxFlink

ParquetSentrySpark

TezImpalaKafkaDrill

FlumeBigtopOozie

HCatalogHue

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

YARNCore Hadoop

2014 2015

KuduRecordService

IbisFalconKnoxFlink

ParquetSentrySpark

TezImpalaKafkaDrill

FlumeBigtopOozie

HCatalogHue

SqoopAvroHive

MahoutHBase

ZooKeeperSolrPig

YARNCore Hadoop

Evolution of the Hadoop Platform

Page 21: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

And There Is More

Page 22: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Hadoop - The Movie: “Divergent”

Hadoop Core

2006

HDPCDH

2008 2011

CM

Navigator

2013

Sentry

2014

RangerAmbari

Impala

2016

CDSW

2015 2017

ZeppelinAtlas

Knox

SolrSpark

Kafka

Kudu

YARN

Page 23: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

So, Hadoop is both complicated and divergent? How can we build data pipelines then, using its components? What else is needed?

Page 24: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Data Processing In Hadoop TodayCoasting through the "Trough of Disillusionment"

Page 25: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Wait! Before we can look at the aspects of building a data pipeline, a bit more context on where users are coming from and what their needs are: The Waves of Adoption.

Page 26: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Waves of Adoption #1• The “AllSpark” (as in the Transformers movie)

• First companies to adopt Hadoop as a way to mirror Google’s approach

• Early Adopters• Inspired by early success stories, these engineering focused companies extended on

Hadoop

• Followers• Companies that are OK to try out new things• Still engineering driven

• Late Bloomers• First Enterprises

• New Wave• Everyone else… AllSpark

Early Adopters

Followers

Late Bloomers

Enterprises

TODAY!

Page 27: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Waves of Adoption #2• Simple logic at bulk (batch processing of petabytes)• What: Reporting • With: SQL (Hive), Pig• Who: Analysts, Developers

• Streaming logic, likely in Lambda architecture• What: Decision support• With: OLAP Analytics, Druid, Oryx• Who: Data architects, DevOps

• Complex analytics• What: Machine Learning, AI• With: Notebooks, DS Workbench,• Who: Data Scientists

Batch

LambdaKappa?

Page 28: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Hybrids: Lambda FTW?

Page 29: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Stage:• Storage & Processing• Ingest• Access• Automation & Management• Security• Onboarding & Discovery• Physical Systems

Storage ProcessingIngest

Automation + Data & Resource Management

Authentication, Authorization, Audits

Access

Onboarding & Discovery

Physical Systems

Page 30: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Storage & ProcessingStorage• Reliable and scalable systems:

HDFS, Kafka, HBase• What about Kudu, Cassandra, …

MongoDB?

• Data laid out in a structured manner• Information Architecture• Physical storage (e.g. columnar)

Processing• Generic framework: YARN• What about Mesos? Non-batch

jobs?

• Resource management hooks• Pluggable engines• MapReduce, Spark, …• MPP Systems?

Page 31: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Information Architecture• There is a need to define how data

flows through the system and is organized• This simplifies the onboarding

process• Can be simple, or arbitrarily

complex• Needs to be enforced as it is used• Living system, may need to adopt• Define batch and stream interfaces

Page 32: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Example: YARN Services?• Little progress in

years• Still batch

oriented• Projects shoehorn

service idea into YARN using kludges• Example: Slider,

Trill

Page 33: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Stage:• Storage & Processing• Ingest• Access• Automation & Management• Security• Onboarding & Discovery• Physical Systems

Storage ProcessingIngest

Automation + Data & Resource Management

Authentication, Authorization, Audits

Access

Onboarding & Discovery

Physical Systems

Page 34: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Ingest• Purpose• Receive data from heterogeneous sources• Save as-is, or do first pass processing• Store data in best format, aggregate small files• Comply to stack rules (security, IA)

• One of the most active areas• Vibrant third-party ecosystem

• Streamsets, Tamr, Waterline Data, Trifacta, IBM, …• Often a generic task, with Hadoop being only one target

• Open-source frameworks• NiFi• Flume (with Kafka)?

Page 35: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Storage ProcessingIngest

Automation + Data & Resource Management

Authentication, Authorization, Audits

Access

Onboarding & Discovery

Physical Systems

Stage:• Storage & Processing• Ingest• Access• Automation & Management• Security• Onboarding & Discovery• Physical Systems

Page 36: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Access• Hadoop has traditionally only a few interfaces• Interactive SQL

• Shell, Notebooks, Hue• JDBC/ODBC• File Access

• WebHDFS/HttpFs• Gateways

• REST, Knox

• Needs to be set up based on the use-case• Throughput vs Latency

• Must apply security rules

Page 37: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Stage:• Storage & Processing• Ingest• Access• Automation & Management• Security• Onboarding & Discovery• Physical Systems

Storage ProcessingIngest

Automation + Data & Resource Management

Authentication, Authorization, Audits

Access

Onboarding & Discovery

Physical Systems

Page 38: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Automation & Management• PoCs and prototyping are not production grade!• Need to automate the pipelines with monitoring and alerting• Full development lifecycle needs to be established• Precious resources need to be managed• Easier if use-cases all fall into the same category• Difficult when they span many systems• One of the remaining topics not addressed at all in Hadoop

• Change management should handle dynamic reconfiguration

Page 39: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Automation• Directed acyclic graphs (DAGs)• Define the actions and link them• Schedules based on various events (time or data)• Handle errors and maintenance

• Examples• Apache Oozie [2007, 2010 O/S, 2012 Apache]

• Java• XML or Hue

• Azkaban (LinkedIn) [2010]• Java

• Luigi (Spotify) [2012]• Python

• Apache Airflow (Airbnb) [2015]• Python

Page 40: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Example: Notebooks• Data scientists like prototyping• But how to bring the results into

production?• One attempt is to boost notebooks

with a framework that can handle their chaining and execution• Shared resources used• Depends on notebook backends

Source: https://databricks.com/blog/2016/08/30/notebook-workflows-the-easiest-way-to-implement-apache-spark-pipelines.html

Page 41: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Stage:• Storage & Processing• Ingest• Access• Automation & Management• Security• Onboarding & Discovery• Physical Systems

Storage ProcessingIngest

Automation + Data & Resource Management

Authentication, Authorization, Audits

Access

Onboarding & Discovery

Physical Systems

Page 42: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Security• Many moving parts• Kerberos• RPC Level• ACLs• RBAC• UIs• Data• Encryption (at-rest and in-

transit)

• Hard to configure properly• Management software helps

to a degree

Page 43: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Stage:• Storage & Processing• Ingest• Access• Automation & Management• Security• Onboarding & Discovery• Physical Systems

Storage ProcessingIngest

Automation + Data & Resource Management

Authentication, Authorization, Audits

Access

Onboarding & Discovery

Physical Systems

Page 44: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Onboarding Use-Cases• Ask the necessary questions ahead of time• Use the answer to set (initially) strict limits• Use HDFS quotas, YARN queues, etc.

• Initialize the system with the defaults• Communicate to other teams what the expected impact might be• During onboarding explain the shared nature of Hadoop• Avoid “long faces” due to changes (change management)

• Define costs and chargeback models• Automate into self-service if possible• Push updated configuration and notifications

Page 45: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Stage:• Storage & Processing• Ingest• Access• Automation & Management• Onboarding & Discovery• Physical Systems

Storage ProcessingIngest

Automation + Data & Resource Management

Authentication, Authorization, Audits

Access

Onboarding & Discovery

Physical Systems

Page 46: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Stack Architecture• Combine the reliable components into a

whole stack• Organize interfaces to outside systems

by users and purpose• Separate components for ease of

maintenance• Layer network to fit data flow• Tight security control at vital points

Page 47: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Network Architecture

Page 48: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Wrap UpDate pipeline deconstruction

Page 49: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

“Oh… and I thought I just add Hadoop to our technology landscape… you know, like a database or an appliance.”

– Misled Decision Maker

Page 50: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Hype CurveVisibility

TimeTechnology Trigger

Peak of Inflated Expectations

Trough of Disillusionment

Slope of Enlightenment

Plateau of Productivity

Page 51: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Technology Waves• Hadoop is just one part of the hype curve• Technologies that follow may (heavily, or even solely) depend on it• “Shaky foundations”?

• But… most (if not all) technologies are initially oversold and overhyped

• What happens in practice?

Page 52: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Hype Curve – The Hadoop VersionVisibility

Time

“Big Data isStrategic for us!”

First PoC

“Where are the results?”

“Darn, Hadoopis difficult!”

“Security? Multitenancy?Development? Lifecycle?

Environments?”

“Maybe Hadoopis not for us?”

Allocate more Resources & Budget

First use-case in productionHadoop Team Productivity

Meanwhile…

Page 53: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Summary• Data Pipelines span many levels of architectures• Hardware, Networking, Information, Security, Data Management

• Core Hadoop itself only provides little in that regard• Vendors offer some support (closed or open source)

• Use-case are often unknown• Guess as good as possible, generalize

• Careful planning is vital, mistakes are costly• Mixed workloads are a nightmare for resource management• Keep things simple (KISS principle)

• Knowledge needs to be built upfront • Hire someone in the know!

Page 54: Data Pipelines in Hadoop - SAP Meetup in Tel Aviv

Thank You!@larsgeorge