Streaming Live Data and the Hadoop Ecosystem
-
Upload
spring-io -
Category
Technology
-
view
279 -
download
2
Transcript of Streaming Live Data and the Hadoop Ecosystem
© Hortonworks Inc. 2013
Streaming Live data in
Hadoop Ecosystem
Page 1
Oleg Zhurakousky@z_oleg
© Hortonworks Inc. 2013 - Confidential
Simplistic view
Page 2
Process
Acquire DataAcquire Data
Store Data
© Hortonworks Inc. 2013 - Confidential
Real view
Page 3
© Hortonworks Inc. 2013 - Confidential
Modern data processing concerns
• Multiple Sources of Data• Geo Distribution• Multiple protocols for data transport• New technologies/products• New data-processing paradigms• Security• New type of users• Etc.
Page 4
© Hortonworks Inc. 2013 - Confidential
Apache Hadoop
• Apache Hadoop– De facto Big Data open source platform– Distributed storage– Distributed processing– Running for about 6 years in production at hundreds of companies
like Yahoo, Ebay and Facebook
Page 5
© Hortonworks Inc. 2013 - Confidential
Storage
Page 6
© Hortonworks Inc. 2013 - Confidential
HDFS – Hadoop Distributed File System
© Hortonworks Inc. 2013 - Confidential
HDFS - details
Namenode
Datanode_1 Datanode_2 Datanode_3
HDFSBlock 1
HDFSBlock 2
HDFSBlock 3 Block 4
• URI-based addressing – hdfs://myhost:55555/foo/bar/foo.txt
• Name Nodes and Data Nodes
• Block-based storage
• Data Replication
• Replica placements
• File formats
© Hortonworks Inc. 2013 - Confidential
Processing
Page 9
© Hortonworks Inc. 2013 - Confidential
1st Generation Hadoop: Batch Focus
HADOOP 1.0Built for Batch Apps
Single App
BATCH
HDFS
Single App
INTERACTIVE
Single App
BATCH
HDFS
All other usage patterns MUST leverage same infrastructure
Forces Creation of Silos to Manage Mixed Workloads
Single App
BATCH
HDFS
Single App
ONLINE
Page 10
© Hortonworks Inc. 2013 - Confidential
Hadoop 1 Limitations
Lacks Support for Alternate Paradigms and ServicesForce everything needs to look like Map Reduce
Iterative applications in MapReduce are 10x slower
ScalabilityMax Cluster size ~5,000 nodes
Max concurrent tasks ~40,000
AvailabilityFailure Kills Queued & Running Jobs
Hard partition of resources into map and reduce slotsNon-optimal Resource Utilization
Page 11
© Hortonworks Inc. 2013 - Confidential
Our Vision: Hadoop as Next-Gen Platform
HADOOP 1.0
HDFS(redundant, reliable storage)
MapReduce(cluster resource management
& data processing)
HDFS2(redundant, highly-available & reliable storage)
YARN(cluster resource management)
MapReduce(data processing)
Others
HADOOP 2.0
Single Use SystemBatch Apps
Multi Purpose PlatformBatch, Interactive, Online, Streaming, …
Page 12
© Hortonworks Inc. 2013 - Confidential Page 13
Hadoop 2 - YARN Architecture
ResourceManager (RM)Central agent - Manages and allocates cluster
resources
NodeManager (NM)Per-Node agent - Manages and enforces
node resource allocations
ApplicationMaster (AM)Per-Application –
Manages application
lifecycle and task
scheduling
© Hortonworks Inc. 2013 - Confidential
YARN: Taking Hadoop Beyond Batch
Page 14
Applications Run Natively in Hadoop
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH(MapReduce)
INTERACTIVE(Tez)
STREAMING(Storm, S4,…)
GRAPH(Giraph)
IN-MEMORY(Spark)
HPC MPI(OpenMPI)
ONLINE(HBase)
OTHER(Search)
(Weave…)
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service
© Hortonworks Inc. 2013 - Confidential
Hadoop/YARN Eco-system
Page 15
ApplicationsApache Giraph – Graph ProcessingApache Hama - BSPApache Hadoop MapReduce – BatchApache Tez – Batch/Interactive Apache Samza – Stream ProcessingApache Storm – Stream ProcessingApache Spark – Iterative applicationsElastic Search – Scalable SearchApache NiFiApache Kafka. . . .
FrameworksApache TwillREEF by MicrosoftSpring for Apache Hadoop. . .
© Hortonworks Inc. 2013 - Confidential
Let’s write some code
DEMO
Page 16
© Hortonworks Inc. 2013 - Confidential
Streaming usage patterns
Page 17
1. Capture -> Persist
2. Capture -> Process –> Persist
3. Capture -> Buffer -> Process -> Persist
© Hortonworks Inc. 2013 - Confidential
Thank you!
Page 18
http://hortonworks.com/products/hortonworks-sandbox/
Download Sandbox: Experience Apache HadoopBoth 2.0 and 1.x Versions Available!http://hortonworks.com/products/hortonworks-sandbox/
Questions?