An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
description
Transcript of An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14
1 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
An IntroducAon to Hadoop and Cloudera Nashville Cloudera User Group, 10/23/14 Ian Wrigley, Director, EducaAonal Curriculum [email protected] @iwrigley
201405
2 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
PresentaAon Topics
An Introduc-on to Hadoop and Cloudera
§ The Mo-va-on for Hadoop
§ ‘Core Hadoop’: HDFS and MapReduce
§ CDH and the Hadoop Ecosystem
§ Data Storage: HBase
§ Data IntegraAon: Flume and Sqoop
§ Data Processing: Spark
§ Data Analysis: Hive, Pig, and Impala
§ Data ExploraAon: Cloudera Search
§ Managing Everything: Cloudera Manager
§ Conclusion
3 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Tradi-onally, computa-on has been processor-‐bound – RelaAvely small amounts of data – Lots of complex processing
§ The early solu-on: bigger computers – Faster processor, more memory – But even this couldn’t keep up
TradiAonal Large-‐Scale ComputaAon
4 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ The beDer solu-on: more computers – Distributed systems – use mulAple machines for a single job
Distributed Systems
“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, we didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.”
– Grace Hopper
Database Hadoop Cluster
5 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Challenges with distributed systems – Programming complexity
– Keeping data and processes in sync – Finite bandwidth – ParAal failures
Distributed Systems: Challenges
6 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Tradi-onally, data is stored in a central loca-on
§ Data is copied to processors at run-me
§ Fine for limited amounts of data
Distributed Systems: The Data Bo>leneck (1)
7 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Modern systems have much more data – terabytes+ a day – petabytes+ total
§ We need a new approach…
Distributed Systems: The Data Bo>leneck (2)
8 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ A radical new approach to distributed compu-ng – Distribute data when the data is stored – Run computaAon where the data is stored
Hadoop
9 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Data is split into “blocks” when loaded
§ Each task typically works on a single block – Many run in parallel
§ A master program manages tasks
Hadoop: Very High-‐Level Overview
Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea un mollit anim id est o laborum ame elita tu a magna omnibus et.
Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio
ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona
irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea
un mollit anim id est o laborum ame elita tu a magna omnibus et.
Slave Nodes Master
10 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Applica-ons are wriDen in high-‐level code
§ Nodes talk to each other as liDle as possible
§ Data is distributed in advance – Bring the computaAon to the data
§ Data is replicated for increased availability and reliability
§ Hadoop is scalable and fault-‐tolerant
Core Hadoop Concepts
11 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Adding nodes adds capacity propor-onally
§ Increasing load results in a graceful decline in performance – Not failure of the system
Scalability
Number of Nodes
Capacity
12 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Node failure is inevitable
§ What happens? – System conAnues to funcAon – Master re-‐assigns tasks to a different node – Data replicaAon = no loss of data – Nodes which recover rejoin the cluster automaAcally
Fault Tolerance
“Failure is the defining difference between distributed and local programming, so you have to design distributed systems with the expectaAon of failure.” – Ken Arnold (CORBA designer)
13 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
PresentaAon Topics
An Introduc-on to Hadoop and Cloudera
§ The MoAvaAon for Hadoop
§ ‘Core Hadoop’: HDFS and MapReduce
§ CDH and the Hadoop Ecosystem
§ Data Storage: HBase
§ Data IntegraAon: Flume and Sqoop
§ Data Processing: Spark
§ Data Analysis: Hive, Pig, and Impala
§ Data ExploraAon: Cloudera Search
§ Managing Everything: Cloudera Manager
§ Conclusion
14 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Hadoop Cluster
§ The Hadoop Distributed File System (HDFS) is a filesystem wriDen in Java
§ Sits on top of a na-ve filesystem
§ Provides storage for massive amounts of data – Scalable – Fault tolerant – Supports efficient processing with MapReduce, Spark, and other tools
HDFS Basic Concepts
HDFS
15 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Data files are split into blocks and distributed to data nodes
How Files are Stored (1)
Block 1
Block 2
Block 3
Very Large
Data File
16 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Data files are split into blocks and distributed to data nodes
How Files are Stored (2)
Block 1
Block 2
Block 3
Block 1
Block 1
Block 1
Very Large
Data File
17 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Data files are split into blocks and distributed to data nodes
§ Each block is replicated on mul-ple nodes (default 3x)
How Files are Stored (3)
Block 1
Block 2
Block 3
Block 1
Block 3
Block 2
Block 3
Block 1
Block 3
Block 1
Block 2
Block 2
Very Large
Data File
18 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Data files are split into blocks and distributed to data nodes
§ Each block is replicated on mul-ple nodes (default 3x)
§ NameNode stores metadata
How Files are Stored (4)
Name Node
Block 1
Block 2
Block 3
Block 1
Block 3
Block 2
Block 3
Block 1
Block 3
Block 1
Block 2
Block 2
Metadata: informaAon about files and blocks
Very Large
Data File
19 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Example: Storing and Retrieving Files (1)
NameNode Metadata
/logs/031512.log: B1,B2,B3 /logs/041213.log: B4,B5
B1: A,B,D B2: B,D,E B3: A,B,C B4: A,B,E B5: C,E,D
/logs/ 031512.log
1
/logs/ 041213.log
3
45
2
Node C 3 5
Node E 5
42
Node A
41 3
2Node B
31
4
Node D 12
5
Client
/logs/041213.log?
B4,B5
20 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Example: Storing and Retrieving Files (2)
NameNode Metadata
/logs/031512.log: B1,B2,B3 /logs/041213.log: B4,B5
B1: A,B,D B2: B,D,E B3: A,B,C B4: A,B,E B5: C,E,D
/logs/ 031512.log
1
/logs/ 041213.log
3
45
2
Node C 3 5
Node E 5
42
Node A
41 3
2Node B
31
4
Node D 12
5
Client
/logs/041213.log?
B4,B5
21 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ HDFS performs best with a modest number of large files – Millions, rather than billions, of files – Each file typically 100MB or more
§ Files in HDFS are “write once” – Files can be replaced but not changed
Important Notes About HDFS
22 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
MapReduce
§ The Mapper – Each Map task (typically) operates on a single HDFS block – Map tasks(usually) run on the node where the block is stored
§ Shuffle and Sort – Sorts and consolidates intermediate data from all mappers – Happens amer all Map tasks are complete and before Reduce tasks start
§ The Reducer – Operates on shuffled/sorted intermediate data (Map task output) – Produces final output
Map
Reduce
Shuffle and Sort
23 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
PresentaAon Topics
An Introduc-on to Hadoop and Cloudera
§ The MoAvaAon for Hadoop
§ ‘Core Hadoop’: HDFS and MapReduce
§ CDH and the Hadoop Ecosystem
§ Data Storage: HBase
§ Data IntegraAon: Flume and Sqoop
§ Data Processing: Spark
§ Data Analysis: Hive, Pig, and Impala
§ Data ExploraAon: Cloudera Search
§ Managing Everything: Cloudera Manager
§ Conclusion
24 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Hadoop Distributed File System
MapReduce
Hive Pig Impala Sqoop
The Hadoop Ecosystem (1)
Oozie … Flume HBase
Hadoop Ecosystem
Hadoop Core Components
CDH
25 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Hive Pig Impala Sqoop
§ CDH includes many Hadoop Ecosystem components
§ Following are more details on some of the key components
The Hadoop Ecosystem (2)
Oozie … Flume HBase
Hadoop Ecosystem
26 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ CDH (Cloudera’s Distribu-on, including Apache Hadoop) – 100% open source, enterprise-‐ready distribuAon of Hadoop and related projects – The most complete, tested, and widely-‐ deployed distribuAon of Hadoop – Integrates all key Hadoop ecosystem projects
CDH
27 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
PresentaAon Topics
An Introduc-on to Hadoop and Cloudera
§ The MoAvaAon for Hadoop
§ ‘Core Hadoop’: HDFS and MapReduce
§ CDH and the Hadoop Ecosystem
§ Data Storage: HBase
§ Data IntegraAon: Flume and Sqoop
§ Data Processing: Spark
§ Data Analysis: Hive, Pig, and Impala
§ Data ExploraAon: Cloudera Search
§ Managing Everything: Cloudera Manager
§ Conclusion
28 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ HBase: database layered on top of HDFS – Provides interacAve access to data
§ Stores massive amounts of data – Petabytes+
§ High throughput – Thousands of writes per second (per node)
§ Handles sparse data well – No wasted space for a row with empty columns
§ Limited access model – OpAmized for lookup of a row by key rather than full queries – No transacAons: single row operaAons only
HBase: The Hadoop Database
HDFS
29 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
HBase vs RDBMS
RDBMS HBase
Transactions Yes Single row only
Query language SQL get/put/scan (or use Hive or Impala)
Indexes Yes Row-key only
Max data size TBs PBs
Read/write throughput (queries per second)
Thousands Millions
30 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Use plain HDFS if… – You only append to your dataset (no random write) – You usually read the whole dataset (no random read)
§ Use HBase if… – You need random write and/or read – You do thousands of operaAons per second on TB+ of data
§ Use an RDBMS if… – Your data fits on one big node – You need full transacAon support – You need real-‐Ame query capabiliAes
When To Use HBase
31 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
PresentaAon Topics
An Introduc-on to Hadoop and Cloudera
§ The MoAvaAon for Hadoop
§ ‘Core Hadoop’: HDFS and MapReduce
§ CDH and the Hadoop Ecosystem
§ Data Storage: HBase
§ Data Integra-on: Flume and Sqoop
§ Data Processing: Spark
§ Data Analysis: Hive, Pig, and Impala
§ Data ExploraAon: Cloudera Search
§ Managing Everything: Cloudera Manager
§ Conclusion
32 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ What is Flume? – A service to move large amounts of data in real Ame – Example: storing log files in HDFS
§ Flume is – Distributed – Reliable and available – Horizontally scalable – Extensible
Flume: Real-‐Ame Data Import
33 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Flume: High-‐Level Overview
Agent Agent Agent
Agent Agent
Agent(s)
Agent
compress encrypt
• Pre-‐process data before storing • e.g., transform, scrub, enrich
• Store in any format • Text, compressed, binary, or custom sink
• Collect data as it is produced • Files, syslogs, stdout or custom source
Agent
• Process in place • e.g., encrypt, compress
• Write in parallel • Scalable throughput
HDFS
34 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Sqoop: SQL to Hadoop – Transfers data between RDBMS and HDFS – Uses a command-‐line tool or applicaAon connector – Allows incremental imports – Supports virtually all RDBMSs which speak JDBC
– Custom connectors available for some RDBMSs for increased speed
Sqoop: Exchanging Data With RDBMSs
HDFS
Sqoop
RDBMS
35 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Data Center IntegraAon
File Server
Relational Database(OLTP)
Data Warehouse(OLAP)
Web/App Servers
Hadoop ClusterSqoop
Flume hadoop fs
Sqoop
36 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
PresentaAon Topics
An Introduc-on to Hadoop and Cloudera
§ The MoAvaAon for Hadoop
§ ‘Core Hadoop’: HDFS and MapReduce
§ CDH and the Hadoop Ecosystem
§ Data Storage: HBase
§ Data IntegraAon: Flume and Sqoop
§ Data Processing: Spark
§ Data Analysis: Hive, Pig, and Impala
§ Data ExploraAon: Cloudera Search
§ Managing Everything: Cloudera Manager
§ Conclusion
37 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Apache Spark is a fast, general engine for large-‐scale data processing on a cluster
§ Originally developed at AMPLab at UC Berkeley
§ Open source Apache project
§ Provides several benefits over MapReduce – Faster – Be>er suited for iteraAve algorithms
– Can hold intermediate data in RAM, resulAng in much be>er performance
– Easier API – Supports Python, Scala, Java
– Supports real-‐Ame streaming data processing
Apache Spark
38 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ MapReduce – Widely used, huge investment already made – Supports and supported by many complementary tools – Mature, well-‐tested
§ Spark – Flexible – Elegant – Fast – Supports real-‐Ame streaming data processing
§ Over -me Spark will supplant MapReduce as the general processing framework used by most organiza-ons
Spark vs Hadoop MapReduce
39 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
PresentaAon Topics
An Introduc-on to Hadoop and Cloudera
§ The MoAvaAon for Hadoop
§ ‘Core Hadoop’: HDFS and MapReduce
§ CDH and the Hadoop Ecosystem
§ Data Storage: HBase
§ Data IntegraAon: Flume and Sqoop
§ Data Processing: Spark
§ Data Analysis: Hive, Pig, and Impala
§ Data ExploraAon: Cloudera Search
§ Managing Everything: Cloudera Manager
§ Conclusion
40 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ The mo-va-on: MapReduce is powerful but hard to master
§ Even Spark requires a developer who can code in Scala or Python
§ A solu-on: Hive and Pig – Built on top of MapReduce
– Currently being ported to run on top of Spark for be>er performance
– Leverage exisAng skillsets – Data analysts who use SQL – Programmers who use scripAng languages
– Open source Apache projects – Hive iniAally developed at Facebook – Pig IniAally developed at Yahoo!
Hive and Pig: High Level Data Languages
41 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Hive
§ What is Hive? – HiveQL: An SQL-‐like interface to Hadoop
SELECT * FROM purchases WHERE price > 10000 ORDER BY storeid
42 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Pig
§ What is Pig? – Pig La-n: A dataflow language for transforming large data sets
purchases = LOAD "/user/dave/purchases" AS (itemID, price, storeID, purchaserID);
bigticket = FILTER purchases BY price > 10000; ...
43 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ High-‐performance SQL engine for vast amounts of data – Similar query language to HiveQL – 10 to 50+ Ames faster than Hive, Pig, or MapReduce
– EffecAvely, provides ‘real Ame’ results
§ Impala runs on Hadoop clusters – Data stored in HDFS – Does not use MapReduce
§ Developed by Cloudera – 100% open source, released under the Apache somware license
Impala: High Performance Queries
44 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Choose the best solu-on for the given task – Mix and match as needed
§ MapReduce – Low-‐level approach offers flexibility, control, and performance – More Ame-‐consuming and error-‐prone to write – Choose when control and performance are most important
§ Pig, Hive, and Impala – Faster to write, test, and deploy than MapReduce – Be>er choice for most analysis and processing tasks
Which to Choose? (1)
45 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Use Impala when… – You have analysts familiar with SQL – You need near real-‐Ame responses to ad hoc queries – You have structured data with a defined schema
§ Use Hive or Pig when… – You need support for custom file types, or complex data types
§ Use Pig when… – You have developers experienced with wriAng scripts – Your data is unstructured/mulA-‐structured
§ Use Hive When… – Your data is structured and you are performing long-‐running, batch jobs
Which to Choose? (2)
46 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Comparing Pig, Hive, and Impala
Descrip-on of Feature Pig Hive Impala
SQL-‐based query language No Yes Yes
Schema OpAonal Required Required
Supports user-‐defined func-ons Yes Yes Yes
Extensible file format support Yes Yes No
Query speed Slow Slow Fast
Accessible via ODBC/JDBC No Yes Yes
47 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Probably not if the RDBMS is used for its intended purpose
§ Rela-onal databases are op-mized for: – RelaAvely small amounts of data – Immediate results – In-‐place modificaAon of data
§ Pig, Hive, and Impala are op-mized for: – Large amounts of read-‐only data – Extensive scalability at low cost
§ Pig and Hive are beDer suited for batch processing – Impala and RDBMSs are be>er for interacAve use
Do These Replace an RDBMS?
48 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Analysis Workflow Example
Import Transaction Datafrom RDBMS
Sessionize WebLog Data with Pig
Analyst using Impala shell for ad hoc queries
Analyst using Impala via BI tool
Sentiment Analysis on Social Media with Hive
Hadoop Cluster with Impala
Generate Nightly Reports using Pig, Hive, or Impala
49 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
PresentaAon Topics
An Introduc-on to Hadoop and Cloudera
§ The MoAvaAon for Hadoop
§ ‘Core Hadoop’: HDFS and MapReduce
§ CDH and the Hadoop Ecosystem
§ Data Storage: HBase
§ Data IntegraAon: Flume and Sqoop
§ Data Processing: Spark
§ Data Analysis: Hive, Pig, and Impala
§ Data Explora-on: Cloudera Search
§ Managing Everything: Cloudera Manager
§ Conclusion
50 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Cloudera Search
§ Real-‐-me, scalable indexing
§ Load any type of data
§ Text and faceted searching
51 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Cloudera Search Example: Twi>er Feed Search
IteraAve search using facets
Full text search
52 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
PresentaAon Topics
An Introduc-on to Hadoop and Cloudera
§ The MoAvaAon for Hadoop
§ ‘Core Hadoop’: HDFS and MapReduce
§ CDH and the Hadoop Ecosystem
§ Data Storage: HBase
§ Data IntegraAon: Flume and Sqoop
§ Data Processing: Spark
§ Data Analysis: Hive, Pig, and Impala
§ Data ExploraAon: Cloudera Search
§ Managing Everything: Cloudera Manager
§ Conclusion
53 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Pujng Hadoop into produc-on requires stringent up-mes
§ Clusters are made up of a large number of hosts – Each host runs mulAple Hadoop services – Difficult to know the status of everything
§ Inevitable issues will arise with hardware and sokware
§ Keeping track of the cluster becomes an issue – Are all hosts healthy and working? – Am I using all of the best pracAces for the service? – Is there a performance issue for a host or service? – Is the cluster secure?
Reducing Complexity With Cloudera Manager
54 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ Cloudera Manager is a purpose-‐built applica-on designed to make the administra-on of Hadoop simple and straighmorward – Automates the installaAon of a Hadoop cluster – Quickly adds and configures new services on a cluster – Provides real-‐Ame monitoring of cluster acAvity – Produces reports of cluster usage – Manages users and groups who have access to the cluster – Integrates with your exisAng enterprise monitoring tools
§ Cloudera Manager Express Edi-on – Free
§ Cloudera Enterprise – Cloudera Manager plus support – Contact us for pricing
What Is Cloudera Manager?
55 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Cloudera Manager Dashboard
56 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
Health Status and CharAng
57 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
PresentaAon Topics
An Introduc-on to Hadoop and Cloudera
§ The MoAvaAon for Hadoop
§ ‘Core Hadoop’: HDFS and MapReduce
§ CDH and the Hadoop Ecosystem
§ Data Storage: HBase
§ Data IntegraAon: Flume and Sqoop
§ Data Processing: Spark
§ Data Analysis: Hive, Pig, and Impala
§ Data ExploraAon: Cloudera Search
§ Managing Everything: Cloudera Manager
§ Conclusion
58 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.
§ There are several more projects in CDH – CDH supports all the key projects you need
§ We haven’t even talked about security! – CDH includes Kerberos integraAon for authenAcaAon – Cloudera Enterprise provides all the security you need, whatever your industry – Recently achieved PCI cerAficaAon
§ Download the QuickStart VM to get started in a single VM
§ Try Cloudera on a real cluster for free
§ All available at cloudera.com/live
§ Ques-ons?
Conclusion
59 © Copyright 2010-‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.