Modern Big Data Analytics Tools: An OverviewHadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout...
Transcript of Modern Big Data Analytics Tools: An OverviewHadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout...
-
Modern Big Data Analytics
Tools:An Overview
7/24/2019 1/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
Hadoop Midwife :-)7/24/2019 2/43
S. Saranya/ IT6006/ Modern Big Data Analytics Tools
-
Onceupon atime, in a landfar far away…
7/24/2019 3/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
7/24/2019 4/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
Fast forward 15years..
7/24/2019 5/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
7/24/2019 6/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
What Happened ?
7/24/2019 7/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
7/24/2019 8/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
7/24/2019 9/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
7/24/2019 10/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
7/24/2019 11/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
In ablinkof aneye…
HDFS
Sqoop Flume
Coordination and workflow management
Zookeeper
Command
Center
GemFire XD
Oozie
MapReduce
Pig Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAW
Q
SpringXD
MADlib
Ham
ste
r
PivotalR
YARN
ASFProjects FLOSSProjects Pivotal Products
7/24/2019 12S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
Google Papers7/24/2019 13/43
S. Saranya/ IT6006/ Modern Big Data Analytics Tools
-
Yahoo! Search
+
=
7/24/2019 14/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
W-1-W
• WebMap :Graph processing for WWW• Dreadnaught: Infrastructure for WebMap• W-1-W:WebMap In One Week• Juggernaut: Infrastructure for W-1-W• JFS,JMR,Condor:Abandoned for Hadoop
7/24/2019 15/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
Lucene,Nutch7/24/2019 16/43
S. Saranya/ IT6006/ Modern Big Data Analytics Tools
-
MapReduce is the Revenge of System Programmers on Database community.
- Anonymous at XLDB, Stanford,2010
7/24/2019 17/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
7/24/2019 18/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
O’Reilly Books20137/24/2019 19/43
S. Saranya/ IT6006/ Modern Big Data Analytics Tools
-
Who Uses Hadoop?(From Hadoop Summit 2010)
7/24/2019 20/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
Big Data Landscape - July 2012http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/
7/24/2019 21/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/
-
7/24/2019 22/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
7/24/2019 23/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
7/24/2019 24/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
Hadoop Maturity
ETL OffloadAccommodate massive data growth with existing EDW investments
Data LakesUnify Unstructured and Structured DataAccess
Big Data AppsBuild analytic-led applications impacting top line revenue
Data-Driven EnterpriseApp Dev and Operational Management on HDFS DataArchitecture
7/24/2019 25S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
70% of data
generated by
customers
80% of data
being stored
3% being prepared for
analysis
0.5% being
analyzed
-
Storage Options
• HDFS, MapR, Quantcast QFS• EMC Isilon,NetApp, IBM GPFS, PanFS, PVFS,
Lustre
• Amazon S3, EMC Atmos, OpenStackSwift• GlusterFS,Ceph• EMCViPR
7/24/2019 27/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
SQL-on-Hadoop
• Pivotal HAWQ• Cloudera Impala, Facebook Presto, Apache
Drill, Cascading Lingual, Optiq, Hortonworks Stinger
• Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase
• More to come...
7/24/2019 28/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
...
......HAWQ & HDFSMaster Severs
Planning & dispatch
Network Interconnect
Segment Severs
Query execution
...Storage
HDFS, HBase …
7/24/201929/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
Namenode
Breplication
Rack1 Rack2
DatanodeDatanode Datanode
Read/Write
S
Segment
Segment
Segment host
Segment host
Master host
Meta Ops
HAWQ Interconnect
Segment
Segment host Segment Segment
Segment
SegmentSegment
Segment
Segment
egment host
Segment
Datanode
Segment
7/24/201930/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
HAWQ vsHive
Lower is Better
7/24/201931/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
Provides data-parallel implementations
of mathematical, statistical and machine-learning methods
for structured and unstructureddata.
In-DatabaseAnalytics
7/24/2019 32/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
MADlibAlgorithms
7/24/2019 33/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
MADLib Functions
• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• K-Means
• Association Rules
• Latent Dirichlet Allocation
• Naïve Bayes
• Elastic NetRegression
• Decision Trees / Random Forest
• Support VectorMachines
• Cox Proportional Hazards Regression
• Descriptive Statistics
• ARIMA
7/24/2019 34/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
k-MeansUsage
SELECT * FROM madlib.kmeanspp (
-- name of the input table
-- name of the feature array column
-- k : number of clusters
„customers‟,
„features‟,
2
);
centroids | objective_fn | frac_reassigned | …
------------------------------------------------------------------------+------------------+-----------------+ …
{{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | …
7/24/2019 35/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
pivotal R
• Interface is Rclient• Execution is in database• Parallelism handled by PivotalR• Supports a portion of R
R> x = db.data.frame(“t1”)
R> l = madlib.lm(interlocks ~ assets + nation, data = t)
7/24/2019 36/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
MapReduce 1.0(Image Courtesy Arun Murthy,Hortonworks)
7/24/2019 37/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
Hadoop 2.0(Image Courtesy Arun Murthy,Hortonworks)
HADOOP 1.0
HDFS(redundant, reliable storage)
MapReduce(cluster resource management
& dataprocessing)
HDFS2(redundant, reliable storage)
YARN(cluster resource management)
Tez(execution engine)
HADOOP 2.0
Pig(dataflow)
Hive(sql)
Others(cascading)
Pig(dataflow)
Hive(sql)
Others(cascading)
MR(batch)
GraphStorm, Giraph
RT
Stream, ServicesHBase
7/24/2019 38/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
Applications Run Natively INHadoop
YARN (Cluster ResourceManagement)
HDFS2 (Redundant, ReliableStorage)
BATCH(MapReduce)
INTERACTIVE(Tez)
STREAMING(Storm,S4,…)
GRAPH(Giraph)
INLMEMORY(Spark)
HPCMPI(OpenMPI)
ONLINE(HBase)
OTHER
(Search) (Weave…)
YARN Platform(Image Courtesy Arun Murthy,Hortonworks)
7/24/2019 39/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
NodeManager NodeManager NodeManager NodeManager
Container 1.1
Container 2.4
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
Container 1.2
Container 1.3
AM 1
Container 2.2
Container 2.1
Container 2.3
AM2
Client2
ResourceManager
Scheduler
YARNArchitecture(Image Courtesy Arun Murthy,Hortonworks)
7/24/2019 40/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
GraphLab + Hamster on
Hadoop
7/24/2019 41/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
Data Platform of the Future ?
Analytic Data Marts
Operational Intelligence
SQL Services In-MemoryDatabase
Run-Time Applications
Data Staging Platform
Stream Ingestion
Streaming Services Data Mgmt. Services
nter
In-Memory Grid
New Data-fabrics
...ETCSoftware-Defined Datace
7/24/2019 42S. Saranya/ IT6006/ Modern Big Data
Analytics Tools
-
Questions?
7/24/2019 43/43S. Saranya/ IT6006/ Modern Big Data
Analytics Tools