© 2013 IBM Corporation Platform Computing 1 IBM BigInsights 2.1 Understanding the role of IBM...
Transcript of © 2013 IBM Corporation Platform Computing 1 IBM BigInsights 2.1 Understanding the role of IBM...
© 2013 IBM Corporation
Platform Computing
1
IBM BigInsights 2.1 Understanding the role of IBM Platform Computing and GPFS FPO in the STG BigInsights 2.1 Reference Architecture
Gord SissonsSteve HurleyChris PorterBlane Rockafellow
© 2013 IBM Corporation
Platform Computing
2
Agenda
• About the BigInsights 2.1 HW reference architecture
• Solution components
• Key BigInsights Advantages
• Platform Computing Products
• IBM Platform Symphony
• IBM GPFS FPO
• IBM Platform Cluster Manager
© 2013 IBM Corporation
Platform Computing
3
The IBM System X BigInsights Reference Architecture
• One of a family of big data reference architectures from IBM
• Enables fast, risk free deployment with validated configurations
• Flexibility to accommodate different client needs
• Value-added software components can be implemented with Lab Services
Pre-Assembled racks Customized to your needs Integrated and tested Supported as a solution Tailored to your needs Start small…and grow Easy to order Easy to manage
Pre-Assembled racks Customized to your needs Integrated and tested Supported as a solution Tailored to your needs Start small…and grow Easy to order Easy to manage
© 2013 IBM Corporation
Platform Computing
4
Configuration Starter Half Rack w/ Mgmt Nodes*
Full Rack w/ Mgmt Nodes*
Full Data Node Rack*
Available storage (2TB/3TB) 108TB / 144TB 324TB / 432TB 648TB / 864TB 720TB / 960TB
Raw data space (2TB/3TB) 27TB / 36TB 81TB / 108TB 114TB / 216TB 180TB / 240TB
Mgmt Nodes / Data Nodes 1 Mgmt / 3 Data 3 Mgmt / 9 Data 3 Mgmt / 18 Data 0 Mgmt / 20 Data
Switches 1 x 10GbE / 1 x 1GbE 1 x 10GbE / 1 x 1GbE 1 x 10GbE / 1 x 1GbE 1 x 10GbE / 1 x 1GbE
IBM BigInsights Reference ArchitectureHardware: Incorporating a balance of value, enterprise and performance options
* Number of management nodes required varies with cluster size and workload; for multi-rack configs, select combination of these racks as needed
Management Node
x3550 M4 withTwo E5-2650 2GHz 8-core CPU128GB RAM, 16x 8GB 1600MHz RDIMMFour 600GB 2.5” HDD (OS)Two Dual-port 10GbE (data)Dual-port 1GbE (mgmt)
Data Node
x3630 M4 withTwo E5-2450 2.1GHz 8-core CPU48GB RAM, 6x 8GB 1600MHz RDIMMTwo 3TB 3.5” HDD (OS/app)Twelve 3TB 3.5” HDD (data)Optional 4TB HDD upgradeDual-port 10GbE (data)Dual-port 1GbE (mgmt)
© 2013 IBM Corporation
Platform Computing
5
IBM BigInsights Reference ArchitectureSoftware: Your choice of best-of-breed and open-source components
* Optional items should be sold with Lab Services to ensure proper installation and configuration
Optional components
IBM InfoSphere BigInsightsIBM InfoSphere BigInsights
Resource sharing / multi-tenancy
Scheduler
Distributed File system
Hadoop Scheduler / Platform Symphony Scheduler (included)
Hadoop Scheduler / Platform Symphony Scheduler (included)
RHEL, SUSE 64bit LinuxRHEL, SUSE 64bit Linux
2 x Mellanox ConnectX data network1 x dual-port 1 GbE management network
2 x Mellanox ConnectX data network1 x dual-port 1 GbE management network
x3550 M4 master node(s)X3630 M4 compute nodesx3550 M4 master node(s)X3630 M4 compute nodes
Hardware
Network
Operating system
HDFS HDFS GPFS 3.5.0.9(optional)
GPFS 3.5.0.9(optional)
Platform Symphony Advanced Edition(optional)
Platform Symphony Advanced Edition(optional)
Analytics Software Environment
Provisioning andCluster Management
Platform Cluster ManagerPlatform Cluster Manager – AE
(optional)
Platform Cluster ManagerPlatform Cluster Manager – AE
(optional)
GPFS FPOConnector
© 2013 IBM Corporation
Platform Computing
6
A Comprehensive solution for big data analytics
BI / Reporting
Exploration / Visualization
FunctionalApp
IndustryApp
Predictive Analytics
Content Analytics
Analytic Applications
Big Data Platform
Systems Management
Application Development
Visualization & Discovery
Accelerators
Information Integration & Governance
Data Warehouse
HadoopSystem
Stream Computing
Agile, multi-tenant shared infrastructure
The IBM Big Data Platform
• Comprehensive platform• Data at rest, data in motion• Structured, un-structured, semi-structured• Extensive library of data connectors• Rich development tools• Application accelerators• Web-based management console
© 2013 IBM Corporation
Platform Computing
7
Visualization & DiscoveryVisualization & Discovery IntegrationIntegration
Workload OptimizationWorkload OptimizationStreams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsightsIBM InfoSphere BigInsights
Runtime / SchedulerRuntime / Scheduler
Advanced Analytic EnginesAdvanced Analytic Engines
File SystemFile System
MapReduce
HDFS
Data StoreData StoreHBase
Text Processing Engine & Extractor Library)
BigSheetsJDBC
Applications & DevelopmentApplications & Development
Text Analytics MapReduce
Pig & Jaql Hive
AdministrationAdministration
Index
Splittable Text Compression
Enhanced Security
Flexible SchedulerJaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard & Visualization
Apps
Workflow Monitoring
ManagementManagement
HCatalog
Security
Audit & History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
Symphony
GPFS FPO
Optional
Symphony AE
The IBM Big Data Platform
© 2013 IBM Corporation
Platform Computing
8
Complexity - A Key Customer Challenge
Multiple distributed software components, often deployed on separate infrastructure
expensive to deploy, expensive to manage, expensive to evolve
Multiple distributed software components, often deployed on separate infrastructure
expensive to deploy, expensive to manage, expensive to evolve
© 2013 IBM Corporation
Platform Computing
9
Cluster sprawl drives cost and inefficiency
Operational challenges are looming• Fast evolving ecosystem• Multiple versions and distributions• Many inter-dependencies• Data management challenges (HDFS)• Application lifecycle management concerns
Operational challenges are looming• Fast evolving ecosystem• Multiple versions and distributions• Many inter-dependencies• Data management challenges (HDFS)• Application lifecycle management concerns
From Mike Gualiteri, Forrester Research
© 2013 IBM Corporation
Platform Computing
10
Resource Orchestration
Multi-tenant shared service environment
Provisioning & Management
Enterprise Storage
Workload Manager(s)
A smarter, consolidated infrastructure
© 2013 IBM Corporation
Platform Computing
11
IBM PLATFORM SYMPHONYUnderstanding the advantage
© 2013 IBM Corporation
Platform Computing
12
IBM Platform Symphony
• A heterogeneous grid management platform
• A high-performance SOA middleware environment
• Supports diverse compute & data intensive applications• Compute and Data intensive ISV analytic applications
• In-house analytic applications (C/C++, C#/.NET, Java, Excel, R etc)
• Optimized low-latency Hadoop compatible run-time
• Can be used to launch, persist and manage non-grid aware application services
• React instantly to time critical-requirements
• Production proven multi-tenancy with resource sharing capabilities
• Embedded single-tenant license in InfoSphere BigInsights 2.1
© 2013 IBM Corporation
Platform Computing
13
Symphony brings unique capabilities to Big Data
Performance• Performance advantages for a variety of Map Reduce workloads – Boost productivity and
reduce or avoid cost
Resource sharing*• Share infrastructure among departments and across multiple Hadoop and non-Hadoop
applications to maximize efficiency and reduce cost
Scheduling agility• Proportional, priority-based resource allocation, SLA guarantees, and fast configurable pre-
emption ensures that Symphony can respond instantly to time critical workloads
SLA management*• Removes a major barrier to resource sharing helping organizations evolve to a shared service
model to maximize flexibility and reduce infrastructure costs
Reporting & Analytics*• Optional Platform Analytics add-on enables organizations to monitor granular resource usage
for charge-back accounting and improved capacity planning
Reliability• Ensure reliability of core system services, and make individual Hadoop jobs recoverable to
avoid down-time, and ensure that critical reporting windows and SLAs are met
* IBM Platform Symphony Advanced Edition license required
© 2013 IBM Corporation
Platform Computing
14
IBM Platform Symphony
Performance
• Low-latency SOA workload manager• Performance results vary between ~40% and
~10x depending on workload• Audited results1 show an average 7x advantage
on social media workloads with a 50x advantage in raw scheduling performance
• Single tenant2 Symphony license included in BigInsights 2.1 Enterprise Edition
• Many performance enhancements• Push-model for low-latency scheduling• Shuffle-stage optimizations• Use of native APIs for JAR file movement• Generic slots to fully utilize cluster
1-Audited STAC Report available for download - http://www-03.ibm.com/systems/technicalcomputing/platformcomputing/products/symphony/highperfhadoop.html
2-The embedded Symphony licenses entitles a user to run only a single instance of BigInsights. No limits are placed on concurrently executing BI workloads. Customers can purchased Platform Symphony Advanced Edition to support multiple grid consumers (tenants)
Comparative “sleep test” based on methodology to measure scheduling performance discussed at Hadoop World 2011. Compares Hadoop 0.20.2, Hadoop 1.0.1 (with 0.3 second heartbeat) and Hadoop 1.0.1 accelerated by IBM Platform Symphony.
http://www.slideshare.net/cloudera/hadoop-world-2011-hadoop-and-performance-todd-lipcon-yanpei-chen-cloudera
© 2013 IBM Corporation
Platform Computing
15
IBM Platform Symphony
Resource sharing
• Share resources among heterogeneous workloads (Hadoop and non-Hadoop)
• Up to 300 concurrent job trackers• Flexible application profiles• Support multiple IBM and third party analytic
applications on a shared infrastructure• InfoSphere Streams, IBM DataStage,
SPSS, SAS, Mathworks MatLab, R etc.
© 2013 IBM Corporation
Platform Computing
16
IBM Platform Symphony
Scheduling agility
• Agile scheduling ensures that time critical workloads start and finish fast
• Optionally give priority to interactive jobs (i.e. BigSheets, Big SQL)
• Resource allocations shift instantly based on priority adjustments and proportional allocations at run-time
• Generic slot models ensures that the cluster can be kept 100% busy
© 2013 IBM Corporation
Platform Computing
17
IBM Platform Symphony
SLA management
• Guarantee minimum quality of service• Time-variant sharing policies• Multiple resource sharing models• Granular, directed sharing• Configurable pre-emption policies• Maintain multiple versions of application
services to simplify life-cycle management• Share resources between Dev, Test,
Production & QA application instances
© 2013 IBM Corporation
Platform Computing
18
IBM Platform Symphony
Reporting and Analytics
• Comprehensive reporting built-in• Monitor resource allocations to tune sharing • Ensure business SLAs are being met• Optional Platform Analytics add-on for OLAP
analysis supporting chargeback accounting and improved capacity planning
© 2013 IBM Corporation
Platform Computing
19
IBM Platform Symphony
Reliability
• No single point of failure• All services highly available• Hadoop jobs recoverable in the event of failure• Ensure deadlines and batch-windows are met• Service replay debugger helps rapidly diagnose problems that occur in production at scale
• Production proven at scale
© 2013 IBM Corporation
Platform Computing
20
IBM GPFSBringing new capabilities to IBM BigInsights
© 2013 IBM Corporation
Platform Computing
21
GPFS – bringing new capabilities to BigInsights
POSIX compliance• Wile HDFS is a single-purpose file system, GPFS implements the POSIX specification natively
meaning that multiple applications can share the same filesystem improving flexibility and avoiding data redundancy
File system reliability• GPFS FPO eliminates the name node as a single point of failure improving file system
reliability and recoverability
Flexible storage configuration• Employ the right storage architecture depending on the application need, using shared nothing
storage with n-way block replication for Hadoop workloads, and traditional GPFS storage for non-Hadoop workloads to improve flexibility and minimize cost
Enterprise features• GPFS FPO and GPFS can co-exist on the same cluster, bringing advanced features to Hadoop
environments including active file management, information lifecycle management and file system snapshots to simplify the management of large storage infrastructure
Support from the source• Avoid the risk of storing critical data on an open-source file system with limited support. IBM
owns the codebase for GPFS and can provide mission critical support
© 2013 IBM Corporation
Platform Computing
22
POSIX file system
• Native POSIX file system• Avoid workarounds like FUSE• Avoid needless data movement and replication• Variable block-sizes provide good performance
across diverse types of workloads
A single filesystem for both MapReduce and non-MapReduce applications
Hadoop MapReduce applications Native OS applications
GPFS – bringing new capabilities to BigInsights
© 2013 IBM Corporation
Platform Computing
23
File system reliability
• GPFS FPO avoids the need for a central namenode, a common failure point in HDFS
• Avoid long recovery times in the event of name node failure
• Pipelined replication for efficient storage of block replicas in GPFS FPO environment
• Boost performance for meta-data intensive applications where the name-node can emerge as a bottleneck.
HDFSNamenode
SecondaryNamenode
Metadata is striped across GPFS FPO nodes, providing better reliability and avoiding the need for primary and secondary name nodes
IBM BigInsights cluster with GPFS FPO
GPFS – bringing new capabilities to BigInsights
© 2013 IBM Corporation
Platform Computing
24
Flexible storage configuration
• GPFS FPO avoids the need for a central namenode with distributed metadata, a common failure point in HDFS environments
• Avoids long recovery times in the event that the namenode fails and metadata needs to be recovered from the secondary name node
• Pipelined replication for efficient storage of block replicas in GPFS FPO environment
GPFS Server GPFS Server
Switched Fabric
Shared nothing storage - GPFS FPO
Shared storage - GPFS
IBM BigInsights cluster with GPFS FPO
GPFS – bringing new capabilities to BigInsights
© 2013 IBM Corporation
Platform Computing
25
PERFORMANCE & FLEXIBILITY • Performance and efficiency – Similar to HDFS for MapReduce workloads but with the option to deploy a high-performance parallel, shared file system
IMPROVED DATA SHARING FORBETTER COLLABORATION
• Enable improved collaboration and efficient sharing of data among globally distributed teams
BUSINESS CONTINUITY AND DATA INTEGRITY
• Ensure business continuity and data integrity with more reliable storage and remote data replication
MORE EFFECTIVE MANAGEMENT OF DATA OVER ITS LIFECYCLE
• Support automated, cost-efficient management of data over its life-cycle
AVOID EXPENSIVE DATA SILOS WITH MORE VERSATILE STORAGE
• Avoid expensive data silos with a single storage environment that supports diverse application types
Enterprise features
GPFS – bringing new capabilities to BigInsights
© 2013 IBM Corporation
Platform Computing
26
IBM PLATFORM CLUSTER MANAGERUnderstanding the advantage
© 2013 IBM Corporation
Platform Computing
27
Platform Cluster Manager
Platform Cluster Manager
Provisioning and management of distributed clusters, including self-service cluster creation and management by multiple user groups
IBM Platform Cluster Manager – Advanced Edition
Cluster & Grid Provisioning and Management
© 2013 IBM Corporation
Platform Computing
28
OverviewMultitenant self-service creation, flexing and management of multiple analytics and high performance computing (HPC) clusters
Key Capabilities
•Rapid deployment of heterogeneous analytics and HPC clusters
•Secure multi-tenant environment
•Dynamically grow and shrink clusters
•Provision physical and/or virtual machines
•Automates self-service cluster delivery and administration
•Consolidates infrastructure from multiple clusters enabling analytics and HPC cloud environments
Benefits
•Faster time to full system readiness
•Single interface for integrated management & monitoring
•Reduces time to full user productivity
•Reduces IT costs with dramatic gains in infrastructure utilization
28
Private
Analytics & HPC Cluster Mgmt
Open Scalable Proven
Resource Pools
IBM Platform Cluster Manager – Advanced Edition
© 2013 IBM Corporation
Platform Computing
29
HPCConsumer
AnalyticsConsumer
AnalyticsConsumer
HPCConsumer
Compute and storage dense nodes – System X or Power
Virtual Infrastructure IBM GPFS Rack Switch
IBM Platform Cluster Manager
Advanced Edition
Ready-to-run clusters dynamically provisioned as tenants on shared infrastructure
Grid Instance #1 Grid Instance #2
IBM Platform LSF
Life Sciences / EDA / CFD / CAE
Grid Instance #3
3rd Party Schedulers
Life Sciences / EDA / CFD / CAE
Grid Instance #4
IBM Platform Symphony
Open-source Apache Hadoop
IBM Platform Symphony
IBM InfoSphereBigInsights
IBM Platform Cluster Manager – Advanced Edition
© 2013 IBM Corporation
Platform Computing
30
• Multiple analytics and HPC clusters• Rapid Provisioning: Get the clusters you
need, in minutes, instead of hours and days• Heterogeneous: Deploy LSF, Symphony,
Grid Engine, PBS, Hadoop, most 3rd party workload managers
• Dynamically grow and shrink clusters• Support expansion and shrinking of clusters
as needed over time.• Based on policy, calendar and user
intervention• Share resources between clusters
• Multitenant• Account separation, different service
catalogs, resource limits, per account reporting
• Dynamic VLAN creation• Authenticated access to portal, service
catalog, provisioned machines & storage
• Physical, virtual and hybrid clusters• Choose the right resource to match the
workload• Bare metal provisioning• Switch management• GUI for multiple xCAT instances
• Self-service delivery and administration• Cluster are available on-demand when they
are needed• Reduce/eliminate the need to wait for
someone to act
• Consolidate• Breaks down silos and provides a larger
resource pool
IBM Platform Cluster Manager – Main Capabilities
© 2013 IBM Corporation
Platform Computing
31
Platform Cluster Manager Cockpit view Manage physical hosts, virtual machines, clusters, tenant accounts, networks, storage and more
© 2013 IBM Corporation
Platform Computing
32
Design clusters for self-service with arbitrarily complex machine elements and software stacks complete with customizable pre and post-provisioning scripts
© 2013 IBM Corporation
Platform Computing
33
Automatically deploy ready-to-use analytic environments - InfoSphere BigInsights, Streams, DataStage, Platform Symphony, GPFS, Platform LSF or other analytic software environments
© 2013 IBM Corporation
Platform Computing
34
BI / Reporting
Exploration / Visualization
FunctionalApp
IndustryApp
Predictive Analytics
Content Analytics
Analytic Applications
Big Data Platform
Systems Management
Application Development
Visualization & Discovery
Accelerators
Information Integration & Governance
Data Warehouse
HadoopSystem
Stream Computing
Agile, multi-tenant shared infrastructure
IBM InfoSphere BigInsights, Platform ComputingExtending the capabilities of IBM BigInsights
Platform Symphony and GPFS provide significant advantages
Improved performance More efficient use of infrastructure Diverse, concurrent workloads Dynamic resource allocation Fast workload pre-emption Sophisticated multi-tenancy Ease of management Guaranteed service levels
© 2013 IBM Corporation
Platform Computing
35
BigInsights, Platform Symphony & GPFSProviding competitive advantage for Big Data infrastructure
Capability Cloudera CDH
EMC / GP UAP
MAPR HortonWorks Open Source
BigInsights Platform,
GPFS FPO
Low-latency scheduling
Impala only No Some features
No No
Heterogeneous workloads
No No No No No
Fast pre-emptive scheduling
No No No No No
Time-variant SLA guarantees
No No Some features
No No
Usage Accounting & Analytics add-on
No No No No No
Recoverable Hadoop jobs
No No No No No
POSIX file system NoNFS only
No No
Enterprise file system features
No No No
© 2013 IBM Corporation
Platform Computing
36
BigInsights, Platform Symphony & GPFS
Capability Cloudera CDH
EMC / GP UAP
MAPR HortonWorks Open Source
BigInsights Platform,
GPFS FPO
SQL Support
Impala
Pivotal
Drill
Via open source only
Impala, Drill
BigSheets No No No No No
External Data Connectors
GP DB built-in
No No No
Accelerators No No No No No
Complete HW & Software solution
Through HW partners No No No
Single vendor support Through HW partners No No No
Full-featured private cloud management
No No No No No
Providing competitive advantage for Big Data infrastructure
© 2013 IBM Corporation
Platform Computing
37
Summing up
IBM BigInsights, Platform Computing, GPFS FPO
• Single-tenant license for Platform Symphony included in BI 2.1
• Upgrade to Symphony Advance Edition for resource sharing features
• Enterprise-class POSIX file system
• Advanced cluster provisioning, private cloud management
• The most complete infrastructure solution for Big Data analytics
© 2013 IBM Corporation
Platform Computing
38
© 2013 IBM Corporation
Platform Computing
39
ADDITIONAL SLIDES
© 2013 IBM Corporation
Platform Computing
40
Other Grid Server
Broker Engines
Each engine polls broker~5 times per second (configurable)
Send work whenengine ready
Client
Serialize input data
Network transport(client to broker) Wait for engine to poll broker
Network transport(broker to engine)
De-serializeInput data
ComputeResult
Serializeresult
Post result back to broker
Time
…
BrokerCompute time
IBM Platform Symphony is (much) faster because:
Efficient C language routines use CDR (common data representation) and IOCP rather than slow, heavy-weight XML data encoding)
Network transit time is reduced by avoiding text based HTTP protocol and encoding data in more compact CDR binary format
Processing time for all Symphony services is reduced by using a native HPC C/C++ implementation for system services rather than Java
Platform Symphony has a more efficient “push model” that avoids entirely the architectural problems with polling
Platform Symphony
Serializeinput
Networktransport
SSM Computetime & logging
Time
Network transport(SSM to engine)
De-serialize
…
Serialize
Network transport(engine to SSM)
Compute result
No wait time due to polling, fasterserialization/de-serialization,More network efficient protocol
Being more efficient means getting more work done with fewer resources
Latency matters in Big Data Analytics
© 2013 IBM Corporation
Platform Computing
41
7.5x Faster
Benchmark: SWIM: Facebook 2010 Workload
© 2013 IBM Corporation
Platform Computing
42
Understanding the advantage
Symphony 6.1 can schedule ~50x more tasks per second
Hadoop results taken from Hadoop World 2011 performance presentation, Lipcon & Chen
Hadoop 1.1.1