Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics

Get Started with Big Data Analytics Cynthia Saracco ([email protected]), Session #1031

Please Note: •  IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole

discretion. •  Information regarding potential future products is intended to outline our general product direction and it should not be relied on in

making a purchasing decision. •  The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any

material, code or functionality. Information about potential future products may not be incorporated into any contract. •  The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. •  Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual

throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

Executive summary

•  Big Data analytics growing rapidly across geo’s, industries •  Core open source technologies often include Hadoop, Spark •  Common challenges

– Getting started and growing skills – Demonstrating value quickly

•  Focus of this talk: – Using open source and IBM technologies for Big Data analytics (on cloud or

on premise installations) – Emphasis on free services and software with modest skill requirements –

available on cloud or on-premise installations

•  The big picture on Big Data: growth projections, applications, challenges •  Understanding IBM’s approach: combining open source and IBM-specific technologies

with IBM BigInsights •  How to get started?

–  Cloud, on-premise installation options

–  Managing your cluster

–  Storing, querying, and analyzing your data with IBM Big SQL (easy on-ramp to Hadoop for SQL professionals)

–  Exploring your data without writing code using IBM BigSheets –  . . .

•  Summary / resources

Agenda

The big picture on Big Data Opportunities, requirements, applications . . . .

Business leaders frequently make decisions based on information they don’t trust, or don’t have 1 in 3

83% of CIOs cited “Business intelligence and analytics” as part of their visionary plans to enhance competitiveness

Business leaders say they don’t have access to the information they need to do their jobs 1 in 2

of CEOs need to do a better job capturing and understanding information rapidly in order to make swift business decisions 60%

2.5 million items per minute 300,000 tweets per minute 200 million emails per minute 220,000 photos per

minute

5 TB per flight

> 1 PB per day gas turbines

Big Data presents new opportunities for insights . . .

What we hear from customers . . . .

•  Lots of potentially valuable data is dormant or discarded due to size/performance considerations

•  Large volume of unstructured or semi-structured data is not worth integrating fully (e.g. Tweets, logs, . . .)

•  Not clear what should be analyzed (exploratory, iterative)

•  Information distributed across multiple systems and/or Internet

•  Some information has a short useful lifespan

•  Volumes can be extremely high

•  Analysis needed in the context of existing information (not stand alone)

Big Data in practice

IBM Big Data customer scenarios (Hadoop-based) •  Applications

–  Data warehouse integration

–  Cloud-based analytics

–  Telematics

–  Targeted marketing campaigns

–  Optimization of capital investments

–  . . .

•  Industries

–  Insurance

–  Travel

–  Entertainment

–  Energy

–  Technology

–  Banking

–  . . . .

8

http://www.ibm.com/analytics/us/en/case-studies.html#topic=hadoop https://developer.ibm.com/hadoop/blog/2015/11/03/biginsights-and-big-sql-customer-use-cases/

IBM’s Big Data approach Leveraging Hadoop, Spark, and IBM technologies

IBM analytics platform strategy for Big Data

•  Integrate and manage the full variety, velocity and volume of Big Data

•  Apply advanced analytics •  Visualize all available

data for ad-hoc analysis •  Support workload

optimization and scheduling

•  Provide for security and governance

•  Integrate with enterprise software

Discovery& Exploration

Prescriptive Analytics

Predictive Analytics

Content Analytics

Business Intelligence

DataMgmt

Hadoop & NoSQL

Content Mgmt

DataWarehouse

Information Integration & Governance

IBM ANALYTICS PLATFORM Built on Spark. Hybrid. Trusted.

Spark Analytics Operating SystemMachine Learning On premises On cloud

Data at Rest & In-motion – Inside & Outside the Firewall – Structured & Unstructured

Warehousing Zone

Enterprise Warehouse

Data Marts

Ingestion and Real-time Analytic Zone Streams

Connectors

BI & Reporting


Analytics and Reporting Zone

Visualization & Discovery

Landing and Analytics Sandbox Zone

Hive/HBaseColStores

Documents in variety of formats

MapReduce

Hadoop

Metadata and Governance Zone

ETL, MDM, Data Governance

Hadoop and the enterprise

An open portfolio of self-service, composable data and analytic services for the developer, data science professional, and analytic architect. We help transform businesses and organizations to build applications and gain new insights

better and faster.

Comprehensive Trusted Flexible

•  Broadest selection of data and analytic services available on multiple cloud platforms

•  Pre-built integrations across the portfolio

•  Integrated with open data to gain deeper insights

•  Fully managed: 24 x 7 •  Secure infrastructure •  Mitigate risk and lower costs

•  Open-sourced driven innovation •  Industry leading support for hybrid

deployments •  Bare metal, virtual, pay-as-you-go and

reserved

Cloud Data Services is Open For Data

IBM BigInsights for Apache Hadoop

Discovery& Exploration

Prescriptive Analytics


Content Analytics

Business Intelligence

DataMgmt

Hadoop & NoSQL

Content Mgmt

DataWarehouse

Information Integration & Governance

IBM ANALYTICS PLATFORM Built on Spark. Hybrid. Trusted.

Spark Analytics Operating SystemMachine Learning On premises On cloud

Data at Rest & In-motion – Inside & Outside the Firewall – Structured & Unstructured

§  Analytical platform for persistent Big Data

–  100% open source core with IBM add-ons for analysts, data scientists, and admins

–  Includes Hadoop and Spark –  On premise or cloud

§  Distinguishing characteristics

–  Built-in analytics . . . . Enhances business knowledge

–  Enterprise software integration . . . . Complements and extends existing capabilities

–  Production-ready . . . . Speeds time-to-value

§  IBM advantage –  Combination of software,

hardware, services and research

Text Analytics

POSIX Distributed Filesystem

Multi-workload, multi-tenant scheduling

IBM BigInsights Enterprise Management

Machine Learning on Big R

Big R (R support)

IBM Open Platform with Apache Hadoop (HDFS,YARN,MapReduce,Ambari,Flume,HBase,Hive,Ka?a,Knox,Oozie,Pig,

Slider,Solr,Spark,Sqoop,Zookeeper)

IBM BigInsights Data Scientist

IBM BigInsights Analyst

Big SQL

BigSheets

Industry standard SQL (Big SQL)

Spreadsheet-style tool (BigSheets)

Overview of BigInsights Free Quick Start (non production): •  IBM Open Platform •  BigInsights Analyst, Data Scientist features •  Community support

. . .

How to get started Acquiring an environment: cloud, on premise options

Options for accessing IBM BigInsights

•  Cloud options – Bluemix: http://bluemix.net –  IMDemo cloud (technical previews): http://bigsql.imdemocloud.com

•  Installations – VMware image – Docker image –  Install image for your own cluster

• Options differ somewhat in breadth of features, privileges available – Check documentation available through from download site – Support available via forums

http://bluemix.net

•  Prototype, demo, trial in the cloud

•  Empowers developers to rapidly drive insight from all data

•  Adds Hadoop-based analytics to your application

•  Enterprise features - BigSheets, Big SQL, Text analytics, HiveQL, HttpFS

•  Delivered via IBM BlueMix. To be decommissioned shortly.

•  Production deployments at scale in the cloud

•  Delivers flexibility and efficiency with subscription pricing

•  Scales to meet spikes in demand without on-premise infrastructure

•  Drives enterprise-class, complex analytics on Big Data sets

•  Available via the IBM Cloud Marketplace and Bluemix

Cloud options (Bluemix) Developer sandbox

Analytics for Hadoop

http://www.ibm.com/cloud http://www.bluemix.net

BigInsights for Apache Hadoop

Production environment

Bluemix sandbox service

To be decommissioned shortly; pay-as-you-go offering in closed beta

IBM BigInsights on Cloud (Bluemix subscription)

Secure, Dedicated Bare-metal Infrastructure

IBM Open Platform

Small Nodes Basic data extraction, transformation, file processing, search 20 cores, 64 GB RAM, 20 TB raw data disks (~6 TB usable), 8 TB OS disks, 10 Gb network

Medium Nodes Data warehouse optimization – store new data or extend warehouse 20 cores, 128 GB RAM, 28 TB raw data disks (~9 TB usable), 8 TB OS disks, 10 Gb network

Large Nodes Advanced analytics – intensive data processing 24 cores, 256 GB RAM, 32 TB raw data disk (~10 TB usable), 8 TB OS disks, 10 Gb network

20

IMDemo Cloud sandbox with technical previews

To register for free use, visit http://bigsql.imdemocloud.com

Text Analytics

POSIX Distributed Filesystem

Multi-workload, multi-tenant scheduling

IBM BigInsights Enterprise Management

Machine Learning on Big R

Big R (R support)

IBM Open Platform with Apache Hadoop* (HDFS,YARN,MapReduce,Ambari,Flume,HBase,Hive,Ka?a,Knox,Oozie,Pig,

Slider,Solr,Spark,Sqoop,Zookeeper)

IBM BigInsights Data Scientist

IBM BigInsights Analyst

Big SQL

BigSheets

Industry standard SQL (Big SQL)

Spreadsheet-style tool (BigSheets)

On-premise options: native install, VMWare, Docker Free Quick Start (non production): •  IBM Open Platform •  BigInsights Analyst, Data Scientist features •  Community support

. . .

IBM BigInsights for Apache Hadoop

Where to download images •  Download Quick Start offering •  Links available from HadoopDev (“try it for free”)

–  https://developer.ibm.com/hadoop/

Looking for data? Look to Bluemix . . . .

•  IBM Analytics Exchange: new publicly accessible catalog with > 150 data sets.

•  Part of IBM’s Open for Data initiative

How to get started Managing your cluster

•  Inspect status, start/stop services, etc. •  Launch via Web browser (or cloud-specific link), e.g. http://

myhost.ibm.com:8080

Ambari console

25

How to get started Storing, querying, and analyzing data with Big SQL

Overview of SQL for Hadoop (Big SQL) SQL-based Application

Big SQL Engine

Data Storage

IBM data server client

SQL MPP Run-time

DFS

27

§  Comprehensive, standard SQL –  SELECT: joins, unions, aggregates, subqueries . . . –  GRANT/REVOKE, INSERT … INTO –  Procedural logic in SQL –  Stored procs, user-defined functions –  IBM data server JDBC and ODBC drivers

§  Optimization and performance

–  IBM MPP engine (C++) replaces Java MapReduce layer –  Continuous running daemons (no start up latency) –  Message passing allow data to flow between nodes without

persisting intermediate results –  In-memory operations with ability to spill to disk (useful for

aggregations, sorts that exceed available RAM) –  Cost-based query optimization with 140+ rewrite rules

§  Various storage formats supported –  Data persisted in DFS, Hive, HBase –  No IBM proprietary format required

§  Integration with RDBMSs via LOAD, query federation

BigInsights

•  Command-line interface: Java SQL Shell (JSqsh) • Web tooling (Data Server Manager) •  Tools that support IBM JDBC/ODBC driver

Invocation options

Creating a Big SQL table •  Standard CREATE TABLE DDL with extensions

create hadoop table users ( id int not null primary key, office_id int null, fname varchar(30) not null, lname varchar(30) not null) row format delimited fields terminated by '|' stored as textfile;

Worth noting: •  “Hadoop” keyword creates table in DFS •  Row format delimited and textfile formats are default •  Constraints not enforced (but useful for query optimization)

Results from previous CREATE TABLE . . . •  Data stored in subdirectory of Hive warehouse

.../hive/warehouse/myid.db/users– Default schema is user ID. Can create new schemas – “Table” is just a subdirectory under schema.db – Table’s data are files within table subdirectory

• Meta data collected (Big SQL & Hive) – SYSCAT.* and SYSHADOOP.* views

• Optionally, use LOCATION clause of CREATE TABLE to layer Big SQL schema over existing DFS directory contents – Useful if table contents already in DFS – Avoids need to LOAD data

Populating Tables via LOAD

•  Typically best runtime performance •  Load data from local or remote file system load hadoop using file url

'sftp://myID:[email protected]:22/install-dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;

•  Load data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL, Informix) via JDBC connection

load hadoop

using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb'

with parameters (user='myID', password='myPassword')

from table MEDIA columns (ID, NAME)

where 'CONTACTDATE < ''2012-02-01'''

into table media_db2table_jan overwrite

with load properties ('num.map.tasks' = 10);

Querying your Big SQL tables

•  Same as ISO-compliant RDBMS •  No special query syntax for Hadoop tables

– Projections, restrictions – UNION, INTERSECT, EXCEPT – Wide range of built-in functions (e.g. OLAP) – Full support for subqueries – All standard join operations –  . . .

SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey ) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate ) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name;

Accessing Big SQL data from Spark shell // based on BigInsights 4.1, which includes Spark 1.5.1 // establish a Hive context val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) // query some Big SQL data val saleFacts = sqlContext.sql("select * from bigsql.sls_sales_fact") // action on the data – count # of rows saleFacts.count() . . . // transform the data as needed (create a Vector with data from 2 cols) val subset = saleFacts.map {row => Vectors.dense(row.getDouble(16),row.getDouble(17))} // invoke basic Spark MLlib statistical function over the data val stats = Statistics.colStats(subset) // print one of the statistics collected println(stats.mean)

A word about . . . SerDes

•  Custom serializers / deserializers (SerDes) – Read / write complex or “unusual” data formats (e.g., JSON) – Commonly used by Hadoop community – Developed by user or available publicly

•  Add SerDes to directories; reference SerDe when creating table -- Create table for JSON data using open source hive-json-serde-0.2.jar SerDe

-- Location clause points to DFS dir containing JSON data

-- External clause means DFS dir & data won’t be drop after DROP TABLE command

create external hadoop table socialmedia-json (Country varchar(20), . . . )

row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'

location '</hdfs_path>/myJSON';

select * from socialmedia-json;

Sample JSON input for previous example

JSON-based social media data to load into Big SQL Table socialmedia-json defined with SerDe

Sample Big SQL query output for JSON data

Sample output: Select * from socialmedia-json

How to get started Exploring your data without writing code using BigSheets

Spreadsheet-style analysis (BigSheets) •  Web-based analysis and

visualization •  Spreadsheet-like interface

– Explore, manipulate data without writing code

–  Invoke pre-built functions – Generate charts – Export results of analysis – Create custom plug-ins –  . . .

Working with BigSheets

•  Create workbook for data in DFS •  Customize workbook through graphical

editor and built-in functions – Filter data – Apply functions / macros / formulas – Combine data from multiple workbooks

•  “Run” workbook: apply work to full data set

•  Explore results in spreadsheet format and/or create charts

• Optionally, export your data

Builder

Front EndEvaluation Service

Simulation

PIG Results

Model Model w/ Data

Full Execution

Summary and resources Discover how you can take the next step with Big Data

Summary

•  Big Data analytics in high demand – Open source technologies (e.g., Apache Hadoop, Spark) – Vendor-specific analytic tools, engines, and applications

• Multiple options to build Big Data skills with IBM BigInsights – Cloud: Bluemix, IMDemo cloud (tech previews) – VMWare / Docker images for your laptop (free download) –  IBM BigInsights Quick Start edition native installation (free download)

41

Hadoop Dev: developer site for IBM BigInsights Downloads, forums, labs, papers, etc on Hadoop Dev

https://developer.ibm.com/hadoop/

Thank You Your Feedback is Important!

Access the InterConnect 2016 Conference Attendee Portal to complete your session surveys from your

smartphone, laptop or conference kiosk.

Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics

Technology

Transcript of Big Data: InterConnect 2016 Session on Getting Started with Big Data Analytics