Big data presentation (2014)
-
Upload
xavier-constant -
Category
Data & Analytics
-
view
762 -
download
0
Transcript of Big data presentation (2014)
copy 2014 IBM Corporation1
Big Data
Xavier Constantxavierconstantesibmcom
Lecture at EADAInternational Master in Marketing (2014)
copy 2014 IBM Corporation2
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation3
Traditional DW
BI Server
ERP
CRM
Data Marts
Reports Dashboards
Operational System
ETL ETL
BENEFITS
Mature Technology
SQL Language (declarative non technical)
Skills amp resources availablity (programmers DBAshellip)
LIMITATIONS
Big operational data volumes
Queries take too long or donrsquot even finish
Admin complexity (partitions archivinghellip)
New data types
Free text images video audiohellip
Data in real time (sensors logs geospatial data etchellip)
New analysis types
Exploratory
Predictive
Flat filesSpread sheets
Data Warehouse(s)
copy 2014 IBM Corporation4
1 in 2business leaders donrsquot have access to data they need
83of CIOrsquos cited BI and analytics as part of their visionary plan
54Xmore likely that top performers use business analytics
80of the worldrsquos data today is unstructured
90of the worldrsquos
data was created in the last two
years
20of available data can
be processed by traditional systems
Source GigaOM Software Group IBM Institute for Business Value
Intrinsic Property of Data hellip it grows
copy 2014 IBM Corporation5
Characteristics of Big Data
Velocity is the game changer Itrsquos NOT just how
fast data is produced or changed BUT the
speed at which it must be analyzed
received understood and processed
copy 2014 IBM Corporation6
Paradigm shifts enabled by big data ILeverage more of the data being captured
copy 2014 IBM Corporation7
Paradigm shifts enabled by big data ILeverage more of the data being captured
Bank X
copy 2014 IBM Corporation8
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation9
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation10
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation11
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
Hypothesis based correlation Weird correlation
copy 2014 IBM Corporation12
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation13
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation14
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation15
Complementary Analytics
Traditional ApproachStructured analytical logical
New ApproachCreative holistic thought intuition
Multimedia
Data Warehouse
Web Logs
Social Data
Sensor data
images
RFID
Internal AppData
TransactionData
MainframeData
OLTP SystemData
Traditional databases
ERP Data
StructuredRepeatable
Linear
UnstructuredExploratory
Dynamic
Text Data
emails
Hadoop andStreams
NewSources
copy 2014 IBM Corporation16
Types of Analytic Tools
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation2
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation3
Traditional DW
BI Server
ERP
CRM
Data Marts
Reports Dashboards
Operational System
ETL ETL
BENEFITS
Mature Technology
SQL Language (declarative non technical)
Skills amp resources availablity (programmers DBAshellip)
LIMITATIONS
Big operational data volumes
Queries take too long or donrsquot even finish
Admin complexity (partitions archivinghellip)
New data types
Free text images video audiohellip
Data in real time (sensors logs geospatial data etchellip)
New analysis types
Exploratory
Predictive
Flat filesSpread sheets
Data Warehouse(s)
copy 2014 IBM Corporation4
1 in 2business leaders donrsquot have access to data they need
83of CIOrsquos cited BI and analytics as part of their visionary plan
54Xmore likely that top performers use business analytics
80of the worldrsquos data today is unstructured
90of the worldrsquos
data was created in the last two
years
20of available data can
be processed by traditional systems
Source GigaOM Software Group IBM Institute for Business Value
Intrinsic Property of Data hellip it grows
copy 2014 IBM Corporation5
Characteristics of Big Data
Velocity is the game changer Itrsquos NOT just how
fast data is produced or changed BUT the
speed at which it must be analyzed
received understood and processed
copy 2014 IBM Corporation6
Paradigm shifts enabled by big data ILeverage more of the data being captured
copy 2014 IBM Corporation7
Paradigm shifts enabled by big data ILeverage more of the data being captured
Bank X
copy 2014 IBM Corporation8
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation9
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation10
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation11
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
Hypothesis based correlation Weird correlation
copy 2014 IBM Corporation12
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation13
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation14
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation15
Complementary Analytics
Traditional ApproachStructured analytical logical
New ApproachCreative holistic thought intuition
Multimedia
Data Warehouse
Web Logs
Social Data
Sensor data
images
RFID
Internal AppData
TransactionData
MainframeData
OLTP SystemData
Traditional databases
ERP Data
StructuredRepeatable
Linear
UnstructuredExploratory
Dynamic
Text Data
emails
Hadoop andStreams
NewSources
copy 2014 IBM Corporation16
Types of Analytic Tools
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation3
Traditional DW
BI Server
ERP
CRM
Data Marts
Reports Dashboards
Operational System
ETL ETL
BENEFITS
Mature Technology
SQL Language (declarative non technical)
Skills amp resources availablity (programmers DBAshellip)
LIMITATIONS
Big operational data volumes
Queries take too long or donrsquot even finish
Admin complexity (partitions archivinghellip)
New data types
Free text images video audiohellip
Data in real time (sensors logs geospatial data etchellip)
New analysis types
Exploratory
Predictive
Flat filesSpread sheets
Data Warehouse(s)
copy 2014 IBM Corporation4
1 in 2business leaders donrsquot have access to data they need
83of CIOrsquos cited BI and analytics as part of their visionary plan
54Xmore likely that top performers use business analytics
80of the worldrsquos data today is unstructured
90of the worldrsquos
data was created in the last two
years
20of available data can
be processed by traditional systems
Source GigaOM Software Group IBM Institute for Business Value
Intrinsic Property of Data hellip it grows
copy 2014 IBM Corporation5
Characteristics of Big Data
Velocity is the game changer Itrsquos NOT just how
fast data is produced or changed BUT the
speed at which it must be analyzed
received understood and processed
copy 2014 IBM Corporation6
Paradigm shifts enabled by big data ILeverage more of the data being captured
copy 2014 IBM Corporation7
Paradigm shifts enabled by big data ILeverage more of the data being captured
Bank X
copy 2014 IBM Corporation8
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation9
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation10
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation11
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
Hypothesis based correlation Weird correlation
copy 2014 IBM Corporation12
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation13
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation14
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation15
Complementary Analytics
Traditional ApproachStructured analytical logical
New ApproachCreative holistic thought intuition
Multimedia
Data Warehouse
Web Logs
Social Data
Sensor data
images
RFID
Internal AppData
TransactionData
MainframeData
OLTP SystemData
Traditional databases
ERP Data
StructuredRepeatable
Linear
UnstructuredExploratory
Dynamic
Text Data
emails
Hadoop andStreams
NewSources
copy 2014 IBM Corporation16
Types of Analytic Tools
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation4
1 in 2business leaders donrsquot have access to data they need
83of CIOrsquos cited BI and analytics as part of their visionary plan
54Xmore likely that top performers use business analytics
80of the worldrsquos data today is unstructured
90of the worldrsquos
data was created in the last two
years
20of available data can
be processed by traditional systems
Source GigaOM Software Group IBM Institute for Business Value
Intrinsic Property of Data hellip it grows
copy 2014 IBM Corporation5
Characteristics of Big Data
Velocity is the game changer Itrsquos NOT just how
fast data is produced or changed BUT the
speed at which it must be analyzed
received understood and processed
copy 2014 IBM Corporation6
Paradigm shifts enabled by big data ILeverage more of the data being captured
copy 2014 IBM Corporation7
Paradigm shifts enabled by big data ILeverage more of the data being captured
Bank X
copy 2014 IBM Corporation8
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation9
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation10
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation11
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
Hypothesis based correlation Weird correlation
copy 2014 IBM Corporation12
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation13
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation14
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation15
Complementary Analytics
Traditional ApproachStructured analytical logical
New ApproachCreative holistic thought intuition
Multimedia
Data Warehouse
Web Logs
Social Data
Sensor data
images
RFID
Internal AppData
TransactionData
MainframeData
OLTP SystemData
Traditional databases
ERP Data
StructuredRepeatable
Linear
UnstructuredExploratory
Dynamic
Text Data
emails
Hadoop andStreams
NewSources
copy 2014 IBM Corporation16
Types of Analytic Tools
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation5
Characteristics of Big Data
Velocity is the game changer Itrsquos NOT just how
fast data is produced or changed BUT the
speed at which it must be analyzed
received understood and processed
copy 2014 IBM Corporation6
Paradigm shifts enabled by big data ILeverage more of the data being captured
copy 2014 IBM Corporation7
Paradigm shifts enabled by big data ILeverage more of the data being captured
Bank X
copy 2014 IBM Corporation8
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation9
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation10
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation11
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
Hypothesis based correlation Weird correlation
copy 2014 IBM Corporation12
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation13
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation14
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation15
Complementary Analytics
Traditional ApproachStructured analytical logical
New ApproachCreative holistic thought intuition
Multimedia
Data Warehouse
Web Logs
Social Data
Sensor data
images
RFID
Internal AppData
TransactionData
MainframeData
OLTP SystemData
Traditional databases
ERP Data
StructuredRepeatable
Linear
UnstructuredExploratory
Dynamic
Text Data
emails
Hadoop andStreams
NewSources
copy 2014 IBM Corporation16
Types of Analytic Tools
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation6
Paradigm shifts enabled by big data ILeverage more of the data being captured
copy 2014 IBM Corporation7
Paradigm shifts enabled by big data ILeverage more of the data being captured
Bank X
copy 2014 IBM Corporation8
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation9
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation10
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation11
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
Hypothesis based correlation Weird correlation
copy 2014 IBM Corporation12
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation13
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation14
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation15
Complementary Analytics
Traditional ApproachStructured analytical logical
New ApproachCreative holistic thought intuition
Multimedia
Data Warehouse
Web Logs
Social Data
Sensor data
images
RFID
Internal AppData
TransactionData
MainframeData
OLTP SystemData
Traditional databases
ERP Data
StructuredRepeatable
Linear
UnstructuredExploratory
Dynamic
Text Data
emails
Hadoop andStreams
NewSources
copy 2014 IBM Corporation16
Types of Analytic Tools
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation7
Paradigm shifts enabled by big data ILeverage more of the data being captured
Bank X
copy 2014 IBM Corporation8
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation9
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation10
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation11
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
Hypothesis based correlation Weird correlation
copy 2014 IBM Corporation12
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation13
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation14
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation15
Complementary Analytics
Traditional ApproachStructured analytical logical
New ApproachCreative holistic thought intuition
Multimedia
Data Warehouse
Web Logs
Social Data
Sensor data
images
RFID
Internal AppData
TransactionData
MainframeData
OLTP SystemData
Traditional databases
ERP Data
StructuredRepeatable
Linear
UnstructuredExploratory
Dynamic
Text Data
emails
Hadoop andStreams
NewSources
copy 2014 IBM Corporation16
Types of Analytic Tools
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation8
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation9
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation10
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation11
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
Hypothesis based correlation Weird correlation
copy 2014 IBM Corporation12
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation13
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation14
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation15
Complementary Analytics
Traditional ApproachStructured analytical logical
New ApproachCreative holistic thought intuition
Multimedia
Data Warehouse
Web Logs
Social Data
Sensor data
images
RFID
Internal AppData
TransactionData
MainframeData
OLTP SystemData
Traditional databases
ERP Data
StructuredRepeatable
Linear
UnstructuredExploratory
Dynamic
Text Data
emails
Hadoop andStreams
NewSources
copy 2014 IBM Corporation16
Types of Analytic Tools
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation9
Paradigm shifts enabled by big data IIReduce effort required to leverage data
copy 2014 IBM Corporation10
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation11
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
Hypothesis based correlation Weird correlation
copy 2014 IBM Corporation12
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation13
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation14
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation15
Complementary Analytics
Traditional ApproachStructured analytical logical
New ApproachCreative holistic thought intuition
Multimedia
Data Warehouse
Web Logs
Social Data
Sensor data
images
RFID
Internal AppData
TransactionData
MainframeData
OLTP SystemData
Traditional databases
ERP Data
StructuredRepeatable
Linear
UnstructuredExploratory
Dynamic
Text Data
emails
Hadoop andStreams
NewSources
copy 2014 IBM Corporation16
Types of Analytic Tools
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation10
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation11
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
Hypothesis based correlation Weird correlation
copy 2014 IBM Corporation12
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation13
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation14
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation15
Complementary Analytics
Traditional ApproachStructured analytical logical
New ApproachCreative holistic thought intuition
Multimedia
Data Warehouse
Web Logs
Social Data
Sensor data
images
RFID
Internal AppData
TransactionData
MainframeData
OLTP SystemData
Traditional databases
ERP Data
StructuredRepeatable
Linear
UnstructuredExploratory
Dynamic
Text Data
emails
Hadoop andStreams
NewSources
copy 2014 IBM Corporation16
Types of Analytic Tools
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation11
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
Hypothesis based correlation Weird correlation
copy 2014 IBM Corporation12
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation13
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation14
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation15
Complementary Analytics
Traditional ApproachStructured analytical logical
New ApproachCreative holistic thought intuition
Multimedia
Data Warehouse
Web Logs
Social Data
Sensor data
images
RFID
Internal AppData
TransactionData
MainframeData
OLTP SystemData
Traditional databases
ERP Data
StructuredRepeatable
Linear
UnstructuredExploratory
Dynamic
Text Data
emails
Hadoop andStreams
NewSources
copy 2014 IBM Corporation16
Types of Analytic Tools
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation12
Paradigm shifts enabled by big data IIIData leads the way ndash and sometimes correlations are good enough
copy 2014 IBM Corporation13
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation14
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation15
Complementary Analytics
Traditional ApproachStructured analytical logical
New ApproachCreative holistic thought intuition
Multimedia
Data Warehouse
Web Logs
Social Data
Sensor data
images
RFID
Internal AppData
TransactionData
MainframeData
OLTP SystemData
Traditional databases
ERP Data
StructuredRepeatable
Linear
UnstructuredExploratory
Dynamic
Text Data
emails
Hadoop andStreams
NewSources
copy 2014 IBM Corporation16
Types of Analytic Tools
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation13
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation14
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation15
Complementary Analytics
Traditional ApproachStructured analytical logical
New ApproachCreative holistic thought intuition
Multimedia
Data Warehouse
Web Logs
Social Data
Sensor data
images
RFID
Internal AppData
TransactionData
MainframeData
OLTP SystemData
Traditional databases
ERP Data
StructuredRepeatable
Linear
UnstructuredExploratory
Dynamic
Text Data
emails
Hadoop andStreams
NewSources
copy 2014 IBM Corporation16
Types of Analytic Tools
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation14
Paradigm shifts enabled by big data IVLeverage data as it is captured
copy 2014 IBM Corporation15
Complementary Analytics
Traditional ApproachStructured analytical logical
New ApproachCreative holistic thought intuition
Multimedia
Data Warehouse
Web Logs
Social Data
Sensor data
images
RFID
Internal AppData
TransactionData
MainframeData
OLTP SystemData
Traditional databases
ERP Data
StructuredRepeatable
Linear
UnstructuredExploratory
Dynamic
Text Data
emails
Hadoop andStreams
NewSources
copy 2014 IBM Corporation16
Types of Analytic Tools
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation15
Complementary Analytics
Traditional ApproachStructured analytical logical
New ApproachCreative holistic thought intuition
Multimedia
Data Warehouse
Web Logs
Social Data
Sensor data
images
RFID
Internal AppData
TransactionData
MainframeData
OLTP SystemData
Traditional databases
ERP Data
StructuredRepeatable
Linear
UnstructuredExploratory
Dynamic
Text Data
emails
Hadoop andStreams
NewSources
copy 2014 IBM Corporation16
Types of Analytic Tools
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation16
Types of Analytic Tools
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation17
Organisations are prioritising internal data sources
17
Untapped stores of internal data
Size and scope of some internal data such as
detailed transactions and operational log data
have become too large and varied to manage
within traditional systems
New infrastructure components make them
accessible for analysis
Some data has been collected but not
analyzed for years
Focus on customer insights
Customers ndash influenced by digital experiences
ndash often expect information provided to an
organization will then be ldquoknownrdquo during future
interactions
Combining disparate internal sources with
advanced analytics creates insights into
customer behavior and preferences
(Transactions Emails Call center interaction records)
Big data sources
Respondents were
asked which data
sources are currently
being collected and
analyzed as part of
active big data efforts
within their
organization
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation18
Stages of Big Data adoption
18
Big data adoption
When segmented into four groups based on current levels of big data activity respondents showed significant consistency
in organizational behaviors Total respondents n = 1061
Totals do not equal 100 due to rounding
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation19
Hadoop workloads
92
92
83
58
42
25
58
92
92
92
67
67
67
83
Staging area
Online archive
Transformation Engine
Ad hoc queries
Scheduled reports
Visual exploration
Data mining
Today In 18 Months
Based on respondents that have implemented Hadoop BI Leadership Forum April 2012
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation20
Big Data ExplorationFind visualize understand all big data to improve decision making
Enhanced 360o Viewof the CustomerExtend existing customer views (MDM CRM etc) by incorporating additional internal and external information sources
Operations AnalysisAnalyze a variety of machinedata for improved business results
Data Warehouse ModernizationIntegrate big data and data warehouse capabilities to increase operational efficiency
SecurityIntelligence ExtensionLower risk detect fraud and monitor cyber security in real-time
Key Big Data Use Cases
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation21
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation22
Solution for Big Data
Rest Data
ndash Data to analyze are already stored (structured and unstructured)
ndash Examples logs facebook twitter etc
ndash Solution Hadoop (open source)
Data in motion
ndash Data are analyzed in real time just in the moment they are generated They are analyzed with any previous storage
ndash Examples Sensors RFID etc
ndash Solution Streams CEP solutions
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation23
Hardware improvements through the years
CPU Speedsndash 1990 - 44 MIPS at 40 MHz
ndash 2000 - 3561 MIPS at 12 GHz ndash 2010 - 147600 MIPS at 33 GHz
RAM Memoryndash 1990 ndash 640K conventional memory (256K extended memory recommended)ndash 2000 ndash 64MB memoryndash 2010 - 8-32GB (and more)
Disk Capacityndash 1990 ndash 20MBndash 2000 - 1GBndash 2010 ndash 1TB
Disk Latency (speed of reads and writes) ndash not much improvement in last 7-10 years currently around 70 ndash 80MB sec
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation24
How long it will take to read 1TB of data
1TB (at 80Mb sec)ndash 1 disk - 34 hours
ndash 10 disks - 20 min
ndash 100 disks - 2 min
ndash 1000 disks - 12 sec
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation25
Parallel Data Processing is the answer
It was with us for a whilendash GRID computing - spreads processing load
ndash Distributed workload - hard to manage applications overhead on
developer
ndash Parallel databases ndash DB2 DPF Teradata Netezza etc (distribute the
data)
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation26
What is Apache Hadoop
Apache Open source software framework
Flexible enterprise-class support for processing large volumes of
data ndash Inspired by Google technologies (MapReduce GFS BigTable hellip)
ndash Initiated at Yahoobull Originally built to address scalability problems of Nutch an open source Web search
technology
ndash Well-suited to batch-oriented read-intensive applications
ndash Supports wide variety of data
Enables applications to work with thousands of nodes and petabytes
of data in a highly parallel cost effective mannerndash CPU + local disks = ldquonoderdquo
ndash Nodes can be combined into clusters
ndash New nodes can be added as needed without changing
bull Data formats
bull How data is loaded
bull How jobs are written
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation27
Design principles of Hadoop New way of storing and processing the data
ndash Let system handle most of the issues automaticallybull Failuresbull Scalabilitybull Reduce communications bull Distribute data and processing power to where the data isbull Make parallelism part of operating systembull Meant for heterogeneous commodity hardware
Bring processing to Data
Hadoop = HDFS + MapReduce infrastructure
Optimized to handlendash Massive amounts of data through parallelism
ndash A variety of data (structured unstructured semi-structured)
ndash Using inexpensive commodity hardware
Reliability provided through replication
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation28
What is the Hadoop Distributed File System Driving principals
ndash Data is stored across the entire cluster (multiple nodes)
ndash Programs are brought to the data not the data to the program
ndash Follows the Divide and Conquer paradigm
Data is stored across the entire cluster (the DFS)
ndash The entire cluster participates in the file system
ndash Blocks of a single file are distributed across the cluster
ndash A given block is typically replicated as well for resiliency
10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111
0100
Logical File
1
2
3
4
Blocks
1
Cluster
1
1
2
22
3
3
34
44
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2011 IBM Corporation29
Introduction to MapReduce
Scalable to thousands of nodes and petabytes of data
MapReduce Application
1 Map Phase(break job into small parts)
2 Shuffle(transfer interim output
for final processing)
3 Reduce Phase(boil all output down to
a single result set)
Return a single result setResult Set
Shuffle
public static class TokenizerMapper extends MapperltObjectTextTextIntWritablegt
private final static IntWritableone = new IntWritable(1)
private Text word = new Text()
public void map(Object key Text val ContextStringTokenizer itr =
new StringTokenizer(valtoString())while (itrhasMoreTokens()) wordset(itrnextToken())
contextwrite(word one)
public static class IntSumReducer extends ReducerltTextIntWritableTextIntWrita
private IntWritable result = new IntWritable()
public void reduce(Text keyIterableltIntWritablegt val Context context)int sum = 0for (IntWritable v val)
sum += vget()
Distribute map
tasks to cluster
Hadoop Data Nodes
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2011 IBM Corporation30
MapReduce Example
Hello World Bye World
Hello IBM
Reduce (final output)
lt Bye 1gt
lt IBM 1gt
lt Hello 2gt
lt World 2gt
Map 1lt Hello 1gt
lt World 1gt
lt Bye 1gt
lt World 1gt
Count number of words occurrences
Map 2lt Hello 1gt
lt IBM 1gt
Entry Data
Map
Process
Reduce
Process
Shuffle
Process
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation31
How to Analyze Large Data Sets in Hadoop
Its not just runtime Development phase has to be taken into
account
Although the Hadoop framework is implemented in Java
MapReduce applications do not need to be written in Java
To abstract complexities of Hadoop programming model a few
application development languages have emerged that build on top
of Hadoopndash Pig
ndash Hive
ndash Jaql
ndash Jaql
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation32
Pig Hive Jaql ndash Similarities
Reduced program size over Java
Applications are translated to map
and reduce jobs behind scenes
Extension points for extending
existing functionality
Interoperability with other
languages
Not designed for random
readswrites or low-latency queries
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
Pig Hive Jaql ndash Differences
Characteristic Pig Hive Jaql
Developed by Yahoo Facebook IBM
Language Pig Latin HiveQL Jaql
Type of language
Data flow Declarative (SQL dialect) Data flow
Data structures supported
Complex Better suited for structured data
JSON semi structured
Schema Optional Not optional Optional
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
Hadoop Distributions
34
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation35
Example of Hadoop Ecosystem
Visualization amp DiscoveryIntegration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
BigSheets JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Dashboard amp Visualization
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation36
Open Source frameworks I
Avro A data serialization system that includes a schema within each file A schema defines the data types that are
contained within a file and is validated as the data is written to the file using the Avro APIs Users can include primary data
types and complex type definitions within a schema
Flume A distributed reliable and highly available service for efficiently moving large amounts of data in a Hadoop
cluster
HBase A column-oriented database management system that runs on top of HDFS and is often used for sparse data
sets Unlike relational database systems HBase does not support a structured query language like SQL HBase applications
are written in Javatrade much like a typical MapReduce application HBase allows many attributes to be grouped into column
families so that the elements of a column family are all stored together This approach is different from a row-oriented
relational database where all columns of a row are stored together
HCatalog A table and storage management service for Hadoop data that presents a table abstraction so that you do
not need to know where or how your data is stored You can change how you write data while still supporting existing data in
older formats HCatalog wraps additional layers around the Hive metadata store to provide an enhanced metadata service
that includes functions for both MapReduce and Pig
Hive A data warehouse infrastructure that facilitates data extract-transform-load (ETL) operations in addition to
analyzing large data sets that are stored in the Hadoop Distributed File System (HDFS) SQL developers write statements
which are broken down by the Hive service into MapReduce jobs and then run across a Hadoop cluster InfoSphere
BigInsights includes a JDBC driver (BigSQL) that is used for programming with Hive and for connecting with Cognos
Business Intelligence software
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation37
Open Source frameworks II
Lucene A high-performance text search engine library that is written entirely in Java When you search within a
collection of text Lucene breaks the documents into text fields and builds an index from them The index is the key
component of Lucene that forms the basis of rapid text search capabilities You use the searching methods within the Lucene
libraries to find text components With InfoSphere BigInsights Lucene is integrated into Jaql providing the ability to build
scan and query Lucene indexes
Oozie A management application that simplifies workflow and coordination between MapReduce jobs Oozie provides
users with the ability to define actions and dependencies between actions Oozie then schedules actions to run when the
required dependencies are met Workflows can be scheduled to start based on a given time or based on the arrival of
specific data in the file system
R A Project for Statistical Computing
Scoop A tool designed to easily import information from structured databases (such as SQL) and related Hadoop
systems (such as Hive and HBase) into your Hadoop cluster You can also use Sqoop to extract data from Hadoop and
export it to relational databases and enterprise data warehouses
Zookeeper A centralized infrastructure and set of services that enable synchronization across a cluster ZooKeeper
maintains common objects that are needed in large cluster environments such as configuration information distributed
synchronization and group services
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation38
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
Big SQL
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation39
BigSheets
Browser based Analytical Tool that generates MapReduce Jobs
working over Hadoop Big Data
Helps non-programmers to work with Hadoop cluster
User models their big data as familiar spreadsheet-like tabular data
structures (collections) Once data is represented in a collection
business analysts can filter and enrich its content using built-in
functions and macros Furthermore analysts can combine data
residing in different collections as well as generate charts and new
ldquosheetsrdquo (collections) to visualize their data They can even export
data into a variety of common formats with a click of a button
Much of the technology included in Sheets was derived from the
BigSheets project of IBMrsquos Emerging Technologies team
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation40
BigSheets Collection Sample
Spreadsheet-like structures defined by user
Based on data accessible through BigInsights Web console ndash eg file
system data output from Web crawl etc
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation41
Big Sheets Collection Operations
Work with built-in ldquosheetsrdquo editor
Add delete columns
Filter data
Specify formulas to compute new
values using spreadsheet-style
syntax
Apply built-in or custom macro
functions
helliphelliphelliphellip
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation42
BigSheets Collection Graphic Visualization
Built-in charting facility aids analysis
Pie charts bar charts tag clouds maps etc
Hover over sections to reveal details
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation43
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
Text Analytics MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation44
What is Big SQL
Big SQL brings robust SQL support to the Hadoop ecosystemndash Scalable server architecture
ndash Comprehensive SQL92 ansi support
ndash Standards compliant client drivers (JDBC amp ODBC)
ndash Efficient handling of point queries
ndash Wide variety of data sources and file formats
ndash Extensive HBase focus
ndash Open source interoperability
Our driving design goalsndash Existing queries should run with no or few modifications
ndash Existing JDBC and ODBC compliant tools should continue to function
ndash Queries should be executed as efficiently as the chosen storage
mechanisms allow
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation45
Architecture
Big SQL shares catalogs with
Hive via the Hive metastorendash Each can query the others tables
SQL engine analyzes incoming
queriesndash Separates portion(s) to execute at
the server vs portion(s) to execute
on the cluster
ndash Re-writes query if necessary for
improved performance
ndash Determines appropriate storage
handler for data
ndash Produces execution plan
ndash Executes and coordinates query
Server layout and relative sizes
for illustrative purposes only
Application
SQL Language
JDBC ODBC Driver
BigInsights Cluster
Head Node
Big SQL Server
Head Node
Name Node
Head Node
Job TrackerNetwork Protocol
SQL Engine
Storage Handlers
Del
Files
SEQ
FilesHBase RDBMS bullbullbull
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Compute Node
Task
Tracker
Data
Node
Region
Server
bullbullbull
Compute Node
Task
Tracker
Data
Node
Region
Server
Head Node
Hive Metastore
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation46
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
R
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation47
What is Text Analytics
High Performance and Scalable rule based Information Extraction Engine
Distill structured information from unstructured data
- Rich annotator library supports multiple languages
Provides sophisticated tooling to help build test and refine rules
ndash Developer tools an easy to use text analytics language and a set of
extractors for fast adoption
ndash Multilingual support including support for DBCS languages
Developed at IBM Research since 2004 System T
BigInsights is the first time IBM opens up the Text Analytics Engine
technology for customization and development
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation48
Annotator Query Language (AQL)
Language to create rules for Text Analytics
SQL Like Language
Fully declarative text analytics language
Once compiled produced an AOG plan to work in the data
No ldquoblack boxesrdquo or modules that canrsquot be customized
Tooling for easy customization because you are abstracted from the
programmatic details
Competing solutions make use of locked up black-box modules that cannot be
customized which restricts flexibility and are difficult to optimize for performance
create view AmountWithUnit as
extract pattern ltNmatchgt ltUmatchgt
as match
from Number N Unit U
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation49
Text Analytic Simple Example
NetherlandsStrikerArjen Robben
Keeper SpainIker Casillas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
Football World Cup 2010 one team distinguished well
from the rest winning the final Early in the second
half Netherlandsrsquo striker Arjen Robben had a chance
to score but the awesome keeper for Spain Iker
Casillas made the save Winner superiority was
reflected when Winger Andres Iniesta scored for Spain
for the win
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation5050
Text Analytic Real Example
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation5151
One step beyond Watson
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation52
Example of Hadoop Ecosystem
Dashboard amp Visualization
Integration
Workload Optimization
Streams
Netezza
Flume
DB2
DataStage
IBM InfoSphere BigInsights
Runtime
Advanced Analytic Engines
File System
MapReduce
HDFS
Data StoreHBase
Text Processing Engine amp Extractor Library)
JDBC
Applications amp Development
MapReduce
Pig amp Jaql Hive
Administration
Index
Splittable Text Compression
Enhanced Security
Flexible Scheduler
Jaql
Pig
ZooKeeper
Lucene
Oozie
Adaptive MapReduce
Hive
Integrated Installer
Admin Console
Sqoop
Adaptive Algorithms
Apps
Workflow Monitoring
Management
HCatalog
Security
Audit amp History
Lineage
Guardium
PlatformComputing
Cognos
IBMOpen Source
GPFS-FPO
NameNode High Avail
Avro
Visualization amp Discovery
BigSheets
Big SQL
Text Analytics
R
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation53
Big R
bull Explore visualize transform
and model big data using
familiar R syntax and
paradigm
bull Scale out R with MR
programming
ndash Partitioning of large data
ndash Parallel cluster execution of R
code
bull Distributed Machine
Learning
ndash A scalable statistics engine that
provides canned algorithms and
an ability to author new ones all
via R
R Clients
Scalable
Machine
Learning
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or push R
functions
right on the
data
1
2
3
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation54
Where Does BigData Fit
Analytical database
(DW)
Source Systems
Analytical tools
5 Explore data
6 Parse aggregate
ldquoCapture in case itrsquos neededrdquo
1 Extract transform load
ldquoCapture only whatrsquos neededrdquo
9 Report and mine data
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation55
Big Data Concepts
Big Data Technology
Data Scientists
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation56
Data scientist ndash The new cool guy in town
Article in Fortune ldquoThe unemployment rate in
the US continues to be abysmal (91 in
July) but the tech world has spawned a
new kind of highly skilled nerdy-cool job
that companies are scrambling to fill data
scientistrdquo
McKinsey Global Institute ldquoBig data Reportrdquo
By 2018 the United States alone could
face a shortage of 140000 to 190000
people with deep analytical skills as well as
15 million managers and analysts with the
know-how to use the analysis of big data to
make effective decisions
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation57
Data Science is Multidisciplinary
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation58
Successful Data Scientist Characteristics
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation59
Data Scientist Qualities
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation60
How Long Does It Take For a Beginner to Become
a Good Data Scientist
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation61
wwwkagglecom
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation62
Kaggle ranking
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation63 copy 2013 IBM Corporation63
Learn Big Data
Reading Materials - Online
ndash Understanding Big Data ndash Free PDF Book
bull httppublicdheibmcomcommonssiecmeniml14297usenIML14297USENPDF
ndash Developing publishing and deploying your first big data application with InfoSphere BigInsights
bull wwwibmcomdeveloperworksdatalibrarytecharticledm-1209bigdatabiginsightsindexhtml
ndash Implementing IBM InfoSphere BigInsights on System x - Redbook
bull httpwwwredbooksibmcomredpiecesabstractssg248077html
Resources
ndash Big Data Information Center
bull www-01ibmcomsoftwareebusinessjstartbigdatainfocenterhtml
ndash InfoSphere BigInsights
bull www-01ibmcomsoftwaredatainfospherebiginsights
ndash Stream Computing
bull www-01ibmcomsoftwaredatainfospherestream-computing
ndash DeveloperWorks forums demos
bull httpwwwibmcomdeveloperworkswikibiginsights
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation64
Learn Big Data Technologies
BigDataUniversitycom
Flexible on-line delivery
allows learning your place
and your pace
Free courses free study
materials
Cloud-based sandbox for
exercises ndash zero setup
Robust Course
Management System and
Content Distribution
infrastructure
copy 2014 IBM Corporation65
65
copy 2014 IBM Corporation65
65