Self-Service Access and Exploration of Big Data

The Briefing Room

Twitter Tag: #briefr

The Briefing Room

Welcome

Host: Eric Kavanagh

[email protected]


The Briefing Room

!   Reveal the essential characteristics of enterprise software, good and bad

!   Provide a forum for detailed analysis of today’s innovative technologies

!   Give vendors a chance to explain their product to savvy analysts

!   Allow audience members to pose serious questions... and get answers!

Mission


The Briefing Room

December: Innovators

January: Big Data

February: Analytics

March: Data in Motion


The Briefing Room

Innovators

!   Charles Babbage conceived the Analytical Engine in 1834.

!   Automation and ease of use have driven innovation in computing ever since.

!   The Cloud and Big Data are raising the bar.


The Briefing Room

Robin Bloor is Chief Analyst at The Bloor Group

Analyst: Robin Bloor

[email protected]


The Briefing Room

! Cirro provides a single method to access any type of data, on any platform, in any environment.

!   Its product suite consists of Cirro Data Hub, Analyst for Excel and Multi Store – all designed to remove complexity from Big Data analytics.

! Cirro’s products are cloud based and can run in public, private and on-premise environments.

Cirro


The Briefing Room

Mark Theissen

Mark is CEO at Cirro. He is a respected analytics and data warehousing expert with more than 22 years in the industry. Most recently Mark was the worldwide data warehousing technical lead at Microsoft following the acquisition of DATAllegro. At DATAllegro Mark was the COO and a member of the board of directors. Prior to joining DATAllegro, Mark was Vice President and Research Lead at META Group

(Gartner Group) for Enterprise Analytics Strategies, covering data warehousing, business intelligence and data integration markets. Before META, Mark was VP of Professional Services at Accruent where he was responsible for domestic and overseas services and operations. Mark has a BS in Computer Information Systems from Chapman University and a MBA from the University of California, Irvine.

©2012 Cirro Inc. All rights reserved.

Corporate Overview

Bringing Big Data to the Desktop


The Big Data Dilemma


Accessing Big Data


Accessing Big Data

Incumbent Approach Hadoop Approach


What the Market Needs

An enterprise data hub to access any type of data, on

any platform, in any environment


The Enterprise Data Hub


Simplifying the Access to Your Data

Structured -‐ Unstructured Mashups

SQL (mul;ple versions)

Java

Sqoop

Map Reduce

HIVE Hadoop Install & Config

Hive – Scoop Install & Config

Source Control

DataBase Management

Cirro Data Hub

Access tool

Conven/onal Approach People manage the access to data

Cirro Approach Cirro Data Hub manages

access to data


Architecture Overview

Cirro Data Hub •  Cost based federa;on op;mizer •  Smart caching •  Dynamic op;miza;on •  Normalized cost es;mates •  Metadata for unstructured sources

Cirro Func;on Library

•  Library of Func;ons •  Logic to build complex specific formulas

Cirro Analyst

•  Excel plug-‐in that allows analysts to explore & process Big Data and tradi;onal data

Cirro Mul; Store (op;onal)

•  Pre-‐built structured/unstructured data store •  Used for holding data or addi;onal workspace


Typical Deployment

IT Staff •  Programmers •  Developers •  DBA’s

Extend, Add Proprietary

Functions to CFL

Excel Analyst Users •  Design Views

•  Minimal IT Support

•  Publish Views •  Data Exploration •  Analysis Tableau

Business Objects

Other BI Tools

Data Consumers Access CDH Views via ODBC & JDBC across all data types

RDBMS Oracle Teradata MySQL SQL Ver;ca

HQL

No SQL Splunk Cassandra MongoDB

MapReduce

Cirro Data Hub •  Cirro Function Library • Proprietary MapReduce

• Custom Views

MapReduce

Hadoop Distributed File System

Hive


Sample Use Case

Summarize the number of tweets per hour with certain keywords from a raw twitter feed.

Requirements: •  Use raw twitter data files in Hadoop •  Keywords stored in SQL table for easy

manipulation •  Results into Tableau Excel for visualization


Too Many Skills, Coding, Processing

Write mapper/reducer in java using development tool : • parse twi[er text -‐ convert to lower case -‐ parse words -‐ exclude common words -‐ group words by hour

Import java classes into Hadoop

Execute command line hadoop using CLI • bin/hadoop jar Twi[erParse /home/cloudera/WordCount.jar /usr/tweet/input /usr/local/output –libjars

Move result into HIVE using JDBC SQL tool • create table output1 (text STRING,created_at STRING,count BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE

• LOAD DATA INPATH '/usr/data/1-‐88f1-‐864e22e77801/part*'OVERWRITE INTO TABLE output1

Move SQL table with keywords to HIVE through Scoop using CLI • export -‐-‐connect jdbc:mySQL://10.17.185.44/mytable -‐-‐password mypasswd -‐-‐username root -‐-‐table words -‐-‐export-‐dir '/home/cloudera/inpumile

• create table mytable (word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE • LOAD DATA INPATH '/home/cloudera/inpumile/part*'OVERWRITE INTO TABLE mytable

Run HIVE query using JDBC SQL tool • select a.text ,a.created_at ,a.count from output1 a join mytable b on (a.text = b.word )

Import results into Excel using Excel


Too Many Skills, Coding, Processing

Write mapper/reducer in java using development tool : • parse twi[er text -‐ convert to lower case -‐ parse words -‐ exclude common words -‐ group words by hour

Import java classes into Hadoop

Execute command line hadoop using CLI • bin/hadoop jar Twi[erParse /home/cloudera/WordCount.jar /usr/tweet/input /usr/local/output –libjars

Move result into HIVE using JDBC SQL tool • create table output1 (text STRING,created_at STRING,count BIGINT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE

• LOAD DATA INPATH '/usr/data/1-‐88f1-‐864e22e77801/part*'OVERWRITE INTO TABLE output1

Move SQL table with keywords to HIVE through Scoop using CLI • export -‐-‐connect jdbc:mySQL://10.17.185.44/mytable -‐-‐password mypasswd -‐-‐username root -‐-‐table words -‐-‐export-‐dir '/home/cloudera/inpumile

• create table mytable (word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE • LOAD DATA INPATH '/home/cloudera/inpumile/part*'OVERWRITE INTO TABLE mytable

Run HIVE query using JDBC SQL tool • select a.text ,a.created_at ,a.count from output1 a join mytable b on (a.text = b.word )

Import results into Excel using Excel

B1=Twi[erParse("/user/twi[er/sample","text,created_at")

B2=ToLower(B1,"text")

B3=WordSeparate(B2,"text")

B4=Exclude(B3,"text")

B5=GroupBy(B4,"text,created_at")

B6=Cirro_Match(B5,"text","MYSQL.KeyWords","word",C9)

Results displayed at cell C9


Corporate Overview

Bringing Big Data to the Desktop


The Briefing Room

Analyst: Robin Bloor

Perceptions & Questions

The Bloor Group

Big Data, Hot Data?

The Bloor Group

Hadoop & The Big Data Dynamic

Hadoop has become the de facto reservoir for data

The Bloor Group

Hadoop & The Big Data Dynamic

– We witnessed something like this a long time ago, with ISAM files - before the advent of RDBMS

– The difference this time is that Hadoop has an ecosystem and it is growing

–  Big Data (usually caught first by Hadoop) is mostly new data and mostly event data

– Hadoop is not (yet) a performance engine. It is an all-purpose capability

–  It is delivering business benefits in a big way: it is hot….

The Bloor Group

BI Categories

Regular reporting/operational BI, Excel

Dashboards, OLAP, BPM, Excel

Data mining, statistical analysis (trends and relationships)

Predictive analytics

HINDSIGHT

OVERSIGHT

INSIGHT

FORESIGHT

The Bloor Group

The New BI Universe (?)

The Bloor Group

Data Sources

Hadoop and

Hadoop ++

Standard SQL NoSQL

Graph DBMS, XML

DBMS, Flat files

Metadata Hub?

The Bloor Group

Problems Of The Data Layer

Hadoop is capable of ETL and often used for ETL, but that usually

involves coding of a kind

A connectivity architecture is needed

IT REQUIRES SIMPLE CONNECTORS

Point to point connectivity usually was, is and may always be a bad

idea

BI tools, which had good-enough interfaces to RDBMS, don’t link to

Hadoop directly, and probably shouldn’t

The data layer is more complicated than it was and its

complexity is increasing

Hadoop is multi-role and hence can spawn multiple instances

The Bloor Group

!  How would one use the Cirro Multi Store?

!  Which companies/products do you regard as competitors (either directly or close competitors)?

!  How does a Cirro implementation proceed, i.e., where do you start, what are the medium term goals, what do you replace?

!  Conceptually a hub for the data layer is attractive. But how well does it scale out?

The Bloor Group

!  Can the hub be physically distributed, i.e., one logical instance with multiple physical instances?

!  How does your proprietary MapReduce differ from Hadoop MapReduce?

!  Is there any aspect of BI that you don’t or can’t cater for (CEP, Data governance, MDM, etc.)?


The Briefing Room


The Briefing Room

Upcoming Topics

January: Big Data

February: Analytics

March: Data in Motion

2013 Editorial Calendar www.insideanalysis.com


The Briefing Room

Thank You for Your

Attention

Self-Service Access and Exploration of Big Data

Technology

Transcript of Self-Service Access and Exploration of Big Data