Big Data and the BI Wild West

44
Big Data and the BI Wild West Don’t Bring an Elephant to a Gun Fight! Paul Groom

description

Hadoop’s “Crossing the chasm” will require widespread and ubiquitous adoption by organizations; but the keystone to all of this isn’t the widely-talked about social media like Facebook, Twitter and LinkedIn. The seemingly mundane “dark data” in business which is captured but left unutilized, or under-utilized, will start the transformation away from the standard architectures of old and transform into the brave new work generally associated with “Big Data”. As members of the Hadoop Community, it is our challenge to bring about that change rapidly and responsibly – bringing order to the “wild west” of the disruptive business intelligence landscape today. BI is the foothold on which to bring Hadoop into mainstream. Success requires linking new technologies with the mature ones in use today to enable the search for value. Beyond the racks and clusters, we need to bring the science and understanding to enable organizations to leave the past behind and move to the brave new world. This requires bringing along applications, processes, and groups of users – intelligently combining noSQL, relational, predictive, and advanced analytics technologies together to make them easily consumable, even to the business user

Transcript of Big Data and the BI Wild West

Big Data and the BI Wild WestDon’t Bring an Elephant

to a Gun Fight!

Paul Groom

ToolsProcessesObjectives

Why Business Intelligence?

ViewLearn

Action

CommunityAcquire

What is Business Intelligence?

NumbersTablesChartsIndicators

Time - History - Lag

Access - to view (portal) - to data - to depth - Control/Secure

Consumption - digestion

…with ease and simplicity

Business [Intelligence] Desires

More timely

Lower latency

More granularity

More users interactions

Richer data model

Self service

View and generate

Got mobile?

200 millionEmployees bring their own

device to work

Nearly halfOf the workforce will be made

up of millennials by 2020

50%Companies BYOD orgs have had

a security breach

1/3Have broken or would break

corporate policy on BYOD

Data flow

Dynamic accessDrill unlimited

Disruption: Data Discovery tools

BI tools have plateaued…again

Decision Support (Reporting) in late 90’s

Business Intelligence of 00’s

…led to data mining

…leading to analytics and data science

More math

…a lot more math

Machine learning algorithms Dynamic

Simulation

Statistical Analysis

Clustering

Behaviour modelling

The drive for deeper understanding

Reporting & BPMFraud detection

Dynamic Interaction

Technology/Automation

Anal

ytica

l Com

plex

ity

Campaign Management

create external script LM_PRODUCT_FORECAST environment rsint receives ( SALEDATE DATE, DOW INTEGER, ROW_ID INTEGER, PRODNO INTEGER, DAILYSALES INTEGER ) partition by PRODNO order by PRODNO, ROW_ID sends ( R_OUTPUT varchar ) isolate partitions script S'endofr( # Simple R script to run a linear fit on daily sales

prod1<-read.csv(file=file("stdin"), header=FALSE,row.names=1)colnames(prod1)<-c("DOW","ID","PRODNO","DAILYSALES")dim1<-dim(prod1)daily1<-aggregate(prod1$DAILYSALES, list(DOW = prod1$DOW), median)daily1[,2]<-daily1[,2]/sum(daily1[,2])basesales<-array(0,c(dim1[1],2))basesales[,1]<-prod1$IDbasesales[,2]<-(prod1$DAILYSALES/daily1[prod1$DOW+1,2])colnames(basesales)<-c("ID","BASESALES")fit1=lm(BASESALES ~ ID,as.data.frame(basesales))forecast<-array(0,c(dim1[1]+28,4))colnames(forecast)<-c("ID","ACTUAL","PREDICTED","RESIDUALS")

select Trans_Year, Num_Trans,count(distinct Account_ID) Num_Accts,sum(count( distinct Account_ID)) over (partition by Trans_Year order by Num_Trans) Total_Accts,cast(sum(total_spend)/1000 as int) Total_Spend,cast(sum(total_spend)/1000 as int) / count(distinct Account_ID) Avg_Yearly_Spend,rank() over (partition by Trans_Year order by count(distinct Account_ID) desc) Rank_by_Num_Accts,rank() over (partition by Trans_Year order by sum(total_spend) desc) Rank_by_Total_Spendfrom( select Account_ID,

Extract(Year from Effective_Date) Trans_Year, count(Transaction_ID) Num_Trans, sum(Transaction_Amount) Total_Spend, avg(Transaction_Amount) Avg_Spend

from Transaction_fact where extract(year from Effective_Date)<2009 and Trans_Type='D' and Account_ID<>9025011 and actionid in (select actionid from DEMO_FS.V_FIN_actions where actionoriginid =1) group by Account_ID, Extract(Year from Effective_Date) ) Acc_Summarygroup by Trans_Year, Num_Transorder by Trans_Year desc, Num_Trans;

select dept, sum(sales) from sales_fact Where period between date ‘01-05-2006’ and date ‘31-05-2006’ group by depthaving sum(sales) > 50000;

select sum(sales) from sales_history where year = 2006 and month = 5 and region=1;

select total_sales from summary where year = 2006 and month = 5 and region=1;

Behind the numbers

It’s all about getting work done

Bottlenecks

Used to be simple fetch of valueTasks evolving:

Then was compute dynamic aggregate

Now complex algorithms!

Bottlenecks

Time to influence

Reaction – what? – potential value

Action – opportunity - interaction

BI is becoming democratized

BI Wild WestData

Business [Intelligence] Desiresin relation to Big Data

More timely

Lower latency

More granularity

More users interactions

Richer data model

Self service

The Data Warehouse?

Realities

Reports against the DW are just plain dull, boring even!

And then came…

Hadoop ticks many but not all the boxes

a

a a a a a a aa a a a a

a a a a aa a a aa a a a a

a a a a a

Stomped on costsMade economics of scale practical

Talk to BI team about plugging into Hadoop – should be simple?

No need to pre-process before storage i.e. no need to align to storage

No need to triage before storage

New economics = New attitude just grab and retain all datathe data science team will dig into it later

Call IT: Why SQL so limited?

Early bridge Building

Early Hadoop integration tools

The new bounty hunters:DrillImpalaPivotalStinger

The No SQL Posse

WantedDead or Alive

SQL

…but Hadoop too slow for interactive BI

…loss of train-of-thought

still

For once technology is on our side

…oh and BTW RAM is cheap!

CPU

NetworkStorage

Lots of these

Not so many of these

Hadoop is…

Hadoop inherently disk oriented

Typically low ratio of CPU to Disk

‘Flash’ washing is not the solution

Analytics needs

low latency, no I/O wait

Analytical Platform Reference Architecture

AnalyticalPlatform

LayerNear-lineStorage

(optional)

Application &Client Layer

All BI Tools All OLAP Clients Excel

PersistenceLayer

HadoopClusters

Enterprise DataWarehouses

LegacySystems

KognitioStorage

Reporting

Cloud Storage

SQL MDX

Cognos

Reach out, actively select and pull back to consume

MPP everything – get more work done

“No SQL” graduates to “not-only-SQL”

SQL remains preferred data access language … for business community

SQL can encapsulate other processing - in-line Python, R, Java etc.

Discovery

Production

Big Data + Hadoop + in-memory for BI

a

a a a a a a a aa a a a a a a aa a a a a a a aa a a a a a a aa a a a a a aa a a a a a a a

Wild West 1865 to 1890

"The Significance of the Frontier in American History" (1893) a thesis by Fredrick Jackson Turner.

The West not as a particular geographic place, but a frontier process - as a series of Wests on a receding frontier line - the point where savagery meets civilization.

For Turner, American history was largely a tale of people leaving settled areas for the frontier, and their struggle to survive in new lands.

Driving the golden spike for Hadoop and BI

connect

kognitio.com

kognitio.tel

kognitio.com/blog

twitter.com/kognitio

linkedin.com/companies/kognitio

tinyurl.com/kognitio

youtube.com/kognitio

contact

Michael HiskeyVP, Marketing & Business [email protected]

Paul GroomChief Innovation [email protected]

Steve Friedberg - press contactMMI [email protected]

Kognitio is a Platinum Sponsor of the Hadoop Summit – see us at booth #31 – center!