Big Data and the BI Wild West
-
Upload
hadoopsummit -
Category
Technology
-
view
1.132 -
download
2
description
Transcript of Big Data and the BI Wild West
What is Business Intelligence?
NumbersTablesChartsIndicators
Time - History - Lag
Access - to view (portal) - to data - to depth - Control/Secure
Consumption - digestion
…with ease and simplicity
Business [Intelligence] Desires
More timely
Lower latency
More granularity
More users interactions
Richer data model
Self service
Got mobile?
200 millionEmployees bring their own
device to work
Nearly halfOf the workforce will be made
up of millennials by 2020
50%Companies BYOD orgs have had
a security breach
1/3Have broken or would break
corporate policy on BYOD
BI tools have plateaued…again
Decision Support (Reporting) in late 90’s
Business Intelligence of 00’s
…led to data mining
…leading to analytics and data science
Machine learning algorithms Dynamic
Simulation
Statistical Analysis
Clustering
Behaviour modelling
The drive for deeper understanding
Reporting & BPMFraud detection
Dynamic Interaction
Technology/Automation
Anal
ytica
l Com
plex
ity
Campaign Management
create external script LM_PRODUCT_FORECAST environment rsint receives ( SALEDATE DATE, DOW INTEGER, ROW_ID INTEGER, PRODNO INTEGER, DAILYSALES INTEGER ) partition by PRODNO order by PRODNO, ROW_ID sends ( R_OUTPUT varchar ) isolate partitions script S'endofr( # Simple R script to run a linear fit on daily sales
prod1<-read.csv(file=file("stdin"), header=FALSE,row.names=1)colnames(prod1)<-c("DOW","ID","PRODNO","DAILYSALES")dim1<-dim(prod1)daily1<-aggregate(prod1$DAILYSALES, list(DOW = prod1$DOW), median)daily1[,2]<-daily1[,2]/sum(daily1[,2])basesales<-array(0,c(dim1[1],2))basesales[,1]<-prod1$IDbasesales[,2]<-(prod1$DAILYSALES/daily1[prod1$DOW+1,2])colnames(basesales)<-c("ID","BASESALES")fit1=lm(BASESALES ~ ID,as.data.frame(basesales))forecast<-array(0,c(dim1[1]+28,4))colnames(forecast)<-c("ID","ACTUAL","PREDICTED","RESIDUALS")
select Trans_Year, Num_Trans,count(distinct Account_ID) Num_Accts,sum(count( distinct Account_ID)) over (partition by Trans_Year order by Num_Trans) Total_Accts,cast(sum(total_spend)/1000 as int) Total_Spend,cast(sum(total_spend)/1000 as int) / count(distinct Account_ID) Avg_Yearly_Spend,rank() over (partition by Trans_Year order by count(distinct Account_ID) desc) Rank_by_Num_Accts,rank() over (partition by Trans_Year order by sum(total_spend) desc) Rank_by_Total_Spendfrom( select Account_ID,
Extract(Year from Effective_Date) Trans_Year, count(Transaction_ID) Num_Trans, sum(Transaction_Amount) Total_Spend, avg(Transaction_Amount) Avg_Spend
from Transaction_fact where extract(year from Effective_Date)<2009 and Trans_Type='D' and Account_ID<>9025011 and actionid in (select actionid from DEMO_FS.V_FIN_actions where actionoriginid =1) group by Account_ID, Extract(Year from Effective_Date) ) Acc_Summarygroup by Trans_Year, Num_Transorder by Trans_Year desc, Num_Trans;
select dept, sum(sales) from sales_fact Where period between date ‘01-05-2006’ and date ‘31-05-2006’ group by depthaving sum(sales) > 50000;
select sum(sales) from sales_history where year = 2006 and month = 5 and region=1;
select total_sales from summary where year = 2006 and month = 5 and region=1;
Behind the numbers
It’s all about getting work done
Bottlenecks
Used to be simple fetch of valueTasks evolving:
Then was compute dynamic aggregate
Now complex algorithms!
Bottlenecks
Time to influence
Reaction – what? – potential value
Action – opportunity - interaction
BI is becoming democratized
Business [Intelligence] Desiresin relation to Big Data
More timely
Lower latency
More granularity
More users interactions
Richer data model
Self service
Hadoop ticks many but not all the boxes
a
a a a a a a aa a a a a
a a a a aa a a aa a a a a
a a a a a
Talk to BI team about plugging into Hadoop – should be simple?
No need to pre-process before storage i.e. no need to align to storage
No need to triage before storage
New economics = New attitude just grab and retain all datathe data science team will dig into it later
Call IT: Why SQL so limited?
Lots of these
Not so many of these
Hadoop is…
Hadoop inherently disk oriented
Typically low ratio of CPU to Disk
Analytical Platform Reference Architecture
AnalyticalPlatform
LayerNear-lineStorage
(optional)
Application &Client Layer
All BI Tools All OLAP Clients Excel
PersistenceLayer
HadoopClusters
Enterprise DataWarehouses
LegacySystems
KognitioStorage
Reporting
Cloud Storage
MPP everything – get more work done
“No SQL” graduates to “not-only-SQL”
SQL remains preferred data access language … for business community
SQL can encapsulate other processing - in-line Python, R, Java etc.
Big Data + Hadoop + in-memory for BI
a
a a a a a a a aa a a a a a a aa a a a a a a aa a a a a a a aa a a a a a aa a a a a a a a
Wild West 1865 to 1890
"The Significance of the Frontier in American History" (1893) a thesis by Fredrick Jackson Turner.
The West not as a particular geographic place, but a frontier process - as a series of Wests on a receding frontier line - the point where savagery meets civilization.
For Turner, American history was largely a tale of people leaving settled areas for the frontier, and their struggle to survive in new lands.
connect
kognitio.com
kognitio.tel
kognitio.com/blog
twitter.com/kognitio
linkedin.com/companies/kognitio
tinyurl.com/kognitio
youtube.com/kognitio
contact
Michael HiskeyVP, Marketing & Business [email protected]
Paul GroomChief Innovation [email protected]
Steve Friedberg - press contactMMI [email protected]
Kognitio is a Platinum Sponsor of the Hadoop Summit – see us at booth #31 – center!