What You Should Know About Big Data
-
Upload
mammoth-data -
Category
Technology
-
view
125 -
download
2
Transcript of What You Should Know About Big Data
The Leader in Big Data Consulting
www.mammothdata.com | @mammothdataco
What You Should Know About Big Data
{CIO/CTO Breakfast Forum | Columbia}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Andrew C. Oliver, President & Founder
● @acoliver
● Programming since age 8
● Java since ~1997
● Founded POI project (currently hosted at Apache) with Marc Johnson ~2000
○ Former member Jakarta PMC
○ Emeritus member of Apache Software Foundation
● Joined JBoss ~2002
● Former Board Member/current helper/lifetime member: Open Source Initiative (http://opensource.org)
● Column in InfoWorld: http://www.infoworld.com/author-bios/andrew-oliver
○ I make fanboys cry
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Open Software Integrators
Founded Nov 2007 by Andrew C. Oliver (me)in Durham, NC
Based in Durham, NCOffice also in Chicago, ILOperate Nationally (and occasionally internationally)Started out specializing in Java/Linux/Enterprise Scalability, now moved more towards
NoSQL, Big DataProfessional Services (Consulting, Training, Strategy)
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Overview
What is Big Data?
What is Hadoop?
But…
Where should you use Big Data technologies?
Market Segments for Hadoop
Where shouldn’t you use Big Data technologies?
How can you identify places to use this?
Why should you do this?
Alphabet soup
www.mammothdata.com | @mammothdataco
What is Big Data?
{CIO/CTO Breakfast Forum | Columbia}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
What is Big Data?
marketing term for a set of technologies
mainly in the Hadoop ecosystem
Not a specific number of bytes or petabytes
www.mammothdata.com | @mammothdataco
What is Hadoop?
{CIO/CTO Breakfast Forum | Columbia}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
What is Hadoop?
Core
HDFS - a distributed filesystem
YARN - a cluster manager
Map-Reduce implementation / API
Pig - a map reduce scripting query language
Hive - SQL and data warehousing infrastructure
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
But...
There is a larger ecosystem beyond this core...
www.mammothdata.com | @mammothdataco
Where Should You Use Big Data Technologies?
{CIO/CTO Breakfast Forum | Columbia}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Where You Should Use...
Unstructured Data
Lots of Data
High volume input
Datawarehousing
Streams
Machine Learning / Decision Support
BI/Analytics
www.mammothdata.com | @mammothdataco
Market Segments
{CIO/CTO Breakfast Forum | Columbia}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Market Segments
“New” market, new kinds of problems
Data Warehousing Market (MPP systems, Teradata, Neteeza)
Machine Learning / Decision / BI
...growing… really fast
www.mammothdata.com | @mammothdataco
Where Shouldn’t You Use Big Data Technology?
{CIO/CTO Breakfast Forum | Columbia}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Market Segments
With a few exceptions this isn’t your “operational” datastore
Cassandra sometimes
Clickstreams sort of
www.mammothdata.com | @mammothdataco
How Can You Identify Your Opportunities to Use Big Data Technology?
{CIO/CTO Breakfast Forum | Columbia}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
How To Find Uses
Some obvious
long running queries?
Questions your database can’t handle
Can you aggregate the data you need to aggregate or to answer all of your questions?
where are costs such as licensing a constraint?
Unify disparate datastreams
www.mammothdata.com | @mammothdataco
Why Should You Do This?
{CIO/CTO Breakfast Forum | Columbia}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Why Should You Do This?
How much data have you thrown away then found out it was useful?
Weblogs since 1996
What questions do you have ?
How is your database doing for those long running queries?
How much to expand your proprietary data warehouse?
Competitive advantage
www.mammothdata.com | @mammothdataco
Alphabet Soup
{CIO/CTO Breakfast Forum | Columbia}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Alphabet Soup
Core
Hadoop File System (HDFS) - distributed filesystem
Yet Another Resource Negotiator (YARN) - cluster manager
Pig - SQL on steroids, query language for map-reduce jobs
Hive / Impala - datawarehousing / SQL frameworks
More
HBase /Cassandra - Column Family datastores (time series data especially)
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Alphabet Soup
More (Cont)
Spark/Shark and Storm - Map reduce in memory (low latency) also Streams.
Oozie - workflow / job control
Ambari - admin/deployment tool (also Cloudera Manager)
Sqoop - ETL tool to extract/transorm/load from your RDBMS
Flume - Enterprise Service Bus like tool for transporting data in/out
Mahout - Machine Learning / Decision making
www.mammothdata.com | @mammothdataco
RDBMS may not scale to your needsYour data may not map efficiently to tablesColumn Family/Big Table - fast, scalable, denormalized, map reduce, good for series, not
efficient for complex dataHadoop is an ecosystem of different software packages mainly centered around HDFS and
Map Reduce (But not exclusively)Both expands our capabilities and disrupts old technologiesNot usually an operational datastoreUse this where you need it, most places create a basic POC and then deploy a
competency center/platform then increase usesThere is a long list of alphabet soup and addons...
Conclusions
www.mammothdata.com | @mammothdataco
Thank you for attending!
{CIO/CTO Breakfast Forum | Columbia}