Post on 17-Mar-2018
Josh Patterson
Email:
josh@pattersonconsultingtn.com
Twitter:
@jpatanooga
Github:
https://github.com/jpatanooga
Slideshare
http://www.slideshare.net/jpatanooga/
Past
Published in IAAI-09:
“TinyTermite: A Secure Routing Algorithm”
Grad work in Meta-heuristics, Ant-algorithms
Tennessee Valley Authority (TVA)
Hadoop and the Smartgrid
Cloudera
Principal Solution Architect
Today: Patterson Consulting
Overview
• What is Hadoop?
• Hadoop and Industry
• Is Hadoop for Me?
Hadoop Distributed File
System (HDFS)
MapReduce
Apache Hadoop
• Consolidates Mixed Data• Move complex and relational
data into a single repository
• Stores Inexpensively• Keep raw data always available
• Use industry standard hardware
• Processes at the Source• Eliminate ETL bottlenecks
• Mine data first, govern later
5
Why is it Called Hadoop?
Doug’s son had a toy elephant that he
called “Ha-Doop”
Doug Cutting Invented Hadoop.
What Hadoop Does
Uses commodity hardware / servers
Scales into Petabytes without hardware changes
Manages fault tolerance and replication with its distributed file system
Scalable processing engine handles all types of data
Text, logs, documents
Binary, images, video
Hadoop Distributed File System (HDFS)
Based on design of Google’s GFS
Meant for high levels of throughput which sustain
Map Reduce parallel processing jobs
Data stored in large files
Large blocks, (64MB, 128MB, 256MB, etc) per block
MapReduce: Distributed Processing
Hadoop Analysis Tools
Java Map Reduce
Hive and Impala
SQL-like language for Hadoop
Declarative higher level language
Pig
Procedural higher level language
Filters, joins, udfs
10
Starting Out: 2008
Got going at Facebook and Yahoo
Became the backbone of many CA startups
In 2009 we did a POC with Hadoop @ TVA
http://openpdc.codeplex.com/
Source: IDC White Paper - sponsored by EMC.
As the Economy Contracts, the Digital Universe Expands. May 2009.
.
Unstructured Data Explosion
• 2,500 exabytes of new information in 2012 with Internet as primary driver
Relational
Complex, Unstructured
(You)
Financial Services Got Interested in Hadoop
Banks saw the potential to look at a lot of transactions for
things like
Fraud
Money Laundering Detection
Comprehensive Credit Reports
Teradata had become very expensive
Started using Hadoop to augment mass data transforms
Genomics
A genome is 2.8GB
There are lots of genomes
Genomics (ISB, Novartis, etc) became very interesting on Hadoop around 2010 and 2011
There are many CPU bound processes in genomics
But a lot of it is also disk bound – great for Hadoop
Other Verticals Jumped In
Telecoms
Lot of call histories to look at, Billing, etc
Media
Help recommend folks stuff to watch based on what they watch
Manufacturing
Sensor data on devices works well as timeseries in Hadoop
Insurance
Lots of data can build better models of how folks live
Can give insurance co’s better ways to model policies
Data is Not Always Big
But the world is becoming progressively more interested in data
Big and small
Data analysis is driving how we build new products, manage our lives, and consume content
Focus on producing a result that is relevant to the industry
And not on how much data you have or if it qualifies as “big”
ETL Pipelines
Many early Hadoop use cases involve porting a data
transform pipeline into Hive or MapReduce
Allows for linearly scalability on throughput
Processing web logs has always been the
MapReduce base case
Many times the result data is sent back to an
RDBMS store
Recommend Products
Ever used Facebook’s “People you may Know”?
Ever used Amazon’s “People Also Bought”?
These are recommendation systems
Hadoop powers both of these
Deep Dive into Recommenders on Hadoop
http://www.slideshare.net/jpatanooga/la-hug-dec-2011-recommendation-talk
Hadoop as Next-Gen Data Warehouse
Easier to work with data where it lives
Less dependent on DBAs and schemas
Hadoop is Community driven
Is spiritually very similar to Linux
Open source core
With Distro model (think RedHat and Ubuntu)
Is Hadoop For My Use Case?
Am I doing a lot of table / disk based transforms?
The process is disk bound and batch-oriented, so yes
Do I need to ad-hoc query large amounts of data?
Hive or Impala make sense here
Am I dealing with a lot of incoming transactional data that I’d like to analyze (logs, typically)?
Hadoop is great for cheap storage and scalable processing
Questions?
Thanks for coming out to hear about Hadoop!