Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Post on 17-Mar-2018

693 views 1 download

Transcript of Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Josh Patterson

Email:

josh@pattersonconsultingtn.com

Twitter:

@jpatanooga

Github:

https://github.com/jpatanooga

Slideshare

http://www.slideshare.net/jpatanooga/

Past

Published in IAAI-09:

“TinyTermite: A Secure Routing Algorithm”

Grad work in Meta-heuristics, Ant-algorithms

Tennessee Valley Authority (TVA)

Hadoop and the Smartgrid

Cloudera

Principal Solution Architect

Today: Patterson Consulting

Overview

• What is Hadoop?

• Hadoop and Industry

• Is Hadoop for Me?

Hadoop Distributed File

System (HDFS)

MapReduce

Apache Hadoop

• Consolidates Mixed Data• Move complex and relational

data into a single repository

• Stores Inexpensively• Keep raw data always available

• Use industry standard hardware

• Processes at the Source• Eliminate ETL bottlenecks

• Mine data first, govern later

5

Why is it Called Hadoop?

Doug’s son had a toy elephant that he

called “Ha-Doop”

Doug Cutting Invented Hadoop.

What Hadoop Does

Uses commodity hardware / servers

Scales into Petabytes without hardware changes

Manages fault tolerance and replication with its distributed file system

Scalable processing engine handles all types of data

Text, logs, documents

Binary, images, video

Hadoop Distributed File System (HDFS)

Based on design of Google’s GFS

Meant for high levels of throughput which sustain

Map Reduce parallel processing jobs

Data stored in large files

Large blocks, (64MB, 128MB, 256MB, etc) per block

MapReduce: Distributed Processing

Hadoop Analysis Tools

Java Map Reduce

Hive and Impala

SQL-like language for Hadoop

Declarative higher level language

Pig

Procedural higher level language

Filters, joins, udfs

10

Starting Out: 2008

Got going at Facebook and Yahoo

Became the backbone of many CA startups

In 2009 we did a POC with Hadoop @ TVA

http://openpdc.codeplex.com/

Source: IDC White Paper - sponsored by EMC.

As the Economy Contracts, the Digital Universe Expands. May 2009.

.

Unstructured Data Explosion

• 2,500 exabytes of new information in 2012 with Internet as primary driver

Relational

Complex, Unstructured

(You)

Financial Services Got Interested in Hadoop

Banks saw the potential to look at a lot of transactions for

things like

Fraud

Money Laundering Detection

Comprehensive Credit Reports

Teradata had become very expensive

Started using Hadoop to augment mass data transforms

Genomics

A genome is 2.8GB

There are lots of genomes

Genomics (ISB, Novartis, etc) became very interesting on Hadoop around 2010 and 2011

There are many CPU bound processes in genomics

But a lot of it is also disk bound – great for Hadoop

Other Verticals Jumped In

Telecoms

Lot of call histories to look at, Billing, etc

Media

Help recommend folks stuff to watch based on what they watch

Manufacturing

Sensor data on devices works well as timeseries in Hadoop

Insurance

Lots of data can build better models of how folks live

Can give insurance co’s better ways to model policies

Data is Not Always Big

But the world is becoming progressively more interested in data

Big and small

Data analysis is driving how we build new products, manage our lives, and consume content

Focus on producing a result that is relevant to the industry

And not on how much data you have or if it qualifies as “big”

ETL Pipelines

Many early Hadoop use cases involve porting a data

transform pipeline into Hive or MapReduce

Allows for linearly scalability on throughput

Processing web logs has always been the

MapReduce base case

Many times the result data is sent back to an

RDBMS store

Recommend Products

Ever used Facebook’s “People you may Know”?

Ever used Amazon’s “People Also Bought”?

These are recommendation systems

Hadoop powers both of these

Deep Dive into Recommenders on Hadoop

http://www.slideshare.net/jpatanooga/la-hug-dec-2011-recommendation-talk

Hadoop as Next-Gen Data Warehouse

Easier to work with data where it lives

Less dependent on DBAs and schemas

Hadoop is Community driven

Is spiritually very similar to Linux

Open source core

With Distro model (think RedHat and Ubuntu)

Is Hadoop For My Use Case?

Am I doing a lot of table / disk based transforms?

The process is disk bound and batch-oriented, so yes

Do I need to ad-hoc query large amounts of data?

Hive or Impala make sense here

Am I dealing with a lot of incoming transactional data that I’d like to analyze (logs, typically)?

Hadoop is great for cheap storage and scalable processing

Questions?

Thanks for coming out to hear about Hadoop!