SF Hadoop Users Group August 2014 Meetup Slides

31
Hadoop at Lookout Aug 13, 2014 Yash Ranadive @yashranadive Thursday, August 14, 14

description

Slides for Hadoop Users Group Meetup on 13th August 2014

Transcript of SF Hadoop Users Group August 2014 Meetup Slides

Page 1: SF Hadoop Users Group August 2014 Meetup Slides

Hadoop at LookoutAug 13, 2014

Yash Ranadive@yashranadive

Thursday, August 14, 14

Page 2: SF Hadoop Users Group August 2014 Meetup Slides

BIO

• Data Engineer

• From Mumbai, India

• Lived in 7 different cities in US

• @yashranadive

• etl.svbtle.com

Thursday, August 14, 14

Page 3: SF Hadoop Users Group August 2014 Meetup Slides

AGENDA

• What we do @Lookout

• Data warehouse

• Evolution from monolithic to micro-services

• Protocol Buffers

• Areas we are exploring

Thursday, August 14, 14

Page 4: SF Hadoop Users Group August 2014 Meetup Slides

WHAT WE DO@LOOKOUT

Thursday, August 14, 14

Page 5: SF Hadoop Users Group August 2014 Meetup Slides

Over 50 million registered users

Thursday, August 14, 14

Page 6: SF Hadoop Users Group August 2014 Meetup Slides

DATA TEAM

• 3 Data Engineers

• 6 data analysts

• Hadoop

• 64 hosts

• 300 TB capacity

Thursday, August 14, 14

Page 7: SF Hadoop Users Group August 2014 Meetup Slides

DATA WAREHOUSEINTERNAL AND EXTERNAL DATA SOURCES

MySQL Star Schema

Warehouse

HDFS

HIVE HBase ImpalaChunker

Mudskipper

R Hue Shiny Tableau Custom Apps

WAREHOUSE

Thursday, August 14, 14

Page 8: SF Hadoop Users Group August 2014 Meetup Slides

FROM MONOLITHIC TO MICROSERVICES

Thursday, August 14, 14

Page 9: SF Hadoop Users Group August 2014 Meetup Slides

MONOLITHIC APPLICATION

Routing

Controller

Mobile/Web Clients

Database

RAILS APPLICATION

HTTP

ORM

Views

Tables

Thursday, August 14, 14

Page 10: SF Hadoop Users Group August 2014 Meetup Slides

DATA INGESTION - MONOLITHIC

Application master_db slave_db

Data Warehouse

MySQL HiveETL

ELTMySQL

Replication

External Sources

Reporting

Ingestion is batch-oriented

Thursday, August 14, 14

Page 11: SF Hadoop Users Group August 2014 Meetup Slides

PROBLEM

• Rails has fast TTM but challenges in scaling

• One code base

• Slower Deployments

• Too complex and large to manage

• Solution

• Microservices / service oriented architecture

• Break out the app in to smaller services

Thursday, August 14, 14

Page 12: SF Hadoop Users Group August 2014 Meetup Slides

MICROSERVICES ARCHITECTURE

Routing

Controller

Mobile/Web Clients

Database

RAILS APPLICATION

HTTP

ORM

Views

Tables

Settings Service

PhotoBackup

We frequently add new services

Thursday, August 14, 14

Page 13: SF Hadoop Users Group August 2014 Meetup Slides

DATA INGESTION - MICROSERVICES

Application master_db slave_db

Data Warehouse

MySQL Hive

ETL

ELTMySQL

Replication

External Sources

Reporting

Settings Service

Backup Service

Locate Service

Messaging Layer

Consumer

Thursday, August 14, 14

Page 14: SF Hadoop Users Group August 2014 Meetup Slides

DATA INGESTION - MONOLITIHIC VS MICROSERVICES

select * from user_settings;

id | setting_id | user_id | modified_at===========================1 backup 2629 20140709T0400Z3 locate 2682 20140709T0402Z8 wipe 2629 20140709T0403Z9 theft_alert 2629 20140709T0407Z

{guid: 1, event_type: “modify_setting”,setting_id: “backup”, setting_status: “ON”, user_id: “2629”, timestamp: “20140709T0400Z”}

{guid: 3, event_type: “start_backup”, user_id: “2629”, timestamp: “20140709T0400Z”}...

Monolithic - Snapshot of a point in time

Microservices - Events

Thursday, August 14, 14

Page 15: SF Hadoop Users Group August 2014 Meetup Slides

DESIGN

• We wanted to create an always-on event ingestion framework that:

• Would scale workers on demand

• Would be easy to monitor

Thursday, August 14, 14

Page 16: SF Hadoop Users Group August 2014 Meetup Slides

FIRST STAB - WORKER

Service ActiveMQ Ruby Worker HIVE

• Upstart script that daemonized Ruby process

• Monitoring using Zenoss

• Very easy to set up

• Mapping Files for JSON -> CSV

• Ruby is terse and clean

Thursday, August 14, 14

Page 17: SF Hadoop Users Group August 2014 Meetup Slides

PROBLEMS

• ActiveMQ

• ActiveMQ did not scale well - even with multiple machines in the AMQ cluster

• ActiveMQ creates a separate queue for every consumer of the topic

• Monitoring using Zenoss is not ideal especially for multi-process consumers

• The worker ran on a single machine- not fault tolerant

Thursday, August 14, 14

Page 18: SF Hadoop Users Group August 2014 Meetup Slides

CURRENT ARCHITECTURE - WORKER

Service Kafka Storm HIVE

• Monitoring using Storm’s thrift API

• Scaling number of workers is easy

• Kafka has better scalability than Kafka

Service ActiveMQ

Thursday, August 14, 14

Page 19: SF Hadoop Users Group August 2014 Meetup Slides

Storm

STORM TOPOLOGY

Service Kafka HDFS

Kafka Spout

ActiveMQ Spout

Processing Bolt

Storm-hdfs bolt

Landing Directory

Hive Directory

Thursday, August 14, 14

Page 20: SF Hadoop Users Group August 2014 Meetup Slides

JSON PROBLEMS

• Problems with JSON

• No predefined schema

• No enforcement of backward compatibility

• Solution

• Protocol Buffers (also Avro/Thrift)

Thursday, August 14, 14

Page 21: SF Hadoop Users Group August 2014 Meetup Slides

PROTOBUFS

• What?

• Way of encoding structured data

• Binary

• Why?

• Schema

• Backward compatibility

• Smaller in size than JSON

Thursday, August 14, 14

Page 22: SF Hadoop Users Group August 2014 Meetup Slides

VERSIONING

• backward compatible changes only

,proto ,proto

Version 1.4 Version 1.1

Producer ConsumerQueue

Thursday, August 14, 14

Page 23: SF Hadoop Users Group August 2014 Meetup Slides

SHARING PROTOBUF SCHEMAS

Artifactory(Schema Repo)

Data Team Storm ProjectProducers

PushJava jars

Ruby gems

PullJava jars

Thursday, August 14, 14

Page 24: SF Hadoop Users Group August 2014 Meetup Slides

BUT HOW DO YOU STORE PROTOBUFS IN HDFS?

Thursday, August 14, 14

Page 25: SF Hadoop Users Group August 2014 Meetup Slides

HOW WE STORE PROTOBUFS

• Store raw version

• Raw dump of kafka topic in to HDFS

• Convert them to a tuple using Storm

• Inflate then convert to TSV

• Can query raw protobufs directly from HIVE but we don’t yet

• elephant-bird (difficult to get it working)

Thursday, August 14, 14

Page 26: SF Hadoop Users Group August 2014 Meetup Slides

Storm

STORM TOPOLOGY

Service Kafka HDFS

Kafka Spout

ActiveMQ Spout

Deserialize Protobuf

Storm-hdfs bolt

Landing Directory

Hive Directory

Thursday, August 14, 14

Page 27: SF Hadoop Users Group August 2014 Meetup Slides

AREAS WE ARE EXPLORING

Thursday, August 14, 14

Page 28: SF Hadoop Users Group August 2014 Meetup Slides

SPARK

• ETL

• Wordcount ~5 lines of scala code vs. 58 lines of Java Map reduce code

• Spark Streaming can achieve similar results as of storm through micro-batchinghttp://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

• Machine Learning

• Online learning using MLLIB

• Logistic Regression and SVM

Thursday, August 14, 14

Page 29: SF Hadoop Users Group August 2014 Meetup Slides

H20

• In-memory machine learning

• Tight integration with R

• Preferred by Data Scientists

Thursday, August 14, 14

Page 30: SF Hadoop Users Group August 2014 Meetup Slides

OPEN SOURCE PROJECTS

• Currently open sourced

• Pipefish - write from MySQL to HDFSgithub.com/lookout/pipefish

• Future

• Mudskipper - capture change-data events from MySQL binlogs.

• Chunker - download mysql table data in chunks

Thursday, August 14, 14

Page 31: SF Hadoop Users Group August 2014 Meetup Slides

Questions

Thursday, August 14, 14