SF Hadoop Users Group August 2014 Meetup Slides

Hadoop at LookoutAug 13, 2014

Yash Ranadive@yashranadive

Thursday, August 14, 14

• Data Engineer

• From Mumbai, India

• Lived in 7 different cities in US

• @yashranadive

• etl.svbtle.com

AGENDA

• What we do @Lookout

• Data warehouse

• Evolution from monolithic to micro-services

• Protocol Buffers

• Areas we are exploring

WHAT WE DO@LOOKOUT

Over 50 million registered users

DATA TEAM

• 3 Data Engineers

• 6 data analysts

• Hadoop

• 64 hosts

• 300 TB capacity

DATA WAREHOUSEINTERNAL AND EXTERNAL DATA SOURCES

MySQL Star Schema

Warehouse

HIVE HBase ImpalaChunker

Mudskipper

R Hue Shiny Tableau Custom Apps

WAREHOUSE

FROM MONOLITHIC TO MICROSERVICES

MONOLITHIC APPLICATION

Routing

Controller

Mobile/Web Clients

Database

RAILS APPLICATION

Tables

DATA INGESTION - MONOLITHIC

Application master_db slave_db

Data Warehouse

MySQL HiveETL

ELTMySQL

Replication

External Sources

Reporting

Ingestion is batch-oriented

PROBLEM

• Rails has fast TTM but challenges in scaling

• One code base

• Slower Deployments

• Too complex and large to manage

• Solution

• Microservices / service oriented architecture

• Break out the app in to smaller services

MICROSERVICES ARCHITECTURE

Routing

Controller

Mobile/Web Clients

Database

RAILS APPLICATION

Tables

Settings Service

PhotoBackup

We frequently add new services

DATA INGESTION - MICROSERVICES

Application master_db slave_db

Data Warehouse

MySQL Hive

ELTMySQL

Replication

External Sources

Reporting

Settings Service

Backup Service

Locate Service

Messaging Layer

Consumer

DATA INGESTION - MONOLITIHIC VS MICROSERVICES

select * from user_settings;

id | setting_id | user_id | modified_at===========================1 backup 2629 20140709T0400Z3 locate 2682 20140709T0402Z8 wipe 2629 20140709T0403Z9 theft_alert 2629 20140709T0407Z

{guid: 1, event_type: “modify_setting”,setting_id: “backup”, setting_status: “ON”, user_id: “2629”, timestamp: “20140709T0400Z”}

{guid: 3, event_type: “start_backup”, user_id: “2629”, timestamp: “20140709T0400Z”}...

Monolithic - Snapshot of a point in time

Microservices - Events

DESIGN

• We wanted to create an always-on event ingestion framework that:

• Would scale workers on demand

• Would be easy to monitor

FIRST STAB - WORKER

Service ActiveMQ Ruby Worker HIVE

• Upstart script that daemonized Ruby process

• Monitoring using Zenoss

• Very easy to set up

• Mapping Files for JSON -> CSV

• Ruby is terse and clean

PROBLEMS

• ActiveMQ

• ActiveMQ did not scale well - even with multiple machines in the AMQ cluster

• ActiveMQ creates a separate queue for every consumer of the topic

• Monitoring using Zenoss is not ideal especially for multi-process consumers

• The worker ran on a single machine- not fault tolerant

CURRENT ARCHITECTURE - WORKER

Service Kafka Storm HIVE

• Monitoring using Storm’s thrift API

• Scaling number of workers is easy

• Kafka has better scalability than Kafka

Service ActiveMQ

STORM TOPOLOGY

Service Kafka HDFS

Kafka Spout

ActiveMQ Spout

Processing Bolt

Storm-hdfs bolt

Landing Directory

Hive Directory

JSON PROBLEMS

• Problems with JSON

• No predefined schema

• No enforcement of backward compatibility

• Solution

• Protocol Buffers (also Avro/Thrift)

PROTOBUFS

• What?

• Way of encoding structured data

• Binary

• Why?

• Schema

• Backward compatibility

• Smaller in size than JSON

VERSIONING

• backward compatible changes only

,proto ,proto

Version 1.4 Version 1.1

Producer ConsumerQueue

SHARING PROTOBUF SCHEMAS

Artifactory(Schema Repo)

Data Team Storm ProjectProducers

PushJava jars

Ruby gems

PullJava jars

BUT HOW DO YOU STORE PROTOBUFS IN HDFS?

HOW WE STORE PROTOBUFS

• Store raw version

• Raw dump of kafka topic in to HDFS

• Convert them to a tuple using Storm

• Inflate then convert to TSV

• Can query raw protobufs directly from HIVE but we don’t yet

• elephant-bird (difficult to get it working)

STORM TOPOLOGY

Service Kafka HDFS

Kafka Spout

ActiveMQ Spout

Deserialize Protobuf

Storm-hdfs bolt

Landing Directory

Hive Directory

AREAS WE ARE EXPLORING

• ETL

• Wordcount ~5 lines of scala code vs. 58 lines of Java Map reduce code

• Spark Streaming can achieve similar results as of storm through micro-batchinghttp://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming

• Machine Learning

• Online learning using MLLIB

• Logistic Regression and SVM

• In-memory machine learning

• Tight integration with R

• Preferred by Data Scientists

OPEN SOURCE PROJECTS

• Currently open sourced

• Pipefish - write from MySQL to HDFSgithub.com/lookout/pipefish

• Future

• Mudskipper - capture change-data events from MySQL binlogs.

• Chunker - download mysql table data in chunks

Questions

SF Hadoop Users Group August 2014 Meetup Slides

Engineering

Transcript of SF Hadoop Users Group August 2014 Meetup Slides

SAP HANA and Apache Hadoop for Big Data Management (SF Scalable Systems Meetup)

MongoDB Aug2010 SF Meetup

SF Gradle Meetup - Netflix OSS

Boston Hadoop Meetup, April 26 2012

Datameer - May 2014 Hadoop MeetUp

Casablanca Hadoop & Big Data Meetup - Introduction à Hadoop

Boston Hadoop Meetup: Presto for the Enterprise

Hadoop as a service presented by Ajay Jha at Houston Hadoop Meetup

Hadoop virtualization extensions hadoop world meetup

Manchester Hadoop Meetup: Cassandra Spark internals

Hadoop meetup 2014

Chattanooga Hadoop Meetup - Hadoop 101 - November 2014

Big Data: SQL on Hadoop - Introduction to Big SQL for SF Bay Area MeetUp, March 13, 2014

SF Cordova Meetup

Oozie High Availability (Hadoop Summit 2014 meetup)

Kafka & Hadoop - for NYC Kafka Meetup

SF Python Meetup: TextRank in Python

Manchester Hadoop Meetup: Spark Cassandra Integration

Apache Hadoop YARN - Hortonworks Meetup Presentation

An evening with... Apache hadoop Meetup