Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban
-
Upload
david-chen -
Category
Technology
-
view
110 -
download
5
description
Transcript of Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with Azkaban
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Building a Self-Service Hadoop Platform at LinkedInwith AzkabanHadoop Summit 2014
David Z. Chen
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 3
Hadoop at LinkedIn
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Hadoop at LinkedIn
4
Profile PageHome Page
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 5
Hadoop at LinkedIn
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved.
Evolution of Workflows
620092010201120122013
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 7
Azkaban 1.0
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 8
Azkaban 1.0
Run workflows Schedule jobs Job History Failure notification Easy to use web UI and
visualizations
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 9
Azkaban 2.0
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 10
Azkaban 2.0
Major re-architecting Separate executor and web servers User authentication Pluggable database drivers
– H2– MySQL
Brand new UI
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 11
Azkaban 2.0
Jobtype plugins– Built-in type: command– Pluggable jobtypes:
Java Pig Hive
– Non-Hadoop jobtypes: Teradata Voldemort
Viewer plugins – extending the Azkaban UI for other tools
– HDFS browser– Reportal
LinkedIn-specific code as plugins
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 12
Azkaban 2.5
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 13
Azkaban 2.5
UI overhauled using Bootstrap Embedded flows New self-service tools
– Job Summary– Flow Summary– Pig Visualizer
Jobtype-specific plugins HDFS viewer improvements
– Display file schema in addition to content
– Parquet file viewer And more
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 14
Who’s using Azkaban?
Software Engineers Data Scientists Analysts Product Managers
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 15
Azkaban Today
Workflow manager and scheduler Integrated runtime environment Unified front-end for Hadoop tools
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 16
Good News! Success!
1000+ users Several clusters 2,500 flows executing per day 30,000 jobs executing per day
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 17
Bad News! Success
1000+ users Several clusters 2,500 flows executing per day 30,000 jobs executing per day
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 18
Creating and Running Workflows
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 19
Creating Workflows
Add job “type” plugins– hadoopJava– Command– Pig– Hive
Dependencies– Determine the dependency graph
Parameter passing– Parameters can be passed to job
type=pigcreamy.level=4chunky.level=4...
type=hadoopJavajelly.type=grapesugar=HFCS...
type=commandbread.type=wheatdependencies=peanutbutter,jelly...
peanutbutter.job
bread.job
jelly.job
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 20
Embedded Flows
Embed a flow as a node in another flow.
– “flow” job type– Set flow.name to name of the
embedded flow– Parameters can be passed to flow
peanutbutter jelly
bread
type=flowflow.name=breaddependencies=coffee,fruit
type=hivecoffee.decaf=falsecoffee.cream=true...
type=hadoopJavafruit.type=apple...
coffee.job fruit.job
sandwich.job
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 21
Project ManagementProject Page
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 22
Running WorkflowsFlow Execution Panel
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 23
Running WorkflowsNotification Options
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 24
Running WorkflowsFailure Options
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 25
Finish Current– Finishes current running flows, then stops
Cancel All– Kills all running jobs and finishes immediately
Finish Possible– Finish all possible jobs if their dependencies have met. Then it fails.
Running WorkflowsFailure Options
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 26
Running WorkflowsFlow Parameters
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 27
Running WorkflowsConcurrent Execution Options
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 28
Skip Executions– Prevent concurrent executions
Run Concurrently– Concurrently run the flow
Pipeline– Distance 1: jobA waits until concurrent jobA finishes– Distance 2: jobA waits until concurrent jobA’s children finishes
Running WorkflowsConcurrent Execution Options
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 29
Running WorkflowsExecuting Flow Page
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 30
Running WorkflowsFlow Job List
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 31
Scheduling Workflows
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 32
Scheduling WorkflowsSchedule Flow Panel
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 33
Scheduling WorkflowsScheduled Flows
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 34
Scheduling FlowsSetting SLAs
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 35
Debugging and Tuning
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 36
Hadoop at LinkedIn
1000+ users Several clusters 2,500 flows executing per day 30,000 jobs executing per day
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 37
Job Execution History
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 38
Flow Execution History
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 39
Running WorkflowsJob Logs
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 40
Job Summary
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 41
Pig Visualizer
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 42
Pig Visualizer
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 43
Pig Visualizer
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 44
Pig Visualizer
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 45
Pig Visualizer
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 46
Flow Summary
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 47
Flow Summary
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 48
Browsing HDFS
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 49
HDFS ViewerBrowsing Files
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 50
HDFS ViewerViewing Files
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 51
HDFS ViewerFile Schema
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 52
Avro Parquet Binary JSON Sequence File Image Text
HDFS ViewerSupported File Types
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 53
Reportal
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 54
ReportalDashboard
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 55
ReportalNew Report
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 56
ReportalViewing Results
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 57
Pig Hive Teradata
ReportalSupported Query Types
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 58
Upcoming Features
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 59
Azkaban Gradle Plugin and DSL
Describe Azkaban flow and deploy with Gradle
Single file (more if you want) to describe all your workflows
– Compiles to .job files Static checker Valid Groovy code
– Add conditionals for deployment to different clusters
azkaban { jobConfDir = ‘./jobs’ workflow(‘workflow2’) { pigJob(‘job2’) { script = ‘src/main/pig/count-by-country.job’ parameter ‘inputFile’, ‘/user/foo/sample’ reads ‘/data/databases/foo’, [as: ‘input’] writes ‘/data/databases/bar’, [as: ‘output’] }
hiveJob(‘job3’) { query = ‘show tables’ }
workflowDepends ‘job2’, ‘job3’ }}
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 60
Future Roadmap
New visualizers (Hive, Tez, etc.) Support DSL from other tools Operationalization tooling Scalability improvements Improved plugin interfaces
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 61
Future Discussions
Conditional branching Hive Metastore browser Pluggable executors (e.g. YARN) Persistence storage server Launching and monitoring long-running YARN applications (Samza, Storm,
etc.)
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 62
Main Contributors
David Chen (LinkedIn) Hien Luu (LinkedIn) Anthony Hsu (LinkedIn) Alex Bain (LinkedIn) Richard Park (RelateIQ) Chenjie Yu (Tango) Shida Li (University of Waterloo)
Data Analytics Infrastructure©2013 LinkedIn Corporation. All Rights Reserved. 63
How to Contribute
Website: azkaban.github.io
GitHub: github.com/azkaban
LinkedIn’s Data Website: data.linkedin.com