3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

24
© Hortonworks Inc. 2015 Page 1 Apache Tez Next Generation of execution engine upon hadoop Jeff Zhang (@zjffdu)

Transcript of 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

Page 1: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015 Page  1

Apache  Tez-­‐ Next  Generation  of  execution  engine  upon  hadoop

Jeff  Zhang  (@zjffdu)

Page 2: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Who’s  this  guy• Start  use  pig  from  2009.  Become  Pig  committer  from  Nov  2009

• Join  Hortonworks  in  2014.  

• Tez Committer  from  Oct  2014

Page 3: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Agenda•Tez Introduction

•Tez Feature  Deep  Dive

•Tez Status  &  Roadmap

Page 4: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

I/O  Synchronization  Barrier

I/O  Synchronization  Barrier

Job  1  (  Join a  &  b  )

Job  3 (  Group by  of  c  )

Job  2    (Group  by  of  a  Join b)

Job  4  (Join  of  S  & R  )

Hive  -­‐ MR

Example  of  MR  versus  Tez

Page  4

Single  Job

Hive  -­‐ Tez

Join a  &  b

Group  by  of  a  Join b

Group by  of  c

Job  4  (Join  of  S  & R  )

Page 5: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Tez  – Introduction

Page  5

• Distributed  execution  framework  targeted  towards  data-­‐processing  applications.• Based  on  expressing  a  computation  as  a  dataflow  graph  (DAG).• Highly  customizable  to  meet  a  broad  spectrum  of  use  cases.• Built  on  top  of  YARN  – the  resource  management  framework  for  Hadoop.•Open  source  Apache  project  and  Apache  licensed.

Page 6: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

What  is  DAG  &  Why    DAG

ProjectionFilterGroupBy…

JoinUnionIntersect…

Split…

• Directed  Acyclic  Graph• Any  complicated  DAG  can  been  composed  of  the  following  3  basic  paradigm– Sequential– Merge– Divide

Page 7: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Expressing  DAG  in  Tez API

• DAG  API  (Logic  View)– Allowuser to  build  DAG– Topological  structure  of  the  data  computation  flow

• Runtime  API  (Runtime  View)– Application  logic  of  each  computation  unit  (vertex)– How tomove/read/write  data between vertices

Page 8: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

DAG  API  (Logic  View)

Page  8

• Vertex  (Processor,  Parallelism,  Resource,  etc…)

• Edge (EdgeProperty)– DataMovement

– Scatter  Gather  (Join,  GroupBy …  )– Broadcast      (  Pig  Replicated  Join  /  Hive  Broadcast  Join  )– One-­‐to-­‐One    (  Pig  Order  by  )– Custom

Page 9: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Runtime  API  (Runtime  View)

Page  9

ProcessorInput Output

• Input– Through  which  processor   receives  data  on  an  edge– Vertex  can  have  multiple   inputs

• Processor– Application  Logic  (One  vertex  one  processor)– Consume  the  inputs  and  produce  the  outputs

•Output– Through  which  processor  writes  data  to  an  edge– One  vertex  can  have  multiple  outputs  

• Example  of  Input/Output/Processor– MRInput &  MROutput (InputFormat/OutputFormat)– OrderedGroupedKVInput &  OrderedPartitionedKVOutput (Scatter  Gather)– UnorderedKVInput &  UnorderedKVOutput (Broadcast  &  One-­‐to-­‐One)– PigProcessor/HiveProcessor

Page 10: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Benefit  of  DAG• Easier  to  express  computation  in  DAG

•No  intermediate  data  written  to  HDFS

• Less  pressure  on  NameNode

•No  resource  queuing  effort  &  less  resource  contention

•More  optimization  opportunity  with  more  global  context

Page 11: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Agenda•Tez Introduction

•Tez Feature  Deep  Dive

•Tez Improvement  &  Debuggability

•Tez Status  &  Roadmap

Page 12: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Container-­‐Reuse• Reuse  the  same  container  across  DAG/Vertices/Tasks

• Benefit  of  Container-­‐Reuse– Less  resources  consumed– Reduce  overhead  of  launching  JVM– Reduce  overhead  of  negotiatewith Resource  Manager– Reduce  overhead  of  resource  localization– Reduce  network  IO– Object  Caching  (Object  Sharing)

Page 13: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Tez Session• Multiple  Jobs/DAGs  in  one  AM

• Container-­‐reuse  across  Jobs/DAGs

• Data  sharing  between  Jobs/DAGs

Page 14: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Dynamic  Parallelism  Estimation  • VertexManager

– Listen  to  the  other  vertices  status

– Coordinate  and  schedule  its  tasks

– Communication  between  vertices

Page 15: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

ATS  Integration• Tez is  fully  integrated  with  YARN  ATS  (Application  Timeline  Service)– DAG  Status,  DAG  Metrics,  Task  Status,  Task  Metrics  are  captured

• Diagnostics  &  Performance  analysis– Data  Source  for  monitoring  &  diagnostics  – Data  Source  for  performance  analysis  

Page 16: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Recovery• AM  can  crash  in  corner  cases

– OOM– Node  failure–…

• Continue  from  the  last  checkpoint

• Transparent  to  end  users

AM  Crash

Page 17: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Order  By  of  Pig

f =  Load  ‘foo’  as  (x,  y);o =  Order  f  by  x;Load

Sample(Calculate  Histogram)

HDFS

Partition

Sort

Broadcast

Load

Sample(Calculate  Histogram)

Partition

Sort

One-­‐to-­‐One

Scatter  Gather

Scatter  Gather

Page 18: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Tez UI

Page 19: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Tez UI

Page 20: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

Tez UI

20

Download  data from  ATS

Page 21: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

RoadMap• Shared  output  edges

– Same  output  to  multiple  vertices

• Local  mode  stabilization

•Optimizing  (include/exclude)  vertex  at  runtime

• Partial  completion  VertexManager

• Co-­‐Scheduling

• Framework  stats  for  better  runtime  decisions

Page 22: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Tez  – Adoption  • Apache  Hive

• Start  from  Hive  0.13• set  hive.exec.engine =  tez

• Apache  Pig• Start  from  Pig  0.14• pig  -­‐x  tez

• Cascading

• Flink

Page  22

Page 23: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Tez Community•Useful  Links

– http://tez.apache.org/– JIRA  :  https://issues.apache.org/jira/browse/TEZ– Code  Repository:  https://git-­‐wip-­‐us.apache.org/repos/asf/tez.git–Mailing  Lists

– Dev List:  [email protected]– User  List:  [email protected]– Issues  List:  [email protected]

• Tez Meetup– http://www.meetup.com/Apache-­‐Tez-­‐User-­‐Group

Page 24: 3. Apache Tez Introducation - Apache Kylin Meetup @Shanghai

©  Hortonworks  Inc.  2015

Thank  You!Questions  &  Answers

Page  24