Introduc)on*to*Apache*Kaa - · PDF fileIdeal*Architecture:*Stream*Data ... Oracle Oracle...

27
Introduc)on to Apache Ka1a Jun Rao Cofounder of Confluent

Transcript of Introduc)on*to*Apache*Kaa - · PDF fileIdeal*Architecture:*Stream*Data ... Oracle Oracle...

Introduc)on  to  Apache  Ka1a  

Jun  Rao  Co-­‐founder  of  Confluent  

Agenda  

•  Why  people  use  Ka1a  •  Technical  overview  of  Ka1a  •  What’s  coming  

What’s  Apache  Ka1a  

Distributed,  high  throughput  pub/sub  system    

Ka1a  Usage  

Common  PaIern  in  Data-­‐driven  Companies    

5  

Value ↑

Insights ↑

Product

Science Data

User

Virality ↑

Signals ↑

GeLng  Data  Is  Hard  

•  Data  scien)sts  spent  70%  )me  on  geLng  and  cleaning  data  

•  Why?  

Variety  in  Data  Sources  •  Database  records  

– Users,  products,  orders  •  Business  metrics  

–  Clicks,  impressions,  pageviews  •  Opera)onal  metrics  

–  CPU  usage,  requests/sec  •  Applica)on  logs  

–  Service  calls,  errors  •  IOT  •  …  

Variety  in  New  Specialized  Systems  

•  Batch  oriented  – Hadoop  ecosystem  (Pig,  Hive,  Spark,  …)  

•  Real  )me  –  Key/value  store  (Cassandra,  MongoDB,  HBase,  …)  –  Search  (Elas)cSearch,  Solr,  …)  –  Stream  processing  (Storm,  Spark  streaming,  Samza)  – Graph  (GraphLab,  FlockDB,  …)  –  Time  series  DB  (open  TSDB,  …)  –   …  

Danger  of  Point-­‐to-­‐point  Pipelines  

OracleOracle

Oracle User Tracking

Hadoop Log Search Monitoring Data

WarehouseSocial Graph

Rec. Engine

Search Email

VoldemortVoldemort

VoldemortEspresso

EspressoEspresso Operational

LogsOperational

Metrics

Production Services

...Security

Ideal  Architecture:  Stream  Data  Plaborm  

Ka1a  is  the  center  of  stream  data  plaborm!  

OracleOracle

Oracle User Tracking

Hadoop Log Search Monitoring Data

WarehouseSocial Graph

Rec Engine

Search Email

VoldemortVoldemort

VoldemortEspresso

EspressoEspresso Operational

LogsOperational

Metrics

Production Services

...Security

Log

Ka1a  at  LinkedIn  

•  800  billion  messages,  175  TB  data  per  day  •  13  million  messages  wriIen/sec  •  65  million  messages  read/sec  •  Tens  of  thousands  of  producers  •  Thousands  of  consumers  

Agenda  

•  Why  people  use  Ka1a  •  Technical  overview  of  Ka1a  

– Throughput  and  scalability  – Real  )me  and  batch  consump)on  – Durability  and  availability  

•  What’s  coming  

Concepts  and  API  

•  Producer  (Java)   byte[] key = "key".getBytes(); byte[] value = "value".getBytes(); record = new ProducerRecord("my-topic", key, value); producer.send(record);

streams[] = Consumer.createMessageStreams("topic1", 1); for(message: streams[0]) { // do something with message }

•  Consumer  (Java)  

•  Topic  defines  a  message  queue  

Distributed  Architecture  

14  

Topics  are  par))oned  for  parallelism  

producer

kafka cluster

producerproducerproducerproducerproducer

producerproducerproducer

consumerconsumerconsumerconsumerconsumerconsumer

consumerconsumerconsumer

Built-­‐in  Cluster  Management  

•  Automated  failure  detec)on  and  failover  – Leverage  Apache  Zookeeper  

•  Online  data  movement  

Simple  Efficient  Storage  with  a  Log  

16  

0 1 2 3 4 5 6 7 8 9 10 11 12Log

Data Source

writes

Destination System A(time = 7)

reads

Destination System B(time = 11)

reads

Batching  and  Compression  

•  Batching  – Producer,  broker,  consumer  

•  Compression  – Gzip,  snappy,  lz4  – End-­‐to-­‐end  

Real  Time  and  Batch  Consump)on  

•  Mul)-­‐subscrip)on  model  •  Data  persisted  on  disk,  NOT  cached  in  JVM  

– Rely  on  pagecache  •  Zero-­‐copy  transfer  (broker  -­‐>  consumer)  •  Ordered  consump)on  per  topic  par))on  

– Significantly  less  bookkeeping  

Durability  and  Availability  

Built-­‐in  replica)on  •  Configurable  replica)on  factor  •  Tolera)ng  f  –  1  failures  with  f  replicas  •  Automated  failover    

Replicas  and  Layout  

•  Topic  par))on  has  replicas  •  Replicas  spread  evenly  among  brokers  

topic1-part1

logs

broker 1

topic1-part2

logs

broker 2

topic2-part2

topic2-part1

logs

broker 3

topic1-part1

logs

broker 4

topic1-part2

topic2-part2 topic1-part1 topic1-part2

topic2-part1

topic2-part2

topic2-part1

Data  Flow  in  Replica)on  

broker 1

producer

leader

broker 2

follower

broker 3

follower

4  

2  

2  

3  

commit

ack  

When  producer  receives  ack   Latency   Durability  on  failures  

no  ack   no  network  delay   some  data  loss  

wait  for  leader   1  network  roundtrip   a  few  data  loss  

wait  for  commiIed   2  network  roundtrips   no  data  loss  

topic1-part1 topic1-part1 topic1-part1 consumer

1  

Extend  to  Mul)ple  Par))ons  

Leaders  are  evenly  spread  among  brokers  

broker 1 broker 2

topic3-part1 follower

broker 3

topic3-part1 follower

topic1-part1

producer

leader

topic1-part1

follower

topic1-part1

follower

broker 4

topic3-part1 leader

producer topic2-part1

producer

leader

topic2-part1

follower

topic2-part1

follower

Agenda  

•  Why  people  use  Ka1a  •  Technical  overview  of  Ka1a  •  What’s  coming  

Stream  Data  Plaborm  

Future  Releases  of  Apache  Ka1a  •  New  java  consumer  client  

–  BeIer  performance  –  Easier  protocol  for  non-­‐java  client  

•  Security  –  Authen)ca)on:  SSL  and  Kerberos  –  Authoriza)on    

•  BeIer  cluster  management  – More  automated  tools  –  Quotas  

•  Transac)onal  support  –  Exactly  once  delivery  

Confluent  

•  Mission:  Make  stream  data  plaborm  a  reality  •  Ka1a  development  and  support  •  Product:  

– Metadata  management  (released)  – Rest  endpoint  (released)  – Connectors  for  common  systems  – Monitor  data  flow  end-­‐to-­‐end  – Stream  processing  integra)on  

Q&A    

•  More  info  on  Apache  Ka1a  – hIp://ka1a.apache.org/  

•  Confluent  – hIp://confluent.io  – hIp://confluent.io/careers  

•  Ka1a  meetup  tonight  at  6:30  (Texas  V)