SDI/ISTC SeminarLinkedIn and became the lead of the Apache Samza team, which provides a scalable...

1
SDI/ISTC Seminar Yi Pan LinkedIn Yi Pan graduated from UCI with a Ph.D. in Computer Science in 2008. Since then, he has worked in distributed platforms for Internet applications. He started at Yahoo! working on Yahoo!'s NoSQL database project, leading the development of multiple features, such as real-time notification of database updates, secondary index, and live-migration from legacy systems to NoSQL databases. Later, he led the development of the Cloud Messaging System, which is used heavily as a pub-sub service and transaction log for distributed databases at Yahoo!. Since 2014, he joined LinkedIn and became the lead of the Apache Samza team, which provides a scalable stream processing service for the whole company. Building a Lambda-less Stream Processing System using Local States and Windowing This talk will provide an overview of LinkedIn's distributed stream processing platform, including Samza/Kafka/Databus. It will first cover the high level scenarios for stream processing in LinkedIn, followed by detailed requirements around scalability, re-processing, accuracy of results, and ease of programmability; then we will focus on the requirements of stateful stream processing applications and explain how Samza’s state management allows us to build applications that meet all the above requirements. The key concepts, architecture and usage in LinkedIn's stream processing pipeline will be explained, including state management in Samza, the use and configuration of Kafka and Databus as input/output and as a change log. We will also discuss in detail how we leverage the reliable, replayable messaging system (i.e. Kafka) together with durable state management in Samza to build a Lambda-less stream processing platform. The key mechanism to achieve a unified process model between batch and real-time stream is windowing. We will dive into the requirements and our solutions to windowing a real-time stream in this talk as well. Thursday April 14, 2016 RMCIC 4th Floor Panther Hollow Room 12:00 - 1:00 pm VISITOR HOSTS: Majd Sakr, Garth Gibson VISITOR COORD: Majd Sakr, [email protected], 412-268-1161 For more information or questions: Karen Lindenfelser, 8-6716, [email protected] http://www.pdl.cmu.edu/SDI/ Partially funded by:

Transcript of SDI/ISTC SeminarLinkedIn and became the lead of the Apache Samza team, which provides a scalable...

Page 1: SDI/ISTC SeminarLinkedIn and became the lead of the Apache Samza team, which provides a scalable stream processing service for the whole company. Building a Lambda-less Stream Processing

SDI/ISTC Seminar

Yi PanLinkedIn

Yi Pan graduated from UCI with a Ph.D. in Computer Science in

2008. Since then, he has worked in distributed platforms for

Internet applications. He started at Yahoo! working on Yahoo!'s

NoSQL database project, leading the development of multiple

features, such as real-time notification of database

updates, secondary index, and live-migration from legacy

systems to NoSQL databases. Later, he led the development of

the Cloud Messaging System, which is used heavily as a

pub-sub service and transaction log for distributed databases at

Yahoo!. Since 2014, he joined LinkedIn and became the lead

of the Apache Samza team, which provides a scalable

stream processing service for the whole company.

Building a Lambda-less Stream Processing System using Local States and WindowingThis talk will provide an overview of LinkedIn's distributed stream processing platform, including Samza/Kafka/Databus. It will first cover the high level scenarios for stream processing in LinkedIn, followed by detailed requirements around scalability, re-processing, accuracy of results, and ease of programmability; then we will focus on the requirements of stateful stream processing applications and explain how Samza’s state management allows us to build applications that meet all the above requirements. The key concepts, architecture and usage in LinkedIn's stream processing pipeline will be explained, including state management in Samza, the use and configuration of Kafka and Databus as input/output and as a change log. We will also discuss in detail how we leverage the reliable, replayable messaging system (i.e. Kafka) together with durable state management in Samza to build a Lambda-less stream processing platform. The key mechanism to achieve a unified process model between batch and real-time stream is windowing. We will dive into the requirements and our solutions to windowing a real-time stream in this talk as well.

ThursdayApril 14, 2016

RMCIC 4th Floor Panther Hollow Room

12:00 - 1:00 pm

VISITOR HOSTS: Majd Sakr, Garth GibsonVISITOR COORD: Majd Sakr, [email protected], 412-268-1161

For more information or questions:Karen Lindenfelser, 8-6716, [email protected]

http://www.pdl.cmu.edu/SDI/

Partially funded by: