netflix-real-time-data-strata-talk

82
Real-Time Data Insights In Netflix Danny Yuan (@g9yuayon) Jae Bae 1 Friday, March 1, 13

description

All animations and demos are replaced with static screenshots

Transcript of netflix-real-time-data-strata-talk

Page 1: netflix-real-time-data-strata-talk

Real-Time Data Insights In NetflixDanny Yuan (@g9yuayon)Jae Bae

1Friday, March 1, 13

Page 2: netflix-real-time-data-strata-talk

Who Am I?

2Friday, March 1, 13

crypto service, that manages pretty much all the keys Netflix uses in the cloud, which translates to billions of requests per day.

Page 3: netflix-real-time-data-strata-talk

Who Am I?Member of Netflix’s Platform Engineering team, working on large scale data infrastructure (@g9yuayon)

2Friday, March 1, 13

crypto service, that manages pretty much all the keys Netflix uses in the cloud, which translates to billions of requests per day.

Page 4: netflix-real-time-data-strata-talk

Who Am I?Member of Netflix’s Platform Engineering team, working on large scale data infrastructure (@g9yuayon)

Built and operated Netflix’s cloud crypto service

2Friday, March 1, 13

crypto service, that manages pretty much all the keys Netflix uses in the cloud, which translates to billions of requests per day.

Page 5: netflix-real-time-data-strata-talk

Who Am I?Member of Netflix’s Platform Engineering team, working on large scale data infrastructure (@g9yuayon)

Built and operated Netflix’s cloud crypto service

Worked with Jae Bae on querying multi-dimensional data in real time

2Friday, March 1, 13

crypto service, that manages pretty much all the keys Netflix uses in the cloud, which translates to billions of requests per day.

Page 6: netflix-real-time-data-strata-talk

Use Cases

3Friday, March 1, 13

We’re going to discuss two types of use cases today: Real-time operational metrics, and business or product insights. By the way, who would know Canadians’ number 1 search query would be 90210?

Page 7: netflix-real-time-data-strata-talk

Use CasesReal-time Operational Metrics

3Friday, March 1, 13

We’re going to discuss two types of use cases today: Real-time operational metrics, and business or product insights. By the way, who would know Canadians’ number 1 search query would be 90210?

Page 8: netflix-real-time-data-strata-talk

Use CasesBusiness or Product Insights

3Friday, March 1, 13

We’re going to discuss two types of use cases today: Real-time operational metrics, and business or product insights. By the way, who would know Canadians’ number 1 search query would be 90210?

Page 9: netflix-real-time-data-strata-talk

Field Name Field Value

ClientApplication “API”

ServerApplication “Cryptex”

StatusCode 200

ResponseTime 73

What Are Log Events?

4Friday, March 1, 13

Before we dive into use cases, let me explain what our log data look like. Lots of Netflix’s log data can be represented by “events”. Netflix applications send hundreds of different types of log events every day. A log event is really just a set of fields. A field has a name and a value. The value itself can be strings, numbers, or set of fields.

Page 10: netflix-real-time-data-strata-talk

5Friday, March 1, 13

Inside Netflix, hundreds of applications run on tens of thousands of machines. Machines come and go all the time, but they all generate tons of application log events, and send them to highly reliable data collectors. The collectors in turn send data to various destinations.

Page 11: netflix-real-time-data-strata-talk

Tens of Thousands of Servers Come and Go

Server Farm

Server Farm

Server Farm

5Friday, March 1, 13

Inside Netflix, hundreds of applications run on tens of thousands of machines. Machines come and go all the time, but they all generate tons of application log events, and send them to highly reliable data collectors. The collectors in turn send data to various destinations.

Page 12: netflix-real-time-data-strata-talk

Server Farm

Server Farm

Server Farm

Log Collectors

Highly Reliable Collectors Collect Log Events from All Servers

5Friday, March 1, 13

Inside Netflix, hundreds of applications run on tens of thousands of machines. Machines come and go all the time, but they all generate tons of application log events, and send them to highly reliable data collectors. The collectors in turn send data to various destinations.

Page 13: netflix-real-time-data-strata-talk

Dynamically Configurable Destinations

Server Farm

Server Farm

Server Farm

Log Collectors

Kafka

Hadoop

HTTP Endpoints

5Friday, March 1, 13

Inside Netflix, hundreds of applications run on tens of thousands of machines. Machines come and go all the time, but they all generate tons of application log events, and send them to highly reliable data collectors. The collectors in turn send data to various destinations.

Page 14: netflix-real-time-data-strata-talk

Netflix is a log generating company that also happens to stream movies

- Adrian Cockroft

6Friday, March 1, 13

As Adrian used to say, Neflix is a log generating company that also happens to stream movies. When we have vast amount of logs for different applications, we also get a treasure trove. In fact, numerous teams, BI, operations, product development, data science... They mine such data all the time. To put this into perspective, let me share some numbers.

Page 15: netflix-real-time-data-strata-talk

1,500,000

7Friday, March 1, 13

During peak hours, our data pipeline collects over 1.5 million log events per second

Page 16: netflix-real-time-data-strata-talk

70,000,000,000

8Friday, March 1, 13

Or 70 billions a day on average.

Page 17: netflix-real-time-data-strata-talk

Making Sense of Billions of Events

9Friday, March 1, 13

Making sense of such vast amount of information is a continuing challenge for Netflix. After all, most of the time it is not feasible to look into individual log event to get anything useful out. We’ve got to have an intelligent ways to digest our data.

Page 18: netflix-real-time-data-strata-talk

10Friday, March 1, 13

And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics every secondWe have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries.And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.

Page 19: netflix-real-time-data-strata-talk

We’ve Got Tools

10Friday, March 1, 13

And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics every secondWe have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries.And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.

Page 20: netflix-real-time-data-strata-talk

10Friday, March 1, 13

And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics every secondWe have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries.And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.

Page 21: netflix-real-time-data-strata-talk

10Friday, March 1, 13

And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics every secondWe have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries.And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.

Page 22: netflix-real-time-data-strata-talk

10Friday, March 1, 13

And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics every secondWe have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries.And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.

Page 23: netflix-real-time-data-strata-talk

10Friday, March 1, 13

And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics every secondWe have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries.And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.

Page 24: netflix-real-time-data-strata-talk

10Friday, March 1, 13

And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics every secondWe have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries.And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.

Page 25: netflix-real-time-data-strata-talk

10Friday, March 1, 13

And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics every secondWe have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries.And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.

Page 26: netflix-real-time-data-strata-talk

10Friday, March 1, 13

And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics every secondWe have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries.And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.

Page 27: netflix-real-time-data-strata-talk

10Friday, March 1, 13

And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics every secondWe have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries.And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.

Page 28: netflix-real-time-data-strata-talk

10Friday, March 1, 13

And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics every secondWe have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries.And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.

Page 29: netflix-real-time-data-strata-talk

10Friday, March 1, 13

And over the past couple of years Netflix has built numerous tools to help us. We have this Turbine real-time dashboard for application metrics on live machines. It is also open sourced, by the way.We have Atlas, our monitoring solution, that handles millions of application metrics every secondWe have CSI, which uses a number of machine learning algorithms to identify correlations and trends in monitored dataWe have Biopsys, which searches logs on multiple live servers, and streams back results to a user’s browserWe also have Hadoop and Hive, of course. DSE team has built a number of tools to help team use Hadoop as easy as possible. And we even have a DSE Sting that visualizes results of Hive queries.And we had log summarization service that alert people on top error-generating service. They are, however, static snapshot of some data that we can’t easily drill down, and they are usually half an hour late.

Page 30: netflix-real-time-data-strata-talk

What Is Missing?

11Friday, March 1, 13

Why do we need yet another tool then? The key question is, what is missing?

Page 31: netflix-real-time-data-strata-talk

Interactive Exploration

12Friday, March 1, 13

For one thing: interactive exploration. Sometimes we want to get data in real time so we can act quickly. Some data is only useful in a small time window after all. Sometimes we want to perform lots of experimental queries just to find the right insights. If we wait too long for a query back, we won’t be able to iterate fast enough. Either way, we need to get query results back in seconds.

Page 32: netflix-real-time-data-strata-talk

Getting Results Back in Seconds

13Friday, March 1, 13

Because aggregation is out of the way, we can simply de-dup the error messages and index them in a search engine. So, you get the best of the both worlds: an instant error report, and instant error search engine.

Page 33: netflix-real-time-data-strata-talk

Getting Results Back in Seconds

13Friday, March 1, 13

Because aggregation is out of the way, we can simply de-dup the error messages and index them in a search engine. So, you get the best of the both worlds: an instant error report, and instant error search engine.

Page 34: netflix-real-time-data-strata-talk

Getting Results Back in Seconds

14Friday, March 1, 13

Here is one example: we process more than 150 thousand events per second about device activities. What if we’d like to know that geographically how many users started playing videos in the past 5 minutes? So I submit my query, and in a few seconds....

The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped to the grid and the activity is then counted.

But this is an aggregated view. What if I want to drill down the data immediately along different dimensions? In this particular case, to find out failed attempts on our SilverLight players that run on PCs and Macs?

Page 35: netflix-real-time-data-strata-talk

Getting Results Back in Seconds

14Friday, March 1, 13

Here is one example: we process more than 150 thousand events per second about device activities. What if we’d like to know that geographically how many users started playing videos in the past 5 minutes? So I submit my query, and in a few seconds....

The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped to the grid and the activity is then counted.

But this is an aggregated view. What if I want to drill down the data immediately along different dimensions? In this particular case, to find out failed attempts on our SilverLight players that run on PCs and Macs?

Page 36: netflix-real-time-data-strata-talk

Getting Results Back in Seconds

150,000

14Friday, March 1, 13

Here is one example: we process more than 150 thousand events per second about device activities. What if we’d like to know that geographically how many users started playing videos in the past 5 minutes? So I submit my query, and in a few seconds....

The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped to the grid and the activity is then counted.

But this is an aggregated view. What if I want to drill down the data immediately along different dimensions? In this particular case, to find out failed attempts on our SilverLight players that run on PCs and Macs?

Page 37: netflix-real-time-data-strata-talk

Getting Results Back in Seconds

14Friday, March 1, 13

Here is one example: we process more than 150 thousand events per second about device activities. What if we’d like to know that geographically how many users started playing videos in the past 5 minutes? So I submit my query, and in a few seconds....

The globe is divided into 1600x800 grid, and each client activity’s coordinate is mapped to the grid and the activity is then counted.

But this is an aggregated view. What if I want to drill down the data immediately along different dimensions? In this particular case, to find out failed attempts on our SilverLight players that run on PCs and Macs?

Page 38: netflix-real-time-data-strata-talk

Querying Data Along Different Dimensions

15Friday, March 1, 13

And from the same event, we may get answers to different questions: How many people started viewing House of Cards in the past 6 hours?

Page 39: netflix-real-time-data-strata-talk

Querying Data Along Different Dimensions

15Friday, March 1, 13

And from the same event, we may get answers to different questions: How many people started viewing House of Cards in the past 6 hours?

Page 40: netflix-real-time-data-strata-talk

Querying Data Along Different Dimensions

15Friday, March 1, 13

And from the same event, we may get answers to different questions: How many people started viewing House of Cards in the past 6 hours?

Page 41: netflix-real-time-data-strata-talk

Querying Data Along Different Dimensions

15Friday, March 1, 13

And from the same event, we may get answers to different questions: How many people started viewing House of Cards in the past 6 hours?

Page 42: netflix-real-time-data-strata-talk

Discover Outstanding Data

16Friday, March 1, 13

There are three fundamental questions we usually want to get out large amount data. First is to find the outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For example, don’t you want to know what happens in the last 10 seconds which applications generated most of the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more complete example.

Hundreds of thousands of requests captured.

Page 43: netflix-real-time-data-strata-talk

Discover Outstanding Data

HTTP 500

16Friday, March 1, 13

There are three fundamental questions we usually want to get out large amount data. First is to find the outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For example, don’t you want to know what happens in the last 10 seconds which applications generated most of the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more complete example.

Hundreds of thousands of requests captured.

Page 44: netflix-real-time-data-strata-talk

Discover Outstanding Data

16Friday, March 1, 13

There are three fundamental questions we usually want to get out large amount data. First is to find the outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For example, don’t you want to know what happens in the last 10 seconds which applications generated most of the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more complete example.

Hundreds of thousands of requests captured.

Page 45: netflix-real-time-data-strata-talk

Discover Outstanding Data

16Friday, March 1, 13

There are three fundamental questions we usually want to get out large amount data. First is to find the outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For example, don’t you want to know what happens in the last 10 seconds which applications generated most of the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more complete example.

Hundreds of thousands of requests captured.

Page 46: netflix-real-time-data-strata-talk

Discover Outstanding Data

16Friday, March 1, 13

There are three fundamental questions we usually want to get out large amount data. First is to find the outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For example, don’t you want to know what happens in the last 10 seconds which applications generated most of the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more complete example.

Hundreds of thousands of requests captured.

Page 47: netflix-real-time-data-strata-talk

Discover Outstanding Data

16Friday, March 1, 13

There are three fundamental questions we usually want to get out large amount data. First is to find the outstanding data. For small number of rows, we can get a summary table. But for large amount of data, even summary table itself can be huge. Lots of information could be noise too. So, Top N query really helps here. For example, don’t you want to know what happens in the last 10 seconds which applications generated most of the errors in the 5 seconds? Now that’s something called timely feedback. Let me share with you a more complete example.

Hundreds of thousands of requests captured.

Page 48: netflix-real-time-data-strata-talk

Discover Outstanding Data

17Friday, March 1, 13

Page 49: netflix-real-time-data-strata-talk

See Trends Over Time

18Friday, March 1, 13

The second fundamental question is: what are the trends over time? More over, what is the trend compared to that of the same data in a different time window? Again, slicing and dicing is very important here because it helps us narrow down our view.

Page 50: netflix-real-time-data-strata-talk

See Trends Over Time

18Friday, March 1, 13

The second fundamental question is: what are the trends over time? More over, what is the trend compared to that of the same data in a different time window? Again, slicing and dicing is very important here because it helps us narrow down our view.

Page 51: netflix-real-time-data-strata-talk

See Trends Over Time

18Friday, March 1, 13

The second fundamental question is: what are the trends over time? More over, what is the trend compared to that of the same data in a different time window? Again, slicing and dicing is very important here because it helps us narrow down our view.

Page 52: netflix-real-time-data-strata-talk

See Trends Over Time

18Friday, March 1, 13

The second fundamental question is: what are the trends over time? More over, what is the trend compared to that of the same data in a different time window? Again, slicing and dicing is very important here because it helps us narrow down our view.

Page 53: netflix-real-time-data-strata-talk

See Data Distributions

19Friday, March 1, 13

The third fundamental question is: what is the distribution of my data? Merely average is not enough. Sometimes it even be deceiving. Percentiles paints a more accurate picture.

Page 54: netflix-real-time-data-strata-talk

See Data Distributions

19Friday, March 1, 13

The third fundamental question is: what is the distribution of my data? Merely average is not enough. Sometimes it even be deceiving. Percentiles paints a more accurate picture.

Page 55: netflix-real-time-data-strata-talk

Technical Challenges

20Friday, March 1, 13

I’d like to share some technical challenges we encountered when integrating Druid.

Page 56: netflix-real-time-data-strata-talk

21Friday, March 1, 13

Even though we instrument code to death, people don’t want to write more code just for a nascent tool. Luckily for us, though, we’ve got a homogeneous architecture in place, and we’ve already separated producing logs from consuming logs. Applications have the common build and continuous integration environment, identical deployment base, and shared platform runtime.

Page 57: netflix-real-time-data-strata-talk

Problem:Minimizing programming effort

21Friday, March 1, 13

Even though we instrument code to death, people don’t want to write more code just for a nascent tool. Luckily for us, though, we’ve got a homogeneous architecture in place, and we’ve already separated producing logs from consuming logs. Applications have the common build and continuous integration environment, identical deployment base, and shared platform runtime.

Page 58: netflix-real-time-data-strata-talk

Problem:Minimizing programming effort

Solution:-Homogeneous architecture-Separating producing logs from consuming logs

21Friday, March 1, 13

Even though we instrument code to death, people don’t want to write more code just for a nascent tool. Luckily for us, though, we’ve got a homogeneous architecture in place, and we’ve already separated producing logs from consuming logs. Applications have the common build and continuous integration environment, identical deployment base, and shared platform runtime.

Page 59: netflix-real-time-data-strata-talk

22Friday, March 1, 13

Every application shares the same design and the same underlying runtime. The logic of delivering log event is completely hidden away from programmers. All they need to do is constructing a log event, and then hand the event to LogManager.

Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/

Page 60: netflix-real-time-data-strata-talk

A Single Data Pipeline

Log data Log Filter Collector Agent

Log Collectors

LogManager.logEvent(anEvent)

22Friday, March 1, 13

Every application shares the same design and the same underlying runtime. The logic of delivering log event is completely hidden away from programmers. All they need to do is constructing a log event, and then hand the event to LogManager.

Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/

Page 61: netflix-real-time-data-strata-talk

Log data Log Filter Collector Agent

Log Collectors

photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/23Friday, March 1, 13

Since producing log events is dead simple. We move all the processing logic to the backend. We introduced this plugin design that is flexible enough to filter, transform, and dispatch log events to different destinations with high throughput.

Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/

Page 62: netflix-real-time-data-strata-talk

Isolated Log Processing

Log data Log Filter Collector Agent

Log Collectors

photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/23Friday, March 1, 13

Since producing log events is dead simple. We move all the processing logic to the backend. We introduced this plugin design that is flexible enough to filter, transform, and dispatch log events to different destinations with high throughput.

Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/

Page 63: netflix-real-time-data-strata-talk

Isolated Log Processing

Log data

Log Filter

Log Filter

Log Filter

Log Dispatcher

Sink Plugin

Sink Plugin

Sink Plugin ElasticSearch

Hadoop

KafkaDruid

photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/23Friday, March 1, 13

Since producing log events is dead simple. We move all the processing logic to the backend. We introduced this plugin design that is flexible enough to filter, transform, and dispatch log events to different destinations with high throughput.

Photo credit: http://www.flickr.com/photos/decade_null/142235888/sizes/m/in/photostream/

Page 64: netflix-real-time-data-strata-talk

24Friday, March 1, 13

Storing and processing log events takes time, requires resources, and ultimately costs money. Lots of events are useful only when they are needed. Therefore, we built this filtering capability into our platform.

Page 65: netflix-real-time-data-strata-talk

Problem:Not All Logs Are Worth Processing

24Friday, March 1, 13

Storing and processing log events takes time, requires resources, and ultimately costs money. Lots of events are useful only when they are needed. Therefore, we built this filtering capability into our platform.

Page 66: netflix-real-time-data-strata-talk

Problem:Not All Logs Are Worth Processing

Solution:Dynamic Filtering

24Friday, March 1, 13

Storing and processing log events takes time, requires resources, and ultimately costs money. Lots of events are useful only when they are needed. Therefore, we built this filtering capability into our platform.

Page 67: netflix-real-time-data-strata-talk

25Friday, March 1, 13

We created both a fluent API and a corresponding in-fix mini-language to filter any JavaBean-like object

Page 68: netflix-real-time-data-strata-talk

26Friday, March 1, 13

It’s just inhumane to ask people to use JSON payload directly. Remember our goal is to allow users to get query results back in seconds. It doesn’t make sense to ask a user to spend half an hour just to construct a query, and spend another half an hour to debug the query.

Page 69: netflix-real-time-data-strata-talk

Problem:JSON Payload Is Tedious

26Friday, March 1, 13

It’s just inhumane to ask people to use JSON payload directly. Remember our goal is to allow users to get query results back in seconds. It doesn’t make sense to ask a user to spend half an hour just to construct a query, and spend another half an hour to debug the query.

Page 70: netflix-real-time-data-strata-talk

Problem:JSON Payload Is Tedious

Solution:Build a parser

26Friday, March 1, 13

It’s just inhumane to ask people to use JSON payload directly. Remember our goal is to allow users to get query results back in seconds. It doesn’t make sense to ask a user to spend half an hour just to construct a query, and spend another half an hour to debug the query.

Page 71: netflix-real-time-data-strata-talk

curl -X POST http://druid -d @data

27Friday, March 1, 13

Added benefit of using a parser upfront is to catch all the semantic errors early.

Page 72: netflix-real-time-data-strata-talk

curl -X POST http://druid -d @data

27Friday, March 1, 13

Added benefit of using a parser upfront is to catch all the semantic errors early.

Page 73: netflix-real-time-data-strata-talk

28Friday, March 1, 13

This is a nascent system with quite a few moving parts. We needed to add new data sources, remove data sources, update schemas for data sources, or debug for certain data sources. Such operations should be easy, and should have minimal impact to a production system.

Page 74: netflix-real-time-data-strata-talk

Problem:Managing data sources can be hairy

28Friday, March 1, 13

This is a nascent system with quite a few moving parts. We needed to add new data sources, remove data sources, update schemas for data sources, or debug for certain data sources. Such operations should be easy, and should have minimal impact to a production system.

Page 75: netflix-real-time-data-strata-talk

Problem:Managing data sources can be hairy

Solution:Use cell-like deployment

28Friday, March 1, 13

This is a nascent system with quite a few moving parts. We needed to add new data sources, remove data sources, update schemas for data sources, or debug for certain data sources. Such operations should be easy, and should have minimal impact to a production system.

Page 76: netflix-real-time-data-strata-talk

Kafka Kafka Kafka

Log Data Pipeline

Druid Druid Druid

29Friday, March 1, 13

We use a cell-like architecture. Each data source has its own persistent queue, its own configuration, and its own indexing cluster. Adding a new data source requires only adding a new set of asgs.

Tuning also becomes isolated.

Page 77: netflix-real-time-data-strata-talk

Integrating with Netflix’s Infrastructure

30Friday, March 1, 13

Integration with Netflix’s infrastructure is essential. We need insights to operate this system, and we need smooth operations.

Page 78: netflix-real-time-data-strata-talk

31Friday, March 1, 13

For example, the current deployment handles 380,000 messages per second, or close to 2TB/hour during its peak time. Without integration into our monitoring system, we wouldn’t know system glitches as shown in this chart.

Page 79: netflix-real-time-data-strata-talk

• Integrating Kafka with Netflix cloud

• Real-time plug-in on Netflix’s data pipeline

• User-configurable event filtering

On Netflix Side

32Friday, March 1, 13

Page 80: netflix-real-time-data-strata-talk

• Integration with Netflix’s monitoring system - Emitter+Servo

• Integration with Netflix’s platform library

• Handling of Zookeeper’s session interruption

• Tuning sharding spec for linear scalability

On Druid Side

33Friday, March 1, 13

Emitter integration with Servo

There are lots of injection points in Druid where we can introduce our own implementations. This greatly helped our integration.

Page 81: netflix-real-time-data-strata-talk

Log Collectors

Event Filter Collector Agent

rtexplorer

Druid

34Friday, March 1, 13

We built our tool sets on top of many excellent open source tools, and it’s our pleasure to contribute back. Therefore, we’re going to open source all the tools we built some time this year.

Page 82: netflix-real-time-data-strata-talk

Log Collectors

Event Filter Collector Agent

rtexplorer

Open Source Plan

Druid

34Friday, March 1, 13

We built our tool sets on top of many excellent open source tools, and it’s our pleasure to contribute back. Therefore, we’re going to open source all the tools we built some time this year.