Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2....

62
COWBOY DATING WITH BIG DATA DATA PLATFORM EVOLUTION IN ACTION BORIS TROFIMOV

Transcript of Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2....

Page 1: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

COWBOY DATING WITH BIG DATA

DATA PLATFORM EVOLUTION IN ACTION

BORIS TROFIMOV

Page 2: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

Big Data competence lead @ Sigma Software

Worked with Verizon/Yahoo/AOL, Collective

Cofounder of Odessa JUG

Passionate follower of Scala

Associate professor at ONPU

ABOUT ME

Page 3: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

INTRO – PARTNER 1

Page 4: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

INTRO – PARTNER 2

Page 5: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

НАГРАДЫ PARTNER 2

• Медаль за Kotlin• Полный кавалер Spring & Spring Boot• Орден за взятие Kubernetes

Page 6: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

LESSON 1 – SHARED STORAGE

Page 7: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

EXPECTATIONS

PRODUCT

Page 8: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

UI

MVP

API FACADE

DB

SHARED STORAGE

Page 9: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

UI

MVP

API FACADE

DB

ГДЕрепортинг?

SHARED STORAGE

Page 10: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

UI

MVP

API FACADE

DB

3rd PROVIDERS

SHARED STORAGE

Page 11: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

PROS• Fast TTM• Relatively cheap from infra and cost perspective

CONS• Tight data and code cohesion• Different Scaling scenarios• Performance and Availability issues

SHARED STORAGE

Page 12: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

SHARED STORAGE

UI

MVP

API FACADE

OLTP

DATA PLATFORM

OLAP

3rd PROVIDERS

Page 13: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

DATA PLATFORM

3rd PROVIDERS

SYSTEM EVENTS

API FACADEDATA PLATFORM

Page 14: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

LESSON 2 – WEAK SCHEDULING

Page 15: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

SCHEDULING

UI

MVP

API FACADE

OLTP

Data Platform

OLAP

3rd PROVIDERS

Page 16: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

SCHEDULING

DATA PLATFORM

DATA PLATFORM

3rd PROVIDERS

SYSTEM EVENTS

DATA PROCESSING SCRIPT

OLAPDATA PROCESSING SCRIPT

DATA PROCESSING SCRIPT

API FACADE

Page 17: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

DATA PROCESSING SCRIPT

SCHEDULING

DATA PLATFORM

3rd PROVIDERS

SYSTEM EVENTS

CRONDQUARTZ

OLAP

DATA PROCESSING SCRIPT

API FACADE

DATA PLATFORM

DATA PROCESSING SCRIPT

Page 18: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

SCHEDULING

UI

MVP

API FACADE

OLTP

Data Platform

OLAP

3rd PROVIDERS

КАКОЙ КРОНТАБ?

Page 19: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

SCHEDULING

Use prod-ready schedulers§ Airflow§ Azkaban§ Oozie

§ Jenkins?

What we gain§ Identity control and Audit§ Job Lineage, Logging and Troubleshooting§ Tools to design Workflows/DAGs§ Fault Tolerance features (rerun etc.)

Page 20: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

CHAPTER 3 –MONOLITHIC DATA PLATFORM

Page 21: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

DATA PLATFORM

3rd PROVIDERS

SYSTEM EVENTS

Scheduler

OLAPAPI FACADE

DATA PLATFORM

DATA PROCESSING SCRIPT

DATA PLATFORM

Page 22: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

DATA PROCESSING SCRIPT

3rd PROVIDERS

SYSTEM EVENTS

Scheduler

OLAP

DATA PLATFORM

ПОЧЕМУ ОПЯТЬ

МОНОЛИТ?

API FACADE

DATA PLATFORM

Page 23: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

DATA PLATFORM

DWH ANALYTICSREPORTING

DECOUPLING DATA PLATFORM

Page 24: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

LOADINGEST

DATA PLATFORM

PROCESS

Scheduler

DATA PLATFORM

3rd PROVIDERS

SYSTEM EVENTS

DWH

DATA LAKE DWH

DECOUPLING DATA PLATFORM

Page 25: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

LOADINGEST

DATA PLATFORM

PROCESS

Scheduler

3rd PROVIDERS

SYSTEM EVENTS

DWH

DATA LAKE DWH

LOADINGEST PROCESS

DECOUPLING DATA PLATFORM

Page 26: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

ML TRAINING

DATA PLATFORM

ANALYTICS

OLAP

ML RUNNING

Scheduler

3rd PROVIDERS

SYSTEM EVENTS

DWH

DATA LAKE DWH

LOADINGEST PROCESS

LOADINGEST PROCESS

DECOUPLING DATA PLATFORM

Page 27: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

DATA PLATFORM

ML TRAINING

DATA PLATFORM

REPORTING

REPORTING ENGINE

ML RUNNING

Scheduler

REPORTING CACHE

REPORTING METADATA

SCHEDULED REPORTS

ANALYTICS

API FACADE

3rd PROVIDERS

SYSTEM EVENTS

OLAP

DWH

DATA LAKE DWH

LOADINGEST PROCESS

LOADINGEST PROCESS

DECOUPLING DATA PLATFORM

Page 28: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

DECOUPLING DATA PLATFORM

ARCHITECTURE DECISIONS

§ Raw data should be stored inside Data Lake

§ Introduce granular reusable and testable steps inside pipelines [ingest, validate, enrich, aggregate etc.]

§ Separate pipeline per vendor/feed

§ Introduce Data Linage, easy troubleshooting

§ Separate concerns (Scalability, Fault Tolerance)

Page 29: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

VENDOR-AGNOSTIC TECHNOLOGY STACK

§ Apache NiFi for data routing and ingestion

§ Apache Spark/Flink/Presto/Beam for processing

§ Kafka/Hive for Data Lake Storage

§ Hive/Memsql for DWH

§ Vertica/Redshift/Memsql/Clickhouse for OLAP

DECOUPLING DATA PLATFORM

Page 30: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

LESSON 4 AGGREGATE IT

Page 31: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

INGEST

DWH

ПОЧЕМУ ТАК ДОЛГО

РАНЯТСЯ РЕПОРТЫ ???

UI

MVP

API FACADE

OLTP

Data Platform

OLAP

3rd PROVIDERS

AGGREGATE IT

Page 32: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

COMMON PITFALLS

§ Direct access to Data Lake or cold DWH

DATA PLATFORM

ML TRAININGINGEST

DATA PLATFORM

REPORTING

PROCESS

REPORTING ENGINE

ML RUNNING

AirFlow

REPORTING CACHE

REPORTING METADATA

SCHEDULED REPORTS

ANALYTICS

API FACADE

3rd

PROVIDERS

SYSTEM EVENTS

LOAD

DWH

DATA LAKE / DWH

Page 33: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

COMMON PITFALLS

§ Reports query RAW data

DATA PLATFORM

ML TRAININGINGEST

DATA PLATFORM

REPORTING

PROCESS

REPORTING ENGINE

ML RUNNING

AirFlow

REPORTING CACHE

REPORTING METADATA

SCHEDULED REPORTS

ANALYTICS

API FACADE

3rd

PROVIDERS

SYSTEM EVENTS

OLAP

LOAD

DWH

DATA LAKE DWH

Page 34: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

INTRODUCE AGGREGATIONS

date hour user_cookie creative_id

05/21/19 03 4444444 123

05/21/19 03 5555555 321

05/21/19 03 6666666 321

05/21/19 04 7777777 567

impressions

creative_id campaign_id

123 1

321 1

567 2

campaigns

date hour campaign_id creative_id impressions

05/21/19 03 1 123 1

05/21/19 03 1 321 2

05/21/19 04 2 567 1

performance_ad

JOIN

Page 35: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

INGESTIONVALIDATION

ENRICHMENT

BATCH 1

HDFS/HIVE/...RAW FACT TABLE

BATCH 2

AGGREGATOR

HDFS/HIVE/…AGGREGATION TABLE

BATCH 1 BATCH 2

INTRODUCE AGGREGATIONS

Page 36: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

INGESTIONVALIDATION

ENRICHMENT

BATCH 1

HDFS/HIVE/...RAW FACT TABLE

BATCH 2

AGGREGATOR

HDFS/HIVE/…AGGREGATION TABLE

BATCH 1 BATCH 2

BATCH 3

INTRODUCE AGGREGATIONS

Page 37: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

INGESTIONVALIDATION

ENRICHMENT

BATCH 1

HDFS/HIVE/...RAW FACT TABLE

BATCH 2

AGGREGATOR

HDFS/HIVE/…AGGREGATION TABLE

BATCH 1 BATCH 2

BATCH 3

BATCH 3

INTRODUCE AGGREGATIONS

Page 38: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

BREAK DOWN DOMAIN

DOMAIN DOMAIN

• Break down domain into business-concerned areas

• Cover area with dedicated aggregation

• Example For Video Platform • Ad performance• Player performance• Video performance• Revenue performance

• Build once, reuse between multiple reports

Page 39: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

LESSON 5AVAILABILITY

Page 40: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

AVAILIABILITY

UI

MVP

API FACADE

OLTP

Data Platform

OLAP

3rd PROVIDERS

Page 41: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

AVAILIABILITY

UI

MVP

API FACADE

OLTP

Data Platform

OLAP

3rd PROVIDERS

ГДЕ МОИ РЕПОРТЫ?

Page 42: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

THINGS EASY TO MISS

AVAILABILITY

§ If possible do not share infrastructure between DP with Core services

§ Chose wise between Kappa and Lambda architectures

§ Introduce effective monitoring

§ Know your data latency and design solution based on it

Page 43: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

THINGS EASY TO MISS

FAULT TOLERANCE

§ Every job should be fail-ready and retry-able by design

§ Enable multiple attempts on scheduler side

§ Use idempotent sinks

§ Implement backpressure: Prefer Pull over Push, leverage Blob/S3/HDFS or Kafka

Page 44: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

THINGS EASY TO MISS

EFFECTIVE MONITORING

§ Collect system and app-specific metrics

§ Measure data availability [ in-rate, out-rate, lag]Bandar-Log https://github.com/VerizonAdPlatforms/bandar-log/

§ Think about Datadog [local agents, dashboards, monitors, notes]

Page 45: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

DASHBOARD EXAMPLE [INGESTION]

Page 46: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

DASHBOARD EXAMPLE [AGGREGATIONS]

Page 47: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

LESSON 6 – DATA GOVERNANCE

Page 48: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

DATA GOVERNANCE

UI

MVP

API FACADE

OLTP

Data Platform

OLAP

3rd PROVIDERS

Page 49: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

UI

MVP

API FACADE

OLTP

Data Platform

OLAP

3rd PROVIDERS

Я НАЙДУ ТЕБЯ НА ТОМ

СВЕТЕ!!!

DATA GOVERNANCE

Page 50: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

DATA GOVERNANCE CHECKLIST

Did I think about Personal Data Protection?Did I think about Data Access Control?Did I think about Data Eviction?Did I think about Data Lineage?Did I think about Data Quality?Did I think about Data Inventory?

Page 51: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

PERSONAL DATA PROTECTION

§Learn what Personally Identifiable Data (PID) is

§Think twice before storing any PID

§Anonymize data as soon as possible in ETL and prefer to use anonymized data over PID where never possible

§Introduce Anonymized Unique ID (AUID) and store relationship PID <-> AUID separately

Page 52: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

DATA ACCESS CONTROL

§ Introduce IAM for components and developers inside Data Lake and DWHControl access to PID and anonymized data

§ Introduce ACL for end users inside OLAPLeverage OLAP features to support ACL -- per row, table, schema, database

Page 53: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

DATA EVICTION

§ Design data and applications with evection enabled

§ Introduce data retention policy and schedule cleanup jobs

§ Separate data retention policy per raw and aggregation tables

§ Document retention policy

Page 54: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

DATA LINEAGE

§ Shit happens

§ Shit will happen, think about it in advance

RECOMMENDATIONS

§ Each ETL step should persist its output with reasonable retention policy

§ Persist any application logs (Spark/Yarn, CMD apps, ETL, …)

§ Log any significant application decisions

§ Persist any provenance logs (NiFi, …)

Page 55: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

DATA QUALITY

§ Introduce data validation [even if it is undefined] and track validation issues§ Schema errors (wrong type, missed mandatory field)§ Semantic errors (unknown or poorly formatted IDs)§ Business errors (certain business constraints per-event or cross-event)

§ Track any errors and expose metrics

§ Track discrepancies and expose metrics§ raw and aggregation data§ Discrepancy between real-time and batch§ Discrepancy between vendor data

Page 56: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

DATA INVENTORY

§ Document how data organized

§ Document where data stored

§ Document what and where data exported

§ Document what and where data ingested

§ Document as granular as possible -- per vendor, data source, ETL component etc.

Page 57: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

LESSON 7INTRODUCE

DATA ENGINEERING

Page 58: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring
Page 59: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring
Page 60: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring
Page 61: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

SHARING RESPONCIBILITY TO DATA

Distinguish expertise

Involve Data Engineers to make Data Platform better and faster

Page 62: Cowboy dating with big data-AI-Ukraine-2019 · intro –partner 1. intro –partner 2. НАГРАДЫpartner 2 •Медаль за kotlin •Полный кавалер spring& spring

THANK YOU