RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

2

RUNNING A PETABYTE SCALE DATA SYSTEM

Alexey KharlamovNov 14st, 2016

Good, Bad, and Ugly Decisions

3

2

1

3

AGENDA

MULTITENANCY

• Problem statement

• Resource management

• Workload isolation

CONTINOUS INTEGRATION

• What is different?

• Caveats of the conventional approach

• BigData release pipeline

INTRODUCTION

• Who?

• What?

• Why?

4

SERVICES

Data Strategy Big Data Architecture

Data Science Big Data DevOpsand Support

Solutions and

Accelerators

BIG DATA AND DATA SCIENCE PRACTICE

15+World-Class Data

Architects

200+Big Data Engineers & Hadoop DevOps

10%Hadoop Certified

Engineers

20+Data Scientists

5

BIO

Alexey a Solution Architect at EPAM Systems Ltd, where he leads EMEA Big Data Competency Center. He has over 20 years of software engineering experience and built multiple systems in the area of low-latency and distributed data processing in financial, e-retail and advertising industries.

During his career, Alexey has designed systems processing millions of messages per second and managing petabytes of stored data. He uses RDBMs, NoSQL, data grids, and Big Data toolchain in his daily work to help companies on their Big Data journey.Alexey Kharlamov

EPAM Systems, Solution Architect

6

DATA THAT CAN NOT BE PROCESSED ON A SINGLE MACHINE

7

• Data

– Machine generated data by social networks, games, sensors, ad networks

– Large volumes

– Allow to build fine grained models of reality

• Traits

– ~1000 USD/TB

– Hundreds of servers, thousands of rotational drives (Failure is a reality)

– High performance server to server network

– It takes days to copy data from a single server

BIG DATA SYSTEM

8

CONTINOUS INTEGRATION @ SCALE

9

• Multiple environments for different purposes

– Local/Continuous Integration

– Quality Assurance

– Production

• The environments are kept in sync

– Configuration

– Databases

• Code and test datasets are deployed to the environments to test different aspects of a system

CLASSICAL (WEB) APPROACH

1 Laptop 1 VM 2 hosts 100+hosts

TRADITIONAL APPROACH

10

TOTALLY DIFFERENT

ENVIRONMENT SYNCRHONIZATION OUTCOME

• CI, QA and PROD are constantly different

• Test failure on CI and QA does not mean it will fail in PROD and visa versa

• People stop to rely on additional environments to test their jobs

• The most frequent bugs

– Unexpected field value / rubbish

– Input data change

– Resource issue due data skew or growth

• Environments have different hardware

– Number of nodes

– Generations of servers

• Hard to synchronize configuration

– Reprovisioning takes hours

– Engineers tend to forget to copy configuration parameters

• Hard to synchronize data

– Different amount of disk space and CPU

– Coping takes hours

11

PREVAILING ISSUE TYPES

• Unexpected field value / rubbish

– Test data do not cover all possible values

– Sampled data may miss exactly this error

– Need to test on production data

• Incompatible change in data format

– Frequently brought in by third-parties and unexpected

– Fall through ETL layers


• Resource issue due data skew or growth

– Causes job termination or cluster failure

– Must be tested on exactly the same hardware configuration


12

PERFECT TEST USES PRODUCTION DATA

PERFECT TEST USES PRODUCTION HARDWARE

13

• Logical partitions for DEV, QA, PROD on the cluster

– Full processing capacity available– Always up-to-date data and

configuration– No environment synchronization

needed

• Cluster becomes multitenant

– Partitions must be isolated!

– Code must be portable!

• Developers need more

– Faster turnaround times

– Easy interactive debugging and cross-process traceability

QA: SINGLE CLUSTER FOR EVERYTHING

14

QA: HADOOP MINICLUSTER

• Full clone of a Hadoop Cluster in a single JVM– Job Driver

– NameNode

– DataNode

– Hive

– Hbase

• Step Into... Hadoop and debug– MapReduce Jobs

– User Defined Functions

– Coprocessors

– Queries

15

QA: CONTINUOUS QUALITY MONITORING

• Assertion of invariants per data chunk or time period

– Number of records

– Field data profile

– Conversion failures

– Missing dictionary/dimension data

– Field values range

• Alerting on assertion failure– Too many errors!

– Number of records differs!

16

MULTITENANCY

17

• Uses unit allocated to them, but always would like to get more

• Wants independence from others

• Do not want to be bothered by other, but can throw a party from time to time

APARTMENT RENTAL

TENANT

• Provides unit fulfilling tenant needs

• Fixes broken facilities

• Ensures tenants follow rules

• Evicts misbehaving tenants

LANDLORD

18

• A logical partition of platform resources independently executing a cluster application

– Data processing scripts and drivers– Cluster services (workflow managers, query engines) – Bespoken services (REST, Web UI, etc)

• Resource management– YARN resource pool defines share of resource available

to application– HDFS quotes for data volume control

• Isolation– Linux Cgroups enforce CPU/RAM utilization– Filesystem ACLs restrict access– Own service instance per domain (Hive, scheduler,

etc)– YARN can preempt tasks running for too long– Watchdog processes terminates ran away jobs

APPLICATION DOMAIN

19

ELASTIC COMPUTING CAPACITY

Mesosphere• Researchers and Developers frequently need a

playground

• Application domains need to dynamically allocate resources

– Metal as a Service

– Virtualization

– Containerization

• Containers are perfect for portable code bundling

– Statelessness encourages externalization of configuration

– All dependencies included

– Explicit amount of resources allocated

– Easy migration between hosts

20

2

1

3

TAKE AWAYS

AUGMENT HADOOP WITH FLUID COMPUTATIONAL CAPACITY

CREATE ISOLATED DOMAINS FOR TENANTS AND WORKLOADS

USE UNIFIED PLATFORM FOR ALL ACTIVITIES

21

THANKYOU

[email protected]

@aih1013

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

Technology

Transcript of RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov