Disco workshop

Disco workshop From zero to CDN log processing

2

1.  Intro to parallel compu1ng •  Algorithms •  Programming model •  Applica1ons

2.  Intro to MapReduce •  History •  (in)applicability •  Examples •  Execu1on overview

3.  Wri1ng MapReduce jobs with Disco •  Disco & DDFS •  Python •  Your first disco job •  Disco @ SpilGames

4.  CDN log processing •  Architecture •  Availability & Performance monitoring •  Steps to get to our Disco landscape

Overview

3

Introduction to Parallel Computing

4

Tradi1onally (Neumann model), soUware has been wriVen for serial computa1on:

•  To be run on a single computer having a single CPU •  A problem is broken into discrete series of instruc1ons •  Instruc1ons are executed one aUer another •  Only on instruc1on may execute at any moment in 1me

Serial computations

5

A parallel computer is of liVle use unless efficient parallel algorithms are available •  The issues in designing parallel algorithms are very different from those in designing their sequen1al counterparts

•  A significant amount of work is being done to develop efficient parallel algorithms for a variety of parallel architectures

Design of efficient algorithms

6

Fibonacci series (1,1,2,3,5,8,13,21…) by F(n) = F(n-1) + F(n-2)

Sequential algorithm, not parallelizable

7

Parallel compu1ng is the simultaneous use of mul1ple compu1ng resources to solve a computa1onal problem:

•  To be run using mul1ple CPUs •  A problem is broken down into discrete parts that can be solved concurrently

•  Each part is further broken down to a series of instruc1ons •  Instruc1ons from each part execute simultaneously on different CPUs

Parallel computations

8

Summation of numbers

9

•  Descrip1on •  The mental model the programmer has about the detailed execu1on of their applica1ons

•  Purpose •  Improve programmer produc1vity

•  Evalua1on •  Expression •  Simplicity •  Performance

Programming Model

10

•  Message passing •  Independent tasks encapsula1ng local data •  Tasks interact by exchanging messages

•  Shared memory •  Tasks share a common address space •  Tasks interact by reading and wri1ng this space asynchronously

•  Data paralleliza1on •  Tasks execute a sequence of independent opera1ons •  Data usually evenly par11oned across tasks •  Also referred to as “Embarrassingly parallel”

Parallel Programming Models

11

•  Historically used for large scale problems in science and Engineering •  Physics – applied, nuclear, par1cle, fusion, photonics •  Bioscience, Biotechnology, Gene1cs, Sequencing •  Chemistry, Molecular sciences •  Mechanical Engineering – from prosthe1cs to spacecraU •  Electrical Engineering, Circuit Design, Microelectronics •  Computer Science, Mathema1cs

Applications (Scientific)

12

•  Commercial applica1ons also provide the driving force in the parallel compu1ng. These applica1ons require the processing of large amounts of data •  Databases, data mining •  Oil explora1on •  Web search engines, web based business services •  Medical imaging and diagnosis •  Pharmaceu1cal design •  Management of na1onal and mul1-‐na1onal corpora1ons •  Financial and economic modeling •  Advanced graphics & VR •  Networked video and mul1-‐media technologies

Applications (Commercial)

13

•  Parallelize •  Distribute

•  Problems? •  Concurrency problems •  Coordina1on •  Scalability •  Fault Tolerance

What if my job is too “big”?

14

•  Applica1on is modeled as Directed Acyclic Graph •  DAG defines the dataflow

•  Computa1onal ver1ces •  Ver1ces of the graph defines the opera1on on data

•  Channels •  File •  TCP pipe •  SHM FIFO

•  Not as restric1ve as MapReduce •  Mul1ple Input and Output

•  Allows developers to define communica1on between ver1ces

Microsoft: MSN search group: DRYAD

15

“A simple and powerful interface that enables automa1c paralleliza1on and distribu1on of large-‐scale computa1ons, combined with an implementa1on of this interface that achieves high performance on large clusters of commodity PCs.”

Google

Deen and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”, Google Inc.

16

Introduction to MapReduce

17

I have a ques1on which a data set can answer. I have lots of data and I have of a cluster of nodes. MapReduce is a parallel framework which takes advantage of my cluster by distribu1ng the work across each node. Specifically, MapReduce maps data in the form of key-‐value pairs which are then par11oned into buckets. The buckets can be spread easily over all the nodes in the cluster and each node or Reducer, reduces the data to an “answer” or a list of “answers”.

What is MapReduce?

18

•  Published in 2004 by Google

MapReduce history

19

•  Published in 2004 by Google •  Func1onal programming (eg. Lisp, Erlang)

•  map() func1on •  Applies a func1on to each value of a sequence

•  reduce() func1on (fold()) •  Combines all elements of a sequence using a binary operator

MapReduce history

20

•  Published in 2004 by Google

MapReduce history

21

•  Restric1ve seman1cs •  Pipelining Map/Reduce stages possibly inefficient •  Solvers problems within a narrow programming domain well •  DB community: our parallel RMDBSs have been doing this

forever… •  Data scale maVers: Use MapReduce if you truly have large

data sets that are difficult to process using simpler solu1ons •  Its not always a high performance solu1on. Straight python,

simple batch scheduled Python, and C core can all outperform MR by and order of magnitude or two on a single node for many problems, even for so-‐called big data problems

Why NOT MapReduce?

22

•  Distributed grep, sort, word frequency •  Inverted index construc1on •  Page Rank •  Web link-‐graph traversal •  Large-‐scale PDF genera1on, image conversion •  Ar1ficial Intelligence, Machine Learning •  Geographical data, Google Maps •  Log querying •  Sta1s1cal Machine Transla1on •  Analyzing similari1es of user’s behavior •  Process clickstream and demographic data •  Research for Ad systems •  Ver1cal search engine for trustworthy wine informa1on

What it is good for?

23

•  Google (proprietary implementa1on in C++) •  Hadoop (Open Source implementa1on in JAVA) •  Disco (erlang, python) •  Skynet (ruby) •  BashReduce (last.fm) •  Spark (Scala, func1onal OO lang. on JVM) •  Plasma MapReduce (OCaml) •  Storm (The hadoop of Real1me Processing)

cat a_bunch_of_files | ./mapper.py | sort | ./reducer.py

Flavors of MapReduce

24

•  Process data using special map() and reduce() func1ons •  The map() func1on is called on every item in the input and emit a series of intermediate key/value pairs

•  All values associated with a given key are grouped together

•  The reduce() func1on is called on every unique key, and its values list, and emits a value that is added to the output

The MR programming model

25

•  More formally •  Map(k1, v1) -‐> list(k2, v2) •  Reduce(k2, list(v2)) -‐> list(v2)

The MR programming model

26

•  Greatly reduces parallel programming complexity •  Reduces synchroniza1on complexity •  Automa1cally par11ons data •  Provides failure transparency

•  Prac1cal •  Hundreds of jobs every day

MapReduce benefits

27

•  Par11ons input data •  Schedules execu1on across a set of machines •  Handles machine failure •  Manages IPC

The MR runtime system

28

•  Distributed grep •  Map func1on emits <word, line_number> if a word matches search criteria

•  Reduce func1on is iden1ty func1on •  URL access frequency

•  Map func1on processing web logs, emits <url, 1> •  Reduce func1on summing values, emits <url, total>

MR Examples

29

•  Geospa1al Query processing •  Given an intersec1on, find all roads connec1ng to it •  Rendering the 1les in the map •  Finding the nearest feature to a given address

MR Examples

30

•  “Learning the right abstrac1on will simplify your life.” – Travis Oliphant

MR Examples

Program Map() Reduce()

Distributed grep Matched lines pass

Reverse web link graph <target, source> <target, list(src)>

URL count <url, 1> <url, total_count)

Term-‐vector per host <hostname, term-‐vector> <hostname, all-‐term-‐vector>

Inverted Index <word, doc id> <word, list(doc_id)>

Distributed Sort <key, value> pass

31

•  The user program, via the MR library, shards the input data

MR Execution 1/8

32

•  The user program creates process copies (workers) distributed on a machine cluster.

•  One copy will be the “Master” and the others will be worker threads

MR Execution 2/8

33

•  The master distributes M map and R reduce tasks to idle workers. •  M == number of shards •  R == the key space is divided into R parts

MR Execution 3/8

34

•  Each map-‐task worker reads assigned input shard and outputs intermediate key/value pairs •  Output buffered in RAM

MR Execution 4/8

35

•  Each worker flushes intermediate values, par11oned into R regions, to disk and no1fies the Master process

MR Execution 5/8

36

•  Master process gives disk loca1on to an available reduce-‐task worker who reads all associated intermediate data

MR Execution 6/8

37

•  Each reduce-‐task worker sorts its intermediate data. Calls the reduce() func1on, passing unique keys and associated key values. Reduce func1on output appended to reduce-‐task’s par11on output file

MR Execution 7/8

38

•  Master process wakes up user process when all tasks have completed.

•  Output contained in R output files.

MR Execution 8/8

39

•  An input reader •  A map() func1on •  A par11on func1on •  A compare func1on (sort) •  A reduce() func1on •  An output writer

Hot spots

40

MR Execution Overview

41

•  Fault Tolerance •  Master process periodically pings workers

•  Map-‐task failure –  Re-‐execute

»  All output was stored locally •  Reduce-‐task failure

–  Only re-‐execute par1ally completed tasks »  All output stored in the global file system

MR Execution Overview

42

•  Don’t move data to workers… Move workers to the data! •  Store data on local disks for nodes in the cluster •  Start up the workers on the node that has data local

•  Why? •  Not enough RAM to hold all the data in memory •  Disk access is slow, disk throughput is good

•  A distributed file system is the answer •  GFS (Google File System) (= Big File System) •  HDFS (Hadoop DFS) = GFS clone •  DDFS (Disco DFS)

Distributed File System

43

•  Sequen1al -‐> Parallel -‐> Distributed •  Hype aUer Google published the paper in 2004 •  A very narrow set of problems •  Big-‐data is a marke1ng buzzword

Summary for Part I.

44

•  MapReduce is a paradigm for distributed compu1ng developed (patented…) by Google for performing analysis on large amounts of data distributed across thousands of commodity computers •  The Map phase processes the input one element at a 1me and returns a (key, value) pair for each element

•  An op1onal Par11on step par11ons Map results into groups based on a par11on func1on on the key.

•  The engine merges par11ons and sorts all the map results.

•  The merged results are passed to the Reduce phase. One or more reduce jobs reduce the (key, value) pairs to produce the final results.

Summary for Part I (cont.)

45

Writing MapReduce jobs with Disco

46

•  Wri1ng MapReduce jobs can be VERY 1me consuming •  MapReduce paVerns •  Debugging a failure is a nightmare •  Large clusters require a dedicated team to keep it running •  Wri1ng a Disco job becomes a soUware engineering task

•  …rather than a data analysis task

Take a deep breath

47

Disco

48

•  “Massive data – Minimal code” – by Nokia Research Center •  hVp://discoproject.org •  WriVen in Erlang

•  Orchestra1ng control •  Robust fault-‐tolerant distributed applica1ons

•  Python for opera1ng on data •  Easy to learn •  Complex algorithms with very liVle code •  U1lize favorite python libraries

•  The complexity is hidden, but…

About Disco

49

•  Distributed •  Increase storage capacity by adding nodes •  Processing on nodes without transferring data

•  Replicated •  Chunked data stored in gzip compressed chunks •  Tag based •  AVributes •  CLI

•  $ ddfs ls data:log •  $ ddfs chunk data:bigtxt ./bigtxt •  $ ddfs blobs data:bigtxt •  $ ddfs xcat data:bigtxt

Disco Distributed “filesystem”

50

•  Everything is preinstalled •  Disco localhost setup: hVps://github.com/spilgames/disco-‐development-‐workflow

Sandbox environment

51

•  www.pythonforbeginners.com -‐ by Magnus •  Import •  Data structures: {} dict, [] list, () tuple •  Defining func1ons and classes •  Control flow primi1ves and structures: for, if, … •  Excep1on handling •  Regular expressions •  GeoIP, MySQLdb, … •  To understand what yield does, you must understand what

generators are. And before generators come iterables.

Python – What you’ll need

52

When you create a list, you can read its items one by one, and it’s called itera1on: >>> mylist = [1, 2, 3] >>> for i in mylist: … print i 1 2 3

Python Lists

53

Mylist is an iterable. When you use a comprehension list, you create a list and so an iterable: >>> mylist = [x*x for x in range(3)] >>> for i in mylist: … print i 0 1 4

Python Iterables

54

Generators are iterables, but you can read them once. It’s because they do not store all the values in memory, they generate the values on the fly: >>> mygenerator = (x*x for x in range(3)) >>> for i in mygenerator: … print i 0 1 4 I just the same except you used () instead of []. But, you can not perform for i in mygenerator a second 1me since generators can only be used once: they calculate 0, then forget about it and calculate 1 and ends calcula1ng 4, one by one.

Python Generators

55

Yield is a keyword that is used like return, except the func1on will return a generator. >>> def createGenerator(): … mylist = range(3) … for i in mylist: … yield i*i … >>> mygenerator = createGenerator() >>> print mygenerator <generator object createGenerator at 0xb7555c34> >>> for I in mygenerator: … print i 0 1 4

Python Yield

56

•  What is the total count for each unique word in the text?

•  Word coun1ng is the Hello World! of MapReduce

•  We need to write map() and reduce() func1ons •  Map(rec) -‐> list(k, v) •  Reduce(k, v) -‐> list(res)

•  Your applica1on communicates with Disco API •  from disco.core import Job, result_iterator

Your first disco job

57

•  Spli�ng file (related chunks) to lines •  Map(line, params)

•  Split line to words •  Emit k,v tuple: <word, 1>

•  Reduce(iter, params) •  OUen, this is an algebraic expression •  <word, [1,1,1]> -‐> <word, 3>

Word count

58

•  Modules to import •  Se�ng the master host •  DDFS •  Job() •  Result_iterator(Job.wait()) •  Job.purge()

Word count: Your application

59

def fun_map(line, params): for word in line.split(): yield word, 1

Word count: Your map

60

def fun_reduce(iter, params): from disco.u1l import kvgroup for word, counts in kvgroup(sorted(iter)): yield word, sum(counts)

Built-‐in disco.worker.classic.func.sum_reduce()

Word count: Your reduce

61

job = Job().run(input=…, map=fun_map, reduce=fun_reduce) for word, count in result_iterator(job.wait(show=True)):

print (word, count) job.purge()

Word count: Your results

62

Class MyJob1(Job): @classmethod def map(self, data, params): … @classmethod def reduce(self, iter, params): …

… MyJob2.run(input=MyJob1.wait()) # <-‐ Job chaining

Word count: More advanced

63

•  Event Tracking & Adver1sing related jobs •  Heatmap: page clicks -‐> 2D density distribu1ons •  Reconstruc1ng sessions •  Ad research •  Behavioral modeling

•  Log crunching •  Gameplays per country •  Frontend performance (CDN) •  404s, Response code tracking •  Intrusion detec1on #security

Disco @ SpilGames

64

•  Calculate your resource need es1mates •  Deploy in workflow •  We have

•  Git •  Package repository / Deployment Orchestra1on •  Disco-‐tools: hVp://github.com/spilgames/disco-‐tools/ •  Job runner: hVp://jobrunner/ •  Data warehouse •  Interac1ve, graphical report genera1on

Disco @ SpilGames

66

CDN log processing

67

•  Ques1on? •  Availability of each CDN provider

•  Data source •  Javascript sampler on client side •  LoadBalancer -‐> HA logging endpoints -‐> Access logs -‐> Disco Distributed FS

CDN Availability monitoring

68


69

•  Input •  URI parsing •  /res.ext?v=o,1|e,1|os,1|ce,1|hw,1|c,0|l,1

•  Expected output •  ProviderO 98.7537% •  ProviderE 57.8851% •  ProviderC 99.4584% •  ProviderL 99.4847%


70

# cdnData: “o,1|e,1|os,1|ce,1|hw,1|c,0|l,1“ •  Parse a log entry •  Yield samples

•  <o, 1> •  <e, 1> •  <os, 1> •  <ce, 1> •  <hw, 1> •  <c, 0> •  <l, 1>

CDN Availability monitoring (map)

71

def map_cdnAvailability(line, params): import urlparse try: (1mestamp, data) = line.split(‘,’, 1) data = dict(urlparse.parse_qsl(data, False)) for cdnData in data[‘a’].split(‘|’) try: cdnName = cdnData.split(‘,’)[0] cdnAvailable = int(cdnData.split(‘,’)[1]) yield cdnName, cdnAvailabe except: pass except: pass

CDN Availability monitoring (map)

72

Availability of <hw, [1,1,1,0,1,1,1,0,1,1,0,1]> •  kvgroup(iter) •  The trick:

•  Samples = […] •  len(samples) -‐> number of all samples •  sum(samples) -‐> number of available •  A = sum()/len() * 100.0

CDN Availability monitoring (reduce)

73

def reduce_cdnAvailability(iter, params): from disco.u1l import kvgroup for cdnName, cdnAvailabili1es in kvgroup(sorted(iter)): try: cdnAvailabili1es = list(cdnAvailabili1es) totalSamples = len(cdnAvailabili1es) totalAvailable = sum(cdnAvailabili1es) totalUnavailable = totalSamples – totalAvailable yield cdnName, (round(float(totalAvailable) / totalSamples * 100.0, 4)) except: pass

CDN Availability monitoring (reduce)

74

•  DDFS •  tag://logs:cdn:la010:12345678900 •  disco.ddfs.list(tag) •  disco.ddfs.[get|set]aVr(tag,aVr,value)

•  Job(name,master).run(input,map,reduce) •  par11ons = R •  map_reader = disco.worker.classic.func.chain_reader •  save = true

Advanced usage

75

CDN Performance

95th percentile with per country breakdown

76

•  Ques1on •  95th percen1le of response 1mes per CDN per country

•  Data source •  Javascript sampler on client side •  LB -‐> HA Logging endpoints -‐> Access logs -‐> DDFS

•  Input •  /res.ext?v=o,1234|l,2345|c,3456&ipaddr=127.0.0.1

•  Expected output •  ProviderN CountryA: 3891 ms CountryB: 1198 ms … •  ProviderC CountryA: 3793 ms CountryB: 1397 ms … •  ProviderE CountryA: 3676 ms CountryB: 1676 ms … •  ProviderL CountryA: 4332 ms CountryB: 1233 ms…

CDN Performance

77

The 95th percentile

A 95th percentile says that 95% of the time data points are below that value and 5% of the time they are above that value. 95 is a magic number used in networking because you have to plan for the most-of-the-time case.

78

v=o,1234|l,2345|c,3456&ipaddr=127.0.0.1 •  Line parsing is about the same •  Advanced key: <cdn:country, performance> •  How to get country from IP?

•  Job().run(…required_modules=[“GeoIP”]…) •  No global variables! Within map() – Why?

•  Use Job().run(…params={}…) instead •  yield “%s:%s“ % (cdnName, country), cdnPerf

CDN Performance (map)

79

# <hw, [123, 234, 345, 456, 567, 678, 798]> def percen1le(N, percent, key=lambda x:x): import math if not N: return None k = (len(N) -‐ 1) * percent f = math.floor(k) c = math.ceil(k) if f == c: return key(N[int(k)]) d0 = key(N[int(f)]) * (c -‐ k) d1 = key(N[int(c)]) * (k -‐ f) return d0 + d1

CDN Performance (reduce)

80

•  Outputs •  Print to screen • Write to a file • Write to DDFS – Why not? •  An other MR job with chaining •  Email it • Write to MySQL • Write to Ver1ca •  Zip and upload to Spil OOSS

Other goodies

81

1.  Ques1on & Data source •  Javascript code •  Nginx endpoint •  Logrotate •  (de-‐personalize) •  DDFS load scripts

2.  MR jobs 3.  Jobrunner jobs 4.  Present your results

Steps to get to our Disco landscape

82

•  Edi1ng on live servers •  No version control •  No staging environment •  Not using deployment mechanism •  Not using Con1nuous Integra1on •  Poor parsing •  No redundancy for MC applica1ons •  Not purging your job •  Not documen1ng your job •  Using hard coded configura1on inside MR code

Bad habits

83

•  No peer review •  Not ge�ng back events from slaves •  Using job.wait() •  Job().run(par11ons=1)

Bad habits cont.

84

•  Wri1ng Disco jobs can be easy •  Finding the right abstrac1on for a problem is not… •  Framework is on the way -‐> DRY •  You can find a lot of good paVerns in SET and other jobs

You successfully took a step to understand how to •  Process large amount of data •  Solve some specific problems with MR

Summary

85

•  Ecosystems •  DiscoDB: lightning-‐fast key-‐>value mapping •  Discodex: disco + ddfs + discodb

•  Disco vs. Hadoop •  HDFS, Hadoop ecosystem •  NoSQL result stores

Bonus: Outlook

Questions?

87

•  Presenta1on can be found at: hVp://spil.com/discoworkshop2013

•  You can contact me at: [email protected]

Thank you!

Disco workshop

Technology

Transcript of Disco workshop