Disco workshop
-
Upload
spil-engineering -
Category
Technology
-
view
119 -
download
1
description
Transcript of Disco workshop
Disco workshop From zero to CDN log processing
2
1. Intro to parallel compu1ng • Algorithms • Programming model • Applica1ons
2. Intro to MapReduce • History • (in)applicability • Examples • Execu1on overview
3. Wri1ng MapReduce jobs with Disco • Disco & DDFS • Python • Your first disco job • Disco @ SpilGames
4. CDN log processing • Architecture • Availability & Performance monitoring • Steps to get to our Disco landscape
Overview
3
Introduction to Parallel Computing
4
Tradi1onally (Neumann model), soUware has been wriVen for serial computa1on:
• To be run on a single computer having a single CPU • A problem is broken into discrete series of instruc1ons • Instruc1ons are executed one aUer another • Only on instruc1on may execute at any moment in 1me
Serial computations
5
A parallel computer is of liVle use unless efficient parallel algorithms are available • The issues in designing parallel algorithms are very different from those in designing their sequen1al counterparts
• A significant amount of work is being done to develop efficient parallel algorithms for a variety of parallel architectures
Design of efficient algorithms
6
Fibonacci series (1,1,2,3,5,8,13,21…) by F(n) = F(n-1) + F(n-2)
Sequential algorithm, not parallelizable
7
Parallel compu1ng is the simultaneous use of mul1ple compu1ng resources to solve a computa1onal problem:
• To be run using mul1ple CPUs • A problem is broken down into discrete parts that can be solved concurrently
• Each part is further broken down to a series of instruc1ons • Instruc1ons from each part execute simultaneously on different CPUs
Parallel computations
8
Summation of numbers
9
• Descrip1on • The mental model the programmer has about the detailed execu1on of their applica1ons
• Purpose • Improve programmer produc1vity
• Evalua1on • Expression • Simplicity • Performance
Programming Model
10
• Message passing • Independent tasks encapsula1ng local data • Tasks interact by exchanging messages
• Shared memory • Tasks share a common address space • Tasks interact by reading and wri1ng this space asynchronously
• Data paralleliza1on • Tasks execute a sequence of independent opera1ons • Data usually evenly par11oned across tasks • Also referred to as “Embarrassingly parallel”
Parallel Programming Models
11
• Historically used for large scale problems in science and Engineering • Physics – applied, nuclear, par1cle, fusion, photonics • Bioscience, Biotechnology, Gene1cs, Sequencing • Chemistry, Molecular sciences • Mechanical Engineering – from prosthe1cs to spacecraU • Electrical Engineering, Circuit Design, Microelectronics • Computer Science, Mathema1cs
Applications (Scientific)
12
• Commercial applica1ons also provide the driving force in the parallel compu1ng. These applica1ons require the processing of large amounts of data • Databases, data mining • Oil explora1on • Web search engines, web based business services • Medical imaging and diagnosis • Pharmaceu1cal design • Management of na1onal and mul1-‐na1onal corpora1ons • Financial and economic modeling • Advanced graphics & VR • Networked video and mul1-‐media technologies
Applications (Commercial)
13
• Parallelize • Distribute
• Problems? • Concurrency problems • Coordina1on • Scalability • Fault Tolerance
What if my job is too “big”?
14
• Applica1on is modeled as Directed Acyclic Graph • DAG defines the dataflow
• Computa1onal ver1ces • Ver1ces of the graph defines the opera1on on data
• Channels • File • TCP pipe • SHM FIFO
• Not as restric1ve as MapReduce • Mul1ple Input and Output
• Allows developers to define communica1on between ver1ces
Microsoft: MSN search group: DRYAD
15
“A simple and powerful interface that enables automa1c paralleliza1on and distribu1on of large-‐scale computa1ons, combined with an implementa1on of this interface that achieves high performance on large clusters of commodity PCs.”
Deen and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”, Google Inc.
16
Introduction to MapReduce
17
I have a ques1on which a data set can answer. I have lots of data and I have of a cluster of nodes. MapReduce is a parallel framework which takes advantage of my cluster by distribu1ng the work across each node. Specifically, MapReduce maps data in the form of key-‐value pairs which are then par11oned into buckets. The buckets can be spread easily over all the nodes in the cluster and each node or Reducer, reduces the data to an “answer” or a list of “answers”.
What is MapReduce?
18
• Published in 2004 by Google
MapReduce history
19
• Published in 2004 by Google • Func1onal programming (eg. Lisp, Erlang)
• map() func1on • Applies a func1on to each value of a sequence
• reduce() func1on (fold()) • Combines all elements of a sequence using a binary operator
MapReduce history
20
• Published in 2004 by Google
MapReduce history
21
• Restric1ve seman1cs • Pipelining Map/Reduce stages possibly inefficient • Solvers problems within a narrow programming domain well • DB community: our parallel RMDBSs have been doing this
forever… • Data scale maVers: Use MapReduce if you truly have large
data sets that are difficult to process using simpler solu1ons • Its not always a high performance solu1on. Straight python,
simple batch scheduled Python, and C core can all outperform MR by and order of magnitude or two on a single node for many problems, even for so-‐called big data problems
Why NOT MapReduce?
22
• Distributed grep, sort, word frequency • Inverted index construc1on • Page Rank • Web link-‐graph traversal • Large-‐scale PDF genera1on, image conversion • Ar1ficial Intelligence, Machine Learning • Geographical data, Google Maps • Log querying • Sta1s1cal Machine Transla1on • Analyzing similari1es of user’s behavior • Process clickstream and demographic data • Research for Ad systems • Ver1cal search engine for trustworthy wine informa1on
What it is good for?
23
• Google (proprietary implementa1on in C++) • Hadoop (Open Source implementa1on in JAVA) • Disco (erlang, python) • Skynet (ruby) • BashReduce (last.fm) • Spark (Scala, func1onal OO lang. on JVM) • Plasma MapReduce (OCaml) • Storm (The hadoop of Real1me Processing)
cat a_bunch_of_files | ./mapper.py | sort | ./reducer.py
Flavors of MapReduce
24
• Process data using special map() and reduce() func1ons • The map() func1on is called on every item in the input and emit a series of intermediate key/value pairs
• All values associated with a given key are grouped together
• The reduce() func1on is called on every unique key, and its values list, and emits a value that is added to the output
The MR programming model
25
• More formally • Map(k1, v1) -‐> list(k2, v2) • Reduce(k2, list(v2)) -‐> list(v2)
The MR programming model
26
• Greatly reduces parallel programming complexity • Reduces synchroniza1on complexity • Automa1cally par11ons data • Provides failure transparency
• Prac1cal • Hundreds of jobs every day
MapReduce benefits
27
• Par11ons input data • Schedules execu1on across a set of machines • Handles machine failure • Manages IPC
The MR runtime system
28
• Distributed grep • Map func1on emits <word, line_number> if a word matches search criteria
• Reduce func1on is iden1ty func1on • URL access frequency
• Map func1on processing web logs, emits <url, 1> • Reduce func1on summing values, emits <url, total>
MR Examples
29
• Geospa1al Query processing • Given an intersec1on, find all roads connec1ng to it • Rendering the 1les in the map • Finding the nearest feature to a given address
MR Examples
30
• “Learning the right abstrac1on will simplify your life.” – Travis Oliphant
MR Examples
Program Map() Reduce()
Distributed grep Matched lines pass
Reverse web link graph <target, source> <target, list(src)>
URL count <url, 1> <url, total_count)
Term-‐vector per host <hostname, term-‐vector> <hostname, all-‐term-‐vector>
Inverted Index <word, doc id> <word, list(doc_id)>
Distributed Sort <key, value> pass
31
• The user program, via the MR library, shards the input data
MR Execution 1/8
32
• The user program creates process copies (workers) distributed on a machine cluster.
• One copy will be the “Master” and the others will be worker threads
MR Execution 2/8
33
• The master distributes M map and R reduce tasks to idle workers. • M == number of shards • R == the key space is divided into R parts
MR Execution 3/8
34
• Each map-‐task worker reads assigned input shard and outputs intermediate key/value pairs • Output buffered in RAM
MR Execution 4/8
35
• Each worker flushes intermediate values, par11oned into R regions, to disk and no1fies the Master process
MR Execution 5/8
36
• Master process gives disk loca1on to an available reduce-‐task worker who reads all associated intermediate data
MR Execution 6/8
37
• Each reduce-‐task worker sorts its intermediate data. Calls the reduce() func1on, passing unique keys and associated key values. Reduce func1on output appended to reduce-‐task’s par11on output file
MR Execution 7/8
38
• Master process wakes up user process when all tasks have completed.
• Output contained in R output files.
MR Execution 8/8
39
• An input reader • A map() func1on • A par11on func1on • A compare func1on (sort) • A reduce() func1on • An output writer
Hot spots
40
MR Execution Overview
41
• Fault Tolerance • Master process periodically pings workers
• Map-‐task failure – Re-‐execute
» All output was stored locally • Reduce-‐task failure
– Only re-‐execute par1ally completed tasks » All output stored in the global file system
MR Execution Overview
42
• Don’t move data to workers… Move workers to the data! • Store data on local disks for nodes in the cluster • Start up the workers on the node that has data local
• Why? • Not enough RAM to hold all the data in memory • Disk access is slow, disk throughput is good
• A distributed file system is the answer • GFS (Google File System) (= Big File System) • HDFS (Hadoop DFS) = GFS clone • DDFS (Disco DFS)
Distributed File System
43
• Sequen1al -‐> Parallel -‐> Distributed • Hype aUer Google published the paper in 2004 • A very narrow set of problems • Big-‐data is a marke1ng buzzword
Summary for Part I.
44
• MapReduce is a paradigm for distributed compu1ng developed (patented…) by Google for performing analysis on large amounts of data distributed across thousands of commodity computers • The Map phase processes the input one element at a 1me and returns a (key, value) pair for each element
• An op1onal Par11on step par11ons Map results into groups based on a par11on func1on on the key.
• The engine merges par11ons and sorts all the map results.
• The merged results are passed to the Reduce phase. One or more reduce jobs reduce the (key, value) pairs to produce the final results.
Summary for Part I (cont.)
45
Writing MapReduce jobs with Disco
46
• Wri1ng MapReduce jobs can be VERY 1me consuming • MapReduce paVerns • Debugging a failure is a nightmare • Large clusters require a dedicated team to keep it running • Wri1ng a Disco job becomes a soUware engineering task
• …rather than a data analysis task
Take a deep breath
47
Disco
48
• “Massive data – Minimal code” – by Nokia Research Center • hVp://discoproject.org • WriVen in Erlang
• Orchestra1ng control • Robust fault-‐tolerant distributed applica1ons
• Python for opera1ng on data • Easy to learn • Complex algorithms with very liVle code • U1lize favorite python libraries
• The complexity is hidden, but…
About Disco
49
• Distributed • Increase storage capacity by adding nodes • Processing on nodes without transferring data
• Replicated • Chunked data stored in gzip compressed chunks • Tag based • AVributes • CLI
• $ ddfs ls data:log • $ ddfs chunk data:bigtxt ./bigtxt • $ ddfs blobs data:bigtxt • $ ddfs xcat data:bigtxt
Disco Distributed “filesystem”
50
• Everything is preinstalled • Disco localhost setup: hVps://github.com/spilgames/disco-‐development-‐workflow
Sandbox environment
51
• www.pythonforbeginners.com -‐ by Magnus • Import • Data structures: {} dict, [] list, () tuple • Defining func1ons and classes • Control flow primi1ves and structures: for, if, … • Excep1on handling • Regular expressions • GeoIP, MySQLdb, … • To understand what yield does, you must understand what
generators are. And before generators come iterables.
Python – What you’ll need
52
When you create a list, you can read its items one by one, and it’s called itera1on: >>> mylist = [1, 2, 3] >>> for i in mylist: … print i 1 2 3
Python Lists
53
Mylist is an iterable. When you use a comprehension list, you create a list and so an iterable: >>> mylist = [x*x for x in range(3)] >>> for i in mylist: … print i 0 1 4
Python Iterables
54
Generators are iterables, but you can read them once. It’s because they do not store all the values in memory, they generate the values on the fly: >>> mygenerator = (x*x for x in range(3)) >>> for i in mygenerator: … print i 0 1 4 I just the same except you used () instead of []. But, you can not perform for i in mygenerator a second 1me since generators can only be used once: they calculate 0, then forget about it and calculate 1 and ends calcula1ng 4, one by one.
Python Generators
55
Yield is a keyword that is used like return, except the func1on will return a generator. >>> def createGenerator(): … mylist = range(3) … for i in mylist: … yield i*i … >>> mygenerator = createGenerator() >>> print mygenerator <generator object createGenerator at 0xb7555c34> >>> for I in mygenerator: … print i 0 1 4
Python Yield
56
• What is the total count for each unique word in the text?
• Word coun1ng is the Hello World! of MapReduce
• We need to write map() and reduce() func1ons • Map(rec) -‐> list(k, v) • Reduce(k, v) -‐> list(res)
• Your applica1on communicates with Disco API • from disco.core import Job, result_iterator
Your first disco job
57
• Spli�ng file (related chunks) to lines • Map(line, params)
• Split line to words • Emit k,v tuple: <word, 1>
• Reduce(iter, params) • OUen, this is an algebraic expression • <word, [1,1,1]> -‐> <word, 3>
Word count
58
• Modules to import • Se�ng the master host • DDFS • Job() • Result_iterator(Job.wait()) • Job.purge()
Word count: Your application
59
def fun_map(line, params): for word in line.split(): yield word, 1
Word count: Your map
60
def fun_reduce(iter, params): from disco.u1l import kvgroup for word, counts in kvgroup(sorted(iter)): yield word, sum(counts)
Built-‐in disco.worker.classic.func.sum_reduce()
Word count: Your reduce
61
job = Job().run(input=…, map=fun_map, reduce=fun_reduce) for word, count in result_iterator(job.wait(show=True)):
print (word, count) job.purge()
Word count: Your results
62
Class MyJob1(Job): @classmethod def map(self, data, params): … @classmethod def reduce(self, iter, params): …
… MyJob2.run(input=MyJob1.wait()) # <-‐ Job chaining
Word count: More advanced
63
• Event Tracking & Adver1sing related jobs • Heatmap: page clicks -‐> 2D density distribu1ons • Reconstruc1ng sessions • Ad research • Behavioral modeling
• Log crunching • Gameplays per country • Frontend performance (CDN) • 404s, Response code tracking • Intrusion detec1on #security
Disco @ SpilGames
64
• Calculate your resource need es1mates • Deploy in workflow • We have
• Git • Package repository / Deployment Orchestra1on • Disco-‐tools: hVp://github.com/spilgames/disco-‐tools/ • Job runner: hVp://jobrunner/ • Data warehouse • Interac1ve, graphical report genera1on
Disco @ SpilGames
65
66
CDN log processing
67
• Ques1on? • Availability of each CDN provider
• Data source • Javascript sampler on client side • LoadBalancer -‐> HA logging endpoints -‐> Access logs -‐> Disco Distributed FS
CDN Availability monitoring
68
CDN Availability monitoring
69
• Input • URI parsing • /res.ext?v=o,1|e,1|os,1|ce,1|hw,1|c,0|l,1
• Expected output • ProviderO 98.7537% • ProviderE 57.8851% • ProviderC 99.4584% • ProviderL 99.4847%
CDN Availability monitoring
70
# cdnData: “o,1|e,1|os,1|ce,1|hw,1|c,0|l,1“ • Parse a log entry • Yield samples
• <o, 1> • <e, 1> • <os, 1> • <ce, 1> • <hw, 1> • <c, 0> • <l, 1>
CDN Availability monitoring (map)
71
def map_cdnAvailability(line, params): import urlparse try: (1mestamp, data) = line.split(‘,’, 1) data = dict(urlparse.parse_qsl(data, False)) for cdnData in data[‘a’].split(‘|’) try: cdnName = cdnData.split(‘,’)[0] cdnAvailable = int(cdnData.split(‘,’)[1]) yield cdnName, cdnAvailabe except: pass except: pass
CDN Availability monitoring (map)
72
Availability of <hw, [1,1,1,0,1,1,1,0,1,1,0,1]> • kvgroup(iter) • The trick:
• Samples = […] • len(samples) -‐> number of all samples • sum(samples) -‐> number of available • A = sum()/len() * 100.0
CDN Availability monitoring (reduce)
73
def reduce_cdnAvailability(iter, params): from disco.u1l import kvgroup for cdnName, cdnAvailabili1es in kvgroup(sorted(iter)): try: cdnAvailabili1es = list(cdnAvailabili1es) totalSamples = len(cdnAvailabili1es) totalAvailable = sum(cdnAvailabili1es) totalUnavailable = totalSamples – totalAvailable yield cdnName, (round(float(totalAvailable) / totalSamples * 100.0, 4)) except: pass
CDN Availability monitoring (reduce)
74
• DDFS • tag://logs:cdn:la010:12345678900 • disco.ddfs.list(tag) • disco.ddfs.[get|set]aVr(tag,aVr,value)
• Job(name,master).run(input,map,reduce) • par11ons = R • map_reader = disco.worker.classic.func.chain_reader • save = true
Advanced usage
75
CDN Performance
95th percentile with per country breakdown
76
• Ques1on • 95th percen1le of response 1mes per CDN per country
• Data source • Javascript sampler on client side • LB -‐> HA Logging endpoints -‐> Access logs -‐> DDFS
• Input • /res.ext?v=o,1234|l,2345|c,3456&ipaddr=127.0.0.1
• Expected output • ProviderN CountryA: 3891 ms CountryB: 1198 ms … • ProviderC CountryA: 3793 ms CountryB: 1397 ms … • ProviderE CountryA: 3676 ms CountryB: 1676 ms … • ProviderL CountryA: 4332 ms CountryB: 1233 ms…
CDN Performance
77
The 95th percentile
A 95th percentile says that 95% of the time data points are below that value and 5% of the time they are above that value. 95 is a magic number used in networking because you have to plan for the most-of-the-time case.
78
v=o,1234|l,2345|c,3456&ipaddr=127.0.0.1 • Line parsing is about the same • Advanced key: <cdn:country, performance> • How to get country from IP?
• Job().run(…required_modules=[“GeoIP”]…) • No global variables! Within map() – Why?
• Use Job().run(…params={}…) instead • yield “%s:%s“ % (cdnName, country), cdnPerf
CDN Performance (map)
79
# <hw, [123, 234, 345, 456, 567, 678, 798]> def percen1le(N, percent, key=lambda x:x): import math if not N: return None k = (len(N) -‐ 1) * percent f = math.floor(k) c = math.ceil(k) if f == c: return key(N[int(k)]) d0 = key(N[int(f)]) * (c -‐ k) d1 = key(N[int(c)]) * (k -‐ f) return d0 + d1
CDN Performance (reduce)
80
• Outputs • Print to screen • Write to a file • Write to DDFS – Why not? • An other MR job with chaining • Email it • Write to MySQL • Write to Ver1ca • Zip and upload to Spil OOSS
Other goodies
81
1. Ques1on & Data source • Javascript code • Nginx endpoint • Logrotate • (de-‐personalize) • DDFS load scripts
2. MR jobs 3. Jobrunner jobs 4. Present your results
Steps to get to our Disco landscape
82
• Edi1ng on live servers • No version control • No staging environment • Not using deployment mechanism • Not using Con1nuous Integra1on • Poor parsing • No redundancy for MC applica1ons • Not purging your job • Not documen1ng your job • Using hard coded configura1on inside MR code
Bad habits
83
• No peer review • Not ge�ng back events from slaves • Using job.wait() • Job().run(par11ons=1)
Bad habits cont.
84
• Wri1ng Disco jobs can be easy • Finding the right abstrac1on for a problem is not… • Framework is on the way -‐> DRY • You can find a lot of good paVerns in SET and other jobs
You successfully took a step to understand how to • Process large amount of data • Solve some specific problems with MR
Summary
85
• Ecosystems • DiscoDB: lightning-‐fast key-‐>value mapping • Discodex: disco + ddfs + discodb
• Disco vs. Hadoop • HDFS, Hadoop ecosystem • NoSQL result stores
Bonus: Outlook
Questions?
87
• Presenta1on can be found at: hVp://spil.com/discoworkshop2013
• You can contact me at: [email protected]
Thank you!