CM40212: Internet Technology - University of Bathjap/cm40212/slides/05-handout.pdf · Does not...
Transcript of CM40212: Internet Technology - University of Bathjap/cm40212/slides/05-handout.pdf · Does not...
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
Summary
CM40212: Internet Technology
Julian PadgetTel: x6971, E-mail: [email protected]
Distributed Progamming Patternsand Middleware
November 1, 2011
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 1 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
Summary
Outline
1 Web programmingAppletsServletsCookiesAJAX
2 Distributed programmingMIMDSIMD/SPMD
3 Programming frameworksMPIMap ReduceMemcachedHADOOP
4 Summary
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 2 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
Summary
Objectives
Establish scope of programming task: not a singlethread, not a single computer
Several complementary frameworks—achievesimilar/different goals... but just technology
Better to understand programming models and theirstrengths and weaknesses
ExpressivenessSecurityDebuggability(?) and maintainabilityReliability!
Task: to recognize the appropriate abstraction to use(or invent)
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 3 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
Summary
The coding space
browser
applet
webserver
servlet
databasesetc.
or? and?
laptop/desktop
Condorpool
NationalGrid
Service
LocalCluster
ComputeCloud
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 4 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Applets
Servlets
Cookies
AJAX
Distributedprogramming
Programmingframeworks
Summary
Web programming
Content
1 Web programmingAppletsServletsCookiesAJAX
2 Distributed programming
3 Programming frameworks
4 Summary
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 5 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Applets
Servlets
Cookies
AJAX
Distributedprogramming
Programmingframeworks
Summary
Web programming
Places to put code
In the browser: applets
In the server: servlets
In the back-end: Perl, Python, PHP, Ruby-on-Rails
Back-end frameworks: Ajax (Backbase, Dojo, Spry)
browser
applet
webserver
servlet
databasesetc.
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 6 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Applets
Servlets
Cookies
AJAX
Distributedprogramming
Programmingframeworks
Summary
Web programming Applets
Applets
Applets can use the browser API to:
Received notification of milestones
Load data files (specified relative to the URL of theapplet or the page in which it is running)
Display status strings
Make the browser display a document
Find other applets running in the same page
Play sounds
Access parameters specified by the user in the<APPLET> tag
Constrained environment for security
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 7 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Applets
Servlets
Cookies
AJAX
Distributedprogramming
Programmingframeworks
Summary
Web programming Applets
The Applet Interface
There are four important Applet methods that enable asubclass to handle major events:
1 init: To initialize the applet each time it’s loaded (orreloaded)
2 start: To start the applet’s execution, such as whenthe applet is loaded or when the user revisits a pagethat contains the applet
3 stop: To stop the applet’s execution, such as when theuser leaves the applet’s page or quits the browser
4 destroy: To perform a final cleanup in preparation forunloading
For example:public class Simple extends JApplet {
. . .
public void init() { . . . }
public void start() { . . . }
public void stop() { . . . }
public void destroy() { . . . }
. . .
}
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 8 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Applets
Servlets
Cookies
AJAX
Distributedprogramming
Programmingframeworks
Summary
Web programming Applets
Security Restrictions
Applets pose a security risk: downloaded code (fromwhere?) is executed within the browser on clientmachineDifferent browsers have different security policies andpolicies change, so cannot hard-code...Specifically:
Cannot load libraries or define native methodsCannot (normally) read or write files on the clientCannot make network connections except to theoriginating hostCannot start any program on the clientCannot read arbitrary system propertiesApplet created windows must look different
SecurityManager object on browser implements securitypolicies: throws a SecurityException when violation isdetected. Applet can catch the SecurityException andprocess it
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 9 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Applets
Servlets
Cookies
AJAX
Distributedprogramming
Programmingframeworks
Summary
Web programming Applets
Applet Capabilities
The java.applet package API allows:
Creation of network connections to originating hostDisplay of HTML documents in the same browserInvocation of public methods of other applets on thesame pageLocally-loaded applets are not subject to the samerestrictions
Circumventing the restrictions:
Use a server application on the applet’s hostAllows saving files—on the applet host rather than theclientCommunicate over sockets
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 10 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Applets
Servlets
Cookies
AJAX
Distributedprogramming
Programmingframeworks
Summary
Web programming Servlets
Servlets I
Supersedes Common Gateway Interface (CGI) scripts
servlet is a Java class for extending serverscapabilities via a request-response programming model
Servlets run in a container on the host: insulatesservlets from one another and the host from them
All servlets must implement the servlet interface
The servlet life cycle:
1 An incoming request is mapped to a servlet. If there isno servlet instance:
Load the servlet classCreate an instanceInitialize instance via init method
2 Invoke the service method3 Servlet removal achieved via destroy method (includes
finalization)
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 11 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Applets
Servlets
Cookies
AJAX
Distributedprogramming
Programmingframeworks
Summary
Web programming Servlets
Servlets IIContainer typically creates a thread for each request:can ensure at most one by implementingSingleThreadModel interface
Otherwise must use synchronized methods and/orsynchronized statements:
public void addName(String name) {
synchronized(this) {
lastName = name;
nameCount++;
}
nameList.add(name);
}
Does not prevent concurrent access to static variablesor external objects
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 12 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Applets
Servlets
Cookies
AJAX
Distributedprogramming
Programmingframeworks
Summary
Web programming Servlets
Sessions
Client state—because HTTP is stateless—e.g. forshopping carts:
Interacts with cookies (q.v.)Session is attribute of HttpServletRequestProvides simple key/value association mechanism
HTTP client cannot signal session is ended, so use atime-to-live counter
If cookies are disabled, session is encoded in URL
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 13 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Applets
Servlets
Cookies
AJAX
Distributedprogramming
Programmingframeworks
Summary
Web programming Cookies
Cookies I
Invented to solve problem of persistent data
RFC2965, HTTP State Management Mechanism,D.Kristol, L.Montulli, October 2000, ftp://ftp.rfc-editor.org/in-notes/rfc2965.txt
Defines three new headers, Cookie, Cookie2, andSet-Cookie2, to carry state information betweenparticipating origin servers and user agents:
1 User Agent → ServerPOST /acme/login HTTP/1.1
[form data]
User identifies self via a form.2 Server → User Agent
HTTP/1.1 200 OK
Set-Cookie2: Customer="WILE_E_COYOTE"; Version="1"; Path="/acme"
Cookie reflects user’s identity.
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 14 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Applets
Servlets
Cookies
AJAX
Distributedprogramming
Programmingframeworks
Summary
Web programming Cookies
Cookies II3 User Agent → Server
POST /acme/pickitem HTTP/1.1
Cookie: $Version="1"; Customer="WILE_E_COYOTE"; $Path="/acme"
[form data]
User selects an item for ”shopping basket”.4 Server → User Agent
HTTP/1.1 200 OK
Set-Cookie2: Part_Number="Rocket_Launcher_0001"; Version="1";
Path="/acme"
Shopping basket contains an item.5 and so on...
Minimum implementation requirements:At least 300 cookiesAt least 4096 bytes per cookieAt least 20 cookies per unique host or domain name
Applications should use as few and as small cookies aspossible, and they should cope gracefully with the lossof a cookie.
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 15 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Applets
Servlets
Cookies
AJAX
Distributedprogramming
Programmingframeworks
Summary
Web programming AJAX
Asynchronous JavaScript and XML
Main feature: allows web applications to retrieve datafrom the server asynchronously in the backgroundwithout interfering with the display and behavior of theexisting page.
Uses XMLHttpRequest object or Remote Scripting
Does not require Javascript, XML or asynchronismPros:
Reduction in client-server traffic: only send changes, notfresh pagesBetter (perceived) response timeReduction in server connections
Cons:Dynamically-created pages 6↔ browser “back” buttonDifficulty in book-markingDifficulty with indexingAccessibility restrictions—e.g. Javascript disabledAbsence of a standard
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 16 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
MIMD
SIMD/SPMD
Programmingframeworks
Summary
Distributed programming
Content
1 Web programming
2 Distributed programmingMIMDSIMD/SPMD
3 Programming frameworks
4 Summary
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 17 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
MIMD
SIMD/SPMD
Programmingframeworks
Summary
Distributed programming
Places to put code
Distribute function across network
Distribute data across network
Control: synchronous or asynchronous
laptop/desktop
Condorpool
NationalGrid
Service
LocalCluster
ComputeCloud
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 18 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
MIMD
SIMD/SPMD
Programmingframeworks
Summary
Distributed programming
Concurrency and Parallelism
Programs components can appear to, or actuallyexecute at the same time new ways for makingmistakes:
1 Synchronization:
Deadlock: suspended waiting for resourcesLivelock: repeatedly checking for resources
2 Race conditions3 Non-determinism
Flynn’s classification identifies four forms:
SISD =Single Instruction
Single Datum
SIMD =Single Instruction
Multiple Data
MISD =Multiple Instruction
Single Datum
MIMD =Multiple Instruction
Multiple Data
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 19 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
MIMD
SIMD/SPMD
Programmingframeworks
Summary
Distributed programming
Basic concurrency concepts
What makes it different from sequential computation?
non-determinism:
computations proceed at different ratescommumnications vary in speed and reliability
communication: channels, messages, shared variablesinterference: contention for resourcesatomicity: several (interdependent) steps (must) appearas one and succeed, or fail and all be undone (c.f.finally etc.)cooperation vs. pre-emption: voluntary ceding ofcontrol (suspend) or time-slicingcoordination: mechanisms to balance cooperation orcompetition between tasks while limiting interference
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 20 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
MIMD
SIMD/SPMD
Programmingframeworks
Summary
Distributed programming
Concurrency patterns
Concurrent programs frequently follow one or a mixtureof a few well-known patterns:
producer/consumer: one thread generates data foranother – communication channel becomes source ofcontentionpipeline: general case of producer/consumermaster/slave: break task into independent parts anddistribute, collect and combine resultsdivide and conquer: general case of above, whereslaves become masters of smaller parts
Essential coordination aspect is access to shared data:
Aim: only one task can change data at one timeBut: many can access data – consistency problemSolution: mechanisms for mutual exclusion
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 21 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
MIMD
SIMD/SPMD
Programmingframeworks
Summary
Distributed programming MIMD
Object spaces
Low-level synchronization mechanisms are (very)difficult to use safely
(A) Solution: Decouple sender and receivertemporally – need not co-existspatially – need no information about each other
Origins:Linda [Carriero and Gelernter, 1989]Chemical Abstract Machine (CHAM)[Berry and Boudol, 1990]Unity [Chandy and Misra, 1988]
Conceptually:
message/tuplepool
T1
T2Tn−1
Tn
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 22 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
MIMD
SIMD/SPMD
Programmingframeworks
Summary
Distributed programming MIMD
Object space operations
Operations in Linda (cf. Javaspaces[Freeman et al., 1999]):
in(pattern): where pattern = tag(elt1,...,eltn),tag is a literal, elt is constant or a variableaction: match pattern against tuples in pool, select one(if exists) and remove, otherwise blockout(tagged tuple): constructs tuple and stores inpoolrd(pattern): as input but fails instead of blocking ifno match. ≡ testing a semaphore before waiting...equally useless (race condition).eval(expression): creates new taks to evaluateexpression, result is written to pool.
Conceptually attractive, easy to learn, easy to use, butcan be problematic in practice.
Javaspaces: http:
//java.sun.com/developer/Books/JavaSpaces
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 23 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
MIMD
SIMD/SPMD
Programmingframeworks
Summary
Distributed programming SIMD/SPMD
Data Parallelism
SIMD: single instruction multiple data; requires specialhardware (or data abstraction)
History:
vector processing – 1970s, 80s, 90s. Manufacturers:Contol Data Corporation (CDC), Cray. Vectorextensions to FORTRAN.array processors – 1980s, 90s. Manufacturers: ICL(Distributed Array Processor), Thinking MachinesConnection Machines, Maspar. Architecture: thousandsof small processors (e.g. 1 bit ALU, 16K memory),local+global connection network. Array extensions toFORTRAN.Languages: C*, FORTRAN*, *Lisp
SPMD: single program multiple data; can be realized ona Condor pool, commodity hardware cluster or computecloud
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 24 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks
Content
1 Web programming
2 Distributed programming
3 Programming frameworksMPIMap ReduceMemcachedHADOOP
4 Summary
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 25 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI summary I
De facto standard for SPMD programming
Point-to-point and collective communications
MPI belongs in layers 5 and higher (session,presentation, application) of the OSI Reference Model
Implementations cover most layers of the referencemodel, with socket and TCP being used in thetransport layer
Functionality:
Virtual topologySynchronizationCommunication
between a set of processes that have been mapped tonodes/servers/computer instances
API extends several widely-used languages
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 26 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI summary IIFeatures:
Point-to-point rendezvous-type send/receiveoperations—synchronous, asynchronous, buffered, andready formsChoice of Cartesian or graph-like logical processtopologyExchange of data between process pairs (send/receiveoperations)Combination of partial results of computations(gathering and reduction operations)Synchronization of nodes (barrier operation)Environmental enquiry operations (number of processesin session, processor identity, neighbouring processes,etc.)
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 27 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
Primary user is e-Science (teraflops, terabytes)
Distributed systems + small programs = slow
Message overheads > gains from parallelism
MPI supports SPMD + message-passing
Functionality:
library of messaging functionsworks with (unmodified) C, Fortran, etc.even Java, Python and othersMPI is a standard not an implementation
Example:
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 28 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
1 #include <stdio.h>2 #include <mpi.h>3
4 int main(int argc, char **argv)5 {6 int rc, myrank, nproc, namelen;7 char name[128];8
9 rc = MPI_Init(&argc, &argv);10 if (rc != MPI_SUCCESS) {11 printf ("Error starting MPI program\n");12 MPI_Abort(MPI_COMM_WORLD, rc);13 }14
15 MPI_Comm_rank(MPI_COMM_WORLD, &myrank);16 MPI_Comm_size(MPI_COMM_WORLD, &nproc);17
18 if (myrank == 0) {19 printf("main reports %d procs\n", nproc);20 }21
22 namelen = 128;23 MPI_Get_processor_name(name, &namelen);24 printf("hello world %d from ’%s’\n", myrank);25
26 MPI_Finalize();27 return 0;28 }
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 29 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
Notes:
MPI Init(&argc, &argv); – system set up; createsall the network connections;
rc – check it workedMPI COMM WORLD – system can be divided into subsetsof processors called communicators:
The WORLD communicator is all processorsMPI COMM SELF refers to the calling processor
MPI Comm rank – each process in a communicator has aunique rank [0, (size of the communicator −1)]
For WORLD the rank is 0 to (# of processors −1)
MPI Comm size – Size of the communicator
if (myrank == i) – All processors run the same code(SPMD). This is how to get a kind of MIMD
MPI Finalize – Tidy up. Forces receipt of messages.Establishes a barrier synchronization.
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 30 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
A basic problem is how to get data from one processorto another
Processor A sends data (integers, floats, strings, etc.)to B
A uses a send function, and B uses a receive function
1 int n[5];2 ...3 if (myrank == 0) {4 MPI_Send(n, 5, MPI_INT, 1, 99, MPI_COMM_WORLD);5 }6 else if (myrank == 1) {7 MPI_Status stat;8 MPI_Recv(n, 5, MPI_INT, 0, 99, MPI_COMM_WORLD, &stat);9 }
This supposes A has rank 0, B rank 1
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 31 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
A sends
n – A pointer to a memory location containing thedata; can be a single variable or an array of values
5 – The number of items to send
MPI INT – The type of the items
1 – The rank of the destination
99 – An integer tag to help identify a particularcommunicatoin in the message logs...
MPI COMM WORLD – The communicator
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 32 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
B receives
n – A pointer to a memory location from where to readthe data
5 – The number of items to read
MPI INT – The type of the items
0 – The rank of the source
99 – The tag on the message receiver wants: useMPI ANY TAG otherwise
MPI COMM WORLD – The communicator
stat – A structure containing the status of thetransfer, in particular the source and tag; and the errortype in case of an error
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 33 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
Types include:
MPI CHAR, MPI SHORT, MPI INT, MPI LONG,MPI FLOAT, MPI DOUBLE, MPI BYTE
MPI Send and MPI Recv are blocking operations:
MPI Send waits until the data has been sentThe buffer n in A can be safely reused when MPI Send
returnsBut the data may not have reached or been read by B
Similarly, MPI Recv waits until the all data is copied
This provides weak synchronisation between A and B
But asynchronism requires careful programming
Messages take a significant time to be transmitted
A send and a receive are not simultaneous
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 34 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
Other communication operators include:
MPI Ssend – Waits until the destination has started toreceive the message
MPI Isend – Send, but do not wait – how soon canbuffer be (re-)used?
MPI Irecv – Indicate a buffer for data storage, but donot wait for it; the data transferred asynchronously
MPI Wait – Block until a given non-blocking send orrecv has completed
And lots more
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 35 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
Simple synchronisation: MPI Barrier(MPI Comm
comm);
This blocks until all the processes in the communicatorhave reached the barrier
Properties of messaging:
MPI messages are in order, but not fair:
A sends M1 then M2 to BB receives M1, then M2
because messages from one source to the samedestination do not overtake each other
However:a prior (but unread) message from A to B may beovertaken by a later message from C to Bthere is no guarantee of order on messages fromdifferent sources
“not fair” means “not guaranteed fair”
Normally events happen as expected, but not always
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 36 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
Send and receive are for point-to-point messages: onesource and one destination
There are several others:
broadcastscattergatherreducescan
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 37 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
MPI_Bcast(void* buffer, int count,
MPI_Datatype datatype, int root,
MPI_Comm comm);
Data sent from process with rank root to all otherprocesses (in the communicator)
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 38 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
All processes, including the receivers, should callMPI Bcast with the same value for root
int n[2];
if (myrank == 1) {
n[0] = 23;
n[1] = 42;
}
...
MPI_Bcast(n, 2, MPI_INT, 1, MPI_COMM_WORLD);
All processes will now have the same values for theircopies of n
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 39 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
MPI_Scatter(void* sendbuf,int sendcount,
MPI_Datatype sendtype, void* recvbuf,
int recvcount, MPI_Datatype recvtype,
int root, MPI_Comm comm);
Takes the data sendbuf in processor with rank root, andsends sendcount items from the array to every otherprocessor (and to itself) to be stored in recvbuf:
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 40 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
MPI_Gather(void* sendbuf, int sendcount,
MPI_Datatype sendtype, void* recvbuf,
int recvcount, MPI_Datatype recvtype, int root,
MPI_Comm comm);
Takes sendcount elements of data sendbuf from eachprocessor and puts them in the array recvbuf onprocessor root
MPI Gather is the inverse of MPI Scatter
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 41 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
MPI_Reduce(void* sendbuf, void* recvbuf, int count,
MPI_Datatype datatype, MPI_Op op, int root,
MPI_Comm comm);
Applies a reduction operation op to each value insendbuf, putting the result(s) into recvbuf onprocessor root
Operations include: max, min, +, *, ∧, ∨Can also use programmer-define reduction operators
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 42 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
MPI_Scan(void* sendbuf, void* recvbuf,
int count, MPI_Datatype datatype,
MPI_Op op, MPI_Comm comm);
A prefix scan of the source sendbuf
Processor of rank i gets the reduction of values fromprocessors 0 . . . i stored in its recvbuf
Prefix scans are very useful in parallel algorithms
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 43 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
Need to think carefully about messages to getmaximum efficiency
For example, reduce latency through non-blockingoperations
Compute something else while a receive operationcompletes, then go backRequires careful programmingOverlapping communication and computation is a good,but delicate
Also:
messaging has a high overhead: needs (very) largeprogramshard to program effectively: needs experiencelacks dynamism: processor count is fixed and cannotchange during execution
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 44 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks MPI
MPI
MPI has succeeded for many reasons
An open standard, inviting several competingimplementations
Thus implementations are optimised and efficient
MPI is simple in concept, so straightforward to program(not necessarily easy to program. . . )
MPI is flexible as it contains lots of kinds ofcommunication
MPI is supported by many languages and environments
MPI scales well to very large problems
The MPI standard is still being developed and updated
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 45 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks Map Reduce
The Google problem
Motivation: Large Scale Data Processing
Many tasks: Process lots of data to produce other data
Want to use hundreds or thousands of CPUs... but thisneeds to be easy
MapReduce provides:
Automatic parallelization and distributionFault-toleranceI/O schedulingStatus and monitoring
Content adapted fromhttp://labs.google.com/papers/mapreduce.html, retrieved
20081026.
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 46 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks Map Reduce
The MapReduce model I
Input & Output: each a set of key/value pairs
Programmer specifies two functions:
map (in_key, in_value) ->
list(out_key, intermediate_value)
Processes input key/value pairProduces set of intermediate pairs
reduce (out_key, list(intermediate_value)) ->
list(out_value)
Combines all intermediate values for a particular keyProduces a set of merged output values (usually justone)
Inspired by similar primitives in LISP and otherlanguages
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 47 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks Map Reduce
The MapReduce model II
Fine granularity tasks: many more map tasks thanmachines
Minimizes time for fault recoveryCan pipeline shuffling with map executionBetter dynamic load balancing
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 48 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks Map Reduce
The MapReduce model IIIOften use 200,000 map/5000 reduce tasks w/ 2000machines
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 49 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks Map Reduce
Fault tolerance + Redundancy
When processes break: re-executeDetect failure via periodic heartbeatsRe-execute completed and in-progress map tasksRe-execute in progress reduce tasksTask completion committed through master
Master failure: not consideredRobust: task completed even when 1600 out of 1800machines failedSlow workers significantly lengthen completion timeSolution: Near end of phase, spawn backup copies oftasks—whichever one finishes first “wins”Effect: Dramatically shortens job completion timeRewrote Google’s production indexing system into 24MapReduce operationsStatistics: #jobs = 29,423, avg. duration = 634secs,total machine time = 79,186days, input data =3,288TB
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 50 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks Memcached
memcached
High-performance, distributed memory object cachePurpose: speed up dynamic web applications byreducing database loadFunctionality: a giant hash table distributed acrossmultiple machinesBehaviour: when the table is full, subsequent insertscause older data to be purged in least recently used(LRU) orderUse case: developed by Danga Interactive forLiveJournal.com:
Site statistics: 20 million+ dynamic page views per dayfor 1 million usersReduced database load close to 0; faster page loadtimes; better resource utilization, and faster access todatabases on a memcache miss.
Summary athttp://en.wikipedia.org/wiki/Memcached
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 51 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks HADOOP
HADOOP
Distributed processing of large data sets
On clusters of computers
Simple programming model
Scalable
Reliability model in software high-availability onunreliable hardwareComponents:
Avro: A data serialization system.Cassandra: A scalable multi-master database with nosingle points of failure.HBase: A scalable, distributed database that supportsstructured data storage for large tables.Hive: A data warehouse infrastructure that providesdata summarization and ad hoc querying.Mahout: machine learning and data mining library.Pig: A high-level data-flow language and executionframework for parallel computation.
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 52 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
MPI
Map Reduce
Memcached
HADOOP
Summary
Programming frameworks HADOOP
HADOOP users
Facebook: We use Hadoop to store copies of internal logand dimension data sources and use it as a source forreporting/analytics and machine learning.Currently we have 2 major clusters:
A 1100-machine cluster with 8800 cores and about 12PB raw storage.
A 300-machine cluster with 2400 cores and about 3 PBraw storage.
Each (commodity) node has 8 cores and 12 TB ofstorage.
We are heavy users of both streaming as well as theJava APIs. We have built a higher level datawarehousing framework using Hive
http://wiki.apache.org/hadoop/PoweredBy
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 53 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
Summary
Summary
Content
1 Web programming
2 Distributed programming
3 Programming frameworks
4 Summary
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 54 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
Summary
Summary
Summary
Web programming is a special case of distributedprogramming
Large number of tools/frameworks—best learnt on need
Basic principle: code in browser + code in server +communication
Distributed programming in general is hard to debugand hard to get right
Fortunately many problems are embarassingly paralleland—with the right abstractions—can be mappedreliably to stock hardware
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 55 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
Summary
Summary
Resources I
“Never believe one source”
Java applets tutorial:http://java.sun.com/docs/books/tutorial/
deployment/applet/appletsonlyindex.html
Java servlets tutorial http://java.sun.com/j2ee/tutorial/1_3-fcs/doc/Servlets.html
Java threads tutorial:http://java.sun.com/docs/books/tutorial/
essential/concurrency/index.html
Message-passing interface:http://www.open-mpi.org/
Tuple spaces:http://en.wikipedia.org/wiki/Tuple_space
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 56 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
Summary
Summary
Resources IIJava spaces: http:
//java.sun.com/developer/Books/JavaSpaces
memcached:http://en.wikipedia.org/wiki/Memcached andhttp://www.danga.com/memcached/
RFC2965, HTTP State Management Mechanism,D.Kristol, L.Montulli, October 2000, ftp://ftp.rfc-editor.org/in-notes/rfc2965.txt
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 57 / 58
CM40212: InternetTechnology
Julian Padget
Web programming
Distributedprogramming
Programmingframeworks
Summary
Summary
References
Berry, G. and Boudol, G. (1990).The chemical abstract machine.In POPL ’90: Proceedings of the 17th ACM SIGPLAN-SIGACTsymposium on Principles of programming languages, pages 81–94, NewYork, NY, USA. ACM Press.
Carriero, N. and Gelernter, D. (1989).Linda in context.Commun. ACM, 32(4):444–458.
Chandy, K. and Misra, J. (1988).Parallel Programming Design: A Foundation.Addison-Wesley.ISBN 0-201-05866-9.
Freeman, E., Hupfer, S., and Arnold, K. (1999).JavaSpaces Principles, Patterns, and Practice.Pearson Education.ISBN: 0201309556,http://java.sun.com/developer/Books/JavaSpaces/.
Julian Padget (CS/Bath) CM40212: Internet Technology November 1, 2011 58 / 58