A multi-tool in computing clouds: Tuple Space
-
Upload
joerg-fritsch -
Category
Technology
-
view
107 -
download
1
description
Transcript of A multi-tool in computing clouds: Tuple Space
Joerg Fritsch,
School of Computer Science & Informatics
Cardiff University, 16 January 2014
A multi-tool in computing clouds: Tuple Space
• Key themes: parallelization, shared nothing, Challenging Data (aka “Big Data”)
• Tuple Space: the multi-tool
• Use Case (1): Overcoming limitations of tier based architectures
• Use Case (2): In Stream processing of Big Data
• Some miscellaneous remarks
2
Agenda
• Eventually everything is about scalability.
• Scalable software: Make use of 1000s of cores– Distribution– Decomposition & modularity– Coordination
• Data does not fit in main memory– Distribution– Stream processing
• Need for speed: reduce time complexity
3
The why’s and how’s
Key Themes: Parallelization
• Clouds will need to support scalable programs.
• Many programs have to parallelize relative small computations with high inter-dependency.
• “Any” application scaled through distribution over parallel (multicore) hardware.
• Everything “inside a cloud” is physically distributed (data, processing).
• Large scale distributed processing. “Many Core”.
4
Key Themes: Shared Nothing
5
• Synchronization = shared “something” for example memory, disk, data(base)
• Asynchronous = shared “shared nothing”
• Avoid synchronization issues
• Abstract multithreading and parallelization issues away from the developer, i.e. actor model
• Highly scalable! –for example Erlang
Challenging Data (aka “Big Data”)
6
• Data in computing clouds is challenging
• 3V Data (Gartner, 2001): Volume, Variety, Velocity
• Volume: perceived as “Big”
– Hadoop & traditional RDBMs often similar in data volume
– Differ in number of nodes (proportional to no. cores)
– Analytics
• Variety: unstructured data, data mashups
– Hadoop does not cast into schemes, rows, cols
• Velocity: streams
Challenging Data (aka “Big Data”)
7
• Batch tasks are the prevailing computational model:– Map Reduce
– Computation over “offline” data set (on disks)
– Parallelized Polynomial time: Nm/k
• Stream Processing catching up: – Operating on real-time data
– N * log (N) time
– You only got ‘one shot’
– In memory data structures (e.g. Redis, Memcached)
– Examples: Storm project, AWS Kinesis, Apache S4,
• Tuples are key-value pairs
• Tuple Space acting as Distributed Shared Memory (DSM)
• Four primitives to manipulate and store tuples: rd(), in(), out(), eval()
• No schema, ideal for unstructured data
• Tuples matched using associative lookup
• Associative lookup generally very powerful: CAM table/Routing, Data Flow programming & processors
• Commercial use as in-memory Data Grids
8
Tuple Space, Gelertner (1985)
• Loose coupling
• Decoupled in
– Time
– Space
– Synchornization
• Distributed shared memory (DSM) vs distributed memory (“like MPI”)
9
Tuple Space, cont’d
• In memory key-value store, can be persistent across system reboot
• No schema
• Keys matched with glob-style patterns in O(1) time– Good enough implementation of associative lookup
• Binary safe
• Other key-value stores may be equally suitable and have different advantages/disadvantages– Distributed Hash Tables (DHT)
– Memcached Distribution
– Dynamo Presence as application service in AWS
10
Redis Key-Value Store as Tuple Space
• Coordination vs Threading
• Composition happens outside of the worker or agentcode– FPLs: composition and currying outside of functions
– Stream processing and composition of kernels
– Unix Pipes: application_1 | applications_3 | application_2
– Pipes/(Message Queues) represent the dataflow graph
• Error handling?– What happens to the mutable state if app_3 (or any of the
kernels) fail?
11
Coordination Language LINDA,Gelertner (1992)
app_1 app_3 app_2 std_out
• Not enough expressive power! (for complex coordination)
• Ways to make it more sonic:– Algebra of communication processes (ACP)
– ACP generally quite suitable for streams, clicks, GUIs, Dataflow programming
– Constraint Handling Rules (CHRs)
– Agent Communicating through Logic Theories (ACLT), Omicini et al (1995), Denti et al (1998)
• For example: Barrier (i.e. MPI_barrier)/Eureka conditions, Turing powerful implementation
12
Coordination Language LINDA,cont’d
• Database , Data Grid– No schema
• Key / Value store
• Extension to programming languages– Without adaptation not Turing-powerful
• Message bus, Message Queue
• Means of coordination– Workers, Agents, Skeletons
• Memory virtualization– Extension of main memory across physical boundaries
13
Recap: Tuple Space is like a(n) …
14
Use Case (1 of 2)Overcoming limitations of tier based
architectures
• Concept has been around since 1998
• Costly serialization (of data) required at every system boundary latency!
• Often depicted w three simple tiers: web server, application server and data(base)
• Many more devices & protocols involved: redundant load balancers, spanning tree, etc.
15
Tier-based architectures
• To date: not many alternatives
• Space based architectures
– Gigaspaces
– Tibco activespace
• Notion of a one stop shop
– Networks L2 Ethernet fabrics
– Networks Integrated packet processing
• Nobody wants to hit a spindle!
– In-memory computing
16
Tier-based architectures(alternatives)
17
The end of Tier-based architectures
Source: http://wiki.gigaspaces.com
18
The end of Tier-based architectures (cont’d 1)
Source: http://wiki.gigaspaces.com
19
The end of Tier-based architectures (cont’d 2)
Space based cloud platformNo tiersImplicit load balancingHarmonization of messaging, data and coordination
Traditional tier-based cloud platform
20
Use Case (2 of 2)In stream processing of Big Data
21
“More programmer-friendly parallel dataflow languages await discovery, I think. Map Reduce is one (small) step in that direction.”
Engineer-to-Engineer LecturesJeff HammerbacherJune 2010
• Stream
– An unbounded sequence of tuples
• Map Reduce excels in ad-hoc queries, no fit for recursion ≠ machine learning (ML)
• Error resilient: Stateful stream processing
– Redis knows transactions
– Tuple space can contain global mutual state
• Tuple vs Batch / Fine grain vs coarse grain
22
Stream processing of 3V Data
• Redis has a built-in Lua interpreter to manipulate data
• Commercial tuple spaces are mostly “reactive”
• Context-based recursion on portion of data that is in memory (aka “granularity”)
23
(Reactive) in Memory Tuple Space
24
Tuple space architecture for in stream processing of Big Data
25
Commonalities
FPLs & Flow based Programming
(Johnston, 2004)
Immutable Data. Shared nothing.
Freedom of side effects
Locality of effects
Lazy evaluation
Data dependency equivalent to scheduling
FPLs & Tuple Space
(Fritsch & Walker, 2013)
Coordination
Distribution
Decoupling
Inter process communication (IPC)
26
Commonalities cont’d
Flow based Programming & Tuple
Space
Both need “a space”
IP Space in Flow based programs
Tuple Spave in LINDA
Altogether
(Data) Queues
Coordonation does not need to reckon w side effects
Coordination & composition
Representation of dataflow graph in place of a (thread) call graph
• News/RSS streams• Clicks
– Online advertisement analytics (e.g. spider.io)– URLs (e.g. bit.ly)– GUI programming
• Logistics & Transportation• Media
– GPUs (streams + kernels)
• Mashups: create new wisdom from multiple data sources (incompatible in velocity, volume, variety/structure)– Separate errors
• Debit card transactions– Data Masking– Fraud detection/Feedback Context Mashups
27
Real World Applications
• The ultimate mashup: batch data (aka “map reduce”) and speed data (aka “streams”)
– Lambda architecture
– Complementary to each other (e.e. Apache Spark, Lambda Architecture)
• Currently three paradigms: RDBMs, Map Reduce, Streams.
– Distributed query processing is a key element
28
Points to ponder
• Tuple Space is a piece of software as well
• Scalability of tuple space
– Distribution vs fast in memory computation
• Complex coordination is a must!
– So is error handling (stream replay?)
• Number of supporting elements needed
– (auto) scaler
– cloud-like deployment: DevOps recipes
– Zookeper?
29
Issues
30
Thank you
Denti, Enrico, Antonio Natali, and Andrea Omicini. "On the expressive power of a language for programming coordination media." Proceedings of the 1998 ACM symposium on Applied Computing. ACM, 1998.
Fritsch, Joerg, and Coral Walker. "CMQ-A lightweight, asynchronous high-performance messaging queue for the cloud." Journal of Cloud Computing 1.1 (2012): 1-13.
Fritsch J. Walker C. (2013), “Cwmwl, a LINDA-based PaaS fabric for the cloud”, Journal of Communications, SI on Cloud and Big Data (to be published)
Fritsch, Joerg, and Coral Walker. "CMQ-A lightweight, asynchronous high-performance messaging queue for the cloud." Journal of Cloud Computing 1.1 (2012): 1-13.
Gelernter, David. "Generative communication in Linda." ACM Transactions on Programming Languages and Systems (TOPLAS) 7.1 (1985): 80-112.
Gelernter, David, and Nicholas Carriero. "Coordination languages and their significance." Communications of the ACM 35.2 (1992): 96.
Johnston, Wesley M., J. R. Hanna, and Richard J. Millar. "Advances in dataflow programming languages." ACM Computing Surveys (CSUR) 36.1 (2004): 1-34.
Omicini, A., Denti, E., & Natali, A. (1995). Agent coordination and control through logic theories. In Topics in Artificial Intelligence (pp. 439-450). Springer Berlin Heidelberg.
31
References