Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation –...
Transcript of Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation –...
![Page 1: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/1.jpg)
Chapter 7: Multiprocessing
Advanced Operating Systems (263‐3800‐00L) Timothy Roscoe
Herbstsemester 2012http://www.systems.ethz.ch/education/courses/hs11/aos/
© Systems Group Department of Computer Science ETH Zürich
![Page 2: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/2.jpg)
Milestone 7: another core
• Assignment: – Bring up the second Cortex‐A9 core
• Today:–Multiprocessor operating systems
– User‐level RPC– Inter‐core messaging in Barrelfish
![Page 3: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/3.jpg)
Multiprocessor OSes
• Multiprocessor computers were anticipated by the research community long before they became mainstream– Typically restricted to “big iron”
• But relatively few OSes designed from the outset for multiprocessor hardware
• A multiprocessor OS:– Runs on a tightly‐coupled (usually shared‐memory) multiprocessor machine
– Provides system‐wide OS abstractions
![Page 4: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/4.jpg)
Multics
• Time‐sharing operating system for a multiprocessor mainframe
• Joint project between MIT, General Electric, and Bell Labs (until 1969)
• 1965 – mid 1980s– Last Multics system decommissioned in 2000
• Goals: reliability, dynamic reconfiguration, security, etc.
• Very influential
![Page 5: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/5.jpg)
Multics: typical configuration
CPU CPU
memory memorymemory memory
I/O controller
I/O controller
I/O controller
to remote terminals, magnetic tape, disc, console reader punch etc
GE645 computerSymmetric multiprocessor
Communication was by using “mailboxes” in the memory modules and corresponding interrupts (asynchronous).
![Page 6: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/6.jpg)
Multics on GE645memory
cache
CPU
chip
Failure boundary (board/box)
• Reliable interconnect• No caches• Single level of shared memory• Uniform memory access (UMA)
• Online reconfiguration of the hardware• Regularly partitioned into 2 separate systems for testing and development and then recombined
• Slow!
![Page 7: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/7.jpg)
Hydra
• Early 1970s, CMU• Multiprocessor operating system for C.mmp(Carnegie‐Mellon Multi‐Mini‐Processor)– Up to 16 PDP‐11 processors– Up to 32MB memory
• Design goals:– Effective utilization of hardware resources– Base for further research into OSes and runtimes for multiprocessor systems
![Page 8: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/8.jpg)
C.mmp multiprocessor
Switch
Mp0 (2MB)
Mp15 (2MB)
Pc0 Pc15
Mp
Dmap
Switchto secondary memory and devices
Dmap
MpKc Kc
Clock
Interrupt
Primarymemory
Central processor (PDP‐11)
Control for clock, IPC
address relocation hardware
![Page 9: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/9.jpg)
Hydra (cont)
• Limited hardware– No hardware messaging, send IPIs
– No caches• 8k private memory on processors
– No virtual memory support
• Crossbar switch to access memory banks– Uniform memory access (~1us if no contention)
– But had to worry about contention
• Not scalable
●●●
●●●
![Page 10: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/10.jpg)
Cm*
• Late 1970s, CMU• Improved scalability over C.mmp– 50 processors, 3MB shared memory– Each processor is a DEC LSI‐11 processor with bus, local memory and peripherals
– Set of clusters (up to 14 processors per cluster) connected by a bus
– Memory can be accessed locally, within the cluster and at another cluster (NUMA)
– No cache• 2 Oses developed: StarOS and Medusa
![Page 11: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/11.jpg)
Cm*
KMAP
KMAP
KMAP
KMAP
KMAP
CM30
CM31
CM39●●●
CM20
CM21
CM29●●●
CM10
CM11
CM19●●●
CM00
CM01
CM09●●● CM
40CM41
CM49●●●
50 compute modules (CMs)5 communication controllers (Kmaps)
One Kmapper cluster
![Page 12: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/12.jpg)
Cm*
●●●●●●
●●●
●●●
●●●
• NUMA• Reliable message‐passing• No caches• Contention and latency big issues when accessing remote memory• Sharing is expensive• Concurrent processes run better if independent
![Page 13: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/13.jpg)
Medusa
• OS for Cm*, 1977‐1980• Goal: reflect the underlying distributed architecture of the hardware
• Single copy of the OS impractical– Huge difference in local vs non‐local memory access times
– 3.5us local vs 24us cross‐cluster• Complete replication of the OS impractical– Small local memories (64 or 128KB)– Typical OS size 40‐60KB
![Page 14: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/14.jpg)
Medusa (cont)
• Replicated kernel on each processor– Interrupts, context switching
• Other OS functions divided into disjoint utilities– Utility code always executed on local processor– Utility functions invoked (asynchronously) by sending messages on pipes
• Utilities:– Memory manager– File system– Task force manager
• All processes are task forces, consisting of multiple activities that are co‐scheduled across multiple processors
– Exception reporter– Debugger/tracer
![Page 15: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/15.jpg)
Medusa (cont)
• Had to be careful about deadlock, eg file open:– File manager must request storage for file control block from memory manager
– If swapping between primary and secondary memory is required, then memory manager must request I/O transfer from file system
→ Deadlock
• Used coscheduling of activities in a task force to avoid livelock
![Page 16: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/16.jpg)
Firefly
• Shared‐memory, multiprocessor, personal workstation– Developed at DEC SRC, 1985‐1987
• Requirements:– Research platform (powerful, multiprocessor)– Built in a short space of time (off‐the‐shelf components as much as possible)
– Suitable for an office (not too large, loud or power‐hungry)
– Ease of programming (hardware cache coherence)
![Page 17: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/17.jpg)
Cache
CPUFPU
CPUFPU
Cache
CPUFPU
CPUFPU
Firefly (version 2)
32MByte memory
I/O controllers(disk, network,
display, keyboard, mouse)
Secondary processors: CVAX 78034(typically 4)
Primary processor:MicroVAX 78032Cache
CPUFPU
CPUFPU
Logic for I/O
Q‐Bus
M‐Bus
Cache
CPUFPU
CPUFPU
Cache
CPUFPU
CPUFPU
![Page 18: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/18.jpg)
Firefly
• UMA• Reliable interconnect• Hardware support for cache coherence
• Bus contention an important issue• Analysis using trace‐driven simulation and a simple queuing model found that adding processors improved performance up to about 9 processors
![Page 19: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/19.jpg)
Topaz
• Software system for the Firefly• Multiple threads of control in a shared address space• Binary emulation of Ultrix system call interface• Uniform RPC communication mechanism– Same machine and between machines
• System kernel called the Nub– Virtual memory– Scheduler– Device drivers
• Rest of the OS ran in user‐mode• All software multithreaded– Executed simultaneously on multiple processors
![Page 20: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/20.jpg)
Memory consistency models
If one CPU modifies memory, when do others observe it?• Strict/Sequential: reads return the most recently written
value• Processor/PRAM: writes from one CPU are seen in order,
writes by different CPUs may be reordered• Weak: separate rules for synchronizing accesses (e.g. locks)
– Synchronising accesses sequentially consistent– Synchronising accesses act as a barrier:
• previous writes completed• future read/writes blocked
Important to know your hardware!– e.g. x86: processor consistency, PowerPC: weak consistency
![Page 21: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/21.jpg)
Memory consistency models
If one CPU modifies memory, when do others observe it?• Strict/Sequential: reads return the most recently written
value• Processor/PRAM: writes from one CPU are seen in order,
writes by different CPUs may be reordered• Weak: separate rules for synchronizing accesses (e.g. locks)
– Synchronising accesses sequentially consistent– Synchronising accesses act as a barrier:
• previous writes completed• future read/writes blocked
Important to know your hardware!– x86: processor consistency– PowerPC: weak consistency
![Page 22: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/22.jpg)
Hardware cache coherence
Example: MOESI protocol• Every cache line is in one of five states:
Modified: dirty, present only in this cacheOwned: dirty, present in this cache and possibly others
Exclusive: clean, present only in this cacheShared: present in this cache and possibly othersInvalid: not present
• May satisfy read from any state• Fetch to shared or exclusive state• Write requires modified or exclusive state;
– if shared, must invalidate other caches• Owned: line may be transferred without flushing to memory
![Page 23: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/23.jpg)
Hive
• Stanford, early 1990s
• Targeted at the Stanford FLASH multiprocessor– Large‐scale ccNUMA
• Main goal was fault containment– Contain hardware and software failure to the smallest possible set of resources
• Second goal was scalability through limited sharing of kernel resources
![Page 24: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/24.jpg)
Stanford FLASH architecture
Memory Processor
Coherence Controller
2nd‐Level Cache
Net I/O
![Page 25: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/25.jpg)
Stanford FLASH
• Reliable message‐passing– Nodes can fail independently
• Designed to scale to 1000’s of nodes
• Non‐Uniform Memory Access– Latency increases with distance
• Hardware cache coherence– Directory‐based protocol– Data structures occupy 7‐9% of main memory
●●●
![Page 26: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/26.jpg)
Hive (cont)
• Each “cell” (ie kernel) independently manages a small group of processors, plus memory and I/O devices– Controls a portion of the global address space
• Cells communicate mostly by RPC– But for performance can read and write each other’s memory directly
• Resource management by program called Wax running in user‐space– Global allocation policies for memory and processors– Threads on different cells synchronize via shared memory
![Page 27: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/27.jpg)
Hive: failure detection and fault containment
• Failure detection mechanisms– RPC timeouts– Keep‐alive increments on shared memory locations– Consistency checks on reading remote cell data structures– Hardware errors, eg bus errors
• Fault containment– Hardware firewall (an ACL per page of memory) prevents wild writes
– Preemptive discard of all pages belonging to a failed process
– Aggressive failure detection• Distributed agreement algorithm confirms cell has failed and reboot it
![Page 28: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/28.jpg)
DiscoRunning commodity OSes on scalable multiprocessors [Bugnion et al., 1997]
• Context: ca. 1995, large ccNUMA multiprocessors appearing
• Problem: scaling OSes to run efficiently on these was hard– Extensive modification of OS required– Complexity of OS makes this expensive– Availability of software and OSes trailing hardware
• Idea: implement a scalable VMM, run multiple OS instances• VMM has most of the features of a scalable OS, e.g.:
– NUMA‐aware allocator– Page replication, remapping, etc.
• VMM substantially simpler/cheaper to implement• Run multiple (smaller) OS images, for different applications
![Page 29: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/29.jpg)
Disco architecture
[Bugnion et al., 1997]
![Page 30: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/30.jpg)
Disco Contributions
• First project to revive an old idea: virtualization– New way to work around shortcomings of commodity Oses– Much of the paper focuses on efficient VM implementation
– Authors went on to found VMware
• Another interesting (but largely unexplored) idea:programming a single machine as a distributed system– Example: parallel make, two configurations:
1. Run an 8‐CPU IRIX instance2. Run 8 IRIX VMs on Disco, one with an NFS server
– Speedup for case 2, despite VM and vNIC overheads
![Page 31: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/31.jpg)
K42
• OS for cache‐coherent NUMA systems
• IBM Research, 1997–2006ish• Successor of Tornado and
Hurricane systems(University of Toronto)
• Supports Linux API/ABI• Aims: high locality, scalability• Heavily object‐oriented
– Resources managed by set of object instances
![Page 32: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/32.jpg)
Why use OO in an OS?
[Appavoo, 2005]
![Page 33: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/33.jpg)
Clustered ObjectsExample: shared counter
• Object internally decomposed into processor‐local representatives
• Same reference on any processor– Object system routes
invocation to local representative
Choice of sharing and locking strategy local to each object
• In example, inc and dec arelocal; only val needs to communicate
![Page 34: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/34.jpg)
Clustered objectsImplementation using processor‐local object translation table:
![Page 35: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/35.jpg)
Challenges with clustered objects
• Degree of clustering (number of reps, partitioned vs replicated) depends on how the object is used
• State maintained by the object reps must be kept consistent
• Determining global state is hard– Eg How to choose the next highest priority thread for scheduling when priorities are distributed across many user‐level scheduler objects
![Page 36: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/36.jpg)
Concrete example: VM objects
• OO decomposition minimizes sharing for unrelated data structures– No global locks reduced synchronization
• Clustered objects system limits sharing within an object
![Page 37: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/37.jpg)
K42 Principles/Lessons
• Focus on locality, not concurrency, to achieve scalability
• Adopt distributed component model to enable consistent construction of locality‐tuned components
• Support distribution within an OO encapsulation boundary:– eases complexity – permits controlled/manageable introduction of localized data structures
![Page 38: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/38.jpg)
Clear trend….
• Finer‐grained locking of shared memory• Replication as an optimization of shared memory
These are research OSes or Supercomputers.So why would you care?
Traditional OSesTraditional OSes
Shared state ,One‐big‐lock
Finer‐grainedlocking
Clustered objectspartitioning
![Page 39: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/39.jpg)
Further reading• Multics: www.multicians.org
• “C.mmp: a multi‐mini‐processor”, W. Wulf and C.G. Bell, Fall Joint Computer Conference, Dec 1972
• “HYDRA: The kernel of a multiprocessor operating system”, W. Wulf et al, Comm. ACM, 17(6) , June 1974
• “Overview of the Hydra Operating System Development”, W. Wulf et al, 5th SOSP, Nov 1975
• “Policy/Mechanism Separation in Hydra”, R. Levin et al, 5th SOSP, Nov 1975
• “Medusa: An Experiment in Distributed Operating System Structure”, John K. Ousterhout et al, CACM, 23(2), Feb 1980
• “Firefly: a multiprocessor workstation”, Chuck Thacker and Lawrence Stewart, Computer Architecture News, 15(5), 1987
• “The duality of memory and communication in the implementation of a multiprocessor operating system”, Michael Young et al, 11th SOSP, Nov 1987 [Mach]
• Mach: http://www.cs.cmu.edu/afs/cs.cmu.edu/project/mach/public/www/mach.html
• “The Stanford FLASH Multiprocessor”, J Kuskin et al, ISCA, 1994
• “Hive: Fault Containment for Shared‐Memory Multiprocessors”, J.Chapin et al, 15th SOSP, Dec 1995
• “K42: Building a Complete Operating System”, 1st EuroSys, April, 2006
• “Tornado: Maximising Locality and Concurrency in a Shared Memory Multiprocessor Operating System”, Gamsa et al, OSDI, Feb 1999
• K42: http://domino.research.ibm.com/comm/research_projects.nsf/pages/k42.index.html
![Page 40: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/40.jpg)
User‐level RPC
![Page 41: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/41.jpg)
User‐level RPC (RPC)
• Arguably, URPC is to Scheduler Activations what LRPC was to kernel threads– Send messages between address spaces directlyno kernel involved!
– Eliminate unnecessary processor reallocation– Amortize processor reallocation over several calls– Exploit inherent parallelism in send / receiving
• Decouple:– notification (user‐space)– scheduling (kernel)– data transfer (also user space).
![Page 42: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/42.jpg)
Application
URPC operation
Stubs
URPC
Scheduler activations
Application
Stubs
URPC
Scheduler activations
Channel
KernelReallocateprocessor
![Page 43: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/43.jpg)
How is the kernel not involved?
• Shared memory channels, mapped pairwise between domains
• Queues with non‐spinning TAS locks at each end• Threads block on channels entirely in user space• Messaging is asynchronous below thread abstractions• Can switch to another thread in same address space
– rather than block waiting for another address space
• Big win: Multiprocessor with concurrent client and server threads
![Page 44: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/44.jpg)
URPC latency
C = #cores for clientsS = #cores for server
![Page 45: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/45.jpg)
URPC throughput
C = #cores for clientsS = #cores for server
![Page 46: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/46.jpg)
URPC performance
All on a Firefly (4‐processor CVAX). Irony:• LRPC, L4 seek performance by optimizing kernel path• URPC gains performance by bypassing kernel entirely
Mechanism Operation Performance
URPC
Cross‐AS latency 93 s
Inter‐processor overhead 53 s
Thread fork 43 s
LRPCLatency 157 s
Thread fork > 1000 s
HardwareProcedure call 7 s
Kernel trap 20 s
![Page 47: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/47.jpg)
Discussion
• L4, LRPC:– Optimize for synchronous, null RPC performance bypass scheduler
– Hard to perform accurate resource accounting
• URPC:– Integrate with the scheduler– Decouple from event transmission slightly slower null RPC times when idle higher RPC throughput lower latency on multiprocessors
![Page 48: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/48.jpg)
Interprocess communication in Barrelfish
![Page 49: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/49.jpg)
Communication stack
CPU
CPU driver
User‐space dispatcher
• Typically IPI• Adjunct to Interconnect Drivers
Interconnect drivers
Notification drivers
Smart stubs
Group communication
Routing
• Low‐level messaging• Highly exposed, fixed MTU• User‐space where possible• Polled or event‐based
• Portability layer (C API)• Generated from IDL
• Multihop routing• Multicast tree construction
• Consensus• Replica maintenance
![Page 50: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/50.jpg)
Barrelfish communication stack
CPU
CPU driver
User‐space dispatcher
• Typically IPI• Adjunct to Interconnect Drivers
Interconnect drivers
Notification drivers
Smart stubs
Group communication
Routing
• Low‐level messaging• Highly exposed, fixed MTU• User‐space where possible• Polled or event‐based
• Portability layer (C API)• Generated from IDL
• Multihop routing• Multicast tree construction
• Consensus• Replica maintenance
![Page 51: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/51.jpg)
Interconnect Drivers
• Barely abstract an inter‐core message channel– Expose message size (e.g. 64 bytes)– Don’t implement flow control, etc. unless it’s there– Don’t require privilege, unless you have to
• May not be able to send capabilities over this channel!– Separate out notification, unless it’s coupled– Interface can be either polled or event‐driven
• General philosophy:– Don’t cover up anything– Expose all functionality– Abstract at a higher layer: in the stubs
![Page 52: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/52.jpg)
CC‐UMP Interconnect Driver
• Cache‐coherent shared memory – inspired by URPC
• Ring buffer of cache‐line sized messages– 64 bytes or 32 bytes– 1 word for bookkeeping; last one written (end of line)
• Credit‐based flow control out of band
• One channel per IPC binding (not shared)
![Page 53: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/53.jpg)
CC‐UMP: cache‐coherent user‐space messaging
Unidirectional channelsFixed‐size circular bufferAll messages are 64‐byte cache lines
Rx
Tx
Sender fills in message in next cache‐line sized
slot
Receiver polls for update at end of
message
Ack pointer advanced via messages in reverse path
Ack
![Page 54: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/54.jpg)
CC‐UMP: cache‐coherent user‐space messaging
S
ISending core’s cache: invalid => no copy of
the line
Receiver’s cache: shared => read‐only copy of the line
Sending core’s write buffer
Tx
Rx
![Page 55: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/55.jpg)
CC‐UMP: cache‐coherent user‐space messaging
Rx
S
ISending core’s cache: invalid => no copy of
the line
Receiver’s cache: shared => read‐only copy of the line
Sending core’s write buffer
Sender starts to write message; h/w combines writes in write buffer
1
Tx
![Page 56: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/56.jpg)
CC‐UMP: cache‐coherent user‐space messaging
S
ISending core’s cache: invalid => no copy of
the line
Receiver’s cache: shared => read‐only copy of the line
Sending core’s write buffer
Sender starts to write message; h/w combines writes in write buffer
1
TxRx
![Page 57: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/57.jpg)
CC‐UMP: cache‐coherent user‐space messaging
S
ISending core’s cache: invalid => no copy of
the line
Receiver’s cache: shared => read‐only copy of the line
Sending core’s write buffer
Sender starts to write message; h/w combines writes in write buffer
1
TxRx
![Page 58: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/58.jpg)
CC‐UMP: cache‐coherent user‐space messaging
Sending core’s write buffer
Sender starts to write message; h/w combines writes in write buffer
1
Write buffer fills: fetch target cache line in exclusive (E) state
2Tx
INVA
LIDATE
I
S
E
I
Sending core’s cache: invalid => no copy of
the line
Receiver’s cache: shared => read‐only copy of the line
Receiver’s cache: invalid => out‐of‐date
copy of the line
Sending core’s cache: exclusive => clean, writable copy
Rx
![Page 59: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/59.jpg)
CC‐UMP: cache‐coherent user‐space messaging
I
Sending core’s cache: modified => dirty r/w
copy
Sending core’s write buffer
Sender starts to write message; h/w combines writes in write buffer
1
Write buffer fills: fetch target cache line in exclusive (E) state
2M
Drain buffered writes into cache line, change to modified (M) state
3
Tx
INVA
LIDATE
Receiver’s cache: invalid => out‐of‐date
copy of the lineRx
![Page 60: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/60.jpg)
CC‐UMP: cache‐coherent user‐space messaging
Sending core’s cache: shared => read‐only
copy
Receiver’s cache: shared =>
read‐only copy
Sending core’s write buffer
Sender starts to write message; h/w combines writes in write buffer
1
Write buffer fills: fetch target cache line in exclusive (E) state
2
Drain buffered writes into cache line, change to modified (M) state
3
Reader polls again; own cache is invalid (I) so needs to fetch
fresh read‐only copy (S)
4
Tx
Receiver’s cache: invalid => out‐of‐date
copy of the lineI
M
S
SSending core’s cache: modified => dirty r/w
copyINVA
LIDATE PR
OBE
Rx
![Page 61: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/61.jpg)
Conventional wisdom
• Stub compilers are a solved problem (Flick)– Optimizing stub compilers compute most efficient way to copy values into a buffer
– Buffers are assumed to be “big enough”– Marshalling code separate from send/receive code– Marshalling code doesn’t handle fragmentation/reassembly
• But: – Interconnect drivers don’t have a buffer abstraction– Transmission units are small (cache lines, registers)– Efficient packing varies across interconnect drivers
![Page 62: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/62.jpg)
Flounder and specialized stubs
• Different backend code generator for each ICD
• Lots of engineering, but:– Haskell makes this easier
– Code reuse where possible– Filet‐o‐Fish would help more (but we don’t use it here)
• Highly optimized: performs final specialization
![Page 63: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/63.jpg)
Stub performance really matters
CC‐UMP Interconnect driver64‐byte (16‐word) MTU
Nehalem‐EX 64‐core system(between packages)
![Page 64: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/64.jpg)
Where does the time go?
![Page 65: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/65.jpg)
Where does the time go?
Null messageIntel Nehalem‐EXAMD ShanghaiCC‐UMP Interconnect driver
![Page 66: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/66.jpg)
Communication binding
Client
Monitor
Server
Monitor
Core 1 Core 2
1. bind_client(iref, cframe)
2. remote_bind_req(iref, cframe)
5. remote_bind_reply(sframe)
3. bind(cframe)
4. ack(sframe)
4. ack(sframe)
sframe
cframe
Monitors route binding requests and replies
![Page 67: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/67.jpg)
Name service
Client Server
Monitor
1. alloc_iref()
Name server is orthogonal to IPC system
Nameserver
2. register(iref, name)3. query(name)
![Page 68: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/68.jpg)
Intra‐OS routing
• Routing within a single machine?– Hardware may not give full‐mesh connectivity
– Some Interconnects must be multiplexed in software (tunnelled)• E.g. PCIe channel to SCC• E.g Ethernet
• Monitors and library provide intra‐OS routing
![Page 69: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/69.jpg)
Multicast
• Even with full routes, may need routing for group communication– High cost of dropping into software for a hop– Balanced with parallelism from e.g. tree topologies
• Routing library provides efficient construction of dissemination trees for specific hardware– Built at runtime from on‐line measurements
![Page 70: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/70.jpg)
Example: radix tree multicast
![Page 71: Chapter Multiprocessing - ETH · • Shared‐memory, multiprocessor, personal workstation – Developed at DEC SRC, 1985‐1987 • Requirements: – Research platform (powerful,](https://reader033.fdocuments.in/reader033/viewer/2022041506/5e251b6672892045792cd590/html5/thumbnails/71.jpg)
Summary
• Multiprocessors are different– Real concurrency– Exploit parallelism in message send/receive
• Use shared memory to bypass kernel– Scheduling decisions decoupled from messages– Spatial scheduling increasingly important
• Lowest level of a complete stack– Stubs, routing, multicast, etc. – almost a network…