Post on 23-Feb-2016
description
1
Research Issues/Challenges to Systems Software for Multicore and Data-Intensive Applications
Xiaodong Zhang Ohio State University
In collaboration with F. Chen, X. Ding, Q. Lu, P. Sadayappan, Ohio State
S. Jiang, Wayne State University, Z. Zhang, Iowa StateQ. Lu, Intel, J. Lin, IBM
Kei Davis, Los Alamos National Lab
2
Foot Steps of Challenges in HPC
1970s-80s: Killer applications demand a lot of CPU cycles a single processor was very slow (below 1MH) beginning of PP: algorithms, architecture, software
1980s: communication bottlenecks and burden of PP challenge I: fast interconnection networks challenge II: automatic PP, and shared virtual memory
1990s: “Memory Wall” and utilization of commodity processors challenge I: cache design and optimization challenge II: Networks of Workstations for HPC
2000s and now: “Disk Wall” and Multi-core processors
3
Moore’s Law Driven Computing Research (IEEE Spectrum, May 2008)
hi
25 year of golden age of parallel computing
10 years of dark age of parallel computing, CPU-memory gap is theMajor concern.
New era of multicore computingMemory problem continues
4
0.3 0.37587,0000.9
1.2451,807
0.72
560,000
2.511.66
1,666,666
1.2537.5
5,000,000
0500000
100000015000002000000250000030000003500000400000045000005000000
CPU
Cyc
les
1980 1985 1990 1995 2000
Year
Latencies of Cache, DRAM and Disk in CPU Cycles
SRAM Access Time DRAM Access Time Disk Seek Time
Unbalanced System Improvements
Bryant and O’Hallaron, “Computer Systems: A Programmer’s Perspective”, Prentice Hall, 2003
The disks in 2000 are 57 times “SLOWER” than their ancestors in 1980 --- increasingly widen the Speed Gapbetween Peta-Scale computing and Peta-Byte acesses.
5
Dropping Prices of Solid State Disks (SSDs) (CACM, 7/08)Fl
ash
cost
($) p
er G
B
6
Prices & Performance: Disks, DRAMs, and SSDs (CACM, 7/08)
7
Power Consumption for Typical Components (CACM, 7/08)
8
Opportunities of Technology Advancements
• Single-core CPU reached its peak performance– 1971 (2300 transistors on Intel 4004 chip): 0.4 MHz– 2005 (1 billion + transistors on Intel Pentium D): 3.75 GHz– After 10,000 times improvement, GHz stopped and dropped– CPU improvement is reflected by number of cores in a chip
• Increased DRAM capacity enables large working sets – 1971 ($400/MB) to 2006 (0.09 cent/MB): 444,444 times lower– Buffer cache is increasingly important to break “disk wall”
• SSDs (flash memory) can further break the “wall”– Low power (6-8X lower than disks, 2X lower than DRAM) – Fast random read (200X faster than disks, 25X slower than DRAM)– Slow writing (300X slower than DRAM, 12X faster than disks) – Relatively expensive (8X more than disks, 5X cheaper than DRAM)
9
Research and Challenges • New issues in Multicore
– To utilize parallelism in multicore is much more complex– Resource competition in multicore causes new problems– OS scheduling is multi-core- and shared-cache-unaware – Challenges: Caches are not in the scope of OS management
• Fast data accesses is most desirable – Sequential locality in disks is not effectively exploited. – Where should flash memory be in the storage hierarchy? – How to use flash memory and buffer cache to improve disk
performance/energy? – Challenges: disks are not in the scope of OS managements
10
Data-Intensive Scalable Computing (DISC)
Massively Accessing/Processing Data Sets in Parallel. drafted by R. Bryant at CMU, endorsed by
Industries: Intel, Google, Microsoft, Sun, and scientists in many areas.
Applications in science, industry, and business. Special requirements for DISC Infrastructure:
Top 500 DISC ranked by data throughput, as well FLOPS
Frequent interactions between parallel CPUs and distributed storages. Scalability is challenging.
DISC is not an extension of SC, but demands new technology advancements.
11
Systems Comparison: (courtesy of Bryant)
– Disk data stored separately• No support for collection
or management– Brought in for
computation• Time consuming• Limits interactivity
– System collects and maintains data• Shared, active data set
– Computation co-located with disks• Faster access
SystemSystem
DISCConventional Computers
12
Outline Why is multicore the only choice? Performance bottlenecks in multicores
OS plays a strong supportive role A case study of DMMS on multicore OS-based cache partitioning in multicores Summary
13
Multi-Core is the only Choice to Continue Moore’s Law
Performance Power Dual-Core
Over-Clocked (1.2x)
1.13 x
1.73 x
0.51 x
0.87 x
Under-Clocked (0.8x)
1.73 x
Dual-Core (0.8x)
1.02 x
R.M. Ramanathan, Intel Multi-Core Processors: Making the Move to Quad-Core and Beyond, white paper
Much better performance
1.00 x
Baseline Frequency
1.00 x
Similar power
consumption
14
Cache
Memory
Cache Cacheconflict
Cache Sensitive Job
Computation Intensive Job
Streaming Job
Memory Bus
• Scheduling two cache sensitive jobs - causing cache conflicts
jobs jobs
Shared Resource Conflicts in Multicores
15
Cache
Memory
Cache Sensitive Job
Computation Intensive Job
Streaming Job
Memory Bus
Cache Cache
Saturation
• Scheduling two streaming jobs - causing memory bus congestions• Scheduling two cache sensitive jobs - causing cache conflicts
jobs jobs
Shared Resource Conflicts in Multicores
16
Cache
Memory
• Scheduling two CPU intensive jobs – underutilizing cache and bus
Cache Sensitive Job
Computation Intensive Job
Streaming Job
Memory Bus
• Scheduling two streaming jobs - causing memory bus congestions• Scheduling two cache sensitive jobs - causing cache conflicts
Underutilized resources
jobs jobs
Shared Resource Conflicts in Multicores
17
Cache
Memory
CacheCache
Cache Sensitive Job
Computation Intensive Job
Streaming Job
Memory Bus
Streaming job pollutes cache
Increased memory activity
jobs jobs
• Scheduling two CPU intensive jobs – underutilizing cache and bus• Scheduling two streaming jobs - causing memory bus congestions• Scheduling two cache sensitive jobs - causing cache conflicts
• Scheduling cache sensitive & streaming jobs – conflicts & congestion
Shared Resource Conflicts in Multicores
18
Cache Cache
Memory
Memory Bus
• Many Cores – oversupplying computational power
Many Cores, Limited Cache, Single Bus
Cache Cache Cache Cache Cache Cache Cache Cache
• Limited Cache – lowering cache capacity per process and per core• Single Bus – increasing bandwidth sharing by many cores
1919
Moore’s Law Driven Relational DBMS Research
1970: Relational data model (E. F. Codd)
1976: DBMS: System R and Ingres(IBM, UC Berkeley)
1977 – 1997: Parallel DBMSDIRECT, Gamma, Paradise
(Wisconsin, Madison)
Architecture-optimized DBMS (MonetDB):Database Architecture Optimized for the
New Bottleneck: memory Access (VLDB’99)
What should we do as DBMS meets multicores?
20
Query2 Core 2
Shared cache
Memory
Query1Core 1
Good/Bad News for “Hash Join”
Shared cache provides bothdata sharing and cache contention
select * from Ta, Tb where Ta.x = Tb.y
H1
Ta Tb
H2
H1
HJ (H1, Tb)
ta
tbsharing!
H2
HJ (H2, Ta)
Conflict!
21
Consequence of Cache Contention
During concurrent query executions in multicore cache contention causes new concerns:
• Suboptimal Query Plans – Due to multicore shared cache unawareness
• Confused scheduling – generating unnecessary cache conflict
• Cache is allocated by demand not by locality – Weak locality blocks, such as one-time accessed
blocks pollute and waste cache space
22
Suboptimal Query Plan
• Query optimizer selects the “best” plan for a query.
• Query optimizer is not shared-cache aware.
• Some query plan is cache sensitive and some are not.
• A set of cache sensitive/non-sensitive queries would confuse the scheduler.
240
10
20
30
40
50
60
70
HashJ oi n Tabl eScan
Ave
Resp
onse
Tim
e (s
)
Def aul t Schedul i ngCMP- Aware Schedul i ng
Multicore Unaware Scheduling• Default scheduling takes a FIFO order, causing
cache conflicts• Multicore-aware optimization: co-schedule queries
• “hashjoins” (cache sensitive), “table scans” (insensitive)• Default: co-schedule “hashjoins” (cache conflict!)• Multicore-aware: co-schedule “hashjoin” and “tablescan”
30% Improvement!
250
10
20
30
40
50
60
HashJ oi n Tabl eScan
Quer
y Re
spon
se T
ime
(s)
Def aul tCachePart i t i oni ng
Locality-based Cache Allocation • Different queries have different cache utilization.• Cache allocation is demand-based by default. • Weak locality queries should be allocated small space.
Co-schedule “hashjoin” (strong locality) and “tablescan” (one-time accesses), allocate more to “hashjoin”.
16% Improvement!
26
A DBMS Framework for Multicores
Core Core
Shared Last Level Cache
QueryOptimizer
Queries
QueryScheduler
CachePartitioning
• Query Optimization (DB level)Query optimizer generates optimal query plans based on usage of shared cache• Query Scheduling (DB level)To group queries co-running to minimize access conflicts in shared cache• Cache Partitioning (OS level)To allocate cache space to maximize cache utilization for co-scheduled queries
27
Challenges and Opportunities• Challenges to DBMS
– DBMS running in user space is not able to directly control cache allocation in multicores
– Scheduling: predict potential cache conflicts among co-running queries
– Partitioning: Determine access locality for different query operations (join, scan , aggregation, sorting,…)
• Opportunities– Query optimizer can provide hints of data access patterns
and estimate working set size during query executions– Operation system can manage cache allocation by using
page coloring during virtual-physical address mapping.
30
OS-Based Cache Partitioning
• Static cache partitioning– Predetermines the amount of cache blocks allocated to
each program at the beginning of its execution– Divides shared cache to multiple regions and partition
cache regions through OS page address mapping• Dynamic cache partitioning
– Adjusts cache quota among processes dynamically – Dynamically changes processes’ cache usage through
OS page address re-mapping
31
Page Coloring
virtual page numberVirtual address page offset
physical page numberPhysical address Page offset
Address translation
Cache tag Block offsetSet indexCache address
Physically indexed cache
page color bits
… …
OS control
=
• Physically indexed caches are divided into multiple regions (colors).• All cache lines in a physical page are cached in one of those regions (colors).
OS can control the page color of a virtual page through address mapping (by selecting a physical page with a specific value in its page color bits).
32
Enhancement for Static Cache Partitioning
… …
...
………
………
Physically indexed cache
………
………
Physical pages are grouped to page binsaccording to their page color1
234
…
i+2
ii+1
…Process 1
1234
…
i+2
ii+1
…Process 2
OS
address mapping
Shared cache is partitioned between two processes through address mapping.
Cost: Main memory space needs to be partitioned too (co-partitioning).
34
Allocated color
Page Re-Coloring for Dynamic Partitioning
page links table
……
N - 1
0
1
2
3
• Page re-coloring:– Allocate page in new color– Copy memory contents– Free old page
Allocated color
Pages of a process are organized into linked lists by their colors.
Memory allocation guarantees that pages are evenly distributed into all the lists (colors) to avoid hot points.
35
Reduce Page Migration Overhead
• Control the frequency of page migration– Frequent enough to capture phase changes– Reduce large page migration frequency
• Lazy migration: avoid unnecessary migration– Observation: Not all pages are accessed between
their two migrations.–Optimization: do not migrate a page until it is
accessed
36
• With the optimization– Only 2% page migration overhead on average– Up to 7%.
Lazy Page Migration
Process page links
……
N - 1
0
1
2
3
Avoid unnecessary page migration for these pages!
Allocated color
Allocated color
37
Research in Cache Partitioning
• Merits of page-color based cache partitioning– Static partitioning has low overhead although memory
will have to be co-partitioned among processes– A measurement-based platform can be built to evaluate
cache partitioning methods on real machines• Limits and research issues
– Overhead of dynamic cache partitioning is high – The cache is managed indirectly at the page level– Can OS directly partition the cache with low overhead?– We have proposed a hybrid methods for the purpose
38
“Disk Wall” is a Critical Issue
Many data-intensive applications generate huge data sets in disks world wide in very fast speed.
LANL Turbulence Simulation: processing 100+ TB.
Google searches and accesses over 10 billion web pages and tens of TB data in Internet.
Internet traffic is expected to increase from 1 to 16 million TB/month due to multimedia data.
We carry very large digital data, films, photos, …
Data home is the cost-effective & reliable Disks
Slow disk data access is the major bottleneck
39
Data-Intensive Scalable Computing (DISC)
Massively Accessing/Processing Data Sets in Parallel. drafted by R. Bryant at CMU, endorsed by
Industries: Intel, Google, Microsoft, Sun, and scientists in many areas.
Applications in science, industry, and business. Special requirements for DISC Infrastructure:
Top 500 DISC ranked by data throughput, as well FLOPS
Frequent interactions between parallel CPUs and distributed storages. Scalability is challenging.
DISC is not an extension of SC, but demands new technology advancements.
40
Systems Comparison: (courtesy of Bryant)
– Disk data stored separately• No support for collection
or management– Brought in for
computation• Time consuming• Limits interactivity
– System collects and maintains data• Shared, active data set
– Computation co-located with disks• Faster access
SystemSystem
DISCConventional Computers
42
Sequential Locality is Unique in Disks
Sequential Locality: disk accesses in sequence fastest
Disk speed is limited by mechanical constraints.
seek/rotation (high latency and power consumption) OS can guess sequential disk-layout, but not always right.
43
Week OS Ability to Exploit Sequential Locality
OS is not exactly aware disk layout Sequential data placement has been implemented
since Fast File System in BSD (1984) put files in one directory in sequence in disks follow execution sequence to place data in disks.
Assume temporal sequence = disk layout sequence.
The assumption is not always right, performance suffers.Data accesses in both sequential and random
patterns an application accesses multiple files. Buffer caching/prefetching know little about disk
layout.
44
IBM Ultrastar 18ZX Specification *
Seq. Read: 4,700 IO/s
Rand. Read:< 200 IO/s
* Taken from IBM “ULTRASTAR 9LZX/18ZX Hardware/Functional Specification” Version 2.4
Our goal: to maximize opportunities of sequential accesses for high speed and high I/O throughput
47
Existing Approaches and Limits Programming for Disk Performance
Hiding disk latency by overlapping computing Sorting large data sets (SIGMOD’97) Application dependent and programming
burden Transparent and Informed Prefetching (TIP)
Applications issue hints on their future I/O patterns to guide prefetching/caching (SOSP’99)
Not a general enough to cover all applications Collective I/O: gather multiple I/O requests
make contiguous disk accesses for parallel programs
48
Our Objectives Exploiting sequential locality in disks by minimizing random disk accesses making disk-aware caching and prefetching
utilizing both buffer cache and SSDs Application independent approach
putting disk access information on OS map
Exploiting DUal LOcalities (DULO): Temporal locality of program execution Sequential locality of disk accesses
50
What is Buffer Cache Aware and Unaware?
I/O Scheduler
Disk Driver
Application I/O Requests
disk
Buffer cacheCaching &
prefetching
Buffer is an agent between I/O requests and disks. aware access patterns in time
sequence (in a good position to exploit temporal locality)
not clear about physical layout (limited ability to exploit sequential locality in disks)
Existing functions send unsatisfied requests to disks LRU replacement by temporal locality make prefetch by sequential access
assumption. Ineffectiveness of I/O scheduler:
sequential locality in is not open to buffer management.
51
Minimizing cache miss ratio by only exploiting temporal locality
Sequentially accessed blocks small miss penalty
Randomly accessed blocks large miss penalty
Limits of Hit-ratio based Buffer Cache Management
penalty Miss
rateMisstimeHitrateHittimeaccessAverage
Temporal locality
Sequential locality
52
X2
C
A
BD
X1X3 X4
Disk Tracks
Hard Disk Drive
Unique and critical roles of buffer cache Buffer cache can influence request stream patterns in
disks If buffer cache is disk-layout-aware, OS is able to
Distinguish sequentially and randomly accessed blocks Give “expensive” random blocks high caching priority in
DRAM/SSD replace long sequential data blocks timely to disks Disk accesses become more sequential.
53
• Prefetching may incur non-sequential disk access– Non-sequential accesses are much slower than sequential accesses– Disk layout information must be introduced into prefetching policies.
Prefetching Efficiency is Performance Critical
Synchronous requests
Process
Disk
idle idle
Disk
Prefetch requests
Processidle idle
It is increasingly difficult to hide disk accesses behind computation
54
File-level Prefetching is Disk Layout Unaware
• Multiple files sequentially allocated on disks cannot be prefetched at once.
• Metadata are allocated separately on disks, and cannot be prefetched
• Sequentiality at file abstraction may not translate to sequentiality on physical disk.
• Deep access history information is usually not recorded.
File Z
File X File Y
File RA
BC
D
Metadata of files XYZ
55
Opportunities and Challenges With Disk Spatial Locality (Disk-Seen)
Exploit DULO for fast disk accesses. Challenges to build Disk-Seen System Infrastructure Disk layout information is increasingly hidden in disks.
analyze and utilize disk-layout Information accurately and timely identify long disk sequences consider trade-offs of temporal and spatial locality
(buffer cache hit ratio vs miss penalty: not necessarily follow LRU)
manage its data structures with low overhead Implement it in OS kernel for practical usage
56
Disk-Seen Task 1: Make Disk Layout Info. Available
Which disk layout information to use? Logical block number (LBN): location mapping provided
by firmware. (each block is given a sequence number) Accesses of contiguous LBNs have a performance close to
accesses of contiguous blocks on disk. (except bad blocks occur)
The LBN interface is highly portable across platforms. How to efficiently manage the disk layout
information? LBN is only used to identify disk locations for read/write; We want to track access times of disk blocks and search
for access sequences via LBNs; Disk block table: a data structure for efficient disk blocks
tracking.
57
Disk-Seen TASK 2: Exploiting Dual Localities (DULO)
Staging Section
Evicting Section
Correlation Buffer
Sequencing Bank
LRU Stack
Sequence Forming
Sequence ---- a number of blocks whose disk locations are adjacent and have been accessed during a limited time period. Sequence Sorting based on its
recency (temporal locality) and size (spatial locality)
58L=L1
Disk-Seen TASK 3: DULO-Caching
LRU Stack
Adapted GreedyDual Algorithm a global inflation value L , and a value
H for each sequence Calculate H values for sequences in
sequencing bank:H = L + 1 / Length( sequence )Random blocks have larger H values When a sequence (s) is replaced,L = H value of s .L increases monotonically and make future sequences have larger H values Sequences with smaller H values
are placed closer to the bottom of LRU stack
H=L0+1
L=L0
H=L0+0.25
H=L0+1
H=L0+0.25
60
DULO-Caching Principles
Moving long sequences to the bottom of stack replace them early, get them back fast from
disks Replacement priority is set by sequence length.
Moving LRU sequences to the bottom of stack exploiting temporal locality of data accesses
Keeping random blocks in upper level stack hold them: expensive to get back from disks.
65
Prefetch size: maximum number of blocks to be prefetched.
Disk-Seen Task 5: DULO-Prefetching
LBN
Timestamp
Temporal window size
Spatial window size
Block initiating prefetching
Resident block
Non-resident block
66
What can DULO-Caching/-Prefetch do and not do? Effective to
mixed sequential/random accesses. (cache them differently)
many small files. (packaging them in prefetch) many one-time sequential accesses (replace them
quickly). repeatable complex patterns that cannot be detected
without disk info. (remember them)
Not effective to dominantly random/sequential accesses. (perform
equivalently to LRU) a large file sequentially located in disks. (file-level
prefetch can do it) non-repeatable accesses. (perform equivalently to file-
level prefetch)
67
DiskSeen: a System Infrastructure to Support DULO-Caching and DULO-Prefetching
Prefetching area
Buffer CacheCaching
areaDestaging
area
DiskBlock transfers between areas
DULO-Prefetching:adj. window/stream
On-demand read:place in stack top
DULO-Caching:LRU blks and Long seqs.
72
DULO Caching does not affect Execution Times of Pure Sequential or Random Workloads
TPC-H Query #6(sequential accesses)
Diff(random accesses)
74
DULO Caching Reduces Execution Times for Workloads with Mixed Patterns
PostMark(mixed patterns of both sequential and
random)
76
DULO Prefetching Reduces Execution Times for Workloads with Many Small Files
0
20
40
60
80
100
120
grep cvs diff
Exec
utio
n Ti
me
(Sec
)
Linux 2.6.11DULO
77
DULO Prefetching Reduces Execution Times for Workloads With Complex Access Patterns
0
20
40
60
80
100
120
strided reverse TPC-H(Q4)
Exec
utio
n Ti
me
(Sec
)
Linux 2.6.11DULO
78
Conclusions (1) Issues of Multicores
resource conflicts in shared caches and memory bus OS is not shared-cache-aware
multicore- and cache –aware scheduling is essential scheduling jobs based on resource demand initially rescheduling jobs to optimize the resource
utilization Building a hybrid OS resource management
system dynamically allocate cache space to each process
putting cache information into OS map minimize OS and hardware overheads.
79
Conclusions (2) Disk performance is limited by
OS is unable to effectively exploit sequential locality. The buffer cache is a critical component for
storage. temporal locality is mainly exploited by existing OS.
Building a Disk-Seen system infrastructure for DULO-Caching and -prefetching
Flash memory in the storage hierarchy holding the random accesses serve as L2 cache for DRAM buffer cache caching for low power