Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.
-
Upload
justin-blankenship -
Category
Documents
-
view
212 -
download
0
Transcript of Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.
![Page 1: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/1.jpg)
Heracles: Improving Resource Efficiency at ScaleISCA’15
Stanford UniversityGoogle, Inc.
![Page 2: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/2.jpg)
OutlineIntroductionDesign
◦Isolation Mechanisms◦Controllers
EvaluationConclusion
![Page 3: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/3.jpg)
MotivationAverage server utilization in most
datacenter is low, ranging between 10%~50%.◦Difficult to consolidate the latency-
critical services on a subset of highly utilized servers.
Increase the server utilization by launching best-effort tasks on the same server with a latency-critical job.
![Page 4: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/4.jpg)
Motivation(Cont.)Previous works tend to protect LC
workloads, but reduce the opportunities for higher utilization through co-location.
![Page 5: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/5.jpg)
GoalEliminate SLO violations at all
levels of load for the LC job while maximizing the throughput for BE tasks.
![Page 6: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/6.jpg)
HeraclesA real-time, feedback-based
controller◦Enables the safe co-location of best-
effort(BE) tasks alongside a latency-critical(LC) service.
◦Ensures that LC jobs meet their target while maximizing the resources given to BE tasks.
![Page 7: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/7.jpg)
Heracles(Cont.)◦Four hardware and software isolation
mechanisms. Hardware: shared cache partitioning,
fine-grained power/frequency setting. Software: core isolation, network traffic
control.
![Page 8: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/8.jpg)
Isolation Mechanisms(Soft)Core isolation
◦Pin workload to a set of core using cpuset cgroups.
◦Speed of (re)allocation: tens of milliseconds.
Network traffic◦Limit the outgoing bandwidth of BE
tasks using Linux traffic control.◦No limit on LC job.◦Take effect in less than hundreds of
milliseconds.
![Page 9: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/9.jpg)
Isolation Mechanisms(Hard)LLC isolation
◦Cache Allocation Technology(CAT) in recent Intel chip. Use way-partitioning to define non-
overlapping partitions on LLC. Take effect in a few milliseconds.
◦Implement software monitor to track the bandwidth usage of LC and BE jobs. Scale down the # of cores for BE jobs if LC
jobs does not receive sufficient bandwidth.
![Page 10: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/10.jpg)
Isolation Mechanisms(Hard)(Cont.)Power isolation
◦CPU frequency monitoring, Running Average Power Limit(RAPL), and per-core DVFS.
◦Take effect within a few milliseconds.
![Page 11: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/11.jpg)
Design ApproachAn optimization problem
◦Maximize utilization with the constraint that the SLO must be met.
Heracles ◦decomposes the high-dimensional
optimization problem into many smaller and independent problem. Decoupling interference sources.
◦Monitors latency, latency slack, and load. Adjust the BE job allocation.
![Page 12: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/12.jpg)
System Diagram
![Page 13: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/13.jpg)
High-level Controller
![Page 14: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/14.jpg)
Core & Memory Sub-controller
![Page 15: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/15.jpg)
Max Load under SLO
![Page 16: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/16.jpg)
Power and Network Sub-controller
![Page 17: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/17.jpg)
EvaluationTwo sets of experiments
◦Co-locates LC applications with BE tasks on a single server.
◦Measuring end-to-end latency of Websearch on tens of servers. BE tasks are also running.
Effective Machine Utilization(EMU)◦LC throughput + BE throughput
![Page 18: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/18.jpg)
WorkloadsThree Google production LC
workloads:◦websearch◦ml_cluster
Real-time text clustering using machine learning
◦memkeyval In-memory key-value store
Run LC workloads with benchmarks that stress a single shared resource.◦Stream-LLC, Stream-DRAM, cpu-pwr, iperf, brain, and streetview.
![Page 19: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/19.jpg)
Latency of LC Applications
![Page 20: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/20.jpg)
EMU
![Page 21: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/21.jpg)
Shared Resource Utilization
![Page 22: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/22.jpg)
Websearch in Cluster
![Page 23: Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.](https://reader036.fdocuments.in/reader036/viewer/2022070404/56649f345503460f94c51626/html5/thumbnails/23.jpg)
ConclusionHeracles
◦a heuristic feedback-based system that manages four isolation mechanisms to enable a latency-critical workload to be co-located with batch jobs without SLO violations.
◦Evaluation on real hardware demonstrates an average utilization of 90% across all evaluated scenarios without any SLO violations for the latency-critical job.