Fast switching of threads between cores - Advanced Operating Systems

Fast Switching of Threads Between CoresRichard Strong & Dean Tullsen (University San Diego)

Jayaram Mudigonda, Jeffrey C. Mogul & Nathan Binkert (HP Labs)

Ruhaim Izmeth | MS14901218Nipuna Pannala | MS14902208

Introduction

● Now we are in the MULTICORE era.● Multi Core CPUs enable inter core communication

with less cost in the terms of Magnitude compared to the traditional multi processors. [This reduce the time for hardware to move migrating data working set]

● But software cost for moving thread remain as high

Asymmetric Multicore Processor

● Core – Core performance asymmetry appears to be very useful way to improve energy and area efficiency.

● Relatively little performance cost, But greater throughput per watt.

● Asymmetric Multicore Processor increases the need for frequent migration of threads between cores very efficiently.

Fast Switching of Threads between Cores

● To get a good performance in switching threads, between cores○ OS scheduler needs to migrate thread from slow

core to fast or ideal core.○ Also necessary to balance the load between

cores.(In a symmetric or Asymmetric system)○ All thread execution time segments should be

relatively short.

Simple Cores…

● Normally simple Cores can be better match for memory-bound application code.○ Operating systems and OS like codes are typical

memory bounded applications.

Thread Migration Techniques

● Migration Mechanism 1 : Constantinou○ This mechanism considered verity of costs

associated with thread migration, But primary focus about the threads in warming up (Caches and branch predictors)

○ But this is not addressing the software cost to migrate threads between cores.

● Migration Mechanism 2 : Choi○ This mechanism specific case of migrating the

branch predictor state when thread switches cores

○ But this is not addressing the software overhead issues.

Shared Thread Multiprocessor: Brown & Tulsan● Hardware manage's the thread moments.● Thread State is represented in hardware and that is

shared among the all cores in a chip.● Therefore hardware can move threads between

cores without direct OS involvement.

Software Approaches to Core Switching

•Core B is in IDLE state ?•Is there any thread to run on core A after T switching to B ?•Can ensure T is the most appropriate thread to run on B?

Transfer architectural state of thread from A to B

Approaches used in the research

● V1: Linux’s thread-migration mechanism● V2: Modified scheduler● V3: Scheduler fast-paths● V4: Addressing IPI costs● V5: Cross-core wakeup from quiesce

V1: Linux Thread Migration Mechanism

● Normally using for relatively long-term load balancing across the cores.

● Linux thread migration mechanism is the art of the core switching.

● One thread is available to initiate the migration.

V1: Linux Thread Migration Mechanism

● When task wants to migrate it puts itself on Per-Core Migration Queue.

● If the target core is idle thread wakes up from per-core migration queue and move to the Run Queue of the target core.

● After getting the approval from the target queue thread will execute in the target core.

V1: Linux Thread Migration MechanismCons...

● This migration approach involves “Extra” context switch between initiating thread and migrating thread.

Linux Thread Migration Mechanism Increase Efficiency

● To remove extra context switching,○ Threads can take migrating decisions by itself○ Centralize the thread status○ Increase the number of per core queues.○ Create Cross core signals

V2: Modified scheduler

Core 0Core 1N T

Run Queue

Alternative Queue (AQ)

Run Queue

schedule() interrupt

Control Block : TCore : 1...

SwitchCore()1

● Remove an extra context switch described in V1, ● Initiate thread migrate by process itself.

V3: Scheduler fast-paths● The original modified schedule● A fast schedule source version (FSS), called to initiate a core switch, ● A fast schedule target version (FST), called at the target core in response to the cross-core

signal.

FSS and FST omit a number of housekeeping functions normally done in schedule (eg: Priority calculation)

FSS only makes a hint to FST, so no locking takes place

FST has AQ check, FSS does not have AQ checks.

V3: Scheduler fast-paths

V4: Addressing inter-processor interrupt (IPI) costs

Inter-processor interrupts are sent to ‘wake up’ polling or paused processors.

Modified scheduler wakes up target core if idle.

The “IPI sending code” modified to be more efficient as it sends the interrupts to all members of a specified set.

schedule() is invoked on the target core with the interrupt

Modified System Calls

Modified long running system calls to initiate CoreSwitch()

Modified system calls : open,stat, read, write, readv, writev, select, poll, fsync, fdatasync,readfrom, sendto and sendfile.

4096 bytes

Simulation Environment

M5 Simulator used for generating detailed timelines, showing when interesting events such as procedure calls, cache misses, and long-latency instructions occur

x86 models are not debugged with M5. Complex core : Alpha EV6 (21264), 64KB L1Simple core : EV4-based (21064), 8KB L1Simulated on shared L2 3.5 MBytesMain-memory access time of 25 nsec.

sim_XXX - number of ‘x’ denote the number of processors

eg: sim_c - single processor

sim_sC - dual processor

Simulation Environment - Configuration naming scheme

Prefix 750Mhz 3Ghz

c CComplex

s SSimple

Tests run on Linux v 2.6.18 kernel

Only one trial run per experiment, as the simulator is deterministic

Microbenchmark results

Modified gettid() to call coreswitch() and run it N= 1,000,000 times in a tight loop

Cross-core wakeup from quiesce

● idle loop polling is inefficient

● initiating cross-CPU interrupt is slow as a powered down CPU needs to be awakened

● Kernel should dynamically decide between spinlock and powering down based on recent history.

Macrobenchmark results - Web Benchmark

Macrobenchmark results - Database Benchmark

Using “TPC-B-like” example from the Berkeley DB distribution

Core switch done only on fdatasync()

Eliminated disk I/O delays by using a RAM disk on the real hardware, and by setting the access time to zero in M5’s disk simulator.

Future Work

● Energy measurement/savings benchmarks for the above tests

● Determining the best core to switch to and the best time to switch in

● Optimal mechanism to poll or power down a Processor

Summary

● Cost of core switching is more important when use asymmetric multicores.

● Core switching to slower OS cores on frequent, expensive system calls some times reduce performance○ But it also provide power down complex application

cores.

References ● J. Aas. Understanding the Linux 2.6.8.1 CPU Scheduler. http://josh.trancesoftware.

com/linux/, Feb. 2005.

● S. Balakrishnan, R. Rajwar, M. Upton, and K. Lai. The Impact of Performance Asymmetry in Emerging Multicore Architectures. In Proc. ISCA, pages 506–517, 2005.

● M. Becchi and P. Crowley. Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures. J. Instruction Level Parallelism, pages 1–26, June 2008.

● N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G.Saidi, and S. K. Reinhardt. The M5 Simulator: Modeling Networked Systems. IEEE Micro, 26(4):52–60, 2006.

● D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In Proc. ISCA, pages 83–94, Jun. 2000.

Thank You

Fast switching of threads between cores - Advanced Operating Systems

Education

Transcript of Fast switching of threads between cores - Advanced Operating Systems

Operating Systems 1 K. Salah Module 2.0: Processes Process Concept Trace of Processes Process Context Context Switching Threads –ULT –KLT.

WebObjects + ScalaWO.pdf · High-End Cores Threads AMD Opteron IBM Power7 Intel Xeon 12 12 8 32 6 12

Kitten: A Lightweight Operating System for Ultrascale ...ktpedre/_assets/documents/pedretti_lanl11.pdf · that supports multi-core processors (N cores, multiple threads per core),

Threads. Threads Overview Multithreading Models Threading Issues P threads Windows XP Threads Linux Threads Java Threads.

Intel’s Tara-scale computing project 100 cores, >100 threads Datacenter-on-a-chip Sun’s Niagara2

X Parallel and Concurrent Haskell ecosystemcourses.softlab.ntua.gr/pl2/2012b/slides/CEFP1-6up.pdf · MVars Parallelism vs. Concurrency Multiple cores for performance Multiple threads

Threads · Web viewOn a system with multiple cores, however, concurrency means that the threads can run in parallel, because the system can assign a separate thread to each core (Figure

THREADS CHAIR’S MESSAGE RESEARCH THREADS

MagAmp cores MagAmp Cores

Threads Relation to processes Threads exist as subsets of processes Threads share memory and state information within a process Switching between threads.

Magnetic Cores for Switching Power Supplies

Oracle SPARC Software In Silicon - MSST Conference · • 32 SPARC Cores – Dynamically Threaded, 1 to 8 Threads Per Core – 4 SPARC S4 Cores per Core Cluster • New Cache Organizations

December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware Previously, we introduced multi-cores. —Today we’ll look at issues related.

Chapter 5: CPU Scheduling. Overview In discussing process management and synchronization, we talked about context switching among processes/threads on.

Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara)

Обзор мобильных технологий · 2014-11-20 · Intel Atom processors for Smartphones and Tablets 5 Intel® Atom™ Processor Z3785 • 4 Cores, 4 Threads, 2MB

GPU Computing Overview - SEA Computing.pdfA GPU is a Throughput Optimized Processor GPU Achieves high throughput by parallel execution 2,688 cores (GK110) Millions of resident threads

Mn-Zn Ferrite Cores for Switching Power Supplies PQ series · March 2014 Ferrite Cores for Switching Power Supplies PQ series Mn-Zn FERRITES

Mn-Zn Ferrite Cores for Switching Power Supplies …...Ferrite Cores for Switching Power Supplies Planar series EL ELT PQI EIR ER EI Mn-Zn FERRITES (2/49) FERRITES 20190912 / ferrite_mz_sw_planar_en.fm

Concurrency in ROS 1 and ROS 2 · 13. Kernel-level concurrency. From threads to CPU cores CallbackQueues and Spinners map topics onto threads. But how do we map threads to CPU cores?