Structure of Computer Structure of Computer SystemsSystems
Course 6Course 6
Multi-core systemsMulti-core systems
Multithreading and multi-processingMultithreading and multi-processing
Exploiting different forms of parallelism:Exploiting different forms of parallelism: data level parallelism (DLP) – same operations on a set of data – SIMD data level parallelism (DLP) – same operations on a set of data – SIMD
architectures, multiple ALUsarchitectures, multiple ALUs instruction level parallelism (ILP) – instructions phases executed in instruction level parallelism (ILP) – instructions phases executed in
parallel – pipeline architecturesparallel – pipeline architectures thread level parallelism (TLP) – instruction sequences/streams executed thread level parallelism (TLP) – instruction sequences/streams executed
in parallel – hyper-treading, multiprocessor architectures (mult-icore, in parallel – hyper-treading, multiprocessor architectures (mult-icore, GRID, cloud, parallel computers)GRID, cloud, parallel computers)
Thread level parallelism execution issues:Thread level parallelism execution issues: synchronization between threadsynchronization between thread data consistencydata consistency concurrent access to shared resourcesconcurrent access to shared resources communication between threadscommunication between threads
MultiprocessingMultiprocessing Limits of performance Limits of performance
increaseincrease
Amdahl’s lawAmdahl’s law S - speedup of a parallel S - speedup of a parallel
executionexecution ts – time for sequential executionts – time for sequential execution tp – time for parallel executiontp – time for parallel execution q fraction of a program which can q fraction of a program which can
be executed in parallelbe executed in parallel n – number of nodes/threadsn – number of nodes/threads
nqq
nsqtstqst
ptstS
/1
1
/)1(
Examples:
q=50%, n->∞ => S=2
q=75%, n->∞ => S=4
q=95%, n->∞ => S=20
Hyper-threadingHyper-threading hyper-treading - parallel execution of instruction streams hyper-treading - parallel execution of instruction streams
on a single CPUon a single CPU Idea: Idea: when a tread is stalled because of some hazard cases when a tread is stalled because of some hazard cases
another thread can be executedanother thread can be executed
SolutionSolution:: two threads executed in parallel on the same pipelined CPUtwo threads executed in parallel on the same pipelined CPU after every stage after every stage two bufferstwo buffers (registers) store the partial results of the (registers) store the partial results of the
two threadstwo threads Speedup – approximately 30%Speedup – approximately 30% The operating system will detect 2 logical CPUs !!The operating system will detect 2 logical CPUs !!
IF ID Ex M WbSingle threaded
IF ID Ex M WbHyper threaded
Thread 1
Thread 2
Thread
MultiprocessorsMultiprocessors
Parallel execution of instruction streams on multiple CPUsParallel execution of instruction streams on multiple CPUs Implementations:Implementations:
multi-core architecturesmulti-core architectures – multiple CPUs in a single integrated – multiple CPUs in a single integrated circuit (IC) circuit (IC)
parallel computersparallel computers – multiple CPUs on different ICs, but in the – multiple CPUs on different ICs, but in the same computer infrastructuresame computer infrastructure
distributed computing facilitiesdistributed computing facilities – multiple CPUs on different – multiple CPUs on different computers, connected through a networkcomputers, connected through a network
• network of PCsnetwork of PCs• GRID architecturesGRID architectures – distributed computing resources for virtual – distributed computing resources for virtual
organizations (VOs), manly for batch processing organizations (VOs), manly for batch processing • cloud architecturescloud architectures – computing resources (execution and storage) – computing resources (execution and storage)
offered as a service; it can be hired dynamicallyoffered as a service; it can be hired dynamically combination of all above: multi-cores on parallel computers, combination of all above: multi-cores on parallel computers,
building distributed computing facilitiesbuilding distributed computing facilities
Multi-core processorsMulti-core processors
Why multi-core: Why multi-core: Difficult to make single-core clock frequencies even higher; in Difficult to make single-core clock frequencies even higher; in
the last 4-5 years the clock frequency growth saturated at 2.5-3 the last 4-5 years the clock frequency growth saturated at 2.5-3 GHz GHz
power consumption and dissipation problems (figher frequency power consumption and dissipation problems (figher frequency means more power)means more power)
pipeline architectures (instruction level parallelism) reached their pipeline architectures (instruction level parallelism) reached their efficiency limits (around 20 pipeline stages)efficiency limits (around 20 pipeline stages)
designing a very complex CPU (with multiple optimization designing a very complex CPU (with multiple optimization schemes involved) requires coordination of very large designing schemes involved) requires coordination of very large designing teamsteams
many new applications are multithreaded (e.g. servers that solve many new applications are multithreaded (e.g. servers that solve multiple concurrent requests, agent systems, gaming, multiple concurrent requests, agent systems, gaming, simulation, etc.) simulation, etc.)
Multi-core processorsMulti-core processors Issues (decision choices):Issues (decision choices):
same or different functionalities for CPUs (homogeneous v.s. same or different functionalities for CPUs (homogeneous v.s. heterogeneous CPUs)heterogeneous CPUs)
• symmetric coressymmetric cores (SMP – Symmetric multi-core processor) – every (SMP – Symmetric multi-core processor) – every core has the same structure and functionalitycore has the same structure and functionality
• asymmetric coresasymmetric cores (ASMP) – there are coordination cores and (ASMP) – there are coordination cores and (simpler) specialized cores(simpler) specialized cores
the relation with the memorythe relation with the memory• symmetric memory access - the symmetric memory access - the SYMASYMA
• non-uniform memory access – non-uniform memory access – NUMANUMA connection between coresconnection between cores
• common bus – parallel or network-based (see network-on-chip)common bus – parallel or network-based (see network-on-chip)
• crossbar – multiple connections controlled with a switchcrossbar – multiple connections controlled with a switch
• memory hierarchy (cache) – common memory zones memory hierarchy (cache) – common memory zones
Multi-core processorsMulti-core processors architectural solutionsarchitectural solutions
Memory
Core Core
L1 L1
L2
Switch
Symmetric multi-core with private L1 cache and shared L2 and memory
Core Core Core Core
L1 L1 L1 L1
L2 L2
L3L3
Memory Module 1
Memory Module 2
crossbar
Symmetric multi-core partially shared L2 and L3
Multi-core processorsMulti-core processors
architectural solutions (cont.)architectural solutions (cont.)
Core (2x SMT)
CoreL1
L2
Core
LocalStore
LocalStore
Core Core
LocalStore
LocalStore
I/OMemory Module
Heterogeneous multi-core with local and shared cache
Memory
Core Core
L1 L1
L2
Switch
Core Core
L1 L1
L2
Switch
Two processors with two cores and shared memory
Processor 1 Processor 2
Ring network
Multi-core processorsMulti-core processors
Shared cacheShared cache high speed memory used by a number of cores (CPUs)high speed memory used by a number of cores (CPUs) advantages:advantages:
• efficient allocation of existing memory spaceefficient allocation of existing memory space
• one core may pre-fetch data for the other coreone core may pre-fetch data for the other core
• sharing of common datasharing of common data
• no cache coherence problemsno cache coherence problems
• less accesses to external memoryless accesses to external memory drawbacks:drawbacks:
• conflict between cores when allocating space on the cache; one core conflict between cores when allocating space on the cache; one core may replace the other core’s datamay replace the other core’s data
• more complex control circuit and longer latency time because of the more complex control circuit and longer latency time because of the switchingswitching
• one core may lock the access to the other coreone core may lock the access to the other core
Multi-core processorsMulti-core processors Cache coherence of private memoryCache coherence of private memory
How to keep the data consistent across caches?How to keep the data consistent across caches?• solutions:solutions:
write through – every write is made also in the memory – not so write through – every write is made also in the memory – not so efficientefficient
snooping and invalidation – cores are snooping the bus and snooping and invalidation – cores are snooping the bus and invalidates their cache line if a write from another core affects its invalidates their cache line if a write from another core affects its caches content (e.g. Pentium Pro’s P6 bus – snooping phase)caches content (e.g. Pentium Pro’s P6 bus – snooping phase)
core 1 core 2 core 3 core 4
Memory
cache cachecache cache
inconsistencyRead
write
Multi-core processorsMulti-core processors
Symmetric v.s. asymmetric coresSymmetric v.s. asymmetric cores Symmetric architectureSymmetric architecture
• all cores are the sameall cores are the same• cores can perform any tasks; they are interchangeablecores can perform any tasks; they are interchangeable• Advantages:Advantages:
easy to build (simple replication), easy to build (simple replication), easy to program, to compile and to execute multithreaded easy to program, to compile and to execute multithreaded
programs programs
• examples: examples: Intel, AMD - Dual and Quad core, Core2, Intel, AMD - Dual and Quad core, Core2, SUN - UltraSparc T1 (Niagara) – 8 coresSUN - UltraSparc T1 (Niagara) – 8 cores
Multi-core processorsMulti-core processors
Symmetric v.s. asymmetric cores (cont.)Symmetric v.s. asymmetric cores (cont.) Asymmetric (heterogeneous) architectureAsymmetric (heterogeneous) architecture
• some cores have different functionalities:some cores have different functionalities: 1-2 master cores and many slave (simpler) cores1-2 master cores and many slave (simpler) cores 1 main core and multiple specialized cores (graphics, Fp, 1 main core and multiple specialized cores (graphics, Fp,
multimedia)multimedia)
• compilations should take into consideration what compilations should take into consideration what functionalities can be performed by each corefunctionalities can be performed by each core
• Advantages:Advantages: can integrate much more simple corescan integrate much more simple cores
• examples: examples: IBM – cell processor – used for Playstation 3IBM – cell processor – used for Playstation 3
Multi-core processorsMulti-core processors
Asymmetric (heterogeneous) Asymmetric (heterogeneous) architecturearchitecture
IBM cell architecture: 9 coresIBM cell architecture: 9 cores• 1 PPE - power processor element1 PPE - power processor element
coordination and data transfercoordination and data transfer
• 8 SPEs - Synergistic Processing 8 SPEs - Synergistic Processing ElementElement
specialized mathematical unitsspecialized mathematical units
• applications:applications: supercomputerssupercomputers playstationsplaystations home cinemahome cinema video cards video cards
Multi-core processorsMulti-core processors
Advantages of multi-core processors:Advantages of multi-core processors: Signals between different CPUs travel shorter distances, those Signals between different CPUs travel shorter distances, those
signals degrade less.signals degrade less.
These higher quality signals allow more data to be sent in a These higher quality signals allow more data to be sent in a given time period since individual signals can be shorter and do given time period since individual signals can be shorter and do not need to be repeated as often not need to be repeated as often
Cache coherency circuitry can operate at a much higher clock Cache coherency circuitry can operate at a much higher clock rate than is possible if the signals have to travel off-chip.rate than is possible if the signals have to travel off-chip.
A dual-core processor uses slightly less power than two coupled A dual-core processor uses slightly less power than two coupled single-core processors.single-core processors.
Multi-core processorsMulti-core processors
Disadvantages of multi-core processors:Disadvantages of multi-core processors: Ability of multi-core processors to increase application Ability of multi-core processors to increase application
performance depends on the use of multiple threads within performance depends on the use of multiple threads within applications.applications.
Most current video games will run faster on a 3 GHz single-core Most current video games will run faster on a 3 GHz single-core processor than on a 2GHz dual-core processor (of the same processor than on a 2GHz dual-core processor (of the same core architecture.core architecture.
Two processing cores sharing the same system bus and Two processing cores sharing the same system bus and
memory bandwidth limits the real-world performance advantage. memory bandwidth limits the real-world performance advantage.
If a single core is close to being memory bandwidth limited, If a single core is close to being memory bandwidth limited, going to dual-core might only give 30% to 70% improvement.going to dual-core might only give 30% to 70% improvement.
If memory bandwidth is not a problem, a 90% improvement can If memory bandwidth is not a problem, a 90% improvement can be expectedbe expected..
Multi-core processorsMulti-core processors
Thread affinityThread affinity we can specify if a thread may be executed we can specify if a thread may be executed
on any core or just on a specific coreon any core or just on a specific core• soft affinity: - controlled by the operating systemsoft affinity: - controlled by the operating system
an interrupted thread should continue on the same corean interrupted thread should continue on the same core
• hard affinity – flags associated to a thread that hard affinity – flags associated to a thread that indicate on which core(s) may be executedindicate on which core(s) may be executed
useful for real-time and control applications – to reduce useful for real-time and control applications – to reduce the load on a core on which critical threads are executedthe load on a core on which critical threads are executed
Top Related