Bab 5 Mesin Selari - ftsm.ukm.myweek4parallelmachine).pdf · 4 Type of Processing SISD Single...

1

Parallel Machine

2

CPU Usage

Normal computer – 1 CPU & 1 memoryThe problem of Von Neumann Bottleneck:

Slow processing because the CPU faster than memorySolution

Use multiple CPUs or multiple ALUsFor simultaneous processingKnown as parallel computers or multiprocessors computer

To improve the efficiency of computerIncrease the speed of processorImprove memory access

3

Type of Processing

Flynn Taxonomy

FLYNN TAXONOMY

Single Instruction Multiple Instruction

SISD SIMD MISD MIMD

4

Type of ProcessingSISD

Single Instruction Single DataSingle processor executes a single instruction stream to operate on data stored in single memory. Example: Von Neumann Machine.

SIMD Single Instruction Multiple DataSingle machine instruction controls the simultaneous execution of a number of processing elements on a lockstep basis. Each processing elements has an associated data memory, so that each instruction is executed on different set of data

VectorParallel

5

Type of ProcessingMISD

Multiple Instruction Single DataA sequence of data is transmitted to a set of processors, each of which executes a different instruction sequence.This structure is not commercially implemented.Extraordinary

MIMDMultiple Instruction Multiple DataA set of processors simultaneously execute different instructionsequence on different data sets.SMP (Symmetric Multiprocessor) and NUMA (Non Uniform Memory Access)Shared memory

switchbus

Distributed / Local MemorySwitch Bus

6

MIMD Distributed MemoryUsing many CPUs connectedCPU control the implementation of each operation separatelyCan perform various tasks simultaneously2 techniques of connection between CPU and memory:

Direct connectionNet/Grid Connection

The relationship between a corner with the opposite corner is fatSolution: use the hypercube / n-cube

7

Direct Connection

8

Net Connection

9

Hypercube Connection

100

101

111

000

010

011110

001

Route from 100 to 111

XOR 100

111

011

Therefore, the possible routes are through 110 and 101

10

MIMD Shared Memory – Bus

Use Bus – simple and easy

Cache memory

Cache memory

Cache memory

Memory

CPU1 CPU2 CPU3

11


Using busProblem: Von Neumann BottleneckSolution: Use cache memory in each CPU

Problem: Coherence cache memory – 2 processors read the same data. When one of them change the data, the other processor assumed the data is original and did know the changing of data. Solution:

SoftwareHardware

12


Solution in softwareClassified the data

SharedRead onlyRead-Write

UnsharedProblem on shared data read-write

Solution: not allow the caching

13

MIMD Shared Memory - Bus

Solution in hardwareUsing the cache memory controller and cache memory resolution protocolThe required word block will be loaded in the memory cache.

14

MIMD Shared Memory - Switch

Crossed Switch that connecting n CPU with k memoryAdvantage – network without barriesDisadvantage – use a lot of cross point (increase in n2)

15

Omega NetworkHave log2 n stages / levels with n/2 switch in each stageExample:Omega Network8 CPU x 8 memoryStage : log28 = 3Number of Switch : 8/2 = 4Total number of switch = 3 * 4 = 12

Less crossed point. Disadvantage – network detained

Suis Bersilang

8 CPU x 8 Ingatan = 64 Suis

16

Omega Network

000

001

010

011

100

101

110

111

1A

1B

1C

2A

2B

2C

3A

3B

3C

000

001

010

011

100

101

110

1111D 2D 3D

17

Benes Network

Resolved obstacles in omega networkUse more switches and more stage

Provide more route options from CPU to memory

18

SIMD Parallel Computer

Execution of programs with the same set of data simultaneouslyMore simple, cheap and very fastExample: connection machine

19

Connection Machine

Consist of: 4 quadrant which can be operated separately1 quadrant = 2 part of 8KPE (8192 processors)Each quadrant has:

ALU 8Kb memory4 bit flagsInterface with memory and I/O system1 route determinant

20

Connection Machine

The compiler is written in C or LISPEach section of 8KPE sub-cube quadrant is divided into 2 part of 4KPE (256 cip pemproses)Each 4KPE subcube has I/O system of its ownBus Width I/O = 64 bitHas 39 disk drive I/O – 1 disk 1 bit

21

SIMD Computer Vector

Connection machine is only suitable to solve artificial intelligent problemsFor floating point arithmetic such as grafic processing that involves vectors, connection machine is not suitableExample of SIMD Computer Vector – Super Computer CRAY-1

22

CRAY-1

Consist of Multiple ALU that can operate simultaneously2 addressing unit to compute addresses4 unit integer scalar for arithmetic operations.6 unit vector integer for vector operations

23

Cache MemoryCharacteristics of Memory System

Location:Refers to whether memory is internal or external to the computerExample: main memory, cache (internal) and optical disk, magnetic disk (external)

Capacity: Number of words or Number of bytesUnit of transfer: Word or blockAccess Method: Sequential, Direct, Random, AssociativePerformance: Access time, cycle time and transfer timePhysical type: semiconductor, magnetic, opticalPhysical characteristic: volatile or erasableOrganization: memory modules

24

Cache Memory PrinciplesIt is intended to give memory speed approaching that of the fastest memory availableAt the same time provide a large memory size at the price of less expensive types of semiconductor memories.The cache contains a copy of portions of main memory.When the processor attempts to read a word of memory:

A check is made to determine if the word is in the cache.If so, word is delivered to the processor.If not, a block of main memory is read into cache and the word is delivered to the processor.

The phenomenon of locality of reference, it is likely that there will be future references to that same memory location or to other words in the block

25

Cache/Main Memory Structure

26

Cache/Main Memory Principles

Main memory consists up to 2n addressable words, with each word having a unique n-bit address.For mapping purpose, this memory is considered to consist of a number of fixed length blocks of K words each.That is, M=2n/K blocks in main memory.The cache consist of m blocks called lines.Each line contains K words, plus a tag of a few bits.Each line also includes control bits.

27

Cache Read Operation

28

Cache Mapping Function

An algorithm is needed for mapping main memory blocks to cache line.It is because of a fewer cache lines than main memory blocks.The choice of the mapping function dictates how the cache is organized.Three technique can be used: Direct, associative and set associative.

29

ExampleA line is an adjacent series of bytes in main memory (that is, their addresses are contiguous). Suppose a line is 16 bytes in size. For example, suppose we have a 212 = 4K-byte cache with 28 = 256 16-byte lines; a 224 = 16M-byte main memory, which is 212 = 4K times the size of the cache; and a 400-line program, which will not all fit into the cache at once.

30

Direct MappingUnder this mapping scheme, each memory line j maps to cache line j mod 128 so the memory address looks like this:

Here:The "Word" field selects one from among the 16 addressable words in a line:The "Line" field defines the cache line where this memory line should reside.The "Tag" field of the address is is then compared with that cache line's 5-bit tag to determine whether there is a hit or a miss. If there's a miss, we need to swap out the memory line that occupies that position in the cache and replace it with the desired memory line.

31

Direct MappingE.g., Supposed that we want to read or write a word at the address 357A, whose 16 bits are 0011010101111010. This translates to Tag = 6, line = 87, and Word = 10 (all in decimal). If line 87 in the cache has the same tag (6), then memory address 357A is in the cache. Otherwise, a miss has occurred and the contents of cache line 87 must be replaced by the memory line 001101010111 = 855 before the read or write is executed.Direct mapping is the most efficient cache mapping scheme, but it is also the least effective in its utilization of the cache - that is, it may leave some cache lines unused.

32

Associative mappingThis mapping scheme attempts to improve cache utilization, but at the expense of speed. Here, the cache line tags are 12 bits, rather than 5, and any memory line can be stored in any cache line. Thememory address looks like this:

Here:The "Tag" field identifies one of the 2 12 = 4096 memory lines; all the cache tags are searched to find out whether or not the Tag field matches one of the cache tags. If so, we have a hit, and if not there's a miss and we need to replace one of the cache linesby this line before reading or writing into the cache.The "Word" field again selects one from among 16 addressable words (bytes) within the line.

33

Associative MappingFor example, suppose again that we want to read or write a word at the address 357A, whose 16 bits are 0011010101111010. Under associative mapping, this translates to Tag = 855 and Word = 10 (in decimal). So we search all of the 128 cache tags to see if any one of themwill match with 855. If not, there's a miss and we need to replace one of the cache lines with line 855 from memory before completing the read or write.The search of all 128 tags in the cache is time-consuming. However, the cache is fully utilized since none of its lines will be unused prior to a miss (recall that direct mapping may detect a miss even though the cache is not completely full of active lines).

34

Set Associative MappingThis scheme is a compromise between the direct and associative schemes described above. Here, the cache is divided into sets of tags, and the set number is directly mapped from the memory address (e.g., memory line j is mapped to cache set j mod 64), as suggested by the diagram below:

35

Set Associative MappingThe memory address is now partitioned to like this:

Here:The "Tag" field identifies one of the 26 = 64 different memory lines in each of the 26 = 64 different "Set" values. Since each cache set has room for only two lines at a time, the search for a match is limited to those two lines (rather than the entire cache). If there's a match, we have a hit and the read or write can proceed immediately. Otherwise, there's a miss and we need to replace one of the two cache lines by this line before reading or writing into the cache. The "Word" field again select one from among 16 addressable words inside the line.

36

Set Associative MappingIn set-associative mapping, when the number of lines per set is n, the mapping is called n-way associative. For instance, the above example is 2-way associative.Example: Again, supposed that we want to read or write a word at the memory address 357A, whose 16 bits are 0011010101111010. Under set-associative mapping, this translates to Tag = 13, Set = 23, and Word = 10 (all in decimal). So we search only the two tags in cache set 23 to see if either one matches tag 13. If so, we have a hit. Otherwise, one of these two must be replaced by the memory line being addressed (good old line 855) before the read or write can be executed.

Bab 5 Mesin Selari - ftsm.ukm.myweek4parallelmachine).pdf · 4 Type of Processing SISD Single...

Documents

Transcript of Bab 5 Mesin Selari - ftsm.ukm.myweek4parallelmachine).pdf · 4 Type of Processing SISD Single...