LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742...

Post on 17-Jan-2016

219 views 0 download

Tags:

Transcript of LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung 05700512 Wong Chung Hoi05596742...

LYU0703 Parallel Distributed Programming on PS3

1

LYU0703Parallel Distributed Programming on PS3

Huang Hiu Fung 05700512Wong Chung Hoi 05596742

Supervised by Prof. Michael R. Lyu

Department of Computer Science and Engineering, CUHK2007-2008 Final Year Project Presentation (1st term)

LYU0703 Parallel Distributed Programming on PS3

2

Agenda

• Background Information• Architecture of PlayStation®3• Principals of Parallel Programming• Optimization of the ADVISER program:

1. Sequential Approach2. Parallel Approach

• Conclusion• Future Works• Q&A

LYU0703 Parallel Distributed Programming on PS3

3

Background Information

Limitation of single-core processor:

1. Memory Access Latency

2. Wire Delays

3. Power Consumption

LYU0703 Parallel Distributed Programming on PS3

4

Power Consumption

P = powerC = capacitance V = voltageF = processor frequency (cycles per second)

LYU0703 Parallel Distributed Programming on PS3

5

Development of Multi-Core Processor

Fig. 1.4 Growth of No. of Cores in Processors

LYU0703 Parallel Distributed Programming on PS3

6

Development of Multi-Core Processor

• Reduce power consumption- use multiple cores with low frequency instead of one with high frequency

• Efficient processing of multiple tasks- divide the computation work- execute among the cores concurrently

LYU0703 Parallel Distributed Programming on PS3

7

Project Objectives

• Need of parallel programming to optimize intensive-computation applications

• Study features of parallel programming, compare sequential and parallel approach

• Optimize an application, showing great improvement by parallel programming

LYU0703 Parallel Distributed Programming on PS3

8

Architecture of PlayStation®3 (PS3)

• A multi-core machine produced by Sony, with the Cell Broadband Engine

• Strong Computation Power

• Opened platform for other applications and development

LYU0703 Parallel Distributed Programming on PS3

9

Cell Broadband Engine (Cell BE)

PPE – Power Processor Element

SPE – Synergistic Processor Element

EIB – Element Interconnect Bus

LYU0703 Parallel Distributed Programming on PS3

10

Power Processor Element (PPE)

• 64-bit PowerPC architecture based

• General purpose operation• Designed as control-

intensive• Control I/O of main

memeory and other devices by the OS

• Control over all 8 SPEs

Fig. 2.5 Design of PPE

LYU0703 Parallel Distributed Programming on PS3

11

Synergistic Processor Element (SPE)

• Designed to provide computation performance

• SPU – perform allocated task• LS – the only memory• MFC – control data transfer• Totally 8 SPEs in Cell• Only 6 acessisble• 1 reserved for system software

1 disabled Fig. 2.6 Design of a SPE

LYU0703 Parallel Distributed Programming on PS3

12

Element Interconnect Bus (EIB)

• Internal communication bus inside Cell

• Connect different elements: PPE, SPEs. Memory controller

Fig. 2.7 Data Flow and Program Control

LYU0703 Parallel Distributed Programming on PS3

13

Principal of Parallel Programming

Parallel algorithm Serial algorithm

multiple processing units single processing unit

communication overhead no communication overhead

higher complexity in code straight forward code

ensure load balance between PU everything is done by CPU

LYU0703 Parallel Distributed Programming on PS3

14

Concept of Load Balance

• Distribute data evenly

• Total runtime depends on

the busiest processing

element

• Wasting computation

time on idling processing

element

LYU0703 Parallel Distributed Programming on PS3

15

Method of parallelism

• Data parallelism

• Task parallelism

Parallel Architecture

Flynn's taxonomy

 Single

InstructionMultiple

Instruction

Single Data

SISD MISD

Multiple Data

SIMD MIMD

LYU0703 Parallel Distributed Programming on PS3

16

SISD

• Traditional Computer

• von Neumann model

LYU0703 Parallel Distributed Programming on PS3

17

SIMD

• Same instruction on all data

• Data parallelism

• SIMD intrinsic function

LYU0703 Parallel Distributed Programming on PS3

18

MISD

• No well known system

• Mention for completeness

LYU0703 Parallel Distributed Programming on PS3

19

MIMD

• Different instruction on

different data

• Task parallelism

• Further break down to– Shared Memory System– Distributed Memory System

LYU0703 Parallel Distributed Programming on PS3

20

Shared Memory System

• Access to central

memory for data

• PS3 :Achieve by

MFC issuing DMA

command

LYU0703 Parallel Distributed Programming on PS3

21

Distributed Memory System

• Each PE has its

own memory

• PS3: Each SPE

has 256KB Local Store

• PS3 is hybrid shared-distributed memory system

LYU0703 Parallel Distributed Programming on PS3

22

ADVISER

• Comparing 2 video clips

1.Generating meaningful data (in form of numbers) of frames from the video

2.Comparing and looking for the most similar frames

3.Locating the similar segment which consist of a series of very similar frames

LYU0703 Parallel Distributed Programming on PS3

23

Input

• 2 Folder, “Repository” & “Target”

• hl3 file = vector of 1024 double precision values

LYU0703 Parallel Distributed Programming on PS3

24

Input No. of hl3 files

Target directory 5473

Repository directory 7547

Processing

• hl3 file = vector of 1024 double precision values

• File P

• File Q

• Similarity =

• Smaller the better

LYU0703 Parallel Distributed Programming on PS3

25

Output

• M “Target”, N “Repository”

• O ( M * N )

• Computation time = 633 sec

• Flash demo

LYU0703 Parallel Distributed Programming on PS3

26

target hl3 1 most match repository A difference value = ??target hl3 2 most match repository B difference value = ??target hl3 3 most match repository C difference value = ??

Parallel Version

• Data parallelism

• Split data to 6 SPEs evenly

• Computation time for 6 SPEs = 330 sec

• Flash demo

LYU0703 Parallel Distributed Programming on PS3

27

Parallel Version

• Expected speed up 6X

• Actual speed up 2X

• PC and PPU, SPE all run at different speed

• Computation time with CPU = 633 sec

• Computation time with 1 SPE = 1928 sec

• Computation time with PPU = 3119 sec

• CPU > SPE > PPULYU0703 Parallel Distributed Progra

mming on PS328

Time Attack

1. SIMD intrinsic function

2. Changing data type

3. Double Buffering

4. Parallel Read

5. Distributing Job to idling PPE

6. SIMD on loop counter

7. Loop unrollingLYU0703 Parallel Distributed Progra

mming on PS329

SIMD intrinsic function

• Addition, subtraction,

multiplication, etc.

• Operates on 128 bits

registers

• Date type: double (64 bits)

• Speed up 2X

LYU0703 Parallel Distributed Programming on PS3

30

Changing Data Type to int

• Precision not important

• Major speed up from

SIMD intrinsic

• Data type: int (32 bits)

• Total Speed up 4X

• Computation time

= 71 sec

LYU0703 Parallel Distributed Programming on PS3

31

Running Time of Parallel PS3, with SIMD Version

0

500

1000

1500

2000

2500

1 2 3 4 5 6No. of SPU used

Sec

Parallel version

Parallel + SIMDversion

Changing Data Type to float

• SPE specified for high

precision computation

• No intrinsic for int data

type at all

• Data Type: float (32 bits)

• Save data conversion time

• Speed up by 30%• Computation time = 49 sec

LYU0703 Parallel Distributed Programming on PS3

32

Running Time of Parallel, with SIMD, float input PS3version

0

100

200

300

400

500

1 2 3 4 5 6

No. of SPU used

Sec

Parallel + SIMD + int

Parallel + SIMD +float

Double buffering

• Save communication time

• MFC and SPU

• 2 buffers– Prefetching– Processing

• Not heavy in communication

• Minor speed up

LYU0703 Parallel Distributed Programming on PS3

33

LYU0703 Parallel Distributed Programming on PS3

34

Parallel Reading for All Files

• Read “Target” and “Repository” concurrently

• Share file reading job among SPEs

• Not improve as predicted, even slower

• Reason: hard disk cannot cannot handle concurrent request

• Failed Attempt

LYU0703 Parallel Distributed Programming on PS3

35

Distributing Job to Idling PPE

• PPE current job: read files, distribute files, collect result

• Use stall time to do some computation

• Relatively low computation power of PPE

• No significant improvement

• Increase program complexity

• Abandon this approach

LYU0703 Parallel Distributed Programming on PS3

36

Applying SIMD for Loop Counter

Major computation power consumed in:

• initialize i = 0, diff = (0, 0, 0, 0).• for i < Number of float numbers in a file / Number of

floats packed in a register

A. temp = SIMD subtraction on vector i in “Target” and “Repository” file.

B. diff = SIMD addition (SIMD multiplication (temp, temp) , diff).

• i = i + 1.• Loop back to 2.

LYU0703 Parallel Distributed Programming on PS3

37

Applying SIMD for Loop Counter

• Try to optimize step 3

• Apply SIMD to the loop counter

• Addition and comparison operations are reduced by 8 times

LYU0703 Parallel Distributed Programming on PS3

38

Applying SIMD for Loop Counter

• initialize i = (0,1,2,3,4,5,6,7) , diff = (0, 0, 0, 0).• for i[0] < Number of float numbers in a file / Number of floats packed in a

registertemp = SIMD subtraction on vector i[0] in “Target” and “Repository” file.

diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[1] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[2] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[3] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[4] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[5] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[6] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).temp = SIMD subtraction on vector i[7] in “Target” and “Repository” file.diff = SIMD addition (SIMD multiplication (temp, temp) , diff).

• i = SIMD addition (i, (8, 8, 8, 8, 8, 8, 8, 8)).• Loop back to 2.

LYU0703 Parallel Distributed Programming on PS3

39

Result of the parallel, with SIMD, float input, SIMD for loop counter PS3 version

No. of SPU used

1 2 3 4 5 6

Read input time (sec)

4 5 3 4 4 4

Total Elapsed time (sec)

286 146 97 75 60 51

Net Elapsed time (sec)

282 141 94 71 56 47

LYU0703 Parallel Distributed Programming on PS3

40

Result of the parallel, with SIMD, float input, SIMD for loop counter PS3 version

Running Time of Parallel, with SIMD, float input, SIMD forloop counter PS3 version

0

100

200

300

400

1 2 3 4 5 6

No. of SPU used

Sec

Parallel+SIMD+float

Parallel+SIMD+float+SIMD for i

LYU0703 Parallel Distributed Programming on PS3

41

Result of the parallel, with SIMD, float input, SIMD for loop counter PS3 version

• little improvement (about 4%).

• shows the possibility to have faster performance by further loop unrolling.

• The best performance becomes 47 sec

LYU0703 Parallel Distributed Programming on PS3

42

Loop Unrolling

• Proved that optimizing the loop can improve performance

• Completely loop unrolling

• More obvious speed up

LYU0703 Parallel Distributed Programming on PS3

43

Result of the parallel, with SIMD, float input, loop unrolling PS3 version

No. of SPU used

1 2 3 4 5 6

Read input time (sec)

3 4 3 3 4 3

Total Elapsed time (sec)

159 82 55 42 35 30

Net Elapsed time (sec)

156 78 52 39 31 27

LYU0703 Parallel Distributed Programming on PS3

44

Result of the parallel, with SIMD, float input, loop unrolling PS3 version

Running Time of Parallel, with SIMD, float input, loopunrolling PS3 version

0

50100

150

200250

300

1 2 3 4 5 6

No. of SPU used

Sec

Parallel+SIMD+float+SIMD for i

Parallel+SIMD+float+loop unrolling

LYU0703 Parallel Distributed Programming on PS3

45

Result of the parallel, with SIMD, float input, loop unrolling PS3 version

• 45% faster

• ultimate best performance becomes 27 sec

LYU0703 Parallel Distributed Programming on PS3

46

Conclusion of Optimization

• PC version:663 sec

• PS3 with 1 SPU (i.e. sequential version on PS3):1928 sec

• Final optimized version of PS3:27 sec

23 times faster than PC version71 times faster than sequential version on PS3

LYU0703 Parallel Distributed Programming on PS3

47

Conclusion of Optimization

Elapsed time change with difference approach applied, in a6 SPU condition

0

50

100

150

200

250

300

350

paral

lelSIM

D

float

type

SIMD fo

r i

loop u

nrollin

g

sec

Elapsed time

LYU0703 Parallel Distributed Programming on PS3

48

Future Works

• Port the whole ADVISER application on PlayStation®3

• Optimization throughout the whole application

LYU0703 Parallel Distributed Programming on PS3

49

Q&A

LYU0703 Parallel Distributed Programming on PS3

50

The End