Recursion Unrolling for Divide and Conquer Programs

75
Recursion Unrolling for Divide and Conquer Programs Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

description

Recursion Unrolling for Divide and Conquer Programs. Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology. What This Talk Is About. Automatic generation of efficient large base cases for divide and conquer programs. Outline. - PowerPoint PPT Presentation

Transcript of Recursion Unrolling for Divide and Conquer Programs

Page 1: Recursion Unrolling  for Divide and Conquer Programs

Recursion Unrolling for Divide and Conquer Programs

Radu Rugina and Martin RinardLaboratory for Computer Science

Massachusetts Institute of Technology

Page 2: Recursion Unrolling  for Divide and Conquer Programs

What This Talk Is About

•Automatic generation of efficient large base cases for divide and conquer programs

Page 3: Recursion Unrolling  for Divide and Conquer Programs

Outline1. Motivating Example2. Computation Structure3. Transformations4. Related Work5. Conclusion

Page 4: Recursion Unrolling  for Divide and Conquer Programs

1. Motivating Example

Page 5: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

• Divide matrices into sub-matrices: A0 , A1, A2 etc

• Use blocked matrix multiply equations

A0 A1

A2 A3

B0 B1

B2 B3

A0B0+A1

B2

A0B1+A1

B3

A2B0+A3

B2

A2B1+A3

B3

=

A B = R

Page 6: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

• Recursively multiply sub-matrices

A0 A1

A2 A3

B0 B1

B2 B3

A0B0+A1

B2

A0B1+A1

B3

A2B0+A3

B2

A2B1+A3

B3

=

A B = R

Page 7: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

• Terminate recursion with a simple base case

=

A B = R

a0 b0 a0 b0

Page 8: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

void matmul(int *A, int *B, int *R, int n) { if (n == 1) {

(*R) += (*A) * (*B); } else {

matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);

}

Implements R += A B

Page 9: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

Divide matrices in sub-matrices andrecursively multiplysub-matrices

void matmul(int *A, int *B, int *R, int n) { if (n == 1) {

(*R) += (*A) * (*B); } else {

matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);

}

Page 10: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

Identify sub-matrices with pointers

void matmul(int *A, int *B, int *R, int n) { if (n == 1) {

(*R) += (*A) * (*B); } else {

matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);

}

Page 11: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

Use a simple algorithm for the base case

void matmul(int *A, int *B, int *R, int n) { if (n == 1) {

(*R) += (*A) * (*B); } else {

matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);

}

Page 12: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

• Advantage of small base case: simplicity

• Code is easy to:• Write• Maintain • Debug• Understand

void matmul(int *A, int *B, int *R, int n) { if (n == 1) {

(*R) += (*A) * (*B); } else {

matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);

}

Page 13: Recursion Unrolling  for Divide and Conquer Programs

Divide and Conquer Matrix Multiply

• Disadvantage: inefficiency

• Large control flow overhead:

• Most of the time is spent in dividing the matrix in sub-matrices

void matmul(int *A, int *B, int *R, int n) { if (n == 1) {

(*R) += (*A) * (*B); } else {

matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);

}

Page 14: Recursion Unrolling  for Divide and Conquer Programs

Hand Coded Implementationvoid serialmul(block *As, block *Bs, block *Rs){ int i, j; DOUBLE *A = (DOUBLE *) As; DOUBLE *B = (DOUBLE *) Bs; DOUBLE *R = (DOUBLE *) Rs; for (j = 0; j < 16; j += 2) { DOUBLE *bp = &B[j]; for (i = 0; i < 16; i += 2) { DOUBLE *ap = &A[i * 16]; DOUBLE *rp = &R[j + i * 16]; register DOUBLE s0_0 = rp[0], s0_1 = rp[1]; register DOUBLE s1_0 = rp[16], s1_1 = rp[17]; s0_0 += ap[0] * bp[0]; s0_1 += ap[0] * bp[1]; s1_0 += ap[16] * bp[0]; s1_1 += ap[16] * bp[1]; s0_0 += ap[1] * bp[16]; s0_1 += ap[1] * bp[17]; s1_0 += ap[17] * bp[16]; s1_1 += ap[17] * bp[17]; s0_0 += ap[2] * bp[32]; s0_1 += ap[2] * bp[33]; s1_0 += ap[18] * bp[32]; s1_1 += ap[18] * bp[33]; s0_0 += ap[3] * bp[48]; s0_1 += ap[3] * bp[49]; s1_0 += ap[19] * bp[48]; s1_1 += ap[19] * bp[49]; s0_0 += ap[4] * bp[64]; s0_1 += ap[4] * bp[65]; s1_0 += ap[20] * bp[64]; s1_1 += ap[20] * bp[65];

s0_0 += ap[5] * bp[80]; s0_1 += ap[5] * bp[81]; s1_0 += ap[21] * bp[80]; s1_1 += ap[21] * bp[81]; s0_0 += ap[6] * bp[96]; s0_1 += ap[6] * bp[97]; s1_0 += ap[22] * bp[96]; s1_1 += ap[22] * bp[97]; s0_0 += ap[7] * bp[112]; s0_1 += ap[7] * bp[113]; s1_0 += ap[23] * bp[112]; s1_1 += ap[23] * bp[113]; s0_0 += ap[8] * bp[128]; s0_1 += ap[8] * bp[129]; s1_0 += ap[24] * bp[128]; s1_1 += ap[24] * bp[129]; s0_0 += ap[9] * bp[144]; s0_1 += ap[9] * bp[145]; s1_0 += ap[25] * bp[144]; s1_1 += ap[25] * bp[145]; s0_0 += ap[10] * bp[160]; s0_1 += ap[10] * bp[161]; s1_0 += ap[26] * bp[160]; s1_1 += ap[26] * bp[161]; s0_0 += ap[11] * bp[176]; s0_1 += ap[11] * bp[177]; s1_0 += ap[27] * bp[176]; s1_1 += ap[27] * bp[177]; s0_0 += ap[12] * bp[192]; s0_1 += ap[12] * bp[193]; s1_0 += ap[28] * bp[192]; s1_1 += ap[28] * bp[193]; s0_0 += ap[13] * bp[208]; s0_1 += ap[13] * bp[209]; s1_0 += ap[29] * bp[208];

s1_1 += ap[29] * bp[209]; s0_0 += ap[14] * bp[224]; s0_1 += ap[14] * bp[225]; s1_0 += ap[30] * bp[224]; s1_1 += ap[30] * bp[225]; s0_0 += ap[15] * bp[240]; s0_1 += ap[15] * bp[241]; s1_0 += ap[31] * bp[240]; s1_1 += ap[31] * bp[241]; rp[0] = s0_0; rp[1] = s0_1; rp[16] = s1_0; rp[17] = s1_1; } }}

cilk void matrixmul(long nb, block *A, block *B, block *R){ if (nb == 1) { flops = serialmul(A, B, R); } else if (nb >= 4) {

spawn matrixmul(nb/4, A, B, R); spawn matrixmul(nb/4, A, B+(nb/4), R+(nb/4)); spawn matrixmul(nb/4, A+2*(nb/4), B+(nb/4), R+2*(nb/4)); spawn matrixmul(nb/4, A+2*(nb/4), B, R+3*(nb/4)); sync; spawn matrixmul(nb/4, A+(nb/4), B+2*(nb/4), R); spawn matrixmul(nb/4, A+(nb/4), B+3*(nb/4), R+(nb/4)); spawn matrixmul(nb/4, A+3*(nb/4), B+3*(nb/4), R+2*(nb/4)); spawn matrixmul(nb/4, A+3*(nb/4), B+3*(nb/4), R+3*(nb/4)); sync;

}}

Page 15: Recursion Unrolling  for Divide and Conquer Programs

Goal

• The programmer writes simple code with small base cases

• The compiler automatically generates efficient code with large base cases

Page 16: Recursion Unrolling  for Divide and Conquer Programs

2. Computation Structure

Page 17: Recursion Unrolling  for Divide and Conquer Programs

Running Example – Array Increment

void f(char *p, int n) if (n == 1) {

/* base case: increment one element */(*p) += 1;

} else {f(p, n/2); /* increment first half */f(p+n/2, n/2); /* increment second

half */}

}

Page 18: Recursion Unrolling  for Divide and Conquer Programs

Dynamic Call Tree for n=4Execution of f(p,4)

Page 19: Recursion Unrolling  for Divide and Conquer Programs

Dynamic Call Tree for n=4

Test n=1Call f Call

f

Execution of f(p,4)

Page 20: Recursion Unrolling  for Divide and Conquer Programs

Dynamic Call Tree for n=4

Test n=1Call f Call

f

Execution of f(p,4)Activation Frame

on the Stack

Page 21: Recursion Unrolling  for Divide and Conquer Programs

Dynamic Call Tree for n=4

Test n=1Call f Call

f

Execution of f(p,4)Executed

Instructions

Page 22: Recursion Unrolling  for Divide and Conquer Programs

Dynamic Call Tree for n=4

Test n=1Call f Call

f

Execution of f(p,4)

Page 23: Recursion Unrolling  for Divide and Conquer Programs

Dynamic Call Tree for n=4

Test n=1Call f Call

f

Test n=1Call f Call

f

Test n=1Call f Call

f

n=4

n=2

Execution of f(p,4)

Page 24: Recursion Unrolling  for Divide and Conquer Programs

Dynamic Call Tree for n=4

Test n=1Call f Call

f

Test n=1Call f Call

f

Test n=1

Inc *p

Test n=1

Inc *p

Test n=1Call f Call

f

Test n=1

Inc *p

Test n=1

Inc *p

n=4

n=2

n=1

Execution of f(p,4)

Page 25: Recursion Unrolling  for Divide and Conquer Programs

Control Flow Overhead

Test n=1Call f Call

f

Test n=1Call f Call

f

Test n=1

Inc *p

Test n=1

Inc *p

Test n=1Call f Call

f

Test n=1

Inc *p

Test n=1

Inc *p

n=4

n=2

n=1

Execution of f(p,4) Call

overhead

Page 26: Recursion Unrolling  for Divide and Conquer Programs

Control Flow Overhead

Test n=1Call f Call

f

Test n=1Call f Call

f

Test n=1

Inc *p

Test n=1

Inc *p

Test n=1Call f Call

f

Test n=1

Inc *p

Test n=1

Inc *p

n=4

n=2

n=1

Execution of f(p,4) Call overhead

+ Test overhead

Page 27: Recursion Unrolling  for Divide and Conquer Programs

Computation

Test n=1Call f Call

f

Test n=1Call f Call

f

Test n=1

Inc *p

Test n=1

Inc *p

Test n=1Call f Call

f

Test n=1

Inc *p

Test n=1

Inc *p

n=4

n=2

n=1

Execution of f(p,4) Call overhead

+ Test overhead

Computation

Page 28: Recursion Unrolling  for Divide and Conquer Programs

Large Base Cases = Reduced Overhead

Test n=2Call f Call

fn=4

n=2

Execution of f(p,4)

Test n=2Inc *p

Inc *(p+1)

Test n=2Inc *p

Inc *(p+1)

Page 29: Recursion Unrolling  for Divide and Conquer Programs

3. Transformations

Page 30: Recursion Unrolling  for Divide and Conquer Programs

Transformation 1: Recursion Inlining

void f (char *p, int n) if (n == 1) {

(*p) += 1; } else {

f(p, n/2);

f(p+n/2, n/2); }

Start with the original recursive procedure

Page 31: Recursion Unrolling  for Divide and Conquer Programs

Transformation 1: Recursion Inlining

void f1(char *p, int n) if (n == 1) {

(*p) += 1; } else {

f1(p, n/2);

f1(p+n/2, n/2); }

void f2(char *p, int n) if (n == 1) {

(*p) += 1; } else {

f2(p, n/2);

f2(p+n/2, n/2); }

Make two copies of the original procedure

Page 32: Recursion Unrolling  for Divide and Conquer Programs

Transformation 1: Recursion Inlining

void f1(char *p, int n) if (n == 1) {

(*p) += 1; } else {

f2(p, n/2);

f2(p+n/2, n/2); }

void f2(char *p, int n) if (n == 1) {

(*p) += 1; } else {

f1(p, n/2);

f1(p+n/2, n/2); }

Transform direct recursion to mutual recursion

Page 33: Recursion Unrolling  for Divide and Conquer Programs

Transformation 1: Recursion Inlining

void f1(char *p, int n) if (n == 1) {

(*p) += 1; } else {

f2(p, n/2);

f2(p+n/2, n/2); }

void f2(char *p, int n) if (n == 1) {

(*p) += 1; } else {

f1(p, n/2);

f1(p+n/2, n/2); }

Inline procedure f2 at call sites in f1

Page 34: Recursion Unrolling  for Divide and Conquer Programs

Transformation 1: Recursion Inlining

void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { if (n/2 == 1) { *p += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); } if (n/2 == 1) { *(p+n/2) += 1; } else { f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } }

Page 35: Recursion Unrolling  for Divide and Conquer Programs

Transformation 1: Recursion Inlining

void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { if (n/2 == 1) { *p += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); } if (n/2 == 1) { *(p+n/2) += 1; } else { f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } }

• Reduced procedure call overhead

• More code exposed at the intra-procedural level

• Opportunities to simplify control flow in the inlined code

Page 36: Recursion Unrolling  for Divide and Conquer Programs

Transformation 1: Recursion Inlining

void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { if (n/2 == 1) { *p += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); } if (n/2 == 1) { *(p+n/2) += 1; } else { f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } }

• Reduced procedure call overhead

• More code exposed at the intra-procedural level

• Opportunities to simplify control flow in the inlined code:

• identical condition expressions

Page 37: Recursion Unrolling  for Divide and Conquer Programs

Transformation 2: Conditional Fusion

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); }

Merge if statements with identical conditions

Page 38: Recursion Unrolling  for Divide and Conquer Programs

Transformation 2: Conditional Fusion

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); }

Merge if statements with identical conditions

• Reduced branching overhead and bigger basic blocks

• Larger base case for n/2 = 1

Page 39: Recursion Unrolling  for Divide and Conquer Programs

Unrolling Iterations

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); }

Repeatedly apply inlining and conditional fusion

Page 40: Recursion Unrolling  for Divide and Conquer Programs

Second Unrolling Iteration

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); }

void f2(char *p, int n) if (n == 1) { *p += 1; } else { f2(p, n/2); f2(p+n/2, n/2); }

Page 41: Recursion Unrolling  for Divide and Conquer Programs

Second Unrolling Iteration

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f2(p, n/2/2); f2(p+n/2/2, n/2/2); f2(p+n/2, n/2/2); f2(p+n/2+n/4, n/2/2); }

void f2(char *p, int n) if (n == 1) { *p += 1; } else { f1(p, n/2); f1(p+n/2, n/2); }

Page 42: Recursion Unrolling  for Divide and Conquer Programs

Result of Second Unrolling Iteration

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; }

else { f1(p, n/2/2/2); f1(p+n/2/2/2, n/2/2/2); f1(p+n/2/2, n/2/2/2); f1(p+n/2/2+n/2/2/2, n/2/2/2); f1(p+n/2, n/2/2/2); f1(p+n/2+n/2/2/2, n/2/2/2); f1(p+n/2+n/2/2, n/2/2/2); f1(p+n/2+n/2/2+n/2/2/2,

n/2/2/2);}

Page 43: Recursion Unrolling  for Divide and Conquer Programs

Unrolling Iterations• The unrolling process stops when the number

of iterations reaches the desired unrolling factor

• The unrolled recursive procedure:• Has base cases for larger problem sizes• Divides the given problem into more sub-

problems of smaller sizes

• In our example:• Base cases for n=1, n=2, and n=4• Problems are divided into 8 problems of 1/8

size

Page 44: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyMatrix of 512 x 512 elements

0

2

4

6

8

10

1 2unrolling factor

spee

dup

inline inline+fusion

Page 45: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyMatrix of 512 x 512 elements

0

2

4

6

8

10

1 2unrolling factor

spee

dup

inline inline+fusion

Page 46: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyMatrix of 1024 x 1024 elements

0

2

4

6

8

10

1 2unrolling factor

spee

dup

inline inline+fusion

Page 47: Recursion Unrolling  for Divide and Conquer Programs

Efficiency of Unrolled Recursive Part

• Because the recursive part is also unrolled, recursion may not exercise the large base

cases

• Which base case is executed depends on the size of the input problem

• In our example:• For a problem of size n=8, the base case for n=1 is

executed• For a problem of size n=16, the base case for n=2 is

executed• The efficient base case for n=4 is not executed in

these cases

Page 48: Recursion Unrolling  for Divide and Conquer Programs

Solution: Recursion Re-Rolling

• Roll back the recursive part of the unrolled procedure after the large base cases are generated

• Re-Rolling ensures that larger base cases are always executed, independent of the input problem size

• The compiler unrolls the recursive part only temporarily, to generate the base cases

Page 49: Recursion Unrolling  for Divide and Conquer Programs

Transformation 3: Recursion Re-Rolling

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; }

else { f1(p, n/2/2/2); f1(p+n/2/2/2, n/2/2/2); f1(p+n/2/2, n/2/2/2); f1(p+n/2/2+n/2/2/2, n/2/2/2); f1(p+n/2, n/2/2/2); f1(p+n/2+n/2/2/2, n/2/2/2); f1(p+n/2+n/2/2, n/2/2/2); f1(p+n/2+n/2/2+n/2/2/2,

n/2/2/2);}

Page 50: Recursion Unrolling  for Divide and Conquer Programs

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; }

Identify the recursive part

else { f1(p, n/2/2/2); f1(p+n/2/2/2, n/2/2/2); f1(p+n/2/2, n/2/2/2); f1(p+n/2/2+n/2/2/2, n/2/2/2); f1(p+n/2, n/2/2/2); f1(p+n/2+n/2/2/2, n/2/2/2); f1(p+n/2+n/2/2, n/2/2/2); f1(p+n/2+n/2/2+n/2/2/2,

n/2/2/2);}

Transformation 3: Recursion Re-Rolling

Page 51: Recursion Unrolling  for Divide and Conquer Programs

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; }

Replace with the recursive part of the original procedure

else { f1(p, n/2); f1(p+n/2, n/2);}

Transformation 3: Recursion Re-Rolling

Page 52: Recursion Unrolling  for Divide and Conquer Programs

Final Result

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; }

else { f1(p, n/2); f1(p+n/2, n/2);}

Page 53: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyMatrix of 512 x 512 elements

0

2

4

6

8

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 54: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyMatrix of 1024 x 1024 elements

0

2

4

6

8

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 55: Recursion Unrolling  for Divide and Conquer Programs

Other Optimizations

• Inlining moves code from the inter-procedural level to the intra-procedural level

• Conditional fusion brings code from the inter-basic-block level to the intra-basic-block level

• Together, inlining and conditional fusion give subsequent compiler passes the opportunity to perform more aggressive optimizations

Page 56: Recursion Unrolling  for Divide and Conquer Programs

Comparison to Hand Coded Programs

• Two applications: Matrix multiply, LU decomposition

• Three machines: Pentium III, Origin 2000, PowerPC

• Two different problem sizes

• Compare automatically unrolled programs to optimized, hand coded versions from the Cilk benchmarks

• Best automatically unrolled version performs:• Between 2.2 and 2.9 times worse for matrix

multiply• As good as hand coded version for LU

Page 57: Recursion Unrolling  for Divide and Conquer Programs

• Procedure Inlining:• Scheifler (1977)• Richardson, Ghanapathi (1989)• Chambers, Ungar (1989)• Cooper, Hall, Torczon (1991)• Appel (1992)• Chang, Mahlke, Chen, Hwu (1992)

Related Work

Page 58: Recursion Unrolling  for Divide and Conquer Programs

Conclusion• Recursion Unrolling

• analogous to the loop unrolling transformation

• Divide and Conquer Programs• The programmer writes simple base cases• The compiler automatically generates large base

cases

• Key Techniques• Inlining: conceptually inline recursive calls• Conditional Fusion: simplify intra-procedural

control flow• Re-Rolling: ensure that large base cases are

executed

Page 59: Recursion Unrolling  for Divide and Conquer Programs
Page 60: Recursion Unrolling  for Divide and Conquer Programs
Page 61: Recursion Unrolling  for Divide and Conquer Programs
Page 62: Recursion Unrolling  for Divide and Conquer Programs

Comparison to Hand Coded Programs

• Matrix multiply 512 x 512 elements:• Best automatically unrolled program: 2.55

sec.• Hand coded with three nested loops: 3.46

sec.• Hand coded Cilk program: 1.16

sec.

• Matrix multiply for 1024 x 1024 elements:• Best automatically unrolled program:

20.47 sec.• Hand coded with three nested loops:

27.40 sec.• Hand coded Cilk program: 9.19

sec.

Page 63: Recursion Unrolling  for Divide and Conquer Programs

Correctness

• Recursion unrolling preserves the semantics of the program:

• The unrolled program terminates if and only if the original recursive program terminates

• When both the original and the unrolled program terminate, the yield the same result

Page 64: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyPentium III, Matrix of 512 x 512

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 65: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyPentium III, Matrix of 1024 x 1024

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 66: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyPower PC, Matrix of 512 x 512

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 67: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyPower PC, Matrix of 1024 x 1024

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 68: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyOrigin 2000, Matrix of 512 x 512

elements

0

2

4

6

8

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 69: Recursion Unrolling  for Divide and Conquer Programs

Speedup for Matrix MultiplyOrigin 2000, Matrix of 1024 x 1024

elements

0

2

4

6

8

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 70: Recursion Unrolling  for Divide and Conquer Programs

Speedup for LUPentium III, Matrix of 512 x 512

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 71: Recursion Unrolling  for Divide and Conquer Programs

Speedup for LUPentium III, Matrix of 1024 x 1024

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 72: Recursion Unrolling  for Divide and Conquer Programs

Speedup for LUPower PC, Matrix of 512 x 512

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 73: Recursion Unrolling  for Divide and Conquer Programs

Speedup for LUPower PC, Matrix of 1024 x 1024

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 74: Recursion Unrolling  for Divide and Conquer Programs

Speedup for LUOrigin 2000, Matrix of 1024 x 1024

elements

02

468

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll

Page 75: Recursion Unrolling  for Divide and Conquer Programs

Speedup for LUOrigin 2000, Matrix of 512 x 512

elements

0

2

4

6

8

10

1 2 3unrolling factor

spee

dup

inline inline+fusion inline+fusion+reroll