SIMD Optimization in COINS Compiler Infrastructure

18
SIMD Optimization in COINS Compiler Infrastructure Mitsugu Suzuki (The University of Electro- Communications) Nobuhisa Fujinami (Sony Computer Entertainment Inc.)

description

SIMD Optimization in COINS Compiler Infrastructure. Mitsugu Suzuki (The University of Electro-Communications) Nobuhisa Fujinami (Sony Computer Entertainment Inc.). Agenda. COINS SIMD optimization Two topics on SIMD optimization Data Size Inference SIMD Benchmark - PowerPoint PPT Presentation

Transcript of SIMD Optimization in COINS Compiler Infrastructure

Page 1: SIMD Optimization in COINS Compiler Infrastructure

SIMD Optimization in COINS Compiler Infrastructure

Mitsugu Suzuki (The University of Electro-Communications)Nobuhisa Fujinami (Sony Computer Entertainment Inc.)

Page 2: SIMD Optimization in COINS Compiler Infrastructure

Agenda

COINS SIMD optimizationTwo topics on SIMD optimization Data Size Inference SIMD Benchmark

Current status and required improvements

Page 3: SIMD Optimization in COINS Compiler Infrastructure

SIMD optimization‥‥ Concept and decision

implemented as an LIR to LIR transformerrequires no additional special extensions for source languages.source-level optimizable matters are postponed.

→ HIR-level matterex. Vectorization (appropriate loop

unrolling),if-peeling, complex if-conversion, etc.

Page 4: SIMD Optimization in COINS Compiler Infrastructure

#define AVE(x,y) (((x)>>1)+((y)>>1)+(((x)|(y))&1))short *v1, *v2, *v3;/* Assume that all pointers are aligned, and distances of source and destination pointers are longer than the size of vector register. */ for (i = 0; i < M; i++) // case-A *v1++ = AVE(*v2++, *v3++); for (i = 0; i < M; i++) // case-B v1[i] = AVE(v2[i], v3[i]); for (i = 0; i < M; i += 4) { // case-C v1[i] = AVE(v2[i], v3[i]); v1[i+1] = AVE(v2[i+1], v3[i+1]); ... v1[i+3] = AVE(v2[i+3], v3[i+3]); } for (i = 0; i < M; i += 4) { // case-D v1[0] = AVE(v2[0], v3[0]); v1[1] = AVE(v2[1], v3[1]); ... v1[3] = AVE(v2[3], v3[3]); v1+=4; v2+=4; v3+=4; }

×

Page 5: SIMD Optimization in COINS Compiler Infrastructure

#define AVE(x,y) (((x)>>1)+((y)>>1)+(((x)|(y))&1))struct { short r, g, b, a;} *u1, *u2, *u3;

/* Assume that all pointers are aligned, and distances of source and destination pointers are longer than the size of vector register. */for (i = 0; i < M; i++) { // case-E u1[i].r = AVE(u2[i].r, u2[i].r); u1[i].g = AVE(u2[i].g, u2[i].g); u1[i].b = AVE(u2[i].b, u2[i].b); u1[i].a = AVE(u2[i].a, u2[i].a); }

Page 6: SIMD Optimization in COINS Compiler Infrastructure

SIMD optimization‥‥ Processing flow

1. If-conversion2. Decompose basic blocks into

DAGs.3. Match LIR patterns to specific

SIMD operation.4. Combine same basic operations.

(parallelization)

(⇒ 3rd page of hand script)

Page 7: SIMD Optimization in COINS Compiler Infrastructure

Data size inference ‥‥ Why needed?

#define AVE(x,y) (((x)>>1) + ((y)>>1) + (((x)|(y))&1))

#define AVE(x,y) (((x) + (y) + 1) >> 1)

Two styles of averaging integers:(assumption : Both x and y are given 8 bits unsigned integers.)

9bits8bits

8bits 8bits7bits7bits 8bits

⇒ max 9bits: zero-extension is needed (normal instruction oriented coding)

⇒ max 8bits: no extension is needed (SIMD instruction oriented coding)

But compiler must extend x and y to itsintegral type (typically 32 bits)← Integral promotion rule

Page 8: SIMD Optimization in COINS Compiler Infrastructure

Data size inference‥‥ Method

1. Get value range for each node.2. Get altering bits from the value range.3. Get meaningful bits for each node with

given one (from upper node).Getting value ranges and required bits are based on their Inference RulesPatterns of the meaningful bits are matched while instruction selection.

Page 9: SIMD Optimization in COINS Compiler Infrastructure

SET

MEM:I8

MEM:I8 MEM:I8

CONVIT:I8

RSHU

ADD

CONVZX

CONST

CONVZX

ADD CONST

1

1

*a = (*b + *c + 1) >> 1;

SET

MEM:I8

MEM:I8

CONVIT:I8

BANDADD

CONVZX CONST

ADD

CONST

1

1

RSHU

MEM:I8

CONVZX CONST

1

RSHUBOR

*a = (*b>>1 + *c>>1 +((*b | *c) & 1));

0..255

0..510

0..511

0..2550..255

1..1

1..1

0..255

0..127

1..1 1..1

0..127

0..255 0..255

0..2551..1

0..254 0..1

0..255

0..255

Page 10: SIMD Optimization in COINS Compiler Infrastructure

Data size inference‥‥ Method

1. Get value range for each node.2. Get altering bits from the value range.3. Get meaningful bits for each node with

given one (from upper node).Getting value ranges and required bits are based on their Inference RulesPatterns of the meaningful bits are matched while instruction selection.

Page 11: SIMD Optimization in COINS Compiler Infrastructure

SET

MEM:I8

MEM:I8 MEM:I8

CONVIT:I8

RSHU

ADD

CONVZX

CONST

CONVZX

ADD CONST

1

1

*a = (*b + *c + 1) >> 1;

0..255

0..510

0..511

0..2550..255

1..1

1..1

SET

MEM:I8

MEM:I8

CONVIT:I8

BANDADD

CONVZX CONST

ADD

CONST

1

1

0..127

1..1

RSHU

MEM:I8

CONVZX CONST

1

1..1

RSHU

0..127

BOR

0..255 0..255

0..2551..1

0..254 0..1

0..255

*a = (*b>>1 + *c>>1 +((*b | *c) & 1));

0..2550..2558

8 8

88

Page 12: SIMD Optimization in COINS Compiler Infrastructure

Data size inference‥‥ Method

1. Get value range for each node.2. Get altering bits from the value range.3. Get meaningful bits for each node with

given one (from upper node).Getting value ranges and required bits are based on their Inference Rules.Patterns of the meaningful bits are matched while instruction selection.

Page 13: SIMD Optimization in COINS Compiler Infrastructure

Data size inference‥‥ Method

1. Get value range for each node.2. Get altering bits from the value range.3. Get meaningful bits for each node with

given one (from upper node).Getting value ranges and required bits are based on their Inference RulesPatterns of the meaningful bits are matched while instruction selection.

Page 14: SIMD Optimization in COINS Compiler Infrastructure

SIMD Benchmark‥‥ Why needed?

Existing benchmarks are not suited for tuning of SIMD optimization. SIMD-optimizable patterns are covered with

non-SIMD-optimizable ones. Existing codes are far from SIMD-

optimization (without hole-in-one matching).

Step-wise milestones for SIMD-optimization was required.

Page 15: SIMD Optimization in COINS Compiler Infrastructure

SIMD Benchmark‥‥ Design

SIMD-optimizable code patterns were extracted from real media processing applications.Multiple versions were crafted by hand for each code patterns so as covering wide range, from easily SIMD

optimized level to original classified by SIMD optimization techniques execution times are reported for each

version

Page 16: SIMD Optimization in COINS Compiler Infrastructure

int16_t acLevel = data[i]; if (acLevel < 0) { acLevel = (-acLevel) - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = -acLevel;} else { acLevel = acLevel - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = acLevel;}

acLevel = ((data[i] < 0) ? -data[i] : data[i]) - quant_d_2;acLevel2 = (acLevel * mult) >> SCALEBITS;sum += ((acLevel < quant_m_2) ? 0 : acLevel2);coeff[i] = ((acLevel < quant_m_2) ? 0 : ((data[i] < 0) ? -acLevel2 : acLevel2));

Original If-peeled

and loop-unrolled / not

Page 17: SIMD Optimization in COINS Compiler Infrastructure

int16_t acLevel = data[i]; if (acLevel < 0) { acLevel = (-acLevel) - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = -acLevel;} else { acLevel = acLevel - quant_d_2; if (acLevel < quant_m_2) { coeff[i] = 0; continue;} acLevel = (acLevel * mult) >> SCALEBITS; sum += acLevel; coeff[i] = acLevel;}

acMsk1 = (int)data[i] >> 31;acLevel = ((data[i] & ~acMsk1)| ((-data[i]) & acMsk1)) - quant_d_2;acMsk2 = (acLevel < quant_m_2) ? 0 : 0xffff;acLevel = (acLevel * mult) >> SCALEBITS;sum += acMsk2 & acLevel;coeff[i] = acMsk2 & (((-acLevel) & acMsk1) | (acLevel & (~acMsk1)));

Original If-conversed

and loop-unrolled / not

Page 18: SIMD Optimization in COINS Compiler Infrastructure

Current status andrequired improvements

Bone of SIMD opt. has been implemented.Following are MUST Enrichment of template for specific SIMD op. Isolation of machine dependent and

independent part in SIMD opt. Recovery method from failure in SIMD op.

matching. Alignment and overlapping check for pointers .

⇒ will be solved in the next release