AISC - University of Minnesotaaltai.ece.umn.edu/Publications_files/wax18_1.pdf · sentative AISC...
Transcript of AISC - University of Minnesotaaltai.ece.umn.edu/Publications_files/wax18_1.pdf · sentative AISC...
LTAI
AISCApproximateInstruc3onSetComputer
AlexandraFerrerón,DaríoSuárez-GraciaJesúsAlastruey-Benedé,UlyaR.Karpuzcu
UniversityofZaragozaUniversityofMinnesota,Twin-CiHes
WAX’18,March252018
AISC
InstrucHonSetArchitecture(ISA)
2
•Viewfromhardwarestack:•BehavioraldesignspecificaHon•Determineshardwarecomplexity
•Viewfromso\warestack:•DefiniHonofthemachinecapabiliHes•Determinesfunc3onalcompleteness
AISC
ApproximateInstrucHonSetComputer(AISC)•Single-ISAheterogeneousfabric•Eachcomputeengine(coreorfixedfuncHon)supportsasubsetoftheISA•Component(funcHonally-incomplete)ISAmayoverlap•IncompleteISAcanreducemicroarchitecturalcomplexity•therebyimproveperformanceperWa_
•TheunionofincompleteISAsubsetsrendersafuncHonallycompleteISA
3
AISC
ApproximateInstrucHonSetComputer(AISC)•Single-ISAheterogeneousfabric•Eachcomputeengine(coreorfixedfuncHon)supportsasubsetoftheISA•Component(funcHonally-incomplete)ISAmayoverlap•IncompleteISAcanreducemicroarchitecturalcomplexity•therebyimproveperformanceperWa_
•TheunionofincompleteISAsubsetsrendersafuncHonallycompleteISA
4
Codethatcannotbeapproximatedcanrunatfullaccuracy
ImprovedenergyefficiencyifISA-inducedapproximaHonistolerable
AISC
ApproximateInstrucHonSetComputer(AISC)
5
•Howtodetermineincomplete,approximateISAsubsets?•VerHcalapproximaHon:•ExcludelessfrequentlyusedcomplexinstrucHons
•HorizontalapproximaHon:•SimplifieseachinstrucHon,bye.g.,precisionreducHon
AISC
Proof-Of-Concept
6
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
WAX’18, March 25, 2018Alexandra Ferrerón1, Jesús Alastruey-Benedé1, Darío Suárez-Gracia1, Ulya R. Karpuzcu2
1 Universidad de Zaragoza, Spain 2 University of Minnesota, Twin Cities
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
errors or simply enforced by design. The latter applies forthe following discussion.
2 Proof-of-concept ImplementationLet us start with a motivating example. Fig. 1 shows how the(graphic) output of a typical noise-tolerant application, SRR 1,changes for representative Vertical, Horizontal, and Horizon-tal + Vertical approximation under AISC. The application iscompiled with GCC 4.8.4 with -O1 on an Intel® Core™ i53210M machine. As we perform manual transformations onthe code, high optimization levels hinder the task; we resortto -O1 for our proof-of-concept and leave for future workmore exploration on compiler optimizations. We focus onthe main kernel where the actual computation takes place,and conservatively assume that this entire code would bemapped to a compute engine with approximated ISA. Weuse ACCURAX metrics [1] to quantify the accuracy-loss. Weprototype basic Horizontal and Vertical ISA approximationson Pin 2.14 [6]. Fig. 1(a) captures the output for the base-line for comparison, Native execution, which excludes anyapproximation. We observe that the accuracy loss remainsbarely visible, but still varies across di�erent approximations.Let us next take a closer look at the sources of this diversity.
(a) Native (b) Vertical (c) Horizontal (d) Horiz.+Vert.
Figure 1. Graphic output of SRR benchmark under repre-sentative AISC approximations (b)-(d).2.1 Vertical ApproximationThe key question is how to pick the instructions to drop. Amore general version of this question, which instructions toapproximate under AISC, already applies across all dimen-sions, but the question becomes more critical in this case. Asthe most aggressive in our bag of tricks, Vertical can incursigni�cant loss in accuracy. The targeted recognition-mining-synthesis applications can tolerate errors in data-centricphases as opposed to control [3]. Therefore, con�ning in-struction dropping to data-�ow can help limit the incurredaccuracy loss. Fig. 1(b) captures an example execution out-come, where we randomly deleted static (arithmetic) �oatingpoint instructions. For each static instruction, we based thedropping decision on a pre-de�ned threshold t. We gener-ated a random number r in the range [0, 1], and droppedthe static instruction if r remains below t. We experimentedwith threshold values between 1% and 10%.
2.2 Horizontal ApproximationWithout loss of generality, we experimented with three ap-proximations to reduce operand widths: DPtoSP, DP(SP)toHP,1Super Resolution Reconstruction, a computer vision application from theCortex suite [12]. We use the (64⇥64) “EIA” input data set of 16 frames. Theoutput is the (256⇥256) reconstructed image.
and DP(SP)toINT. Under the IEEE 754 standard, 32 (64) bitsexpress a single (double) precision �oating point number:one bit speci�es the sign; 8 (11) bits, the exponent; and 23 (52)bits the mantissa, i.e., the fraction. For example, (�1)si�n ⇥2exponent�127⇥1.mantissa represents a single-precision �oat-ing number. DPtoSP is a bit discarding variant, which omits 32least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact.DP(SP)toHP comes in two �avors. DPtoHP omits 48 least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact; SP-toHP, 16 least-signi�cant bits of the mantissa of each single-precision operand of an instruction. Fig. 1(c) captures anexample execution outcome under DPtoHP. DP(SP)toINTalso comes in two �avors. DPtoINT (SPtoINT) replaces dou-ble (single) precision instructions with their integer counter-parts, by rounding the �oating point operand values to theclosest integer.2.3 Horizontal +Vertical ApproximationWithout loss of generality, we experimented with two rep-resentatives in this case: MULtoADD and DIVtoMUL. MUL-toADD converts multiplication instructions to a sequenceof additions. We picked the smaller of the factors as themultiplier (which determines the number of additions), androunded �oating point multipliers to the closest integer num-ber. DIVtoMUL converts division instructions to multipli-cations. We �rst calculated the reciprocal of the divisor,which next gets multiplied by the dividend to render theend result. In our proof-of-concept implementation based onthe x86 ISA, the reciprocal instruction has 12-bit precision.DIVtoMUL12 uses this instruction. DIVtoMUL.NR, on theother hand, relies on one iteration of the Newton-Raphsonmethod [4] to increase the precision of the reciprocal to23 bits. DIVtoMUL12 can be regarded as an approximateversion of DIVtoMUL.NR, eliminating the Newton-Raphsoniteration, and hence enforcing a less accurate estimate ofthe reciprocal (of only 12 bit precision). Fig. 1(d) captures anexample execution outcome under DIVtoMUL.NR.
3 Conclusion & DiscussionOur proof-of-concept analysis revealed that, in its restrictedform – where the region of interest of an application ismapped in its entirety to an incomplete-ISA compute en-gine, irrespective of potential changes in noise tolerancewithin the course of its execution – AISC can cut energy upto 37% at around 10% accuracy loss.The most critical design aspect is how instruction se-
quences should be mapped to restricted-ISA compute en-gines, and how such sequences should be migrated from oneengine to another within the course of execution, to trackpotential temporal changes in algorithmic noise tolerance.While fast code migration is not impossible, if not orches-trated carefully, the energy overhead of �ne-grain migrationcan easily become prohibitive. Therefore, a break-even pointin terms of migration frequency and granularity exists, pastwhich AISC may degrade energy e�ciency.
2
AISC
Proof-Of-Concept
7
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
WAX’18, March 25, 2018Alexandra Ferrerón1, Jesús Alastruey-Benedé1, Darío Suárez-Gracia1, Ulya R. Karpuzcu2
1 Universidad de Zaragoza, Spain 2 University of Minnesota, Twin Cities
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
errors or simply enforced by design. The latter applies forthe following discussion.
2 Proof-of-concept ImplementationLet us start with a motivating example. Fig. 1 shows how the(graphic) output of a typical noise-tolerant application, SRR 1,changes for representative Vertical, Horizontal, and Horizon-tal + Vertical approximation under AISC. The application iscompiled with GCC 4.8.4 with -O1 on an Intel® Core™ i53210M machine. As we perform manual transformations onthe code, high optimization levels hinder the task; we resortto -O1 for our proof-of-concept and leave for future workmore exploration on compiler optimizations. We focus onthe main kernel where the actual computation takes place,and conservatively assume that this entire code would bemapped to a compute engine with approximated ISA. Weuse ACCURAX metrics [1] to quantify the accuracy-loss. Weprototype basic Horizontal and Vertical ISA approximationson Pin 2.14 [6]. Fig. 1(a) captures the output for the base-line for comparison, Native execution, which excludes anyapproximation. We observe that the accuracy loss remainsbarely visible, but still varies across di�erent approximations.Let us next take a closer look at the sources of this diversity.
(a) Native (b) Vertical (c) Horizontal (d) Horiz.+Vert.
Figure 1. Graphic output of SRR benchmark under repre-sentative AISC approximations (b)-(d).2.1 Vertical ApproximationThe key question is how to pick the instructions to drop. Amore general version of this question, which instructions toapproximate under AISC, already applies across all dimen-sions, but the question becomes more critical in this case. Asthe most aggressive in our bag of tricks, Vertical can incursigni�cant loss in accuracy. The targeted recognition-mining-synthesis applications can tolerate errors in data-centricphases as opposed to control [3]. Therefore, con�ning in-struction dropping to data-�ow can help limit the incurredaccuracy loss. Fig. 1(b) captures an example execution out-come, where we randomly deleted static (arithmetic) �oatingpoint instructions. For each static instruction, we based thedropping decision on a pre-de�ned threshold t. We gener-ated a random number r in the range [0, 1], and droppedthe static instruction if r remains below t. We experimentedwith threshold values between 1% and 10%.
2.2 Horizontal ApproximationWithout loss of generality, we experimented with three ap-proximations to reduce operand widths: DPtoSP, DP(SP)toHP,1Super Resolution Reconstruction, a computer vision application from theCortex suite [12]. We use the (64⇥64) “EIA” input data set of 16 frames. Theoutput is the (256⇥256) reconstructed image.
and DP(SP)toINT. Under the IEEE 754 standard, 32 (64) bitsexpress a single (double) precision �oating point number:one bit speci�es the sign; 8 (11) bits, the exponent; and 23 (52)bits the mantissa, i.e., the fraction. For example, (�1)si�n ⇥2exponent�127⇥1.mantissa represents a single-precision �oat-ing number. DPtoSP is a bit discarding variant, which omits 32least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact.DP(SP)toHP comes in two �avors. DPtoHP omits 48 least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact; SP-toHP, 16 least-signi�cant bits of the mantissa of each single-precision operand of an instruction. Fig. 1(c) captures anexample execution outcome under DPtoHP. DP(SP)toINTalso comes in two �avors. DPtoINT (SPtoINT) replaces dou-ble (single) precision instructions with their integer counter-parts, by rounding the �oating point operand values to theclosest integer.2.3 Horizontal +Vertical ApproximationWithout loss of generality, we experimented with two rep-resentatives in this case: MULtoADD and DIVtoMUL. MUL-toADD converts multiplication instructions to a sequenceof additions. We picked the smaller of the factors as themultiplier (which determines the number of additions), androunded �oating point multipliers to the closest integer num-ber. DIVtoMUL converts division instructions to multipli-cations. We �rst calculated the reciprocal of the divisor,which next gets multiplied by the dividend to render theend result. In our proof-of-concept implementation based onthe x86 ISA, the reciprocal instruction has 12-bit precision.DIVtoMUL12 uses this instruction. DIVtoMUL.NR, on theother hand, relies on one iteration of the Newton-Raphsonmethod [4] to increase the precision of the reciprocal to23 bits. DIVtoMUL12 can be regarded as an approximateversion of DIVtoMUL.NR, eliminating the Newton-Raphsoniteration, and hence enforcing a less accurate estimate ofthe reciprocal (of only 12 bit precision). Fig. 1(d) captures anexample execution outcome under DIVtoMUL.NR.
3 Conclusion & DiscussionOur proof-of-concept analysis revealed that, in its restrictedform – where the region of interest of an application ismapped in its entirety to an incomplete-ISA compute en-gine, irrespective of potential changes in noise tolerancewithin the course of its execution – AISC can cut energy upto 37% at around 10% accuracy loss.The most critical design aspect is how instruction se-
quences should be mapped to restricted-ISA compute en-gines, and how such sequences should be migrated from oneengine to another within the course of execution, to trackpotential temporal changes in algorithmic noise tolerance.While fast code migration is not impossible, if not orches-trated carefully, the energy overhead of �ne-grain migrationcan easily become prohibitive. Therefore, a break-even pointin terms of migration frequency and granularity exists, pastwhich AISC may degrade energy e�ciency.
2
AISC
Proof-Of-Concept
8
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
WAX’18, March 25, 2018Alexandra Ferrerón1, Jesús Alastruey-Benedé1, Darío Suárez-Gracia1, Ulya R. Karpuzcu2
1 Universidad de Zaragoza, Spain 2 University of Minnesota, Twin Cities
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
errors or simply enforced by design. The latter applies forthe following discussion.
2 Proof-of-concept ImplementationLet us start with a motivating example. Fig. 1 shows how the(graphic) output of a typical noise-tolerant application, SRR 1,changes for representative Vertical, Horizontal, and Horizon-tal + Vertical approximation under AISC. The application iscompiled with GCC 4.8.4 with -O1 on an Intel® Core™ i53210M machine. As we perform manual transformations onthe code, high optimization levels hinder the task; we resortto -O1 for our proof-of-concept and leave for future workmore exploration on compiler optimizations. We focus onthe main kernel where the actual computation takes place,and conservatively assume that this entire code would bemapped to a compute engine with approximated ISA. Weuse ACCURAX metrics [1] to quantify the accuracy-loss. Weprototype basic Horizontal and Vertical ISA approximationson Pin 2.14 [6]. Fig. 1(a) captures the output for the base-line for comparison, Native execution, which excludes anyapproximation. We observe that the accuracy loss remainsbarely visible, but still varies across di�erent approximations.Let us next take a closer look at the sources of this diversity.
(a) Native (b) Vertical (c) Horizontal (d) Horiz.+Vert.
Figure 1. Graphic output of SRR benchmark under repre-sentative AISC approximations (b)-(d).2.1 Vertical ApproximationThe key question is how to pick the instructions to drop. Amore general version of this question, which instructions toapproximate under AISC, already applies across all dimen-sions, but the question becomes more critical in this case. Asthe most aggressive in our bag of tricks, Vertical can incursigni�cant loss in accuracy. The targeted recognition-mining-synthesis applications can tolerate errors in data-centricphases as opposed to control [3]. Therefore, con�ning in-struction dropping to data-�ow can help limit the incurredaccuracy loss. Fig. 1(b) captures an example execution out-come, where we randomly deleted static (arithmetic) �oatingpoint instructions. For each static instruction, we based thedropping decision on a pre-de�ned threshold t. We gener-ated a random number r in the range [0, 1], and droppedthe static instruction if r remains below t. We experimentedwith threshold values between 1% and 10%.
2.2 Horizontal ApproximationWithout loss of generality, we experimented with three ap-proximations to reduce operand widths: DPtoSP, DP(SP)toHP,1Super Resolution Reconstruction, a computer vision application from theCortex suite [12]. We use the (64⇥64) “EIA” input data set of 16 frames. Theoutput is the (256⇥256) reconstructed image.
and DP(SP)toINT. Under the IEEE 754 standard, 32 (64) bitsexpress a single (double) precision �oating point number:one bit speci�es the sign; 8 (11) bits, the exponent; and 23 (52)bits the mantissa, i.e., the fraction. For example, (�1)si�n ⇥2exponent�127⇥1.mantissa represents a single-precision �oat-ing number. DPtoSP is a bit discarding variant, which omits 32least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact.DP(SP)toHP comes in two �avors. DPtoHP omits 48 least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact; SP-toHP, 16 least-signi�cant bits of the mantissa of each single-precision operand of an instruction. Fig. 1(c) captures anexample execution outcome under DPtoHP. DP(SP)toINTalso comes in two �avors. DPtoINT (SPtoINT) replaces dou-ble (single) precision instructions with their integer counter-parts, by rounding the �oating point operand values to theclosest integer.2.3 Horizontal +Vertical ApproximationWithout loss of generality, we experimented with two rep-resentatives in this case: MULtoADD and DIVtoMUL. MUL-toADD converts multiplication instructions to a sequenceof additions. We picked the smaller of the factors as themultiplier (which determines the number of additions), androunded �oating point multipliers to the closest integer num-ber. DIVtoMUL converts division instructions to multipli-cations. We �rst calculated the reciprocal of the divisor,which next gets multiplied by the dividend to render theend result. In our proof-of-concept implementation based onthe x86 ISA, the reciprocal instruction has 12-bit precision.DIVtoMUL12 uses this instruction. DIVtoMUL.NR, on theother hand, relies on one iteration of the Newton-Raphsonmethod [4] to increase the precision of the reciprocal to23 bits. DIVtoMUL12 can be regarded as an approximateversion of DIVtoMUL.NR, eliminating the Newton-Raphsoniteration, and hence enforcing a less accurate estimate ofthe reciprocal (of only 12 bit precision). Fig. 1(d) captures anexample execution outcome under DIVtoMUL.NR.
3 Conclusion & DiscussionOur proof-of-concept analysis revealed that, in its restrictedform – where the region of interest of an application ismapped in its entirety to an incomplete-ISA compute en-gine, irrespective of potential changes in noise tolerancewithin the course of its execution – AISC can cut energy upto 37% at around 10% accuracy loss.The most critical design aspect is how instruction se-
quences should be mapped to restricted-ISA compute en-gines, and how such sequences should be migrated from oneengine to another within the course of execution, to trackpotential temporal changes in algorithmic noise tolerance.While fast code migration is not impossible, if not orches-trated carefully, the energy overhead of �ne-grain migrationcan easily become prohibitive. Therefore, a break-even pointin terms of migration frequency and granularity exists, pastwhich AISC may degrade energy e�ciency.
2
AISC
Proof-Of-Concept
9
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
WAX’18, March 25, 2018Alexandra Ferrerón1, Jesús Alastruey-Benedé1, Darío Suárez-Gracia1, Ulya R. Karpuzcu2
1 Universidad de Zaragoza, Spain 2 University of Minnesota, Twin Cities
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
errors or simply enforced by design. The latter applies forthe following discussion.
2 Proof-of-concept ImplementationLet us start with a motivating example. Fig. 1 shows how the(graphic) output of a typical noise-tolerant application, SRR 1,changes for representative Vertical, Horizontal, and Horizon-tal + Vertical approximation under AISC. The application iscompiled with GCC 4.8.4 with -O1 on an Intel® Core™ i53210M machine. As we perform manual transformations onthe code, high optimization levels hinder the task; we resortto -O1 for our proof-of-concept and leave for future workmore exploration on compiler optimizations. We focus onthe main kernel where the actual computation takes place,and conservatively assume that this entire code would bemapped to a compute engine with approximated ISA. Weuse ACCURAX metrics [1] to quantify the accuracy-loss. Weprototype basic Horizontal and Vertical ISA approximationson Pin 2.14 [6]. Fig. 1(a) captures the output for the base-line for comparison, Native execution, which excludes anyapproximation. We observe that the accuracy loss remainsbarely visible, but still varies across di�erent approximations.Let us next take a closer look at the sources of this diversity.
(a) Native (b) Vertical (c) Horizontal (d) Horiz.+Vert.
Figure 1. Graphic output of SRR benchmark under repre-sentative AISC approximations (b)-(d).2.1 Vertical ApproximationThe key question is how to pick the instructions to drop. Amore general version of this question, which instructions toapproximate under AISC, already applies across all dimen-sions, but the question becomes more critical in this case. Asthe most aggressive in our bag of tricks, Vertical can incursigni�cant loss in accuracy. The targeted recognition-mining-synthesis applications can tolerate errors in data-centricphases as opposed to control [3]. Therefore, con�ning in-struction dropping to data-�ow can help limit the incurredaccuracy loss. Fig. 1(b) captures an example execution out-come, where we randomly deleted static (arithmetic) �oatingpoint instructions. For each static instruction, we based thedropping decision on a pre-de�ned threshold t. We gener-ated a random number r in the range [0, 1], and droppedthe static instruction if r remains below t. We experimentedwith threshold values between 1% and 10%.
2.2 Horizontal ApproximationWithout loss of generality, we experimented with three ap-proximations to reduce operand widths: DPtoSP, DP(SP)toHP,1Super Resolution Reconstruction, a computer vision application from theCortex suite [12]. We use the (64⇥64) “EIA” input data set of 16 frames. Theoutput is the (256⇥256) reconstructed image.
and DP(SP)toINT. Under the IEEE 754 standard, 32 (64) bitsexpress a single (double) precision �oating point number:one bit speci�es the sign; 8 (11) bits, the exponent; and 23 (52)bits the mantissa, i.e., the fraction. For example, (�1)si�n ⇥2exponent�127⇥1.mantissa represents a single-precision �oat-ing number. DPtoSP is a bit discarding variant, which omits 32least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact.DP(SP)toHP comes in two �avors. DPtoHP omits 48 least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact; SP-toHP, 16 least-signi�cant bits of the mantissa of each single-precision operand of an instruction. Fig. 1(c) captures anexample execution outcome under DPtoHP. DP(SP)toINTalso comes in two �avors. DPtoINT (SPtoINT) replaces dou-ble (single) precision instructions with their integer counter-parts, by rounding the �oating point operand values to theclosest integer.2.3 Horizontal +Vertical ApproximationWithout loss of generality, we experimented with two rep-resentatives in this case: MULtoADD and DIVtoMUL. MUL-toADD converts multiplication instructions to a sequenceof additions. We picked the smaller of the factors as themultiplier (which determines the number of additions), androunded �oating point multipliers to the closest integer num-ber. DIVtoMUL converts division instructions to multipli-cations. We �rst calculated the reciprocal of the divisor,which next gets multiplied by the dividend to render theend result. In our proof-of-concept implementation based onthe x86 ISA, the reciprocal instruction has 12-bit precision.DIVtoMUL12 uses this instruction. DIVtoMUL.NR, on theother hand, relies on one iteration of the Newton-Raphsonmethod [4] to increase the precision of the reciprocal to23 bits. DIVtoMUL12 can be regarded as an approximateversion of DIVtoMUL.NR, eliminating the Newton-Raphsoniteration, and hence enforcing a less accurate estimate ofthe reciprocal (of only 12 bit precision). Fig. 1(d) captures anexample execution outcome under DIVtoMUL.NR.
3 Conclusion & DiscussionOur proof-of-concept analysis revealed that, in its restrictedform – where the region of interest of an application ismapped in its entirety to an incomplete-ISA compute en-gine, irrespective of potential changes in noise tolerancewithin the course of its execution – AISC can cut energy upto 37% at around 10% accuracy loss.The most critical design aspect is how instruction se-
quences should be mapped to restricted-ISA compute en-gines, and how such sequences should be migrated from oneengine to another within the course of execution, to trackpotential temporal changes in algorithmic noise tolerance.While fast code migration is not impossible, if not orches-trated carefully, the energy overhead of �ne-grain migrationcan easily become prohibitive. Therefore, a break-even pointin terms of migration frequency and granularity exists, pastwhich AISC may degrade energy e�ciency.
2
double precision ➔ half-precision
AISC
Proof-Of-Concept
10
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
WAX’18, March 25, 2018Alexandra Ferrerón1, Jesús Alastruey-Benedé1, Darío Suárez-Gracia1, Ulya R. Karpuzcu2
1 Universidad de Zaragoza, Spain 2 University of Minnesota, Twin Cities
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
errors or simply enforced by design. The latter applies forthe following discussion.
2 Proof-of-concept ImplementationLet us start with a motivating example. Fig. 1 shows how the(graphic) output of a typical noise-tolerant application, SRR 1,changes for representative Vertical, Horizontal, and Horizon-tal + Vertical approximation under AISC. The application iscompiled with GCC 4.8.4 with -O1 on an Intel® Core™ i53210M machine. As we perform manual transformations onthe code, high optimization levels hinder the task; we resortto -O1 for our proof-of-concept and leave for future workmore exploration on compiler optimizations. We focus onthe main kernel where the actual computation takes place,and conservatively assume that this entire code would bemapped to a compute engine with approximated ISA. Weuse ACCURAX metrics [1] to quantify the accuracy-loss. Weprototype basic Horizontal and Vertical ISA approximationson Pin 2.14 [6]. Fig. 1(a) captures the output for the base-line for comparison, Native execution, which excludes anyapproximation. We observe that the accuracy loss remainsbarely visible, but still varies across di�erent approximations.Let us next take a closer look at the sources of this diversity.
(a) Native (b) Vertical (c) Horizontal (d) Horiz.+Vert.
Figure 1. Graphic output of SRR benchmark under repre-sentative AISC approximations (b)-(d).2.1 Vertical ApproximationThe key question is how to pick the instructions to drop. Amore general version of this question, which instructions toapproximate under AISC, already applies across all dimen-sions, but the question becomes more critical in this case. Asthe most aggressive in our bag of tricks, Vertical can incursigni�cant loss in accuracy. The targeted recognition-mining-synthesis applications can tolerate errors in data-centricphases as opposed to control [3]. Therefore, con�ning in-struction dropping to data-�ow can help limit the incurredaccuracy loss. Fig. 1(b) captures an example execution out-come, where we randomly deleted static (arithmetic) �oatingpoint instructions. For each static instruction, we based thedropping decision on a pre-de�ned threshold t. We gener-ated a random number r in the range [0, 1], and droppedthe static instruction if r remains below t. We experimentedwith threshold values between 1% and 10%.
2.2 Horizontal ApproximationWithout loss of generality, we experimented with three ap-proximations to reduce operand widths: DPtoSP, DP(SP)toHP,1Super Resolution Reconstruction, a computer vision application from theCortex suite [12]. We use the (64⇥64) “EIA” input data set of 16 frames. Theoutput is the (256⇥256) reconstructed image.
and DP(SP)toINT. Under the IEEE 754 standard, 32 (64) bitsexpress a single (double) precision �oating point number:one bit speci�es the sign; 8 (11) bits, the exponent; and 23 (52)bits the mantissa, i.e., the fraction. For example, (�1)si�n ⇥2exponent�127⇥1.mantissa represents a single-precision �oat-ing number. DPtoSP is a bit discarding variant, which omits 32least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact.DP(SP)toHP comes in two �avors. DPtoHP omits 48 least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact; SP-toHP, 16 least-signi�cant bits of the mantissa of each single-precision operand of an instruction. Fig. 1(c) captures anexample execution outcome under DPtoHP. DP(SP)toINTalso comes in two �avors. DPtoINT (SPtoINT) replaces dou-ble (single) precision instructions with their integer counter-parts, by rounding the �oating point operand values to theclosest integer.2.3 Horizontal +Vertical ApproximationWithout loss of generality, we experimented with two rep-resentatives in this case: MULtoADD and DIVtoMUL. MUL-toADD converts multiplication instructions to a sequenceof additions. We picked the smaller of the factors as themultiplier (which determines the number of additions), androunded �oating point multipliers to the closest integer num-ber. DIVtoMUL converts division instructions to multipli-cations. We �rst calculated the reciprocal of the divisor,which next gets multiplied by the dividend to render theend result. In our proof-of-concept implementation based onthe x86 ISA, the reciprocal instruction has 12-bit precision.DIVtoMUL12 uses this instruction. DIVtoMUL.NR, on theother hand, relies on one iteration of the Newton-Raphsonmethod [4] to increase the precision of the reciprocal to23 bits. DIVtoMUL12 can be regarded as an approximateversion of DIVtoMUL.NR, eliminating the Newton-Raphsoniteration, and hence enforcing a less accurate estimate ofthe reciprocal (of only 12 bit precision). Fig. 1(d) captures anexample execution outcome under DIVtoMUL.NR.
3 Conclusion & DiscussionOur proof-of-concept analysis revealed that, in its restrictedform – where the region of interest of an application ismapped in its entirety to an incomplete-ISA compute en-gine, irrespective of potential changes in noise tolerancewithin the course of its execution – AISC can cut energy upto 37% at around 10% accuracy loss.The most critical design aspect is how instruction se-
quences should be mapped to restricted-ISA compute en-gines, and how such sequences should be migrated from oneengine to another within the course of execution, to trackpotential temporal changes in algorithmic noise tolerance.While fast code migration is not impossible, if not orches-trated carefully, the energy overhead of �ne-grain migrationcan easily become prohibitive. Therefore, a break-even pointin terms of migration frequency and granularity exists, pastwhich AISC may degrade energy e�ciency.
2
DIV ➔ MUL
AISC
Proof-Of-Concept
11
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
WAX’18, March 25, 2018Alexandra Ferrerón1, Jesús Alastruey-Benedé1, Darío Suárez-Gracia1, Ulya R. Karpuzcu2
1 Universidad de Zaragoza, Spain 2 University of Minnesota, Twin Cities
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
errors or simply enforced by design. The latter applies forthe following discussion.
2 Proof-of-concept ImplementationLet us start with a motivating example. Fig. 1 shows how the(graphic) output of a typical noise-tolerant application, SRR 1,changes for representative Vertical, Horizontal, and Horizon-tal + Vertical approximation under AISC. The application iscompiled with GCC 4.8.4 with -O1 on an Intel® Core™ i53210M machine. As we perform manual transformations onthe code, high optimization levels hinder the task; we resortto -O1 for our proof-of-concept and leave for future workmore exploration on compiler optimizations. We focus onthe main kernel where the actual computation takes place,and conservatively assume that this entire code would bemapped to a compute engine with approximated ISA. Weuse ACCURAX metrics [1] to quantify the accuracy-loss. Weprototype basic Horizontal and Vertical ISA approximationson Pin 2.14 [6]. Fig. 1(a) captures the output for the base-line for comparison, Native execution, which excludes anyapproximation. We observe that the accuracy loss remainsbarely visible, but still varies across di�erent approximations.Let us next take a closer look at the sources of this diversity.
(a) Native (b) Vertical (c) Horizontal (d) Horiz.+Vert.
Figure 1. Graphic output of SRR benchmark under repre-sentative AISC approximations (b)-(d).2.1 Vertical ApproximationThe key question is how to pick the instructions to drop. Amore general version of this question, which instructions toapproximate under AISC, already applies across all dimen-sions, but the question becomes more critical in this case. Asthe most aggressive in our bag of tricks, Vertical can incursigni�cant loss in accuracy. The targeted recognition-mining-synthesis applications can tolerate errors in data-centricphases as opposed to control [3]. Therefore, con�ning in-struction dropping to data-�ow can help limit the incurredaccuracy loss. Fig. 1(b) captures an example execution out-come, where we randomly deleted static (arithmetic) �oatingpoint instructions. For each static instruction, we based thedropping decision on a pre-de�ned threshold t. We gener-ated a random number r in the range [0, 1], and droppedthe static instruction if r remains below t. We experimentedwith threshold values between 1% and 10%.
2.2 Horizontal ApproximationWithout loss of generality, we experimented with three ap-proximations to reduce operand widths: DPtoSP, DP(SP)toHP,1Super Resolution Reconstruction, a computer vision application from theCortex suite [12]. We use the (64⇥64) “EIA” input data set of 16 frames. Theoutput is the (256⇥256) reconstructed image.
and DP(SP)toINT. Under the IEEE 754 standard, 32 (64) bitsexpress a single (double) precision �oating point number:one bit speci�es the sign; 8 (11) bits, the exponent; and 23 (52)bits the mantissa, i.e., the fraction. For example, (�1)si�n ⇥2exponent�127⇥1.mantissa represents a single-precision �oat-ing number. DPtoSP is a bit discarding variant, which omits 32least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact.DP(SP)toHP comes in two �avors. DPtoHP omits 48 least-signi�cant bits of the mantissa of each double-precisionoperand of an instruction, and keeps the exponent intact; SP-toHP, 16 least-signi�cant bits of the mantissa of each single-precision operand of an instruction. Fig. 1(c) captures anexample execution outcome under DPtoHP. DP(SP)toINTalso comes in two �avors. DPtoINT (SPtoINT) replaces dou-ble (single) precision instructions with their integer counter-parts, by rounding the �oating point operand values to theclosest integer.2.3 Horizontal +Vertical ApproximationWithout loss of generality, we experimented with two rep-resentatives in this case: MULtoADD and DIVtoMUL. MUL-toADD converts multiplication instructions to a sequenceof additions. We picked the smaller of the factors as themultiplier (which determines the number of additions), androunded �oating point multipliers to the closest integer num-ber. DIVtoMUL converts division instructions to multipli-cations. We �rst calculated the reciprocal of the divisor,which next gets multiplied by the dividend to render theend result. In our proof-of-concept implementation based onthe x86 ISA, the reciprocal instruction has 12-bit precision.DIVtoMUL12 uses this instruction. DIVtoMUL.NR, on theother hand, relies on one iteration of the Newton-Raphsonmethod [4] to increase the precision of the reciprocal to23 bits. DIVtoMUL12 can be regarded as an approximateversion of DIVtoMUL.NR, eliminating the Newton-Raphsoniteration, and hence enforcing a less accurate estimate ofthe reciprocal (of only 12 bit precision). Fig. 1(d) captures anexample execution outcome under DIVtoMUL.NR.
3 Conclusion & DiscussionOur proof-of-concept analysis revealed that, in its restrictedform – where the region of interest of an application ismapped in its entirety to an incomplete-ISA compute en-gine, irrespective of potential changes in noise tolerancewithin the course of its execution – AISC can cut energy upto 37% at around 10% accuracy loss.The most critical design aspect is how instruction se-
quences should be mapped to restricted-ISA compute en-gines, and how such sequences should be migrated from oneengine to another within the course of execution, to trackpotential temporal changes in algorithmic noise tolerance.While fast code migration is not impossible, if not orches-trated carefully, the energy overhead of �ne-grain migrationcan easily become prohibitive. Therefore, a break-even pointin terms of migration frequency and granularity exists, pastwhich AISC may degrade energy e�ciency.
2
Upto37%energycutataround10%accuracyloss
AISC
DesignAspects
12
•WhichsubsetoftheISAshouldeachcomputeenginesupport?•HowtomapinstrucHonsequencestocomputeengines?•HowtokeepthepotenHalaccuracylossbounded?•HowtoorchestratemigraHonofcodesequences•fromonecomputeenginetoanotherwithinthecourseofcomputaHon•tolerancetonoisemayvaryfordifferentapplicaHonphases
AISC
DesignAspects
13
•WhichsubsetoftheISAshouldeachcomputeenginesupport?•HowtomapinstrucHonsequencestocomputeengines?•HowtokeepthepotenHalaccuracylossbounded?•HowtoorchestratemigraHonofcodesequences•fromonecomputeenginetoanotherwithinthecourseofcomputaHon•tolerancetonoisemayvaryfordifferentapplicaHonphases
•Mostcri3caldesignaspect:migra3ongranularity•Abreak-evenpointexistsformigra3ongranularity(andfrequency)
LTAI
AISCApproximateInstruc3onSetComputer
AlexandraFerrerón,DaríoSuárez-GraciaJesúsAlastruey-Benedé,UlyaR.Karpuzcu
WAX’18,March252018
AISC
InstrucHonSetArchitecture(ISA)
15
Hardware
Software
Architecture
int main(void) { ... }
Dec
ode
Fetc
h
DIV.D F0, F2, F4
ADD.D F10, F0, F8
SUB.D F8, F8, F14