Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and...

23
Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia and Security Department of Electrical Engineering Princeton University 18 th IEEE Symposium on Computer Arithmetic (ARITH-18) Montpellier, France, June 25-27, 2007

Transcript of Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and...

Page 1: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

Performing Advanced Bit Manipulations Efficiently in General-Purpose ProcessorsYedidya Hilewitz and Ruby B. Lee

Princeton Architecture Lab for Multimedia and SecurityDepartment of Electrical EngineeringPrinceton University

18th IEEE Symposium on Computer Arithmetic (ARITH-18)Montpellier, France, June 25-27, 2007

Page 2: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

2

Background and Motivation

Advanced bit manipulations are not well supported by commodity microprocessors These operations are performed using

“programming tricks” (cf. Hacker’s Delight) Bit manipulations play a role in applications

of increasing importance We propose a brand new shifter architecture

that replaces the shifter with a new unit that directly supports bit manipulation operations

Page 3: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

3

Outline

Background and motivation Advanced bit manipulation operations

Delineation and example usage New shift-permute functional unit Summary and conclusions

Page 4: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

4

Advanced Bit Manipulation Instructions Bit Permutation

Butterfly (bfly) and Inverse Butterfly (ibfly) Bit Gather and Bit Scatter

Parallel Extract and Parallel Deposit

Page 5: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

5

Any of the n! permutations of n bits can be done with one pass of bfly and ibfly instructions bfly+ibfly = general permutation circuit 8-bit Butterfly

lg(n) stages of n 2:1 MUXes split into n/2 pairs that pass through or swap inputs

8-bit Inverse Butterfly

Page 6: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

6

Bit Gather (Parallel Extract) and Bit Scatter (Parallel Deposit) Parallel Extract

pex r1 = r2, r3

extracts bits from r2 flagged by 1’s in r3 and compresses and right justifies in result register

Parallel extract maps to ibfly datapath

Parallel Deposit pdep r1 = r2, r3

deposits in the result register, at positions flagged by 1’s in r3, the right justified bits from r2

Parallel deposit maps to bfly datapath

Page 7: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

7

Example Usage: Bioinformatics - DNA Sequence Reversal DNA Bases A, C, G and T

represented by two bit codes

Reversing DNA sequence is equivalent to reversing order of bit pairs bfly or ibfly permutation

1 ibfly instruction equivalent to 11-23 ALU and shifter instructions 2×(and, and, shift, shift, or)

+ byte reverse instruction, at minimum

Page 8: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

8

Advanced Bit Manipulation Functional Unit We propose adding a new functional unit to

directly perform advanced bit manipulations To minimize the cost, we intend for this new

functional unit to replace the shifter unit Shifter currently performs basic bit manipulation

operations Our new functional unit represents an

evolution of shifter designs

Page 9: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

9

Basic Bit Manipulation Operations shift r1 = r2, s

extract r1 = r2, pos, len

mix r1 = r2, r3

rotate r1 = r2, s

deposit r1 = r2, pos, len

Page 10: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

10

Parallel Extract and Parallel Deposit Parallel Extract Parallel Deposit

Page 11: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

11

Evolution of Shifter Designs

Barrel Shifter Log Shifter

Our proposed design

?

Page 12: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

12

New Shifter Design

Inverse butterfly (or butterfly) circuit enhanced with extra multiplexer stage is basis of new shifter design

We will show that either butterfly or inverse butterfly individually can do rotate

Rotations are the basic operation underlying shift, extract, deposit and mix Model other basic bit manipulation operations as rotate +

zeroing sign bit propagation or merging

Page 13: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

13

New Shift-Permute Functional Unit Implementation

Page 14: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

14

Configuring Inverse Butterfly for Rotations Hard Problem: generating control bits for

rotations on inverse butterfly circuit We derive an expression for the control bits

based on recursive function of shift amount, s, and stage number, j

Page 15: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

15

Example: Right Rotation by 5 on 8-bit Inverse Butterfly Circuit The input is right

rotated by 5 after each stage within each subcircuit

Page 16: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

16

Example: Right Rotation by 5 on 8-bit Inverse Butterfly Circuit After stage 1, input is

right rotated by 5 (mod 2) = 1 within each 2-bit subcircuit

Page 17: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

17

Example: Right Rotation by 5 on 8-bit Inverse Butterfly Circuit After stage 2, input is

right rotated by 5 (mod 4) = 1 within each 4-bit subcircuit

Bits that wrapped at output of previous stage are swapped

Page 18: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

18

Example: Right Rotation by 5 on 8-bit Inverse Butterfly Circuit After stage 2, input is

right rotated by 5 (mod 4) = 1 within each 4-bit subcircuit

Bits that wrapped at output of previous stage are swapped

Page 19: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

19

Example: Right Rotation by 5 on 8-bit Inverse Butterfly Circuit After stage 3, input is

right rotated by 5 Bits that wrapped at

output of previous stage are passed through

Page 20: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

20

Rotations in general on n-bit Inverse Butterfly Circuit shift amount, s < n/2 → swap bits that wrapped

shift amount, s ≥ n/2 → pass through bits that wrapped

Page 21: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

21

Circuit Implementation of Rotation Control Bit Generator

11 ||, jjj ssjsfcb

1{},

2,1,||||1,, 222

j

jsjsfssjsfjsf jjj

Page 22: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

22

Comparison to Barrel and Log ShiftersBarrel Log IBFLY

# of Gates n2 n×log4(n) n×lg(n)

Control Lines n lg(n) n/2×lg(n)

Gate delay (of datapath)

1 log4(n) lg(n)

Mux Width (Capacitance)

n 4 2

Relative Delay (Logical Effort)

1.16× 1 1.19×

Bit Manipulation Capabilities

basic basic basic +

advanced

Page 23: Performing Advanced Bit Manipulations Efficiently in General-Purpose Processors Yedidya Hilewitz and Ruby B. Lee Princeton Architecture Lab for Multimedia.

PALMS Princeton University

Yedidya Hilewitz and Ruby B. Lee Performing Advanced Bit Manipulations Efficiently

23

Summary and Conclusions

We proposed evolving the shifter to a new design using butterfly and inverse butterfly datapaths New shifter subsumes basic shifter, multimedia shift-

permute unit and advanced bit manipulation unit We have shown how to perform basic shifter

operations on these datapaths Rotation control bit generator Extra multiplexer stage for masking and merging

Use of the new shifter design in future microprocessor implementations allows for increased capabilities at only marginal cost