IXP Lab 2012: Part 3

47
IXP Lab 2012: Part 3 Programming Tips

description

IXP Lab 2012: Part 3. Programming Tips. Outline. Memory Independent Techniques Instruction Selection Task Partition Memory Dependent Techniques Reducing Overhead Reduce the number of memory accesses Reduce average access latency Hiding Overhead. Memory Independent Techniques. - PowerPoint PPT Presentation

Transcript of IXP Lab 2012: Part 3

Page 1: IXP Lab 2012: Part 3

IXP Lab 2012: Part 3

Programming Tips

Page 2: IXP Lab 2012: Part 3

Outline

• Memory Independent Techniques– Instruction Selection– Task Partition

• Memory Dependent Techniques– Reducing Overhead• Reduce the number of memory accesses• Reduce average access latency

– Hiding Overhead

NCKU CSIE CIAL Lab 2

Page 3: IXP Lab 2012: Part 3

Memory Independent Techniques

• Instruction Selection– General Coding Skill– Use Hardware Instruction

• Task Partition– Multi-Processing– Context-Pipelining

NCKU CSIE CIAL Lab 3

Page 4: IXP Lab 2012: Part 3

General Coding Skill

• Remove loop• Shift Operation– Avoid using multiply and divide

• Inline Function– __inline & __forceinline

• Branch Prediction– Branch Prediction Penalty

NCKU CSIE CIAL Lab 4

Page 5: IXP Lab 2012: Part 3

Hardware Instruction

• POP_COUNT• FFS• Multiply• CRC• Hashing• CAM

NCKU CSIE CIAL Lab 5

Page 6: IXP Lab 2012: Part 3

POP_COUNT--Brief

• Population Count• Report number of bit set in a 32-bit register• 3 cycles latency• Example:– pop_count( 0x3121 ) = ?– 0011 0001 0010 0001– Result = 5

NCKU CSIE CIAL Lab 6

Page 7: IXP Lab 2012: Part 3

POP_COUNT--Naïve Implementation

unsigned int pop_count_for (unsigned int x){ unsigned int y=0; unsigned int i;

for(i=0; i<32; i++) { if( (x&1)==1 ) y++; x=x>>1; } return y;}

NCKU CSIE CIAL Lab 7

Page 8: IXP Lab 2012: Part 3

POP_COUNT--Faster Implementation

unsigned int pop_count_agg(unsigned int x){

x -= ((x >> 1) & 0x55555555); x = (((x >> 2) & 0x33333333) + (x & 0x33333333)); x = (((x >> 4) + x) & 0x0f0f0f0f); x += (x >> 8); x += (x >> 16); return(x & 0x0000003f);}}

Reference http://aggregate.org/MAGIC/

NCKU CSIE CIAL Lab 8

Page 9: IXP Lab 2012: Part 3

POP_COUNT--Hardware Instruction

unsigned int pop_count_hardware(unsigned int x)

{return pop_count (x);

}

NCKU CSIE CIAL Lab 9

Page 10: IXP Lab 2012: Part 3

POP_COUNT--Additional Information

• Bitmap-RFC (Liu, TECS 2008)

NCKU CSIE CIAL Lab 10

Page 11: IXP Lab 2012: Part 3

FFS

• Find the first bit set in data and return its position• Example:– ffs ( 0x3121 ) = 0

• 0011 0001 0010 0001

– ffs ( 0x3120 ) = 5• 0011 0001 0010 0000

– ffs ( 0x3100 ) = 8• 0011 0001 0000 0000

NCKU CSIE CIAL Lab 11

Page 12: IXP Lab 2012: Part 3

Multiply

• Specific Multiply Instruction– Multiply_24x8()– Multiply_16x16()– Multiply_32x32_hi()– Multiply_32x32_lo()

NCKU CSIE CIAL Lab 12

Page 13: IXP Lab 2012: Part 3

CRC

• 14 cycles latency• Example of CRC operation

crc_write( 0x42424242);crc_32_be( source_address, bytes_0_3 );crc_32_be( dest_address, bytes_0_3 );…

Cache_index = crc_read();

NCKU CSIE CIAL Lab 13

Page 14: IXP Lab 2012: Part 3

Hash

• hash_48()• hash_64()• hash_128()

• Example:SIGNAL sig_hash;hash48(data_out, data_in, count, sig_done, &sig_hash);__wait_for_all(&sig_hash);

NCKU CSIE CIAL Lab 14

Page 15: IXP Lab 2012: Part 3

CAM--Brief

• Content Addressable Memory• Each ME has 16 32-bit CAM entries• The CAM is private to other MEs• With lookup operation, each entries is

searching in parallel• With a success lookup, the index of matched

entries will be returned• Else, the index of entries to be replaced will be

returnedNCKU CSIE CIAL Lab 15

Page 16: IXP Lab 2012: Part 3

CAM--Structure

• cam_lookup_t

NCKU CSIE CIAL Lab 16

Page 17: IXP Lab 2012: Part 3

CAM--Usage

cam_lookup_t cam_result;cam_result = cam_lookup( data );if( cam_result.hit == 1 ) {

Access Entry cam_result.entry_num;…

}else {……cam_write( cam_result.entry_num, data, 15 );

}

NCKU CSIE CIAL Lab 17

Page 18: IXP Lab 2012: Part 3

Task Partition

• Multi-Processing– More Computing Power– Easy to implement

• Context-Pipelining– More Useable Resource– Hard to balance

NCKU CSIE CIAL Lab 18

Page 19: IXP Lab 2012: Part 3

Memory Relative Techniques--Reducing Overhead

• Reduce the number of memory accesses– Wide-word Accesses– Result Caches

• Reduce average access latency– Multi-level Memory Hierarchy– Data Cache

NCKU CSIE CIAL Lab 19

Page 20: IXP Lab 2012: Part 3

Wide-Word Accesses--Brief

• Batch Access the needed data• Reduce the necessary accesses• Useful when the data stored

contiguously

NCKU CSIE CIAL Lab 20

MEM_ADDR+0 ……+4 ……+8 ……

+12 ……+16 ……+20 ……+24 ……+28 ……

Page 21: IXP Lab 2012: Part 3

Wide-Word Accesses--Usage (One Node per Access)

__declspec(sram_read_reg) UINT32 A;SIGNAL sig_read;

sram_read( &A, MEM_ADDR+(i*4), 1, sig_done, &sig_read);__wait_for_all( &sig_read );

Access A ......----------------------------------------------Result: 8 Accesses are needed

NCKU CSIE CIAL Lab 21

Page 22: IXP Lab 2012: Part 3

Wide-Word Accesses--Usage (Two Node per Access)

__declspec(sram_read_reg) UINT32 A[2];SIGNAL sig_read;

sram_read( &A, MEM_ADDR+(i*8), 2, sig_done, &sig_read);__wait_for_all( &sig_read );

Access A ......----------------------------------------------Result: 4 Accesses are needed

NCKU CSIE CIAL Lab 22

Page 23: IXP Lab 2012: Part 3

Wide-Word Accesses--Usage (Four Node per Access)

__declspec(sram_read_reg) UINT32 A[4];SIGNAL sig_read;

sram_read( &A, MEM_ADDR+(i*16), 4, sig_done, &sig_read);__wait_for_all( &sig_read );

Access A ......----------------------------------------------Result: 2 Accesses are needed

NCKU CSIE CIAL Lab 23

Page 24: IXP Lab 2012: Part 3

Wide-Word Accesses--Experiment

• Platform: IXP2800• Total Accesses: 8 LW (8*4 Byte)

Case Total Cycle Average Cycles/ LW

1LW * 8 Time 1211 151.38

2LW * 4 Time 725 90.63

4LW * 2 Time 460 57.50

8LW * 1 Time 387 48.38

NCKU CSIE CIAL Lab 24

Page 25: IXP Lab 2012: Part 3

Wide-Word Accesses--Limitation

• Data must be contiguous– Suitable for linear search– Not support random accesses

• Number of Transfer Registers are fixed– Each thread has 16 read / write registers– The Tx-Regs may be reserved by others

NCKU CSIE CIAL Lab 25

Page 26: IXP Lab 2012: Part 3

Resulting Cache--Brief

• Caching the result of application• If same fields appear again, the cached result

is returned• Memory accesses are reduced when cache

hit.• Depends on temporal locality of the traffic

NCKU CSIE CIAL Lab 26

Page 27: IXP Lab 2012: Part 3

Result Cache--IXP2400

• No hardware cache is supported in IXP2400 ME

• Not easy to implement set-associative cache• Replacement policy will also be an overhead

NCKU CSIE CIAL Lab 27

Page 28: IXP Lab 2012: Part 3

Result Cache--Design Consideration

• Shared or Private Cache ?• Size of Cache ?• Works with specific Hardware ?• Miss penalty handling ?

NCKU CSIE CIAL Lab 28

Page 29: IXP Lab 2012: Part 3

Result Cache--Example

NCKU CSIE CIAL Lab 29

Page 30: IXP Lab 2012: Part 3

Multi-Level Memory Hierarchy--Brief

• Reduce the average access latency• Number of accesses remained unchanged• If data can fit in faster memory, then do it

NCKU CSIE CIAL Lab 30

Page 31: IXP Lab 2012: Part 3

Multi-Level Memory Hierarchy--Data Placement

• Size smaller while read-only– Hard Code

• Size smaller while need updating– Local Memory

• Size larger– Scratchpad

• Size largest– SRAM

NCKU CSIE CIAL Lab 31

Page 32: IXP Lab 2012: Part 3

Multi-Level Memory Hierarchy--Packet Data Type

• Packet related data– Temporary Data– Valid with specific packet– Local Memory

• Flow related data– Related to specific flow– Spatial Locality– Wide-Word Access

• Application related data– Valid with specific application– Temporal Locality– Result Cache

NCKU CSIE CIAL Lab 32

Page 33: IXP Lab 2012: Part 3

Split-Cache (Z. Liu, IET-COM 2007)

• Two separate hardware for application data and flow data

NCKU CSIE CIAL Lab 33

Page 34: IXP Lab 2012: Part 3

Data Cache--Brief

• Hardware Cache Mechanism that cached the data for packet processing– App-Cache– Flow-Cache

• However, not supported by IXP2400 (Need additional hardware)

NCKU CSIE CIAL Lab 34

Page 35: IXP Lab 2012: Part 3

Data Cache--CAM + Local Memory

• CAM works with Local Memory acts like hardware cache

• However, number of CAM entries is limited• Each CAM entry may co-worked with several

Local Memory Cache entry

NCKU CSIE CIAL Lab 35

Page 36: IXP Lab 2012: Part 3

Memory Relative Techniques--Hiding Overhead

• Not really reduce the overhead, but overlapped it– Hardware Multi-Threading– Asynchronous Memory

NCKU CSIE CIAL Lab 36

Page 37: IXP Lab 2012: Part 3

Hardware Multi-Threading

• Swap out itself and let another thread to execute while access memory

• Each thread kept its own set of registers, thus no stack are needed for thread swapping

• Round Robin Scheduling• No thread preemptive

NCKU CSIE CIAL Lab 37

Page 38: IXP Lab 2012: Part 3

Asynchronous Memory--Brief

• Thread will not be blocked when issue a memory request

• Thus, thread can issues multiple memory requests at a time

NCKU CSIE CIAL Lab 38

Page 39: IXP Lab 2012: Part 3

Asynchronous Memory--Example (1 Issue)

Read X__wait_for_all ( &sig_x )Read Y__wait_for_all ( &sig_y )

// Use X and Y …

NCKU CSIE CIAL Lab 39

Page 40: IXP Lab 2012: Part 3

Asynchronous Memory--Example (2 Issues)

Read XRead Y__wait_for_all ( &sig_x, &sig_y )

// Use X and Y …

NCKU CSIE CIAL Lab 40

Page 41: IXP Lab 2012: Part 3

Wide-Word Access + Multiple Issues

MEM_ADDR+0……

+4 ……

+8 ……

+12 ……

+16 ……

+20 ……

+24 ……

+28 ……NCKU CSIE CIAL Lab 41

Page 42: IXP Lab 2012: Part 3

Wide-Word Access +Multiple Issues (1LW, 2 Issue)

MEM_ADDR+0……

+4 ……

+8 ……

+12 ……

+16 ……

+20 ……

+24 ……

+28 ……NCKU CSIE CIAL Lab 42

Page 43: IXP Lab 2012: Part 3

Wide-Word Access +Multiple Issues (2LW, 2 Issue)

MEM_ADDR+0……

+4 ……

+8 ……

+12 ……

+16 ……

+20 ……

+24 ……

+28 ……NCKU CSIE CIAL Lab 43

Page 44: IXP Lab 2012: Part 3

Wide-Word Access +Multiple Issues (4LW, 2 Issue)

MEM_ADDR+0……

+4 ……

+8 ……

+12 ……

+16 ……

+20 ……

+24 ……

+28 ……NCKU CSIE CIAL Lab 44

Page 45: IXP Lab 2012: Part 3

Wide-Word Access +Multiple Issues (Experiment)

Scheme Total Cycles Average Cycles / LW

1 LW * 1 Issue 1211 151.38

2 LW * 1 Issue 725 90.63

4 LW * 1 Issue 460 57.50

8 LW * 1 Issue 387 48.38

1 LW * 2 Issue 716 89.50

2 LW * 2 Issue 445 55.63

4 LW * 2 Issue 364 45.50

1 LW * 4 Issue 396 49.50

2 LW * 4 Issue 320 40.00

1 LW * 8 Issue 318 39.75NCKU CSIE CIAL Lab 45

Page 46: IXP Lab 2012: Part 3

Reference (1)

• Jayaram Mudigonda, Harrick M. Vin, Raj Yavatkar, “Overcoming the memory wall in packet processing: hammers or ladders?”, Proc. ANCS 2005.

• Duo Liu, Zheng Chen, Bei Hua, Nenghai Yu, Xinan Tang, “High-Performance Packet Classification Algorithm for Multireaded IXP Network Processor”, ACM TECS 2008.

NCKU CSIE CIAL Lab 46

Page 47: IXP Lab 2012: Part 3

Reference (2)

• Z. Liu, K. Zheng, B. Liu, “Hybrid cache architecture for high-speed packet processing”, IET-COM 2007.

NCKU CSIE CIAL Lab 47