A Hardware Accelerator IP for EBCOT Tier-1 Coding in JPEG2000 Standard Tien-Wei Hsieh Youn-Long Lin...
-
Upload
cornelius-lawrence -
Category
Documents
-
view
214 -
download
0
Transcript of A Hardware Accelerator IP for EBCOT Tier-1 Coding in JPEG2000 Standard Tien-Wei Hsieh Youn-Long Lin...
A Hardware Accelerator IP for EBCOT A Hardware Accelerator IP for EBCOT Tier-1 Coding in JPEG2000 Standard Tier-1 Coding in JPEG2000 Standard
Tien-Wei Hsieh Youn-Long LinTien-Wei Hsieh Youn-Long Lin
Department of Computer ScienceDepartment of Computer Science
National Tsing Hua UniversityNational Tsing Hua University
TAIWANTAIWAN
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 22
AbstractAbstract
PropositionProposition– 16-bit parallel context generator16-bit parallel context generator– Stripe-skipping methodStripe-skipping method– 3-stage pipelined arithmetic encoder3-stage pipelined arithmetic encoder– Renormalization strategy with forwarding methodRenormalization strategy with forwarding method
ContributionContribution– We reduce the cycle count by 17% compared with We reduce the cycle count by 17% compared with
the best-known designthe best-known design– We have achieved 5% within the optimumWe have achieved 5% within the optimum
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 33
OutlineOutline
IntroductionIntroduction
Previous workPrevious work
Proposed architectureProposed architecture
Experimental resultsExperimental results
ConclusionConclusion
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 44
OutlineOutline
IntroductionIntroduction
Previous workPrevious work
Proposed architectureProposed architecture
Experimental resultsExperimental results
ConclusionConclusion
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 55
Pre-process
Discrete WaveletTransform (DWT)
Quantization Block Coding(Tier-1 Coding)
Block Coding(Tier-1 Coding)
Bit-streamOrganization
(Tier-2 Coding)
Bit-streamOrganization
(Tier-2 Coding)
Original Image Data
Compressed Image Data
EBCOT (Embedded Block Coding with Optimized Truncation)
JPEG2000 Image CodingJPEG2000 Image Coding
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 66
EBCOT Tier-1 Time ConsumingEBCOT Tier-1 Time Consuming
Platform: Pentium 4 2.8GHz, 736MB RAM, Microsoft Windows XP, VC ++
Reference software: JPEG 2000 jasper 1.500.4
Test pattern: 512x512 gray image, 1 tile, 5/3 DWT
3 decomposition levels, code-block size 64x64
DWT
EBCOT Tier-1
EBCOT Tier-2
others
13.325 %
71.625 %
1.725 %13.325 %
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 77
EBCOT Tier-1 Block CodingEBCOT Tier-1 Block Coding(Context-based adaptive binary arithmetic
coding)
LL
LH
LH
HL
HHHL
HH
Sub-bitstream N
ContextFormation (CF)
ArithmeticEncoder (AE)
Sub-bitstream 3
Sub-bitstream 2
Sub-bitstream 1
context decision
Block Coding
Code-blockN
Code-block3
Code-block2
Code-block1
From DWT & Quantization
To Tier-2
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 88
Bit-Plane Division of Code-blockBit-Plane Division of Code-block1Sign bit
MSB
LSB
Magnitude bits
insignificant
significant
Pixel
Bit-plane
1
0
0
1
1
0
0
0
2004/6/162004/6/16 99
Scanning Each Bit-plane 3 TimesScanning Each Bit-plane 3 Times
4 bits in a column
N stripes in a bit-plane
M columns in a stripe
(pass > stripe > column > bit)
Code-block size is 4N x M
Coding a Bit-planeCoding a Bit-plane
0 0 1 1 0 1 0 11 1 0 0 0 1 0 01 0 0 0 1 1 1 00 0 1 1 0 0 1 00 1 0 0 0 1 1 00 1 1 0 0 0 1 11 1 0 0 1 0 0 01 0 0 0 0 0 1 1
1 0 11 1 0 0
0 1 1 11 1
0 1 0 10 11 1 01 0 1 1
1 0 11 0 0
0 1 1 11 1
0 1 00 11 11 0 1 1
Insignificant bits with significant neighbors
Significant bits Bits not coding in previous two passes
Pass 1 Pass 2 Pass 3
0 0 1 1 0 1 0 11 1 0 0 0 1 0 01 0 0 0 1 1 1 00 0 1 1 0 0 1 00 1 0 0 0 1 1 00 1 1 0 0 0 1 11 1 0 0 1 0 0 01 0 0 0 0 0 1 1
Significant bit
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 1111
OutlineOutline
IntroductionIntroduction
Previous workPrevious work
Proposed architectureProposed architecture
Experimental resultsExperimental results
ConclusionConclusion
1212
Previous workPrevious workContext Formation– Normal mode
NTU: Skipping methods, Sample Skipping (SS) and Group-of-Column Skipping (GOCS)
– Reduce 60% cycle count compared with straightforward method
NCTU: Memory-saving algorithm– Reduce 4K bits memory space if the code-block size is 64x64
– Pass-parallel modeTKU: Pass-parallel context modeling
– No cycle wasted– 0.1 ~ 0.2 dB image quality degradation
Arithmetic EncoderArithmetic Encoder– MQ coder (JBIG)MQ coder (JBIG)
Osaka University: 4-stage pipelined architectureOsaka University: 4-stage pipelined architecture
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 1313
OutlineOutline
IntroductionIntroduction
Previous workPrevious work
Proposed architectureProposed architecture
Experimental resultsExperimental results
ConclusionConclusion
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 1414
Proposed EBCOT Tier-1 CoderProposed EBCOT Tier-1 Coder
AddressGenerator
Code BlockMemory
State Memory
ContextFormation
(CF)
Compress & PISOArithmeticEncoder
(AE)
Pixel_in
Byte_out
16 bits
40 (CX, D)s
(CX, D)
16 bits
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 1515
Context Formation UnitContext Formation Unit
AddressGenerator
Code BlockMemory
State Memory
ContextFormation
(CF)
Compress & PISOArithmeticEncoder
(AE)
Pixel_in
Byte_out
16 bits
40 (CX, D)s
(CX, D)
16 bits
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 1616
Data Dependency for 16 BitsData Dependency for 16 Bits
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
(delay)
2004/6/162004/6/16
16-Way Parallel Architecture16-Way Parallel Architecture1
2
3
4
5
6
7
8
5
6
7
8
1
2
3
4
9
10
11
12
13
14
15
16
13
14
15
16
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
1818
Memory SchemeMemory Scheme
Stripe N
Stripe N+1
Memory B
Memory C
Memory A
Memory B
Memory C
ORDER
Memory A
Memory B
Memory C
2004/6/162004/6/16 1919
Stripe SkippingStripe Skipping
3 registers record coding condition of all stripes in 3 registers record coding condition of all stripes in 3 passes3 passes– A stripe is skipped in Pass1 if all bits in the stripe are A stripe is skipped in Pass1 if all bits in the stripe are
significantsignificant– A stripe is skipped in Pass2 if all bits in the stripe are A stripe is skipped in Pass2 if all bits in the stripe are
insignificantinsignificant– A stripe is skipped in Pass3 if all bits in the stripe have A stripe is skipped in Pass3 if all bits in the stripe have
been coded in Pass1 or 2been coded in Pass1 or 2
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 2020
Arithmetic EncoderArithmetic Encoder
AddressGenerator
Code BlockMemory
State Memory
ContextFormation
(CF)
Compress & PISOArithmeticEncoder
(AE)
Pixel_in
Byte_out
16 bits
40 (CX, D)s
(CX, D)
16 bits
2121
Feedback Loop in AE Flow ChartFeedback Loop in AE Flow Chart
Probability Estimation Table (PET)
ContextTable
A CalculationA
C C Calculation
Index Updating
Table Reading
MPS Updating
Byte
Context Decision
Renormalization
NM
PS
, NLP
S
SW
ITC
H
Qe
Qe
index mps
2222
Modified Probability Estimation Table (MPET)
Context Table
A Calculation
A
C
C Calculation
Index Updating
Bit Shifting
MPS Updating
Context Decision
Renormalization
Proposed Pipelined AEProposed Pipelined AE
Byte
Stage 1
Stage 2
Stage 3
Table Reading
2323
Fast RenormalizationFast Renormalization
CT > A_shift ?
C = C << A_shiftCT = CT – Ashift
C = C << CTA_shift = A_shift – CT
BYTEOUT2
DONE
NOYES
BYTEOUT
Twice ?YES
NO
CT > A_shift ?
C = C << A_shiftCT = CT – Ashift
C = C << CTA_shift = A_shift – CT
CT = 0
BYTEOUT
DONE
NOYES
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 2424
Compress & Parallel In Serial OutCompress & Parallel In Serial Out
AddressGenerator
Code BlockMemory
State Memory
ContextFormation
(CF)
Compress & PISOArithmeticEncoder
(AE)
Pixel_in
Byte_out
16 bits
40 (CX, D)s
(CX, D)
16 bits
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 2525
Interaction Between CF and AEInteraction Between CF and AE
Clock
CF_stall
AE_stall
CF generates 4, 2, 0, 0, 0, 2, 1, 2, 0, and 0 (CX, D) pairs respectively
4 2 0 0 0 2 1 2 0 0
For example
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 2626
Overlapping CF and AE Overlapping CF and AE
Clock
CF_stall
AE_stall
Clock
CF_stall
AE_stall
4 2 0 0 0 2 1 2 0 0
4 2 0 0 0 2 1 2 0 0
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 2727
AHB InterfaceAHB Interface
RegisterBlock Context Formation
Arithmetic Encoder
Slave Controller
Slave Transaction
Master Interface
Master Controller
Tier-1 Encoder
MemoryBlock
AHB
Slave Interface
Master Interface
IP Core
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 2828
OutlineOutline
IntroductionIntroduction
Previous workPrevious work
Proposed architectureProposed architecture
Experimental resultsExperimental results
ConclusionConclusion
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 2929
Objective of ExperimentObjective of Experiment
The objective of our experiment is to proveThe objective of our experiment is to prove– Low powerLow power– High performanceHigh performance– AHB-compliantAHB-compliant
Test pattern– 512x512 gray images (airplane, baboon, lena, pepper
s)– 1 title– 5/3 DWT– 3 decomposition levels– Code-block size 64x64Code-block size 64x64
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 3030
IP Qualification & Code CoverageIP Qualification & Code Coverage
IP qualification (nLint)IP qualification (nLint)– Compliant with RMM guidelinesCompliant with RMM guidelines
Code coverage (Code coverage (Verification Navigator ))
Design for testability (TetraMAX)Design for testability (TetraMAX)
Our designOur design General expectancyGeneral expectancy
Statement coverageStatement coverage 97.9%97.9% 95%95%
Branch coverageBranch coverage 95.8%95.8% 95%95%
Toggle coverageToggle coverage 100%100% 95%95%
Path coveragePath coverage 69.2%69.2% 50%50%
Total faultsTotal faults Test patternsTest patterns Test coverageTest coverage
77,20077,200 439439 99.99%99.99%
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 3131
Synthesis and power analysisSynthesis and power analysis
Our designOur design
Technology libraryTechnology library TSMC .35TSMC .35
Area (gate count)Area (gate count)25,706 +25,706 +
45kb memory45kb memory
Max. Frequency (MHz)Max. Frequency (MHz) 43.4843.48
Power (mW)Power (mW) 26.6826.68
Synthesis tool : Design Compiler (under WCCOM)
Power analysis tool : PrimePower
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 3232
0 500000 1000000 1500000 2000000 2500000
Airplane
Babbon
Lena
Peppers
# of contexts
# of BYTEOUTs
# of AE stalls
Composition of coding cycleComposition of coding cycle
0 0.5 1 1.5 2 2.5 (unit: 1,000,000)
Simulation tool : ModelSim SE/PE 5.7e
1.32
1.75
1.46
1.35
0.14
0.22
0.17
0.16
0.12
0.13
0.12
0.11
(1.75)
(1.62)
(2.1)
(1.58)
(CX, D)s
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 3333
0 500000 1000000 1500000 2000000 2500000
Airplane
Lena
Babbon
Peppers
Cycle reductionCycle reduction
0 0.5 1 1.5 2 2.5 (unit: 1,000,000)
Peppers
Lena
Baboon
Airplane
Peppers
Lena
Baboon
Airplane
2% reduction by stripe-skipping
9% reduction by proposed renormalization
0.078
0.084
0.069
0.077
0.006
0.005
0.007
0.005
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 3434
Comparison Comparison
0 500000 1000000 1500000 2000000 2500000
Airplane
Baboon
Lena
Peppers
Our desgin
Column-base design
Lower bound
1.54
1.43
1.83
1.41
1.88
1.74
1.761.32
2.11.75
1.46
1.35
0 0.5 1 1.5 2 2.5
( Unit: 1,000,000 cycles )
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 3535
Platform Platform ArchitectureArchitecture – –Altera Excalibur EPXA10DDRAltera Excalibur EPXA10DDR
FPGA
Embedded Stripe
AHB2
AHB1
ARM922TProcessor
InterruptControlle
r
WatchDogTimer
AHB1 toAHB2 Bridge
DPRAM128KB
SRAM256KB
SDRAMControlle
r
SDRAM128MB
PLD to StripeBridge
Stripe to PLD
Bridge
UARTController
External BusInterface
FLASH32MB
PLDMaster
1
PLDMaster
2
PLDSlav
e3
PLDSlav
e4
PLDSlav
e2
PLDMaster
3
PLDSlav
e1
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 3636
Platform-based SOC Design Platform-based SOC Design FlowFlow
■ ADS■ SOPC Builder■ Quartus II■ ESS, ADS and Modelsim SE
System spec.
Profiling & HW/SW partitionSoftware
spec. HW spec. for each component
FPGA designStripe
configuration
Software coding
Library
BUS interfaceHDL coding
Device HDL
coding
Interface.vAccelerator.v
Component.vStripe.
v
User defined
firmware
Integration (SOPC Builder)
System PTF
SOPC generation
Pin assignment & Hardware compilation (Quartus II)
*.c files
Compilation (ADS)
Software image
System building
(Quartus II)
System image
SOPC platform
Prototyping
Excalibur.h
Stripe.h
Component SDK
HW/SW co-simulation
RTL codes
Hardware image
3737
FPGA Prototyping ResultFPGA Prototyping Result
Platform: Altera ExcaliburTM EPXA10DDR, 25MHz
0
20
40
60
80
100
120
Airpla
ne
Baboo
nLe
na
Peppe
rs
Airpla
ne
Baboo
nLe
na
Peppe
rs
OthersTier-1
Pure software Proposed accelerator(Second)
100.98
80.18 81.4785.34
26.9931.2
27.6 27.78
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 3838
OutlineOutline
IntroductionIntroduction
Previous workPrevious work
Proposed architectureProposed architecture
Experimental resultsExperimental results
ConclusionConclusion
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 3939
SummarySummary
PropositionProposition– 16-bit parallel context generator16-bit parallel context generator– Stripe-skipping methodStripe-skipping method– 3-stage pipelined arithmetic encoder3-stage pipelined arithmetic encoder– Renormalization strategy with forwarding methodRenormalization strategy with forwarding method
ContributionContribution– We reduce the cycle count by 17% compared with the We reduce the cycle count by 17% compared with the
best-known designbest-known design– We have achieved 5% within the optimumWe have achieved 5% within the optimum
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 4040
Future WorkFuture Work
Ping-pong method for the “compress & Ping-pong method for the “compress & PISO” to reduce the 5% coding cyclesPISO” to reduce the 5% coding cycles
ASIC IntegrationASIC Integration
2004/6/162004/6/16 Tien-Wei HsiehTien-Wei Hsieh 4141
Demo on the SoC platformDemo on the SoC platform
Pure software (100 sec.)Pure software (100 sec.)– Configure the FPGAConfigure the FPGA– Load the original image to SDRAMLoad the original image to SDRAM– Execute the JPEG2000 encoderExecute the JPEG2000 encoder– Get the compressed image from SDRAMGet the compressed image from SDRAM– Record the time consumingRecord the time consuming
Proposed accelerator (50 sec.)Proposed accelerator (50 sec.)– Configure the FPGAConfigure the FPGA– Load the original image to SDRAMLoad the original image to SDRAM– Execute the JPEG2000 encoderExecute the JPEG2000 encoder– Get the compressed image from SDRAMGet the compressed image from SDRAM– Compare images and time consumingCompare images and time consuming