JPEG Decompression Andy Hocker Jeff Huxel Joel Williams Matthew Tilleman.

JPEG DecompressionJPEG Decompression

Andy HockerAndy Hocker

Jeff HuxelJeff Huxel

Joel WilliamsJoel Williams

Matthew TillemanMatthew Tilleman

Presentation OverviewPresentation Overview

Project OverviewProject Overview Breakdown of workBreakdown of work A little about JPEGA little about JPEG SW DesignSW Design iMX21 LCD DriveriMX21 LCD Driver HW DesignHW Design Conclusions and resultsConclusions and results

Project OverviewProject Overview

Originally…Originally…– MPEG decompression and playback on MPEG decompression and playback on

the iMX21the iMX21– Hardware offloading to the FPGA when Hardware offloading to the FPGA when

needed.needed.– At least in the cases we investigated, At least in the cases we investigated,

the hardware was not enough to support the hardware was not enough to support this.this.

Project OverviewProject Overview

And now…And now…– Real time JPEG decompression into a Real time JPEG decompression into a

bitmap and stored into memory.bitmap and stored into memory.– The iMX21 subsequently would take the The iMX21 subsequently would take the

bitmap and display it on the screen.bitmap and display it on the screen.– Were able to display subsequent Were able to display subsequent

images.images.

Breakdown of workBreakdown of work

JPEG decompression:JPEG decompression:– SW JPEG – AndySW JPEG – Andy– HW IDCT – JeffHW IDCT – Jeff

LCD interfacing and registersLCD interfacing and registers– Joel and MatthewJoel and Matthew

A little about JPEGA little about JPEG

Why is JPEG so pervasive?Why is JPEG so pervasive? Because of the compression to image quality. Because of the compression to image quality. The image is divided into 8x8 pixel arrays and is converted to The image is divided into 8x8 pixel arrays and is converted to

frequency with a set orthogonal cosine functions. The frequency frequency with a set orthogonal cosine functions. The frequency components are coefficients that are also on an 8x8 array in freq components are coefficients that are also on an 8x8 array in freq space. Due to the fact that most images don’t have a lot of high space. Due to the fact that most images don’t have a lot of high frequency spectral components compression is achieved, since the frequency spectral components compression is achieved, since the freq array has less data than the original image array.freq array has less data than the original image array.

Further compression is achieved using Huffman coding the data Further compression is achieved using Huffman coding the data stream which uses variable bit lengths and most frequently used stream which uses variable bit lengths and most frequently used codes to further decrease the file size.codes to further decrease the file size.

There is also a quantization table that at the expense of image There is also a quantization table that at the expense of image quality the values can be increased so that the coefficients of the quality the values can be increased so that the coefficients of the DCT are smaller in the data stream.DCT are smaller in the data stream.

Jpeg is a lossy file format (except for some incredibly rare DPM Jpeg is a lossy file format (except for some incredibly rare DPM settings, not supported in our software or most software I have settings, not supported in our software or most software I have seen)seen)

A little bit about JPEGA little bit about JPEG Jpeg uses a color space called YCbCr. Y is the Jpeg uses a color space called YCbCr. Y is the

luminance Cb is the blue difference and Cr is the luminance Cb is the blue difference and Cr is the red difference.red difference.

One of the reasons for using this color space has One of the reasons for using this color space has to do with humans.to do with humans.

The human eye is significantly more sensitive to The human eye is significantly more sensitive to luminance changes than color changes.luminance changes than color changes.

Jpeg capitalizes on this by allowing the encoding Jpeg capitalizes on this by allowing the encoding scheme of sub-sampling the Cb and Cr so that scheme of sub-sampling the Cb and Cr so that further compression is achieved.further compression is achieved.

The typical setting used is for every 4 8x8 The typical setting used is for every 4 8x8 luminance blocks (2x2 array) there is 1 Cr and 1 luminance blocks (2x2 array) there is 1 Cr and 1 Cb. Interpolation is then used.Cb. Interpolation is then used.

There are some noticeable side effects of this There are some noticeable side effects of this depending on image resolutiondepending on image resolution

Picture courtesy WikipediaPicture courtesy Wikipedia

Subsample of ChrominanceSubsample of Chrominance

1 Cr and Cb IDCT for 2x2 Y (luminance) IDCTs 32x32 pixels.

1 Cr, Cb and Y (luminance) IDCT for each 8x8 array. 32x32 pixels

Huffman codingHuffman coding Huffman coding decreases overall size by Huffman coding decreases overall size by

assigning a length of bit sequences to a assigning a length of bit sequences to a Codeword base on frequency of use.Codeword base on frequency of use.

The bit sequence for CWs that occur more The bit sequence for CWs that occur more frequently are shorter in length.frequently are shorter in length.

Jpeg uses these code words to help Jpeg uses these code words to help determine values for the ac and dc determine values for the ac and dc coefficients.coefficients.

Jpeg headers only define the # of codes Jpeg headers only define the # of codes of a certain length as well as the code of a certain length as well as the code words, the decompressor needs to words, the decompressor needs to implement the tree.implement the tree.

Each Y Cr Cb and each ac and dc can Each Y Cr Cb and each ac and dc can have there own Huffman table.have there own Huffman table.

0x0a

0x00

0x04

0x06

0x08

0x09

0x0a

The lower nibble of a code word tells the system how many more bits to The lower nibble of a code word tells the system how many more bits to read in to determine the coefficients value, the upper nibble (N) on ac read in to determine the coefficients value, the upper nibble (N) on ac coefficients is used to inform the decompressor that the previous N coefficients is used to inform the decompressor that the previous N coefficients were zero. Codeword 00 signals that all further coefficients coefficients were zero. Codeword 00 signals that all further coefficients are zero.are zero.

Example data stream is 101 That corresponds to CW 0x04 so 4 more Example data stream is 101 That corresponds to CW 0x04 so 4 more bits are read in: 1111 so the value is 15 (if the bits were 0111 the value bits are read in: 1111 so the value is 15 (if the bits were 0111 the value would be -8 [7-15] ) No need to have values from -7 to 7 since CW 0x03 would be -8 [7-15] ) No need to have values from -7 to 7 since CW 0x03 covers these numbers.. Etc.covers these numbers.. Etc.

Next stepsNext steps

The coefficient values are then de-quantized The coefficient values are then de-quantized with the quantize table. with the quantize table.

One note. The ac coefficients will be added One note. The ac coefficients will be added directly, the dc on the other hand is actually directly, the dc on the other hand is actually represented as the difference from the last represented as the difference from the last IDCT of the same component. For space IDCT of the same component. For space saving.saving.

They are then added into the 8x8 frequency They are then added into the 8x8 frequency matrix based on the zig zag pattern to the matrix based on the zig zag pattern to the left.left.

The Zig zag is used because it fills up in an The Zig zag is used because it fills up in an order such that the lowest frequencies are order such that the lowest frequencies are added first.added first.

The values are then fed into the IDCT. This The values are then fed into the IDCT. This is repeated for the Y, Cr, Cb components.is repeated for the Y, Cr, Cb components.

The values are then converted from YCbCr The values are then converted from YCbCr space to RGB space.space to RGB space.

And..And..Sample Quantization table

Voila! An ImageVoila! An Image

Actual results from the software Actual results from the software implementation of IDCT on LCD at 16bppimplementation of IDCT on LCD at 16bpp

SW DesignSW Design

SW DesignSW Design Software was divided into modules to aid in the use Software was divided into modules to aid in the use

of different IDCT approaches both for software and of different IDCT approaches both for software and hardware.hardware.

The code can display to the LCD in 16bpp mode The code can display to the LCD in 16bpp mode using the /dev/fb and imxfb driver and was designed using the /dev/fb and imxfb driver and was designed to work with new LCD driver.to work with new LCD driver.

The program also has the ability to write the The program also has the ability to write the resulting RGB image to ppm format (bitmap of RGB) resulting RGB image to ppm format (bitmap of RGB) file so that the results can be analyzed offline.file so that the results can be analyzed offline.

The code has little and big endian support for file The code has little and big endian support for file writing and has been compiled on arm, x86 and PPC.writing and has been compiled on arm, x86 and PPC.

SW DesignSW Design– Picture.c: is composed of functions for decoding Picture.c: is composed of functions for decoding

the jpeg file, displaying and writing files. It is a the jpeg file, displaying and writing files. It is a little unwieldy due to a large case statement little unwieldy due to a large case statement that helps determine what jpeg header marker is that helps determine what jpeg header marker is currently being worked on.currently being worked on.

– Idct.c Contains all the code for the 3 IDCT Idct.c Contains all the code for the 3 IDCT implementations (see next slide)implementations (see next slide)

– Main.c Calls the frame buffer and memory maps Main.c Calls the frame buffer and memory maps it, sets up initialization for the IDCTs and runs it, sets up initialization for the IDCTs and runs the picture and displays it (and write to file if the picture and displays it (and write to file if compiled in)compiled in)

IDCT IDCT The first SW implementation is a dumb verbatim of the The first SW implementation is a dumb verbatim of the

IDCT equation. 8*8*8*8 = 4096 Loops per 8x8 IDCT!!IDCT equation. 8*8*8*8 = 4096 Loops per 8x8 IDCT!! Second implementation is of the Wang-Chen algorithm Second implementation is of the Wang-Chen algorithm

which only needs 16 loops but has a 1024 int lookup which only needs 16 loops but has a 1024 int lookup table.table.

For the FPGA IDCT, we init by memory mapping the FPGA For the FPGA IDCT, we init by memory mapping the FPGA and the function writes the values and then it can and the function writes the values and then it can immediately read them. immediately read them. – One issue we ran into with this is that when /dev/mem/ One issue we ran into with this is that when /dev/mem/

was memmapped and the jpeg file was at /tmp there was memmapped and the jpeg file was at /tmp there was some corruption going on. When the file was was some corruption going on. When the file was moved to the flash drive the problem went away.moved to the flash drive the problem went away.

SW DesignSW Design There is an entire hand made implementation of a binary tree There is an entire hand made implementation of a binary tree

to support the optimized Huffman trees. Called struct node.to support the optimized Huffman trees. Called struct node.– The runtime is on the order of the Huffman bit length, I The runtime is on the order of the Huffman bit length, I

think you might be able to create a hashing function that think you might be able to create a hashing function that could move this to O(1)could move this to O(1)

No memory leaks noticed based on calls to top in between No memory leaks noticed based on calls to top in between successive runs.successive runs.– Each image is on the order of 240*320*16bit = 150kBEach image is on the order of 240*320*16bit = 150kB

Code was optimized to use integers to read and write out Code was optimized to use integers to read and write out data, instead of the initial byte at a time format.data, instead of the initial byte at a time format.

All initial floating point calculations, used for the IDCT and the All initial floating point calculations, used for the IDCT and the YCbCr to RGB conversion where modified to work as fixed YCbCr to RGB conversion where modified to work as fixed point shift and adds. This was done because the arm doesn’t point shift and adds. This was done because the arm doesn’t have an fpu, and floating point would have been emulated.have an fpu, and floating point would have been emulated.

For fun all the c functions used in jpeg For fun all the c functions used in jpeg decompressiondecompression

int free_huf(struct image* imgp);int free_huf(struct image* imgp); int free_node(struct node* n);int free_node(struct node* n); int create_tree(struct node* top, unsigned char* num_code_per, unsigned char* codes);int create_tree(struct node* top, unsigned char* num_code_per, unsigned char* codes); int add_node(struct node* nd, unsigned char* level, unsigned char* cw);int add_node(struct node* nd, unsigned char* level, unsigned char* cw); int fillup(struct node* nd);int fillup(struct node* nd); int initialize_img(struct image* imgp);int initialize_img(struct image* imgp); unsigned char get_bit(unsigned int index, unsigned int* arry);unsigned char get_bit(unsigned int index, unsigned int* arry); unsigned short get_bits(unsigned int index, unsigned char length, unsigned int* arry);unsigned short get_bits(unsigned int index, unsigned char length, unsigned int* arry); int decode_data(struct image* imgp, unsigned short length, unsigned int* entptr, struct mcu* mcu_bl, struct pix_frame* fp, int* jpg, unsigned char* int decode_data(struct image* imgp, unsigned short length, unsigned int* entptr, struct mcu* mcu_bl, struct pix_frame* fp, int* jpg, unsigned char*

old);old); int start_image( char *argv, struct pix_frame* framep);int start_image( char *argv, struct pix_frame* framep); int add_MCU_frame(struct pix_frame* frame, struct mcu* mcu_bl, struct image* imgp, unsigned int* mcu_num);int add_MCU_frame(struct pix_frame* frame, struct mcu* mcu_bl, struct image* imgp, unsigned int* mcu_num); int make_frame(struct pix_frame* frame);int make_frame(struct pix_frame* frame); int free_frame(struct pix_frame* frame );int free_frame(struct pix_frame* frame ); void ycc_rgb(unsigned char* y, unsigned char* cb, unsigned char* cr, unsigned char* r, unsigned char* g, unsigned char* b);void ycc_rgb(unsigned char* y, unsigned char* cb, unsigned char* cr, unsigned char* r, unsigned char* g, unsigned char* b); void print_frame(struct pix_frame* frame, char * argv);void print_frame(struct pix_frame* frame, char * argv); unsigned short read_data(unsigned int* entptr, int size, int* jpg, unsigned char* old);unsigned short read_data(unsigned int* entptr, int size, int* jpg, unsigned char* old); unsigned short read_data_off(unsigned int* entptr, int size, int* jpg, unsigned char* old, unsigned int* offset);unsigned short read_data_off(unsigned int* entptr, int size, int* jpg, unsigned char* old, unsigned int* offset); void display_frame(struct pix_frame* frame);void display_frame(struct pix_frame* frame); void set_coeff(int b);void set_coeff(int b); int add_coeff(struct mcu* mcu_block, unsigned char* component, unsigned char* hsamp, unsigned char* vsamp, unsigned char* coeff_num, int* int add_coeff(struct mcu* mcu_block, unsigned char* component, unsigned char* hsamp, unsigned char* vsamp, unsigned char* coeff_num, int*

value, unsigned char* quantize);value, unsigned char* quantize); void zigzag(unsigned char* coeff_num, unsigned char* x, unsigned char* y );void zigzag(unsigned char* coeff_num, unsigned char* x, unsigned char* y ); void mcu_init(struct mcu* mcu_zero,unsigned char* component, unsigned char* hsamp, unsigned char* vsamp );void mcu_init(struct mcu* mcu_zero,unsigned char* component, unsigned char* hsamp, unsigned char* vsamp ); int run_idct(struct mcu* mcu_block, unsigned char* component, unsigned char* hsamp, unsigned char* vsamp);int run_idct(struct mcu* mcu_block, unsigned char* component, unsigned char* hsamp, unsigned char* vsamp); void mcu_dc_init(struct mcu* mcup);void mcu_dc_init(struct mcu* mcup); void multiply(int * var, char table);void multiply(int * var, char table); void init_idct(void);void init_idct(void); void free_idct(void);void free_idct(void); void idct(struct mcu* mcu_block, unsigned char* component, unsigned char* hsamp, unsigned char* vsamp );void idct(struct mcu* mcu_block, unsigned char* component, unsigned char* hsamp, unsigned char* vsamp ); void idctrow(int i, struct mcu* mcu_block, unsigned char* component, unsigned char* hsamp, unsigned char* vsamp );void idctrow(int i, struct mcu* mcu_block, unsigned char* component, unsigned char* hsamp, unsigned char* vsamp ); void idctcol(int i, struct mcu* mcu_block, unsigned char* component, unsigned char* hsamp, unsigned char* vsamp);void idctcol(int i, struct mcu* mcu_block, unsigned char* component, unsigned char* hsamp, unsigned char* vsamp); void init_idct_fpga();void init_idct_fpga(); void free_idct_fpga();void free_idct_fpga(); void idct_fpga(struct mcu* mcu_block, unsigned char* component, unsigned char* hsamp, unsigned char* vsamp)void idct_fpga(struct mcu* mcu_block, unsigned char* component, unsigned char* hsamp, unsigned char* vsamp)

2165 lines of code2165 lines of code

iMX21 LCD DriveriMX21 LCD Driver

iMX21 LCD DriveriMX21 LCD Driver lcdtest.clcdtest.c

– Test applicationTest application– Read and display default Read and display default

register values for a sanity register values for a sanity checkcheck

– [Attempts to] display all white [Attempts to] display all white to verify correct operation.to verify correct operation.

lcdintf.clcdintf.c– Abstraction layerAbstraction layer– Provides interface to LCD Provides interface to LCD

registers via “human-friendly” registers via “human-friendly” function namesfunction names

mx21lcd.cmx21lcd.c– Low-level driverLow-level driver– Performs direct register Performs direct register

reads/writesreads/writes

mx21lcd.c

lcdintf.c

lcdtest.c

lcdintf.h

mx21lcd.h

LCD Driver – mx21lcd.cLCD Driver – mx21lcd.c

Provides register access to control:Provides register access to control:– Physical Screen size (W x H: 240 x 320)Physical Screen size (W x H: 240 x 320)– Virtual Page Width (also 240)Virtual Page Width (also 240)– Bits per pixel (Target was 18 bpp – 6 Bits per pixel (Target was 18 bpp – 6

bits each for red, blue, green. This is bits each for red, blue, green. This is the maximum color depth supported by the maximum color depth supported by the iMX21. 2the iMX21. 21818 – 1 = 262143 colors) – 1 = 262143 colors)

– Screen refresh on/offScreen refresh on/off

LCD Driver – mx21lcd.c LCD Driver – mx21lcd.c (cont.)(cont.)

– Screen start Screen start address (holds address (holds memory address memory address of first pixel)of first pixel) 16bpp stores 2 16bpp stores 2

pixels per wordpixels per word 18bpp stores 1 18bpp stores 1

pixel per word pixel per word (14 bits are (14 bits are wasted)wasted)

LCD Driver – lcdintf.cLCD Driver – lcdintf.c

Provides friendlier method to control LCDProvides friendlier method to control LCD– ““Human-readable” function names, i.e.Human-readable” function names, i.e.

intint lcdintf_set_screen_address_start ( lcdintf_set_screen_address_start (intint id, id, intint addr); addr); intint lcdintf_get_screen_address_start ( lcdintf_get_screen_address_start (intint id, id, intint *addr); *addr); intint lcdintf_set_screen_size ( lcdintf_set_screen_size (intint id, id, intint xsize, xsize, intint ysize); ysize); intint lcdintf_get_screen_size ( lcdintf_get_screen_size (intint id, id, intint *xsize, *xsize, intint *ysize); *ysize);

– Each register has a get/set function for easy Each register has a get/set function for easy read/writeread/write

– Friendly function parameters, i.e. xsize and Friendly function parameters, i.e. xsize and ysize are used so that the specific bit order of ysize are used so that the specific bit order of those parameters in the register are those parameters in the register are transparent to the programmer.transparent to the programmer.

LCD Driver – lcdtest.cLCD Driver – lcdtest.c

Driver test applicationDriver test application– Reads and displays default values of registers.Reads and displays default values of registers.– Sets screen size to 240 x 320Sets screen size to 240 x 320– Set virtual page width to 240Set virtual page width to 240– Set bits per pixel to 6.Set bits per pixel to 6.– Disable panel (refresh mode OFF)Disable panel (refresh mode OFF)– Allocate memory to hold pixel informationAllocate memory to hold pixel information

malloc an array of ints with size 240 x 320 = 76800malloc an array of ints with size 240 x 320 = 76800(76800 * 4 = 307200 bytes)(76800 * 4 = 307200 bytes)

Set each pixel to white by writing 0xffffffff to each location Set each pixel to white by writing 0xffffffff to each location in the malloc’d array.in the malloc’d array.

– Enable panel (refresh mode ON)Enable panel (refresh mode ON)

HW DesignHW Design

2D 8x8 IDCT FPGA 2D 8x8 IDCT FPGA AcceleratorAccelerator

2D IDCT Implemented on the FPGA2D IDCT Implemented on the FPGA Algorithm validated and fixed-point Algorithm validated and fixed-point

coefficients generated using Matlabcoefficients generated using Matlab FPGA hardware coded in Verilog and FPGA hardware coded in Verilog and

tested in Virsim with a testbenchtested in Virsim with a testbench C program written to directly test the C program written to directly test the

FPGA for individual 1-D or 2-D IDCT’sFPGA for individual 1-D or 2-D IDCT’s

2D DCT and IDCT2D DCT and IDCT

DCT

IDCT

Validate IDCT Validate IDCT ImplementationImplementation

Verify my_idct (Fast 1-D IDCT)

Verify decomposition of 2-D IDCT into 16 1-D IDCT’s

Utilize MatLAB for algortithm verification

Fast 1-D 8-Point IDCTFast 1-D 8-Point IDCT

Ci = cos-1(πi/16)/2-1

x+y

x-y

x

y

C4

Coefficients

Parhi, VLSI DSP Systems, pg. 285

Coefficient GenerationCoefficient Generation MatLAB used to generate CSD (canonic signed MatLAB used to generate CSD (canonic signed

digit) fixed-point representation of coeffs for IDCT digit) fixed-point representation of coeffs for IDCT multsmults

Carry

Save

Carry

Save

C2*X

X >> 1

X >> 5

X >> 7

X >> 9

2D IDCT Block Diagram2D IDCT Block Diagram

UV Array8x8x(8bit

)

Write Data 4 Bytes

4x8b

Pad to

16bitSwap

Tmp Array8x8x(16bit

)

8x16b

Bypassw/

IDCTResult

H o

r V

8x16b

8x16b

Pipeline Register

1D IDCTPipelined2-Cycle

x8

Quant/Sat 16b to 8b

XY Array8x8x(8bit)

Write To 8 Entry Row

Write To 8 Entry Column

Read Result 4 Bytes

CNTLUnit

Sequencer

WrAddr

R/W, AS

HO

R_D

ON

E[U

]

VER

T_D

ON

E[Y

]

Byp

ass C

ntl

Verilog SimulationVerilog Simulation

FPGA SynthesisFPGA Synthesis

First pass, Place and Route would not First pass, Place and Route would not finish, MaxPath of >80ns caused timing finish, MaxPath of >80ns caused timing violationviolation

IDCT and array reads were pipelined and IDCT and array reads were pipelined and Max Frequency was estimated at 42MHzMax Frequency was estimated at 42MHz

System Verification of FPGASystem Verification of FPGA

FPGA 2d IDCT memory FPGA 2d IDCT memory mapped for accessmapped for access

UV -> 0xd30000100UV -> 0xd30000100 XY -> 0xd30000140XY -> 0xd30000140 Simple C test code Simple C test code

written with UV and XY written with UV and XY arrays assign to arrays assign to pointers returned from pointers returned from mmap()mmap()

/tmp # ./example.exe 0 120 110 -20 32 -24 0 0 0The idct of 120 110 -20 32 -24 0 0 0 is 91 89 69 44 41 39 4 -43/tmp # ./example.exe 1 36 0 0 0 0 0 0 0The idct of 36 0 0 0 0 0 0 0 is 12 12 12 12 12 12 12 12/tmp # ./example.exe 2 60 0 40 0 20 0 0 0The idct of 60 0 40 0 20 0 0 0 is 46 21 6 9 9 6 21 46/tmp # ./example.exe 3 80 70 60 50 40 30 20 10The idct of 80 70 60 50 40 30 20 10 is 138 23 13 11 10 9 9 9/tmp # ./example.exe 4 40 20 0 0 0 0 0 0The idct of 40 20 0 0 0 0 0 0 is 23 22 19 15 12 8 5 4/tmp # ./example.exe 5 50 0 30 0 0 0 0 0The idct of 50 0 30 0 0 0 0 0 is 31 23 11 3 3 11 23 31/tmp # ./example.exe 6 60 0 0 40 0 0 0 0The idct of 60 0 0 40 0 0 0 0 is 37 17 1 11 31 40 25 4/tmp # ./example.exe 7 70 0 0 0 50 0 0 0The idct of 70 0 0 0 50 0 0 0 is 41 7 7 42 42 7 7 42/tmp # ./example.exe 145 75 50 43 44 41 35 32 0 9 14 0 0 0 0 0 0 24 17 29 39 31 10 0 0 27 27 0 0 6 0 0 52 25 29 33 27 6 0 0 76 30 16 0 6 25 2 0 50 30 21 16 6 0 1 13 0 29 18 11 13 13 0 0

FPGA IDCT Performance FPGA IDCT Performance MetricsMetrics

2D IDCT, 33MHz, 16clocks, 8x8 block2D IDCT, 33MHz, 16clocks, 8x8 block 480ns per block or 20M blocks/second480ns per block or 20M blocks/second 16 Word Writes and 16 Word Reads 16 Word Writes and 16 Word Reads

to send/receive 1 block to/from FPGA to send/receive 1 block to/from FPGA over effective 4M transfer/s bus.over effective 4M transfer/s bus.

Only 125k 8x8 block/s transfer rate Only 125k 8x8 block/s transfer rate possiblepossible

Bus transfer is limiting perfomanceBus transfer is limiting perfomance

2-D IDCT FPGA Issues2-D IDCT FPGA Issues

Incorrect saturation levels on output resultIncorrect saturation levels on output result Input block data was saturated often also, Input block data was saturated often also,

so a scaling needs to be appliedso a scaling needs to be applied Bandwidth insufficient to realize a speedupBandwidth insufficient to realize a speedup More of the decode needs to be added to More of the decode needs to be added to

overcome Bandwidth penalty (Huffman overcome Bandwidth penalty (Huffman decode, Inverse Quant, YUV->RGB)decode, Inverse Quant, YUV->RGB)

Transferring decompressed block data Transferring decompressed block data eats up too much bandwidtheats up too much bandwidth

ResultsResults IDCT Algorithm comparison (Average time needed to IDCT Algorithm comparison (Average time needed to

read in decompress file and be ready to display) read in decompress file and be ready to display) Display time not included.Display time not included.

– Dumb software algorithm: Dumb software algorithm: 4.26 seconds4.26 seconds– Wang-Chen algorithm: Wang-Chen algorithm: 0.73 seconds0.73 seconds– FPGA IDCT algorithm: FPGA IDCT algorithm: 2.17 seconds2.17 seconds

The bus must be limiting the performance of the The bus must be limiting the performance of the FPGA IDCT.FPGA IDCT.

Even at our best rate MPEG would have been Even at our best rate MPEG would have been impossible, without significant performance tweaks.impossible, without significant performance tweaks.– Best Frames per sec: 1.36 Best Frames per sec: 1.36

Images from SW IDCT and hw IDCTImages from SW IDCT and hw IDCT

There is some issues with the hw IDCT which we think has to do with not having high enough bit precision on the input. Interesting note is that if the Cb and Cr values are 0 the end result is green, with various shades based on the luminance.

Jpeg IDCT progression (SW IDCT)Jpeg IDCT progression (SW IDCT)

DC only DC and 1 ac coefficient DC and 4 ac coefficient

DC and 10 ac coefficient DC and 32 ac coefficient

SW IDCT, is the very slow but accurate version of the IDCT calculation performed in software, on the imx21

DC and 63 ac coefficient

Comparison versus Adobe Photoshop JPEG Comparison versus Adobe Photoshop JPEG decompressiondecompression

SW IDCT versus Photoshop de-comp.

SW IDCT versus Photoshop de-comp.High brightness and contrast

Wang Chen versus Photoshop de-comp. High brightness and contrast

Wang Chen IDCT versus Photoshop de-comp.

Conclusions/SummaryConclusions/Summary

Able to decode jpeg files and display Able to decode jpeg files and display on an LCD at 16 bpp using software on an LCD at 16 bpp using software IDCT methods.IDCT methods.

FPGA IDCT appears to be working to FPGA IDCT appears to be working to some degree and may need precision some degree and may need precision modifications to work correctly.modifications to work correctly.

Code when run on an x86 system Code when run on an x86 system was able to decompress and write a was able to decompress and write a file for a 3000x3000 pixel image.file for a 3000x3000 pixel image.

JPEG Decompression Andy Hocker Jeff Huxel Joel Williams Matthew Tilleman.

Documents

Transcript of JPEG Decompression Andy Hocker Jeff Huxel Joel Williams Matthew Tilleman.