Fast JPEG Coding on the GPU - GPU Technology...

24
Fast JPEG Coding on the GPU Fast JPEG Coding on the GPU Fyodor Serzhenko, Fastvideo, Dubna, Russia Victor Podlozhnyuk, NVIDIA, Santa Clara, CA © Fastvideo, 2011

Transcript of Fast JPEG Coding on the GPU - GPU Technology...

Page 1: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

Fast JPEG Coding on the GPUFast JPEG Coding on the GPU

Fyodor Serzhenko, Fastvideo, Dubna, Russia

Victor Podlozhnyuk, NVIDIA, Santa Clara, CA

© Fastvideo, 2011

Page 2: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

Key Points

� We implemented the fastest JPEG codec

� Many applications using JPEG can benefit from our codec

Page 3: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

High Speed Imaging

Data Path for High Speed Camera (500 – 1000 fps)

Camera data rate from 600 MB/s to 2400 MB/s.

Camera External cables PCI-E Frame grabber Host Storage

Camera data rate from 600 MB/s to 2400 MB/s.

Problem: how to record 1 hour or more?

Possible Solutions

RAID, SSD, online compression on FPGA / DSP / CPU / GPU

The fastest solution: JPEG compression on GPU

Page 4: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

Why JPEG

� Popular open compression standard

� Good image quality at 10x-20x compression ratio

Moderate computational complexity� Moderate computational complexity

Page 5: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

Main Stages of Baseline JPEG Algorithm

Source Image Upload RGB→YUV Transform Image split to blocks 8x8

2D DCTQuantizationZig-Zag

RLE + DPCM Huffman Bitstream Download

Page 6: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

JPEG Codecs: GPU vs. CPU

Performance summary for the fastest JPEG codecs

JPEG Codec (Q=50%, CR=13) Encode, MB/s Decode, MB/s

Fastvideo FVJPEG + GTX 680 5200 4500

(*) - as reported by manufacturer

Fastvideo FVJPEG + GTX 580 3500 3500

Intel IPP-7.0 + Core i7 3770 680 850

Intel IPP-7.0 + Core i7 920 430 600

Vision Experts VXJPG 1.4 (*) 500 --

Accusoft PICTools Photo (*) 250 380

Page 7: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

Best JPEG encoder IP Cores

JPEG IP Core Encode MB/s

Cast Inc. JPEG-E 750

Alma-Tech SVE-JPEG-E 500

Results as reported by manufacturer

Visengi JPEG Encoder 405

Page 8: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

JPEG Encoding Rates for GPU & CPU

5000

6000

7000 GTX 680 + FVJPEG GTX 580 + FVJPEG GT 555M + FVJPEG GT 240 + FVJPEG Core i7 3770 + IPP-7 Core i7 920 + IPP-7

JPEG compression throughput, MB/s

% 25% 50% 75% 100%0

1000

2000

3000

4000

Core i7 920 + IPP-7

Quality level

Page 9: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

JPEG Encoding Time for GeForce 580

7

8

9

10Time for JPEG Compression Stages (ms)

100%95%75%50%Host-to-Device

RLE+DPCM

Quality level

1

2

3

4

5

6

50%25%10%

Host-to-Device

DCT/Quant/Zig

HuffmanDevice-to-Host

Page 10: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

JPEG Encoding Time for GeForce 680

7

8

9

10Time for JPEG Compression Stages (ms)

100%95%75%50%25%

RLE+DPCMQuality level

1

2

3

4

5

625%10%

Host-to-Device

DCT/Quant/Zig

Huffman

Device-to-Host

Page 11: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

DCT and Entropy Encoding (GeForce 580)

40

50

DCTRLE+DPCM

Throughput for JPEG encoding stages (GB/s)

% 25% 50% 75% 100%0

10

20

30RLE+DPCMHuffman

Quality level

Page 12: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

JPEG Decoding

� No good parallel algorithm is known for Huffman decoding

� Restart markers is a standard feature supported by all decoders

Fully parallel JPEG decoding is still possible� Fully parallel JPEG decoding is still possible

� Currently supported restart intervals: 0, 1, 2, 4, 8, 16, 32

Page 13: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

JPEG Decoding Rates for GPU & CPU

5000

6000

7000 GTX 680 + FVJPEG GTX 580 + FVJPEG GT 555M + FVJPEG GT 240 + FVJPEG Core i7 3770 + IPP-7 Core i7 920 + IPP-7

JPEG decompression throughput, MB/s

% 25% 50% 75% 100%0

1000

2000

3000

4000

Core i7 920 + IPP-7

Quality level

Page 14: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

JPEG Decoding Time for GeForce GTX 580

6

7

8

9Time for JPEG Decompression Stages (ms)

100%95%75%50%25%

RLE+DPCM

Huffman Device-to-Host

Quality level

1

2

3

4

5

6 25%10%

Host-to-Device

IDCT/Quant/Zig

Page 15: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

JPEG Decoding Time for GeForce GTX 680

6

7

8

9Time for JPEG Decompression Stages (ms)

100%95%75%50%

RLE+DPCM HuffmanQuality level

1

2

3

4

5

6 50%25%10%

Host-to-DeviceIDCT/Quant/Zig

Device-to-Host

Page 16: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

Getting More Speed-up

� GPUs with PCI-Express 3.0 interface

� Concurrent copy and execution

� Multi-GPU computing� Multi-GPU computing

Page 17: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

Applications to 3D rendering

• Modern 3D applications are working with increasingly high-resolution data sets

• JPEG is a standard color map storage format• Decoding JPEG on the CPU has major drawbacks• Decoding JPEG on the CPU has major drawbacks

• CPU-based decoding can be unacceptably slow even with partial GPU acceleration

• Transferring raw decoded image or intermediate decoding results over PCI-Express is much more expensive

• JPEG decoding on the GPU is a perfect solution to both problems

Page 18: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

Applications to JPEG Imaging for Web

• Server-side image scaling to fit client devices.

• Thumbnail generation for big image databases.

Problem: how to cope with 100’s of millions images per day?

Method outline

• Get images from the database and load them to Host

• Image Decompression → Resize → Compression

• Store final images to the database or send them to users

Page 19: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

Conclusion

� Fast image coding on the GPU is reality

� Modern GPUs are capable of running many non-floating point

algorithms efficientlyalgorithms efficiently

Page 20: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

Future Work

� SDK for FVJPEG codec for Windows / Linux

� Optimized JPEG, MJPEG, JPEG2000

� Multi-GPU computing

� Custom software design

Page 21: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

Questions?

� Contacts: [email protected]

� More info at www.fastcompression.com

Page 22: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

PCs & Laptop for testing

� ASUS P6T Deluxe V2 LGA1366, X58, Core i7 920, 2.67 GHz, DDR-III 6 GB, GPU GeForce GTX 580 or GeForce GT 240

� ASUS P8Z77-PRO, Z77, Core i7 3770, 3.4 GHz, DDR-III 8 GB, GPU GeForce GTX 680 (cc = 3.0, 1536 cores)

� OS Windows-7, 64-bit, CUDA 4.1, driver 296.10

Laptop

� ASUS N55S, Core i5 2430M, DDR III 6 GB

� GeForce GT 555M (cc = 2.1, 144 cores)

� OS Windows-7, 64-bit, CUDA 4.1, driver 296.10

Page 23: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

Baseline JPEG parameters for test

� 8-bit grayscale images

� Compression quality from 10% to 100%

� Default static quantization and Huffman tables

� Test image: 7216 x 5408, 8-bit, CR = 12.8

� 8-thread encode/decode option for CPU

Conclusion: These parameters define the same calculation

procedures for CPU & GPU.

Page 24: Fast JPEG Coding on the GPU - GPU Technology Conferenceon-demand.gputechconf.com/gtc/2012/presentations/S0273... · 2014-04-14 · JPEG Decoding No good parallel algorithm is known

Test image