A Low-Cost Implementation of MJPEG Encoder on … · A Low-Cost Implementation of MJPEG Encoder on...

A Low-Cost Implementation of MJPEG Encoder on TI TMS320DM642 using On-Chip Memory Only

Noha A. El-Yamany,Southern Methodist University (SMU), Dallas, TX

Raj Pawate and Cheng PengTexas Instruments Inc., Stafford, TX

Draft #1

Problems are opportunities in disguise!

Objectives, requirements and constraints

Quick overview on MJPEG

The proposed solution

Encoder profile

Constraints and future work

Agenda

TI is introducing solutions for the video security market and low-cost (LC) IP network cameras (Netcams) have large sales potential

Success of the on-chip MJPEG encoder project will:

1. enable the LC IP Netcam on TI TMS320DM642, and

2. drive similar implementations on the DM64LC device derivatives

Objectives

♣ Courtesy of Texas Instruments Inc.

Low Cost

Solution

TMS320DM642 +SDRAM

( $38.55♣ + $Y )

TMS320DM642

ONLY

( $38.55♣)

Constraints:

1. On-chip memory is 256 KB only

2. D1 resolution (4:2:2) @ 30 fps must be encoded meeting the real-time constraint

Requirements and Constraints

Requirements:

1. No external memory in the encoding process

2. Minimum MHZ and memory consumption

→ Headroom for TCP/IP (or UDP) & minimal

applications to run.

?

JPEG

Standard

Features

Mem

ory

Con

strai

nts

DM642TM

Capabilities

JPEG

Bit Stream

Quick Overview on MJPEG EncodingMotion JPEG (MJPEG): Informal name for a multimedia format in which each video frame of a digital video sequence is separately compressed as a JPEG image.

JPEG Encoder:

8x8 Block

DecompositionFDCT Quantization

Entropy

Coding

DC

Encode

Quant. &

RLE

AC

VLC

Byte

Stuffing

Quality Factor

ITU-BT.656

YUV 4:2:2

D1 Resolution

ITU-BT.656

YUV 4:2:2

D1 Resolution

TMS320DM642

(600 MHZ)

NTSC Camera

JPEG

Bit StreamJPEG

ENCODE

MonitorJPEG

DECODE

JPEG

Bit Stream(On-Chip Memory)

A High Level Block Diagram of the On-Chip MJPEG Encoder

Dashed path is for JPEG syntax compliance verification.

L2 Memory (SRAM)

L1P

C64x DSP

Core

JPEG Code

INTERBUFF

[Intermediate Data Buffer]

BITSTREAM

[Bit Stream Buffer]

JPEG Tables, Data & Scratch

Memory

Video Port (VP0)

Channel A

NTSC Camera

Y FIFO

Cr FIFO

Cb FIFO

EDMA

Controller

L1D

INBUFF

[Strip Buffer]

A Functional Block Diagram of the On-Chip MJPEG Encoder

Y Buffer A

1280 Bytes

Cr Buffer A

640 Bytes

Cb Buffer A

640 BytesCbSRCA

YSRCA

CrSRCA

VP0 (Capture FIFO A)

VDIN[9-0]

8 64

8 64

8 64

1. Configuration of the Video Capture Port

• VP0 is a 20-bit video capture/display port (two channels A and B)

• Each channel has a 2560 Bytes FIFO

• 8-bit ITU BT.656 capture mode is selected (channel A only)

• No color resampling (4:2:2)

Sub-Frame Capture (and Encoding):Because of the memory constraints (on-chip memory only), transfer of a

full frame to the internal memory is not feasible

Sub-frame capture and encoding is a viable option (an 8-lines strip is captured and encoded at a time)

Why 8 lines?

Minimum number of lines for JPEG encoding of 4:2:2 data for minimum data

buffers in the L2 SRAM.A strip of 8 lines

480 Lines

Data Transfer from VP0 to the L2 SRAM (INBUFF):

EDMA events will be on a line basis (because of the FIFOs size)

The transfer size is set to the buffer threshold:

Y buffer threshold will be set to 720 bytes = 90 double words

Cr & Cb buffer thresholds will be set to 360 bytes = 45 double words

Y Buffer A

1280 Bytes

CB Buffer A

640 Bytes

CR Buffer A

640 Bytes

VP0 (Capture FIFO A)

2. Configuration of the On-Chip Memory (L2)

The L2 memory can operate as SRAM, cache, or both.

Its total capacity is 256 Kbytes (0x0000 0000 to 0x0003 FFFF)

After reset, the entire L2 is mapped as a 256 KB SRAM.

→ No need to configure it using the CSL function CACHE_setL2Mode( )

→ Savings in the code size by 1.875 KB (used for CACHE_wait, CACHE_clean, CACHE_setL2Mode & CACHE_wbInvL1d )

64 KB 0x0000 0000

ALL

SRAM

64 KB

64 KB

0x0001 0000

0x0002 0000

0x0003 0000

64 KB

1. INBUFF: Size = 11.25 KB, to hold one strip of data.

(720+360+360)×8 = 11.25 Kbytes

2. INTERBUFF: 22.5 Kbytes, to hold the 16-bit precision strip.

3. JPEG tables and scratch memory 9.25 Kbytes

4. JPEG Code: 28 Kbytes

5. BITSTREAM (frame bit stream buffer): 40 KB

The L2 will hold 2 input data buffers, an intermediate bit stream buffer,

and JPEG code and data buffers (as shown in figure). L2 Memory (SRAM)

JPEG Code

INTERBUFF

[Intermediate Data Buffer]

BITSTREAM

[Bit Stream Buffer]

JPEG Tables, Data & Scratch

Memory

INBUFF

[Strip Buffer]

88888888888888888888888888888888888888888888888

22222222222222222222222222222222222222222222222

11111111111111111111111111111111111111111111111

Line #8 Data

Line #2 Data

Line #1 Data

A Strip of

8 Lines

Arrangement into 2D form (8×8 blocks)

11111111 22222222 88888888

Offset = 64 pixels

22222222 8888888811111111

Offset = 64 pixels

3. The Proposed Data Transfer and Encoding Scheme

8x8 Block

Decomposition

FDCT

JPEG encoding requires FDCT processing of incoming data arranged in

8x8 blocks of pixels (2D form).

Data captured from the VP is linearly arranged in the DSP memory.

Pixels should be expanded to 16-bit precision before FDCT computation

TI TMS320C6000TM JPEG encoders uses a function, reformat_enc, to arrange data

in the 2D form and expand it to 16-bit precision

1.1 KBCode Size

♣82,800CPU Cycles/Strip

reformat_encFunction

♣ These numbers include the L1D and L1P overhead.

For D1 resolution @ 30 fps, and using the 600 MHZ DM642:

SET

STT

Strip #1 Strip #2

Implications:CPU cycles and cache overhead is large (82,800 CPU cycles/strip), assuming no

other traffic that would delay L1D servicing – could be doubled!

→ Lower performance

Tight constraint on the strip encoding time (SET).

Two input data buffers to relax the constraint on the processing time.

→ Larger memory requirement

bound) memory beto system the(for

cyles CPU 333,000 beSET must

cycles CPU

<⇒

≅

××××

≅

000333

10600848030

1STT 6

,

Proposed Solution: Application level optimizationUsing the EDMA to simultaneously transfer and arrange the data in the 2D format,

from the VP into the input buffer (INBUFF).

16-bit expansion of data from INBUFF to INTERBUFF before the new strip transfer beginning.

Pros: 1. No CPU cycles or cache/EDMA overhead involved in 2D data arrangement

2. 16-bit expansion happens in the NO TRAFFIC ZONE!

→ The time constraint on 16-bit expansion is to be < 41,000 CPU cycles

→ Feasible since the overhead is 4820♣ cycles/strip.

♣ These numbers include the L1D and L1P overhead.

≈ 880 %1.1 KB128 Bytes Code Size

> 888 %♣82,800♣4820CPU Cycles/Strip

%Increasereformat_encIMG_pix_expandFunction

The Proposed Data Transfer and Encoding Scheme

SET SET

STT

STP STP

16-Bit Expansion of the

2D Strip into INTERBUFF

Strip Encoding

Strip #1Transfer of the 2D Strip from

VP0 to L2 (INBUFF) Strip #2 Strip #3

No traffic

Transfer of Bit stream to Bit Stream Buffers

[3] CPU encodes 2D strip

[4] CPU stores strip bit stream

[5] EDMA transfer of the second 2D strip

♣Steps [3] and [4] occur in parallel with step [5]

The Proposed Buffering Scheme

INBUFF

11.25 KB

[1] EDMA transfer of the first 2D strip

[2] Half-word expansion

INTERBUFF

22.5 KB

OUTBUFF

EDMA Transfer # 1 into the L2 memory

Line #1 data stored in the VP FIFO

11111111 11111111 11111111 11111111

11111111 11111111

Data stored in the input data buffer (INBUFF)

11111111 1111111

Offset = 64

EDMA Channels Configuration for Data Transfer & 2D Arrangement

8888888888888888888888

2222222222222222222222

1111111111111111111111 11111111 22222222 88888888

Offset = 64 pixels

22222222 8888888811111111

Offset = 64 pixels

Line #2 data stored in the VP FIFO

22222222 22222222 22222222 22222222

22222222 22222222

EDMA Transfer # 2 into the L2 memory

Data stored in the input data buffer (INBUFF)

11111111 111111122222222 22222222

Offset = 64

EDMA Channels Configuration for Data Transfer & 2D Arrangement

The linking feature of the EDMA is exploited to achieve the desired

2D data transfer and arrangement.

More programming effort due to using the EDMA, but it is worth it!

4. The Data Memory (L1D) and System Overhead

The L1D is a 16 KB (2-way set associative) cache of 64-byte line size.

L1D miss penalty is up to 6 CPU cycles.

JPEG Encoding requires 8×8 block processing at 16-pit precision

⇒ an 8×8 block ≡ 128 bytes

To reduce L1D cache overhead, the number of blocks to process (N)

at each JPEG iteration must be limited to the size the L1D,

i.e., N ≤ 16×1024/128 = 128 Blocks

For sub-frame capture and encoding, a total of 180 blocks need to be

processed/strip

To avoid L1D overhead, each component is encoded separately

→ procedural level optimization

Iteration #3

Y Component

90 Blocks

(11.25 KB)

U Component

45 Blocks

(5.625 KB)

V Component

45 Blocks

(5.625 KB)

An 8-lines strip (180 blocks)

Iteration #2Iteration #1

Compulsory cache misses before each iteration can’t be avoided Data sections arrangement to further reduce overhead

5. The Program Memory (L1P) and System Overhead

The L1P is a 16 KB direct mapped cache of 32- byte line size

L1P read miss penalty is up to 8 CPU cycles

It is important to limit the JPEG code to fit the L1P size to avoid any

cache in the encoding loop overhead (compulsory cache misses can

not be avoided)

→ procedural level optimization

So far, the JPEG kernels fit into the L1P (only 10.5 KB) Optimization is in progress

6. The External Memory and System Overhead

No external memory in the encoding process

Cons:

Tight memory budget → Problems are solutions!

Pros:

☺ Lower cost

☺ External memory access latency is ignored

☺ No cache coherence issues need to be addressed

☺ EDMA scheduling is easy

1. Memory Requirement

a. Data buffers: 33.75 KB

c. JPEG tables: 2.5 KB

d. JPEG scratch memory: 6.75 KB

e. Program memory: 28 KB (No optimization) → (JPEG Kernels 10.5 KB)f. Frame Bit stream buffer: 40 KB

Total: 111 Kbytes

⇒ Remaining budget = 145 Kbytes

2. MHZ Requirement

For D1 YUV 4:2:2 → Upper Bound ≈ 6,000,000 CPU cycles /frame The final estimate will be reported soon in the second draft

Encoder Profile - D1 YUV 4:2:2 @ 30 fps

TMS320DM642TM

ONLY

($38.55)

Original Frame (675 KB) Compressed Frame (33 KB) at Q = 50%

Sample ResultsThese artifacts are still under investigation

The road to success is always under construction.

- Anonymous

Investigation of the resulting artifacts

Further optimization (e.g. cache analysis, code optimization)

Benchmarks

Unique Features of the Proposed Implementation

On-chip implementation → low cost

Low MHZ and memory requirements → headroom for other tasks

VP configuration for direct transfer to on-chip memory

2D data transfer and arrangement from the VP using the EDMA

→ Lower overhead and memory requirement

Constraints and Future Work

Support for different video resolutions

→ Only D1 YUV4:2:2 @30 fps or less, is supported

→ D1 YUV4:2:0 will require larger data buffers (strip ⇒ 16 lines)

Support for interleaved mode

A comparison between the proposed solution & TI JPEG

encoder on the TM320DM642 (2003) – D1 YUV 4:2:2

769 KB111 KBTotal Memory

25 KB (optimized)

(JPEG Kernels 9.5625 KB)

28 KB (no optimization)

(JPEG Kernels 10.5 KB)

Program Memory

Data Memory

Criterion

29 KB ( JPEG Scratch, tables & buffers)

675 KB Frame Buffer

40 KB - Bit Stream

43 KB ( JPEG Scratch, tables & buffers)

40 K – Bit Stream

Current TI JPEG EncoderProposed Solution

(1) [email protected], (2) [email protected](3) [email protected]

Noha A. El-Yamany (1),Southern Methodist University (SMU), Dallas, TX

Raj Pawate (2) and Cheng Peng (3)

Texas Instruments Inc., Stafford, TX

A Low-Cost Implementation of MJPEG Encoder on … · A Low-Cost Implementation of MJPEG Encoder on...

Documents

Transcript of A Low-Cost Implementation of MJPEG Encoder on … · A Low-Cost Implementation of MJPEG Encoder on...