A Low-Cost Implementation of MJPEG Encoder on … · A Low-Cost Implementation of MJPEG Encoder on...
Transcript of A Low-Cost Implementation of MJPEG Encoder on … · A Low-Cost Implementation of MJPEG Encoder on...
A Low-Cost Implementation of MJPEG Encoder on TI TMS320DM642 using On-Chip Memory Only
Noha A. El-Yamany,Southern Methodist University (SMU), Dallas, TX
Raj Pawate and Cheng PengTexas Instruments Inc., Stafford, TX
Draft #1
Problems are opportunities in disguise!
Objectives, requirements and constraints
Quick overview on MJPEG
The proposed solution
Encoder profile
Constraints and future work
Agenda
TI is introducing solutions for the video security market and low-cost (LC) IP network cameras (Netcams) have large sales potential
Success of the on-chip MJPEG encoder project will:
1. enable the LC IP Netcam on TI TMS320DM642, and
2. drive similar implementations on the DM64LC device derivatives
Objectives
♣ Courtesy of Texas Instruments Inc.
Low Cost
Solution
TMS320DM642 +SDRAM
( $38.55♣ + $Y )
TMS320DM642
ONLY
( $38.55♣)
Constraints:
1. On-chip memory is 256 KB only
2. D1 resolution (4:2:2) @ 30 fps must be encoded meeting the real-time constraint
Requirements and Constraints
Requirements:
1. No external memory in the encoding process
2. Minimum MHZ and memory consumption
→ Headroom for TCP/IP (or UDP) & minimal
applications to run.
?
JPEG
Standard
Features
Mem
ory
Con
strai
nts
DM642TM
Capabilities
JPEG
Bit Stream
Quick Overview on MJPEG EncodingMotion JPEG (MJPEG): Informal name for a multimedia format in which each video frame of a digital video sequence is separately compressed as a JPEG image.
JPEG Encoder:
8x8 Block
DecompositionFDCT Quantization
Entropy
Coding
DC
Encode
Quant. &
RLE
AC
VLC
Byte
Stuffing
Quality Factor
ITU-BT.656
YUV 4:2:2
D1 Resolution
ITU-BT.656
YUV 4:2:2
D1 Resolution
TMS320DM642
(600 MHZ)
NTSC Camera
JPEG
Bit StreamJPEG
ENCODE
MonitorJPEG
DECODE
JPEG
Bit Stream(On-Chip Memory)
A High Level Block Diagram of the On-Chip MJPEG Encoder
Dashed path is for JPEG syntax compliance verification.
L2 Memory (SRAM)
L1P
C64x DSP
Core
JPEG Code
INTERBUFF
[Intermediate Data Buffer]
BITSTREAM
[Bit Stream Buffer]
JPEG Tables, Data & Scratch
Memory
Video Port (VP0)
Channel A
NTSC Camera
Y FIFO
Cr FIFO
Cb FIFO
EDMA
Controller
L1D
INBUFF
[Strip Buffer]
A Functional Block Diagram of the On-Chip MJPEG Encoder
Y Buffer A
1280 Bytes
Cr Buffer A
640 Bytes
Cb Buffer A
640 BytesCbSRCA
YSRCA
CrSRCA
VP0 (Capture FIFO A)
VDIN[9-0]
8 64
8 64
8 64
1. Configuration of the Video Capture Port
• VP0 is a 20-bit video capture/display port (two channels A and B)
• Each channel has a 2560 Bytes FIFO
• 8-bit ITU BT.656 capture mode is selected (channel A only)
• No color resampling (4:2:2)
Sub-Frame Capture (and Encoding):Because of the memory constraints (on-chip memory only), transfer of a
full frame to the internal memory is not feasible
Sub-frame capture and encoding is a viable option (an 8-lines strip is captured and encoded at a time)
Why 8 lines?
Minimum number of lines for JPEG encoding of 4:2:2 data for minimum data
buffers in the L2 SRAM.A strip of 8 lines
480 Lines
Data Transfer from VP0 to the L2 SRAM (INBUFF):
EDMA events will be on a line basis (because of the FIFOs size)
The transfer size is set to the buffer threshold:
Y buffer threshold will be set to 720 bytes = 90 double words
Cr & Cb buffer thresholds will be set to 360 bytes = 45 double words
Y Buffer A
1280 Bytes
CB Buffer A
640 Bytes
CR Buffer A
640 Bytes
VP0 (Capture FIFO A)
2. Configuration of the On-Chip Memory (L2)
The L2 memory can operate as SRAM, cache, or both.
Its total capacity is 256 Kbytes (0x0000 0000 to 0x0003 FFFF)
After reset, the entire L2 is mapped as a 256 KB SRAM.
→ No need to configure it using the CSL function CACHE_setL2Mode( )
→ Savings in the code size by 1.875 KB (used for CACHE_wait, CACHE_clean, CACHE_setL2Mode & CACHE_wbInvL1d )
64 KB 0x0000 0000
ALL
SRAM
64 KB
64 KB
0x0001 0000
0x0002 0000
0x0003 0000
64 KB
1. INBUFF: Size = 11.25 KB, to hold one strip of data.
(720+360+360)×8 = 11.25 Kbytes
2. INTERBUFF: 22.5 Kbytes, to hold the 16-bit precision strip.
3. JPEG tables and scratch memory 9.25 Kbytes
4. JPEG Code: 28 Kbytes
5. BITSTREAM (frame bit stream buffer): 40 KB
The L2 will hold 2 input data buffers, an intermediate bit stream buffer,
and JPEG code and data buffers (as shown in figure). L2 Memory (SRAM)
JPEG Code
INTERBUFF
[Intermediate Data Buffer]
BITSTREAM
[Bit Stream Buffer]
JPEG Tables, Data & Scratch
Memory
INBUFF
[Strip Buffer]
88888888888888888888888888888888888888888888888
22222222222222222222222222222222222222222222222
11111111111111111111111111111111111111111111111
Line #8 Data
Line #2 Data
Line #1 Data
A Strip of
8 Lines
Arrangement into 2D form (8×8 blocks)
11111111 22222222 88888888
Offset = 64 pixels
22222222 8888888811111111
Offset = 64 pixels
3. The Proposed Data Transfer and Encoding Scheme
8x8 Block
Decomposition
FDCT
JPEG encoding requires FDCT processing of incoming data arranged in
8x8 blocks of pixels (2D form).
Data captured from the VP is linearly arranged in the DSP memory.
Pixels should be expanded to 16-bit precision before FDCT computation
TI TMS320C6000TM JPEG encoders uses a function, reformat_enc, to arrange data
in the 2D form and expand it to 16-bit precision
1.1 KBCode Size
♣82,800CPU Cycles/Strip
reformat_encFunction
♣ These numbers include the L1D and L1P overhead.
For D1 resolution @ 30 fps, and using the 600 MHZ DM642:
SET
STT
Strip #1 Strip #2
Implications:CPU cycles and cache overhead is large (82,800 CPU cycles/strip), assuming no
other traffic that would delay L1D servicing – could be doubled!
→ Lower performance
Tight constraint on the strip encoding time (SET).
Two input data buffers to relax the constraint on the processing time.
→ Larger memory requirement
bound) memory beto system the(for
cyles CPU 333,000 beSET must
cycles CPU
<⇒
≅
××××
≅
000333
10600848030
1STT 6
,
Proposed Solution: Application level optimizationUsing the EDMA to simultaneously transfer and arrange the data in the 2D format,
from the VP into the input buffer (INBUFF).
16-bit expansion of data from INBUFF to INTERBUFF before the new strip transfer beginning.
Pros: 1. No CPU cycles or cache/EDMA overhead involved in 2D data arrangement
2. 16-bit expansion happens in the NO TRAFFIC ZONE!
→ The time constraint on 16-bit expansion is to be < 41,000 CPU cycles
→ Feasible since the overhead is 4820♣ cycles/strip.
♣ These numbers include the L1D and L1P overhead.
≈ 880 %1.1 KB128 Bytes Code Size
> 888 %♣82,800♣4820CPU Cycles/Strip
%Increasereformat_encIMG_pix_expandFunction
The Proposed Data Transfer and Encoding Scheme
SET SET
STT
STP STP
16-Bit Expansion of the
2D Strip into INTERBUFF
Strip Encoding
Strip #1Transfer of the 2D Strip from
VP0 to L2 (INBUFF) Strip #2 Strip #3
No traffic
Transfer of Bit stream to Bit Stream Buffers
[3] CPU encodes 2D strip
[4] CPU stores strip bit stream
[5] EDMA transfer of the second 2D strip
♣Steps [3] and [4] occur in parallel with step [5]
The Proposed Buffering Scheme
INBUFF
11.25 KB
[1] EDMA transfer of the first 2D strip
[2] Half-word expansion
INTERBUFF
22.5 KB
OUTBUFF
EDMA Transfer # 1 into the L2 memory
Line #1 data stored in the VP FIFO
11111111 11111111 11111111 11111111
11111111 11111111
Data stored in the input data buffer (INBUFF)
11111111 1111111
Offset = 64
EDMA Channels Configuration for Data Transfer & 2D Arrangement
8888888888888888888888
2222222222222222222222
1111111111111111111111 11111111 22222222 88888888
Offset = 64 pixels
22222222 8888888811111111
Offset = 64 pixels
Line #2 data stored in the VP FIFO
22222222 22222222 22222222 22222222
22222222 22222222
EDMA Transfer # 2 into the L2 memory
Data stored in the input data buffer (INBUFF)
11111111 111111122222222 22222222
Offset = 64
EDMA Channels Configuration for Data Transfer & 2D Arrangement
The linking feature of the EDMA is exploited to achieve the desired
2D data transfer and arrangement.
More programming effort due to using the EDMA, but it is worth it!
4. The Data Memory (L1D) and System Overhead
The L1D is a 16 KB (2-way set associative) cache of 64-byte line size.
L1D miss penalty is up to 6 CPU cycles.
JPEG Encoding requires 8×8 block processing at 16-pit precision
⇒ an 8×8 block ≡ 128 bytes
To reduce L1D cache overhead, the number of blocks to process (N)
at each JPEG iteration must be limited to the size the L1D,
i.e., N ≤ 16×1024/128 = 128 Blocks
For sub-frame capture and encoding, a total of 180 blocks need to be
processed/strip
To avoid L1D overhead, each component is encoded separately
→ procedural level optimization
Iteration #3
Y Component
90 Blocks
(11.25 KB)
U Component
45 Blocks
(5.625 KB)
V Component
45 Blocks
(5.625 KB)
An 8-lines strip (180 blocks)
Iteration #2Iteration #1
Compulsory cache misses before each iteration can’t be avoided Data sections arrangement to further reduce overhead
5. The Program Memory (L1P) and System Overhead
The L1P is a 16 KB direct mapped cache of 32- byte line size
L1P read miss penalty is up to 8 CPU cycles
It is important to limit the JPEG code to fit the L1P size to avoid any
cache in the encoding loop overhead (compulsory cache misses can
not be avoided)
→ procedural level optimization
So far, the JPEG kernels fit into the L1P (only 10.5 KB) Optimization is in progress
6. The External Memory and System Overhead
No external memory in the encoding process
Cons:
Tight memory budget → Problems are solutions!
Pros:
☺ Lower cost
☺ External memory access latency is ignored
☺ No cache coherence issues need to be addressed
☺ EDMA scheduling is easy
1. Memory Requirement
a. Data buffers: 33.75 KB
c. JPEG tables: 2.5 KB
d. JPEG scratch memory: 6.75 KB
e. Program memory: 28 KB (No optimization) → (JPEG Kernels 10.5 KB)f. Frame Bit stream buffer: 40 KB
Total: 111 Kbytes
⇒ Remaining budget = 145 Kbytes
2. MHZ Requirement
For D1 YUV 4:2:2 → Upper Bound ≈ 6,000,000 CPU cycles /frame The final estimate will be reported soon in the second draft
Encoder Profile - D1 YUV 4:2:2 @ 30 fps
TMS320DM642TM
ONLY
($38.55)
Original Frame (675 KB) Compressed Frame (33 KB) at Q = 50%
Sample ResultsThese artifacts are still under investigation
The road to success is always under construction.
- Anonymous
Investigation of the resulting artifacts
Further optimization (e.g. cache analysis, code optimization)
Benchmarks
Unique Features of the Proposed Implementation
On-chip implementation → low cost
Low MHZ and memory requirements → headroom for other tasks
VP configuration for direct transfer to on-chip memory
2D data transfer and arrangement from the VP using the EDMA
→ Lower overhead and memory requirement
Constraints and Future Work
Support for different video resolutions
→ Only D1 YUV4:2:2 @30 fps or less, is supported
→ D1 YUV4:2:0 will require larger data buffers (strip ⇒ 16 lines)
Support for interleaved mode
A comparison between the proposed solution & TI JPEG
encoder on the TM320DM642 (2003) – D1 YUV 4:2:2
769 KB111 KBTotal Memory
25 KB (optimized)
(JPEG Kernels 9.5625 KB)
28 KB (no optimization)
(JPEG Kernels 10.5 KB)
Program Memory
Data Memory
Criterion
29 KB ( JPEG Scratch, tables & buffers)
675 KB Frame Buffer
40 KB - Bit Stream
43 KB ( JPEG Scratch, tables & buffers)
40 K – Bit Stream
Current TI JPEG EncoderProposed Solution
(1) [email protected], (2) [email protected](3) [email protected]
Noha A. El-Yamany (1),Southern Methodist University (SMU), Dallas, TX
Raj Pawate (2) and Cheng Peng (3)
Texas Instruments Inc., Stafford, TX