Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall
description
Transcript of Entropy Slices for Parallel Entropy Coding K. Misra, J. Zhao and A. Segall
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C A
l
Entropy Slices for Parallel Entropy Coding
K. Misra, J. Zhao and A. Segall
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices
Introduction: Entropy Slice Introduce partitioning of slices
into smaller “entropy” slices Entropy slice
Reset context models Restrict definition for
neighborhood Process identical to current
slice by entropy decoder Key difference: reconstruction
uses information from neighboring entropy slices
Reconstruct slice
Each Picture/Slice
Reset CABAC state
Reset CABAC state
entropy_slice_flag?
Parse regular slice header
Parse entropy slice header
Define neighbor info for CABAC & reconstruct
Entropy Decode Slice
Data
Define neighbor info for CABAC
Entropy Decode Slice
Data
Define neighbor info
for reconstruct
Entropy decode slice
data
Parse slice headerYN
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices
We now introduce major advantages for the entropy slice concept
Advantage #1 - Parallelization: Entropy slices do not depend on
information outside of the entropy slice and can be decoded independently
Allows for parallelization of entire entropy decoding loop – including context adaptation and bin coding
Advantage #2 - Generalization Entropy slices can be used for all
entropy coding engines currently under study in the TMuC and TMuC software
Moreover, we have software available for PIPE and CABAC V2V
CABAC PIPE UVLC
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices
Advantages #3 – No impact on single thread/core: Parallelization capability does not
come at the expense of single thread/core applications
A single thread/core process may1. Decode all entropy slices prior to
reconstructionOR2. Decode entropy slice and then
reconstruct without neighbourhood reset
This is friendly to any architecture
Reconstruct slice
Each Picture/Slice
Reset CABAC state
Reset CABAC state
entropy_slice_flag?
Parse regular slice header
Parse entropy slice header
Define neighbor info for CABAC & reconstruct
Entropy Decode Slice
Data
Define neighbor info for CABAC
Entropy Decode Slice
Data
Define neighbor info
for reconstruct
Entropy decode slice
data
Parse slice headerYN
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices
Advantage #4 –Easy Adaptation to Decoder Design
Bit-stream can be partitioned into a large number of entropy slices with little overhead
For example, we will show performance of 32 entropy slices for 1080p on next slide – this would translate to ~128 slices for 4k.
Decoder can schedule N entropy decoders easily, where N is arbitrary
One example: for 32 slices, architecture with parallelization of 4 (N=4) would assign 8 slices per decoder.
Another example: for 32 slices, architecture with N=8 would assign 4 slices per decoder
Additionally, for large resolutions (4k,8k) possible to scale to 100s of decoders for GPU implementations
Parse N slice/entropy sliceHeaders or until
start of next picture
Entropy decode 1st
slice data
reconstruct m slices
Each Picture
N : Desired degree of parallel m: available slices in current picture
…Entropy
decode 2nd
slice data
Entropy Decode mth
slice data
Parallel Entropy Decoding
m slices
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices
Advantage #5 –Coding Efficiency Insertion of Entropy Slices results in
negligible impact on coding efficiency. For example, if configure the encoder for a parallelization factor of 32, we get:
Y BD-rate U BD-rate V BD-rateClass A 0.4 0.3 0.3Class B 0.3 0.2 0.2Class C 0.3 0.1 0.2Class D 0.2 0.2 0.1Class E 0.2 0.1 0.0All 0.3 0.2 0.2Enc Time[%]Dec Time[%]
Random accessY BD-rate U BD-rate V BD-rate
Class A 0.2 0.8 0.5Class B 0.2 0.6 0.2Class C 0.1 0.1 0.2Class D 0.1 0.0 0.1Class EAll 0.1 0.4 0.2Enc Time[%]Dec Time[%]
Y BD-rate U BD-rate V BD-rateClass AClass B 0.1 0.9 0.3Class C 0.0 0.3 0.1Class D 0.0 -0.1 0.2Class E 0.1 0.6 -0.3All 0.0 0.4 0.1Enc Time[%]Dec Time[%]
111%
103%
103%
Intra
Low delay
101%103%
105%
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices
Advantage #6 –Specification Entropy slices allow simple and direct
specification of parallelization at the Profile and Level stage
This is accomplished by: Specifying the maximum number
of bins in an Entropy Slice Specifying the maximum number
of Entropy Slices per picture Allows addition specification of
PIPE/V2V configurations Maximum number of bins per bin
coder in an Entropy Slice Additional advantage: straightforward
to determine conformance at encoder
16
16
16
16
16
16
16
32
-
-
-
-
-
-
-
-
Max number of motion
vectors per two
consecutive MBs
MaxMvsPer2Mb
M5.12[-512,+511.75]
240 000240 00069 120.036 864
983 0405.1
M52[-512,+511.75]
135 000135 00041 400.022 080
589 8245
M4.22[-512,+511.75]
62 50050 00013 056.08 704522 2404.2
M4.12[-512,+511.75]
62 50050 00012 288.08 192245 7604.1
M44[-512,+511.75]
25 00020 00012 288.08 192245 7604
M3.24[-512,+511.75]
20 00020 0007 680.05 120216 0003.2
M3.14[-512,+511.75]
14 00014 0006 750.03 600108 0003.1
M32[-256,+255.75]
10 00010 0003 037.51 62040 5003
-2[-256,+255.75]
4 0004 0003 037.51 62020 2502.2
-2[-256,+255.75]
4 0004 0001 782.079219 8002.1
-2[-128,+127.75]
2 0002 000891.039611 8802
-2[-128,+127.75]
2 000768891.039611 8801.3
-2[-128,+127.75]
1 000384891.03966 0001.2
-2[-128,+127.75]
500192337.53963 0001.1
-2[-64,+63.75]350128148.5991 4851b
-2[-64,+63.75]17564148.5991 4851
Max number of bin in entropy
slice
Min compression ratio
MinCR
Vertical MV component
range MaxVmvR
(luma frame samples)
MaxCPB sizeMaxCPB
(1000 bits,1200 bits,
cpbBrVclFactor bits, or cpbBrNalFa
ctor bits)
Max video
bit rate MaxBR(1000 bits/s,
1200 bits/s, cpbBrVclFactor bits/s,
or cpbBrNalFactor bits/s)
Max decoded picture buffer size
MaxDPB(1024
bytes for 4:2:0)
Max fram
e size
MaxFS
(MBs)
Max macroblo
ckprocessin
g rate MaxMBPS (MB/s)
Level
number
16
16
16
16
16
16
16
32
-
-
-
-
-
-
-
-
Max number of motion
vectors per two
consecutive MBs
MaxMvsPer2Mb
M5.12[-512,+511.75]
240 000240 00069 120.036 864
983 0405.1
M52[-512,+511.75]
135 000135 00041 400.022 080
589 8245
M4.22[-512,+511.75]
62 50050 00013 056.08 704522 2404.2
M4.12[-512,+511.75]
62 50050 00012 288.08 192245 7604.1
M44[-512,+511.75]
25 00020 00012 288.08 192245 7604
M3.24[-512,+511.75]
20 00020 0007 680.05 120216 0003.2
M3.14[-512,+511.75]
14 00014 0006 750.03 600108 0003.1
M32[-256,+255.75]
10 00010 0003 037.51 62040 5003
-2[-256,+255.75]
4 0004 0003 037.51 62020 2502.2
-2[-256,+255.75]
4 0004 0001 782.079219 8002.1
-2[-128,+127.75]
2 0002 000891.039611 8802
-2[-128,+127.75]
2 000768891.039611 8801.3
-2[-128,+127.75]
1 000384891.03966 0001.2
-2[-128,+127.75]
500192337.53963 0001.1
-2[-64,+63.75]350128148.5991 4851b
-2[-64,+63.75]17564148.5991 4851
Max number of bin in entropy
slice
Min compression ratio
MinCR
Vertical MV component
range MaxVmvR
(luma frame samples)
MaxCPB sizeMaxCPB
(1000 bits,1200 bits,
cpbBrVclFactor bits, or cpbBrNalFa
ctor bits)
Max video
bit rate MaxBR(1000 bits/s,
1200 bits/s, cpbBrVclFactor bits/s,
or cpbBrNalFactor bits/s)
Max decoded picture buffer size
MaxDPB(1024
bytes for 4:2:0)
Max fram
e size
MaxFS
(MBs)
Max macroblo
ckprocessin
g rate MaxMBPS (MB/s)
Level
number
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices
Syntax Slice header Indicate slice is “entropy slice” Send only information necessary for entropy decoding
slice_header( ) { C Descriptor
first_lctb_in_slice 2 ue(v)
entropy_slice_flag 2 u(1)
if(!entropy_slice_flag) {
…
}
else {
if( entropy_coding_mode_flag && slice_type != I )
cabac_init_idc 2 ue(v)
}
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AConclusions
We have presented the concept of an “entropy slice” for the HEVC system
Advantages include: 1. Parallel entropy decoding (both context adaptation and/or bin coding)
2. Generalization to any entropy coding system under study3. No impact on serial implementations4. Easy adaptation to different parallelization factors at the decoder5. Negligible impact on coding efficiency (<0.2%)6. Direct path for specifying parallelization at the profile/level stage
Software is available
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices
In the last meeting, two topics were discussed1. Size of entropy slice headers2. Extension to potential architectures that do not decouple
parsing and reconstruction
We address these in the next slides…
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices
Header Size Very small (as asserted previously) Quantitative
2 bytes + NALU (1 byte) for 1080p Scales for resolutions due to first_lctb_in_slice
slice_header( ) { C Descriptor
first_lctb_in_slice 2 ue(v)
entropy_slice_flag 2 u(1)
if(!entropy_slice_flag) {
…
}
else {
if( entropy_coding_mode_flag && slice_type != I )
cabac_init_idc 2 ue(v)
}
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices
Extension to additional architectures Previous meeting there was interest in extending the method to
architectures that do no buffer symbols between parsing and reconstruction
This anticipates “joint-wave-front” processing of both parsing and reconstruction loops
We investigated this issue and concluded the following:1. In the current TMuC design, we observe that it is not possible to
do wavefront processing of the parsing stage.2. If we configure the TMuC to support wavefront parsing, the
extension of entropy slices is straightforward
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices
Confidential 13
Our approach: provide additional entry-points without neighbor restriction
EC Init
EC Init
EC Init
EC Init
EC Init : Use cabac_init_idc to initialize entropy coder
“Entropy slice” entry-points
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices
Confidential 14
Entropy + Reconstruction steps : 16
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices
Syntax1. Signal that the bin coding
engine will be reset at start of each LCU row
2. Allow signaling cabac_init_idc for the reset
coding_unit( x0, y0, currCodingUnitSize ) { C Descriptor
if (x0==0 && currCodingUnitSize==MaxCodingUnitSize && lcu_row_cabac_init_idc_flag==true && lcu_id!=first_lcu_in_slice) {
cabac_init_idc_present_flag 1 u(1)
if( cabac_init_idc_present_flag )
cabac_init_idc 2 ue(v)
}
a regular coding unit …
}
slice_header( ) { C Descriptor
entropy_slice_flag 2 u(1)
if (entropy_slice_flag) {
first_lcu_in_slice 2 ue(v)
lcu_row_cabac_init_flag 1 u(1)
if( lcu_row_cabac_init_flag ){
lcu_row_cabac_init_idc_flag 1 u(1)
}
if( entropy_coding_mode_flag && slice_type != I) {
cabac_init_idc 2 ue(v)
}
}
else {
lcu_row_cabac_init_flag 1 u(1)
if( lcu_row_cabac_init_flag ){
lcu_row_cabac_init_idc_flag 1 u(1)
}
a regular slice header ……..
}
}
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices
Performance
Y BD-rate U BD-rate V BD-rateClass A 0.5 0.4 0.5Class B 0.5 0.4 0.4Class C 0.6 0.5 0.6Class D 0.6 0.5 0.6Class E 0.5 0.5 0.6All 0.5 0.5 0.5Enc Time[%]Dec Time[%]
Random accessY BD-rate U BD-rate V BD-rate
Class A 0.9 1.5 1.2Class B 1.0 1.5 1.2Class C 0.9 0.8 0.9Class D 1.3 1.2 1.3Class EAll 1.1 1.2 1.2Enc Time[%]Dec Time[%]
#NUM!#NUM!
#NUM!
Intra
#NUM!
Max parallelism: Maintain initial 32x parallelization Additionally: one entry point for every LCU row 17x for 1080p RD performance - .5-1%
Y BD-rate U BD-rate V BD-rateClass A 0.4 0.4 0.4Class B 0.3 0.2 0.2Class C 0.3 0.2 0.3Class D 0.2 0.2 0.1Class E 0.2 0.2 0.2All 0.3 0.2 0.2Enc Time[%]Dec Time[%]
Random accessY BD-rate U BD-rate V BD-rate
Class A 0.4 1.2 1.0Class B #VALUE! #VALUE! #VALUE!Class C 0.2 0.3 0.2Class D 0.1 0.0 0.1Class EAll #VALUE! #VALUE! #VALUE!Enc Time[%]Dec Time[%]
Intra
#NUM!
#NUM!
#NUM!
#NUM!
4x parallelism: Maintain initial 32x parallelism Additionally: Four entry points in the ES (aligned with LCU rows; result 4x speedup) RD performance - .3%
L A B O R A T O R I E S O F A M E R I C AL A B O R A T O R I E S O F A M E R I C AEntropy Slices
Conclusion Entropy slices well tested and flexible
Demonstrated in multiple environments (JM, JMKTA, TMuC) Demonstrated with CABAC and CAV2V Friendly to serial and parallel architectures (including both decoupled and
coupled parsing/reconstruction architectures)
From the last meeting:“The basic concept of desiring enhanced high-level parallelism
of the entropy coding stage to be in the HEVC design is agreed.”
We propose1. Adoption of the entropy slice technology into the TM2. Evaluation of the “joint-wavefront” extension in a CE