Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023...

Understanding Sources of Inefficiency in General-Purpose ChipsHameed, Rehan, et al.

PRESENTED BY:

XIAOMING GUO

SIJIA HE

Outline➢Motivation

➢H.264 Basics

➢Key ideas

➢Implementation & Evaluation

➢Summary

➢Conclusion

➢Discussion

We Love General-Purpose Chips…

Runs a broad range of applications!

14nm lithography

4GHz frequency

4 cores/8 threads

64KB L1 cache/core

256KB L2 cache/core

8MB LLC

Intel Core i7-6700K

https://www.techpowerup.com/215333/intel-skylake-die-layout-detailed.html

But…… General-purpose chips are Inefficient for specific tasks

Performance (fps) Area (mm2) Energy/frame (mJ)

Intel (720×480 SD) 30 122 742

Intel (1280×720 HD) 11 122 2023

ASIC (1280×720 HD) 30 8 4

Highly optimized 2.8GHz Intel Pentium 4 implementation of H.264 encoder vs. ASIC

GP is 500x less energy efficient than ASIC in H.264 encoding

Outline➢Introduction

➢H.264 Basics

➢Key ideas

➢Summary

➢Conclusion

➢Discussion

SequentialParallel

H.264 BasicsOne of the most commonly used video coding format

MotionPrediction

Video Source

Inter-framePrediction

Intra-framePrediction (IP)

IME FME

Transform/Quantize

Entropy Encoding(CABAC)

Compressed Video

H.264 Basics: An Example

http://blog.webmproject.org/2010/07/inside-webm-technology-vp8-intra-and.html

➢H.264 Basics

➢Key ideas

➢Summary

➢Conclusion

➢Discussion

Key Ideas

1. Use an Tensilica-based, extensible CMP system as the baseline

2. Explore customizations of the baseline that reduce overheads totransform the baseline into an ASIC-like H.264 encoder

3. Study how the overhead is reduced

4. Understand the source of inefficiency in GP

➢H.264 Basics

➢Key ideas

➢Summary

➢Conclusion

➢Discussion

Implementation - Baseline

Implementation – Customization Step 1Data Level Parallelism:

Up to 18-way SIMD

https://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-docs/CellProgrammingTutorial/BasicsOfSIMDProgramming.html

Instruction Level Parallelism:Up to 3-slot VLIW

Instruction bundle

Instruction 1 Instruction 2 Instruction 3

FPU Integer Unit Memory Unit

Evaluation – Customization Step 1Speedup: 9.2x

Good: An order of magnitude increase in performance and energy efficiency

Bad: Still two orders of magnitude worse than ASIC

Taken from the presentation slides for this paper on 2010 ISCA

Evaluation – Customization Step 1

Good: Function unit becomes more energy efficient

Bad: Overheads from other operations are still significant. FU only accounts for 10% in all the energy consumed.

Implementation – Customization Step 2Operation Fusion: Combine several instructions into one

Pseudo-code:acc = 0;acc = Addshft (acc, x0, x1, 20);acc = Addshft (acc, x0, x1, -5);acc = Addshft (acc, x0, x1, 1);xn = Sat(acc);

acc = acc + 20(x0 + x1):temp1 = 20 * x0

temp2 = 20 * x1

temp3 = temp1 + temp2acc = acc + temp3

Good: Compiler can do operation fusion. So it is “free”.

Bad: No significant change in performance and energy efficiency compared to SIMD + VLIW

Speedup: 15.7x (baseline)Speedup: 1.7x (SIMD+VLIW)

Evaluation – Observations So Far…Exploring data and instruction level parallelism helps ➢ 10x improvement in both performance and energy efficiency

The inefficiency is dominated by data movement➢ FU only takes 10% of total energy even with SIMD + VLIW

Need large computation/low communication➢ Specific hardware to perform 100s operations with 1 instruction

Implementation – Customization Step 3Special Hardware for FME

Implementation – Customization Step 3Special Hardware for IME

This image cannot currently be displayed.

• SAD: Sum of Absolute Difference

• 256 SAD / operation

• Pixels blocks in red box can shift down and right, maximizing data re-use.

• Minimize bits fetched/operation

Good: Orders of magnitude faster than GP. Energy efficiency within 3x of ASIC.

Bad: Maybe very difficult to implement in GP. Requires customized data storage, function unit and instruction.

Speedup: 256x (baseline)This image cannot currently be displayed.

Good: Over 35% of the energy is now in FU. More data re-use, lowering total energy.

Bad: Most of the code involves “magic” instructions

This image cannot currently be displayed.

Processor Energy Breakdown

➢H.264 Basics

➢Key ideas

➢Summary

➢Conclusion

➢Discussion

Summary➢Basic operations only take hundreds femtojoule/op in ASIC

➢Instruction/data fetch, pipeline registers, control, etc. take 140pJ/op

➢Even with SIMD, VLIW and operation fusion, the gap is hard to close

➢Need “magic instruction” that can perform 100s operations each time

➢Can be done by coupling customized data storage and data path

➢Example: using shift registers to enable data reuse and parallel access

➢H.264 Basics

➢Key ideas

➢Summary

➢Conclusion

➢Discussion

Conclusion➢Achieving ASIC-like energy efficiency is difficult due to the large overheads in fetch, pipeline registers, control, etc.

➢Need customized hardware that fits in processor framework that can execute 100s of operations/instruction.

➢Need tools that enable designers to create customized hardware easily to reduce the cost. (Bridging Moore’s Law performance gap is more about “How Much” does it cost – Prof. Todd Austin)

Questions?

Discussion1. The paper only studied the gap between general purpose chips and H.264

ASIC. Do the results also apply to other application?

1. The authors introduced “magic” instructions that can perform 100s operations to achieve ASIC-like efficiency. Is it realistic to have those instructions in a real design?

Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023...

Documents

Transcript of Understanding Sources of Inefficiency in General-Purpose Chips...Intel (1280×720 HD) 11 122 2023...

HP Pavilion 15 Notebook PC - HP® Official · PDF fileProduct Name HP Pavilion 15 Notebook PC ... DTS sound+ Integrated HP TrueVision HD webcam (fixed [no tilt], activity LED, 1280×720

· Reader, Web KaMepa, MP integrated 0.9 Megapixel HD Webcam (1280 x 720) with, microphone Satepvq, cell lithium-ion, up to 3 h 30 min Mobile Mark 2012 24 Mece CROPocr HA rlEYAT

Inspiron 15 5000 Gaming Setup and Specifications...• Video: 1280 x 720 (HD) at 30 fps (maximum) Diagonal viewing angle 74 degrees Touch pad Table 13. Touch pad Resolution • Horizontal:

HD Visual Communication Systems · video codec, which enables efficient transmission of high-quality images at up to 60 fps in a high-definition resolution of 1280 x 720 pixels. Stunning

ICAD · 1280 x 720, H264 codec video, MP4 file orvimeo link (set to allow download). 1280 x 720, H264 codec video, MP4 file orvimeo link (set to allow download). ... Stings refer

Quick Installation Guide Third Edition, October 2013… · It is an industrial-grade, H.264 box-type IP camera that combines HD resolution (1280 x 720), advanced IVA (Intelligent

HD IP CAM 1280 720 Manual - netipcam.com Camera User Guideb.pdf · HD IP CAM 1280 720 Manual. ... Watch the video: Open video folder and view the saved video segments. Intercom: Used

Visit or call 619-246-8236 · Payments accepted: Magstripe, EMV smartcard, EMV contactless (including Apple Pay') Security certification: PCI PTS 5.0 5-inch HD screen (1280 x 720

MEMORY RECODER PMW-1000...Features The PMW-1000 is a full-HD (1920 × 1080 and 1280 × 720) memory recorder. It features an enhanced networking and other IT functions, and is highly

MEMORY RECODER PMW-1000 · 6 Features Chapter 1 Overview Overview Chapter 1 Features The PMW-1000 is a full-HD (1920 × 1080 and 1280 × 720) memory recorder. It features an enhanced

36x - VivitekDisplay Resolutions 1280 x 720 (16:9, 720p at 60 FPS), 1920 x 1080 (16:9, 1080p at 60 FPS), 3840 x 2160 (16:9, UltraHD 4K at 30 FPS) 1280 x 720 (16:9, 720p at 60 FPS),

720 p hd SPY WATCH CAMERA

Hd 720 p waterproof sport helmet action camera cam dvr

HUAWEI TE40 · • Coding/Decoding resolution: 800 x 600, 1024 x 768, 1280 x 1024, 1280 x 720, 1920 x 1080, 1600 x 1200, 1920 x 1200

IQEYE ALLIANCE-MX CAMERA DOME - Amazon S3s3.amazonaws.com/cdn.vicon-security.com/wp-content/uploads/IQeye-Alliance-mx6... · Max Resolution: HD 720P - 1280 x 720 HD 1080P - 1920 x

FCB-EH Series - JenCam · affordable FCB-EH3150 features a 12x optical zoom lens with superb HD (1280 x 720) resolution, and is the smallest camera in the FCB- EH camera family. All

dl.openipc.org · Web viewWidth range, which is associated with the video coding formats SD and HD. The SD resolution is below 1280 x 720 pixels, and the HD resolution is 1280 x 720

Brilliant Cameras with essential functions · 2018. 11. 7. · Video Compression Format H.264, MJPEG Resolution 1920 x 1080, 1280 x 1024, 1280 x 960, 1280 x 720, 1024 x 768, 800 x

Direct Film (DF) for IED’s X-ray HD.pdf · ‣Video: VGA 640 x 480 + snapshot HD 1280 x 720 ‣Wired and Wifi option (100m line of sight) ‣Pan. angle: 210° horizontal ‣View

Wireless HD CCTV System - ERA Everywhere · PDF fileCCTV home security system suitable for most homes Easy to install - simply plug and go Includes an HD LCD monitor 1280 x 720 Full