Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON...
Transcript of Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON...
![Page 1: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/1.jpg)
Optimizing Zlib on Arm:The power of NEONAdenilson CavalcantiARM - San Jose (California)
@adenilsonc
![Page 2: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/2.jpg)
Why zlib?
Zlib
Used everywhere (libpng, Skia, freetype, cronet, Firefox, Chrome, linux kernel, android, iOS, JDK, git, etc).
Old code base released in 1995.
Written in K&R C style.
Context
Lacks any optimizations for ARM CPUs.
Problem statement
Identify potential optimization candidates and verify positive effects in Chromium.
![Page 3: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/3.jpg)
● Cloudflare● Intel● Zlib-ng
Previous art
![Page 4: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/4.jpg)
● Performed some benchmarking.● Contacted each project.● Mixed results (1 project never replied back).
Before deepening the fork...
![Page 5: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/5.jpg)
● Performed some benchmarking.● Contacted each project.● Mixed results (1 project never replied back).
None focused on decompression* or had ARM specific optimizations.
Before forking...
*Important for a Web Browser.
![Page 6: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/6.jpg)
PNGs rely on zlib● Transparent.● Pre-filters.● High-res.
Meet Mr. Parrot
Source: https://upload.wikimedia.org/wikipedia/commons/3/3f/ZebraHighRes.png
![Page 7: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/7.jpg)
Parrots are not created equal
Original: 2.7MB
Palette: 0.8MB
Zopfli: 2.6MB
![Page 8: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/8.jpg)
Perf to the rescue
![Page 9: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/9.jpg)
NEON: Advanced SIMD(Single Instruction Multiple Data)
![Page 10: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/10.jpg)
● Optional on ARMv7.● Mandatory on
ARMv8.
NEON
![Page 11: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/11.jpg)
RegistersARMv7● 16 registers@128 bits: Q0
- Q15.● 32 registers@64bits: D0 -
D31.● Varied set of instructions:
load, store, add, mul, etc.
ARMv8● 32 registers@128 bits: Q0 - Q31.● 32 registers@64bits: D0 - D31.● 32 registers@32bits: S0 - S31.● 32 registers@8bits: H0 - H31.● Varied set of instructions: load,
store, add, mul, etc.
![Page 12: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/12.jpg)
An example: VADD.I16 Q0, Q1, Q2
![Page 13: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/13.jpg)
Entropy & Compression
![Page 14: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/14.jpg)
Entertaining definition
https://www.youtube.com/watch?v=l49MHwooaVQ
![Page 15: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/15.jpg)
Formal definition
Shannon Entropy
Where:p_i: probability of character i appearing in the stream of characters.
https://en.wiktionary.org/wiki/Shannon_entropy
![Page 16: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/16.jpg)
Practical explanationa) HTML b) JPEG
![Page 17: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/17.jpg)
Practical visualization./binwalk -E filea) HTML: 0.68 b) JPEG: 0.95
![Page 18: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/18.jpg)
Decompression optimizations
![Page 20: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/20.jpg)
Adler-32 simplistic implementation
https://en.wikipedia.org/wiki/Adler-32
![Page 21: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/21.jpg)
Adler-32: problems
● Zlib’s Adler-32 was more than 7x faster than naive implementation.
● It is hard to vectorize the following computation:
![Page 22: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/22.jpg)
Adler-32: technical drawing (Jan 2017)
![Page 24: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/24.jpg)
Adler-32: Intel got some love too!
https://bugs.chromium.org/p/chromium/issues/detail?id=688601
![Page 25: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/25.jpg)
fast_chunk● Second candidate in the perf
profiling was inflate_fast.● Very high level idea: perform
long loads/stores in the byte array.
● Average 20% faster!● Shipping on M62.● Original patch by Simon
Hosie.
https://bugs.chromium.org/p/chromium/issues/detail?id=697280
![Page 26: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/26.jpg)
CRC-32
https://bugs.chromium.org/p/chromium/issues/detail?id=709716
● YMMV on PNGs (from 1 to 5%).● Remember it is used while decompressing web
content (29% boost for gzipped content).● ARMv8-a has a crc32 instruction (from 3 to 10x faster
than zlib’s crc32 C code).● Shipping on M66.
![Page 27: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/27.jpg)
Results: Chromium’s zlib*
* c-zlib
![Page 28: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/28.jpg)
Arm: zlib format 1.4x
![Page 29: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/29.jpg)
Arm: gzip format 1.5x
![Page 30: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/30.jpg)
Arm: c-zlib X Vanilla
![Page 31: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/31.jpg)
x86: c-zlib X Vanilla
![Page 32: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/32.jpg)
We were missing compression...
![Page 33: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/33.jpg)
Bonus: Compression on Arm
![Page 34: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/34.jpg)
Slide-hash: NEON
https://chromium-review.googlesource.com/1136940
● Using NEON instruction vqsubq.
● Works on 8x 16bits chunks.
● Perf gain of 5%.
![Page 35: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/35.jpg)
insert-string: crypto CRC-32
https://chromium-review.googlesource.com/c/chromium/src/+/1173262
● Using ARMv8-ainstruction crc32.
● Works on 1x 32bits chunks.
● Perf gain of 24%.
![Page 36: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/36.jpg)
Arm: current state● Compression: average 1.36x faster, but 1.4x faster for HTML.● Decompression: average 1.6x faster (gzip), but 1.8x faster for HTML.
![Page 37: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/37.jpg)
Conclusions
![Page 38: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/38.jpg)
Conclusions● There is plenty of life left even in an old code base.● NEON optimizations can yield a *huge* impact.● It pays up to work in a lower layer.● OSS love: Intel got it too.
![Page 39: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/39.jpg)
Chromium’s zlib: c-zlib● Decompression: 1.7x to 2x faster.● Compression: 1.3x to 1.4x faster.● Both ARM & x86 are supported.● Highly tested (i.e. cronet, fuzzers).● Widely deployed (over 1 billion users).● Open to performance & security patches.
![Page 40: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/40.jpg)
Chromium’s zlib: c-zlib● Decompression: 1.7x to 2x faster.● Compression: 1.3x to 1.4x faster.● Both ARM & x86 are supported.● Highly tested (i.e. cronet, fuzzers).● Widely deployed (over 1 billion users).● Open to performance & security patches.
Zlib users should consider moving to Chromium’s zlib.
![Page 41: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/41.jpg)
Resources
a) Slides: https://goo.gl/vaZA9ob) Performance benchmarks: https://goo.gl/qLVdvh c) Code:
https://cs.chromium.org/chromium/src/third_party/zlib/
![Page 42: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/42.jpg)
Final words
“This is how the open-source model works: building upon the work of others is far more efficient than rewriting everything.”
Jean-loup Gailly (zlib author)
https://slashdot.org/story/00/03/10/1043247/jean-loup-gailly-on-gzip-go-and-mandrake
![Page 43: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/43.jpg)
Questions
![Page 44: Optimizing Zlib on Arm - Linux Foundation Events...Optimizing Zlib on Arm: The power of NEON Adenilson Cavalcanti ARM - San Jose (California) @adenilsonc](https://reader030.fdocuments.in/reader030/viewer/2022041102/5edf7315ad6a402d666acbc5/html5/thumbnails/44.jpg)