Understanding Android Benchmarks

  • Understanding Android Benchmarks freedom koan-sin tan freedom@computer.org OSDC.tw,Taipei Apr 11th, 2014 1
  • disclaimers many of the materials used in this slide deck are from the Internet and textbooks, e.g., many of the following materials are from Computer Architecture: A Quantitative Approach, 1st ~ 5th ed opinions expressed here are my personal one, dont reect my employers view 2
  • who am i did some networking and security research before working for a SoC company, recently on big.LITTLE scheduling and related stuff parallel construct evaluation run benchmarking from time to time for improving performance of our products, and know what our colleagues' progress 3
  • Focusing on CPU and memory parts of benchmarks lets ignore graphics (2d, 3d), storage I/O, etc. 4
  • Blackbox ! google image search benchmark, you can nd many of them are Android-related benchmarks Similar to recently Cross-Strait Trade in Services Agreement (TiSA), most benchmarks on Android platform are kinda blackbox 5
  • Is Apple A7 good? When Apple released the new iPhone 5s, you saw many technical blog showed some benchmarks for reviews they came up commonly used ones: GeekBench JavaScript benchmarks Some graphics benchmarks Why? Are they right ones? etc. e.g., http://www.anandtech.com/show/7335/the-iphone-5s-review 6
  • open blackbox 7
  • Android Benchmarks 8
  • No, not improvement in this way 9 http:// www.anandtech.com /show/7384/state-of- cheating-in-android- benchmarks
  • Assuming there is not cheating, what we we can do?
  • Outline Performance benchmark review Some Android benchmarks What we did and what still can be done Future 11
  • To quote what Prof. Raj Jain quoted Benchmark v. trans. To subject (a system) to a series of tests in order to obtain prearranged results not available on competitive systems From:The Devils DP Dictionary S. Kelly-Bootle 12
  • Why benchmarking We did something good, let check if we did it right comparing with own previous results to see if we break anything We want to know how good our colleagues in other places are 13
  • What to report? Usually, what we mean by benchmarking is to measure performance What to report? intuitive answer: how many things we do in certain period of time yes, time. E.g., MIPS, MFLOPS, MiB/s, bps 14
  • MIPS and MFLOPS MIPS (Million Instruc0ons per Second), MFLOPS (Million Floa0ng-Point Opera0ons per Second) All instruc0ons are not created equal CISC machine instruc0ons usually accomplish a lot more than those of RISC machines, comparing the instruc0ons of a CISC machine and a RISC machine is similar to comparing La0n and Greek 15
  • MIPS and whats wrong with them MIPS is instruc0on set dependent, making it dicult to compare MIPS of one computers with dierent ISA MIPS varies between programs on the same computers; and most importantly, MIPS can vary inversely to performance w/ hardware FP, generally, MIPS is smaller 16
  • MFLOPS and whats wrong with them Applied only to programs with oa0ng-point opera0ons Opera0ons instead of instruc0ons, but s0ll oa0ng-point instruc0ons are dierent on machines dierent ISAs Fast and slow oa0ng-point opera0ons Possible solu0on: weight and source code level count ADD, SUB, COMPARE : 1 DIVIDE, SQRT: 2 EXP, SIN: 4 17
  • The best choice of benchmarks to measure performance is real applica0ons 18
  • Problema0c benchmarks Kernel: small, key pieces of real applica0ons, e.g., linpack Toy programs: 100-line programs from beginning programming assignments, e.g., quicksort Synthe0c benchmarks: fake programs invented to try to match the prole and behavior of really applica0ons, e.g., Dhrystone 19
  • Why they are disreputed? Small, t in cache Obsolete instruc0on mix Uncontrolled source code Prone to compiler tricks Short run0mes on modern machines Single-number performance characteriza0on with a single benchmark Dicult to reproduce results (short run0me and low-precision UNIX 0mer) 20
  • Dhrystone Source hhp://homepages.cwi.nl/~steven/dry.c < 1000 LoC Size of CA15 binary compiled with bionic Instruc0ons: ~ 14 KiB text data bss dec 13918 467 10266 24660 21
  • Whetstone Dhrystone is a pun on Whetstone Source code: hhp:// www.netlib.org/ benchmark/whetstone.c Test MFLOPS MOPS ms N1 float 119.78 0.16 N2 float 171.98 0.78 N3 if 154.25 0.67 N4 fixpt 397.48 0.79 N5 cos 19.08 4.36 N6 float 84.22 6.41 N7 equal 86.84 2.13 N8 exp 5.95 6.26 MWIPS 463.97 21.55 22
  • More on Synthe0c benchmarks The best known examples of synthe0c benchmarks are Whetstone and Dhrystone Problems: Compiler and hardware op0miza0ons can ar0cially inate performance of these benchmarks but not of real programs The other side of the coin is that because these benchmarks are not natural programs, they dont reward op0miza0ons of behaviors that occur in real programs Examples: Op0mizing compilers can discard 25% of the Dhrystone code; examples include loops that are only executed once, making the loop overhead instruc0ons unnecessary Most Whetstone oa0ng-point loops execute small numbers of 0mes or include calls inside the loop. These characteris0cs are dierent from many real programs Some more discussion in 1st edi0on of the textbook 23
  • LINPACK LINPACK: a oa0ng point benchmark from the manual of LINPACK library Source hhp://www.netlib.org/benchmark/linpackc hhp://www.netlib.org/benchmark/linpackc.new 883 LoC Size of CA15 binary compiled with bionic Instruc0ons: ~ 13 KiB text data bss dec 12670 408 0 13086 24
  • CoreMark (1/2) CoreMark is a benchmark that aims to measure the performance of central processing units (CPU) used in embedded systems. It was developed in 2009 by Shay Gal-On at EEMBC and is intended to become an industry standard, replacing the an0quated Dhrystone benchmark The code is wrihen in C code and contains implementa0ons of the following algorithms: Linked list processing. Matrix (mathema0cs) manipula0on (common matrix opera0ons), state machine (determine if an input stream contains valid numbers), and CRC from wikipedia 26
  • CoreMark (2/2) name LoC core_list_join.c 496 core_matrix.c 308 core_stat.c 277 core_util.c 210 CoreMark vs. Dhrystone Repor0ng rule Use of library calls, e.g., malloc() is avoided CRC to make sure data are corrected However, CoreMark is a kernel + synthe0c benchmark, s0ll quite small footprint text data bss dec 18632 456 20 19108 27
  • So? Too overcome the danger of placing eggs in one basket, collec0ons of benchmark applica0ons, called benchmark suites, are popular measure of performance of processors with variety of applica0ons Standard Performance Evalua0on Corpora0on (SPEC) 28
  • Why CPU2000 in 2010s? Why ARM s0cks with SPEC CPU2000 instead of CPU2006 1999 q4 results, earliest available CPU2000 results (hhp:// www.spec.org/cpu2000/results/res1999q4/) CINT2000 base: 133 424 CFP2000 base: 126 514 2005 Opteron 144, 1.8 GHz 1,440 (CA15 1.9 GHz reported nVidia is 1,168) CPU2006 requires much more DRAM, 1 GiB DRAM is not enough name CA9 CA7 CA15 Krait SPECint 200 356 320 537 326 SPECfp 2000 298 236 567 350 All normalized to 1.0 GHz 30
  • SPEC numbers from Quan0ta0ve Approach 5th Edi0on 31