Presented by: John Tully Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Presented by: John TullyDept of Computer & Information Sciences

University of Delaware

Using Machine Learning to Guide Architecture Simulation

Greg Hamerly (Baylor University)Gerez Perelman, Jeremy Lau, Brad Calder (UCSD)

Timothy Sherwood (UCSB)

Journal of Machine Learning Research 7 (2006)http://cseweb.ucsd.edu/~calder/papers/JMLR-06-SimPoint.pdf


Simulation is Critical!

• Allows engineers to understand cycle-level behavior of processor before fabrication

• Can play with design options cheaply. How are performance, complexity, area, power affected when I make modification X, and remove feature Y?


But... Simulation is SLOW

• Modelling at cycle level is very slow

• Simplescalar in cycle-accurate mode: a few hundred million cycles per hour

• Modelling at gate level is very, very, very slow

• ETI cutting-edge emulation technology: 5,000 cycles/second (24 hours = ~1 second of Cyclops-64 instructions).


Demands are increasing

• Size of benchmarks: applications can be quite large.

• Number of programs: Industry standard benchmarks are large suites. Many focus on variety (i.e. SPEC – 26 programs. Stress ALUs, FPUs, Memory, Cache, etc.)

• Iterations required: just to experiment with one feature (cache size) can take hundreds of thousands of benchmark runs


‘Current’ Remedies

• Simulate programs for N instructions (whatever your timeframe allows), and just stop.

• Similarly, fast-forward through initialization portion, and then simulate N instructions.

• Simulate N instructions from only the “important” (most computationally intensive) portions of a program.

• Neither work well, and at their worst are embarrassing: error rates of almost 4,000%!


SimPoint to the Rescue

• 1. As a program executes, its behavior changes. The changes aren’t random – they’re structured as sequences of recurring behavior (termed phases).

• 2. If repetitive and structured behavior can be identified, then we only need to sample each unique behavior of a program (and not the whole thing) to get an idea for its execution profile.

• 3. How can we identify repetitive, structured behavior? Use machine learning!

• Now, only a small set of samples needed. Collect points from each phase (simulation points), and weigh them – this accurately depicts execution of the entire program.


Defining Phase Behavior

• Seems pretty easy at first... let's just collect hardware-based statistics, and classify phases accordingly

• CPI (performance)

• Cache Miss Rates

• Branch Statistics (Frequency, Prediction Rate)

• FPU instructions per cycle

• But what's the problem here?


Defining Phase Behavior• Problem: if we use hardware-based stats, we're tying

phases to architectural configuration!

• Every time we tweak architecture, we must re-define phases!

• Underlying methodology: identify phases without relying on architectural metrics. Then, we can find a set of samples that can be used across our entire design space.

• But what can we use that's independent of hardware-based status, but still relates to fundamental changes in what the hardware is doing?



• Basic Block Vector (BBV): a structure designed to capture how a program changes behavior over time.

• A distribution of how many times each basic block is executed over an interval (can use a 1D-array)

• Each entry weighted by # of instructions in the BB (so all instructions have equal weight).

• Subsets of information in BBVs can also be extracted

• Register usage vectors

• Loop / branch execution frequencies



• Now, we can use BBVs to find patterns in the program. But can we prove they're useful?

• Detailed study by Lau et. al: very strong correlation between the following:

• 1) Difference in BBV of the interval, and BBV of the whole program (code changes)

• 2) CPI of the interval (performance)

• Graphic on next slide......

• Things are looking really good now – we can create a set of phases (and therefore, points to simulate) by ONLY looking at executed code.


Extracting Phases

• Next step: how do I actually turn my BBV vectors into phases?

• Create a function to compare two BBVs: how similar are they?

• Use machine learning data clustering algorithms to group similar BBVs. Each cluster (set of similar points) = a phase!

• SimPoint is the implementation of this

• Profiles programs (divides them into intervals, and creates BBVs for each).

• Use k-means clustering algorithm. Input includes granulatiry of clusters - that dictates the size and abundance of phases!


Choosing Simulation Pts

• Final Step: choose simulation points. From each phase, SimPoint chooses one representative interval that will be simulated (in full detail) to represent the whole phase.

• All points in the phase are (theoretically) similar in performance statistics – so we can extrapolate.

• Machine learning also used to pick representative points of a cluster (the interval to use from a phase).

• Points are weighed based on interval size (and phase size, of course)

• Only needs to be done one per program+input combination – remember why?


Choosing Simulation Pts

• User can tweak interval length, # clusters, etc – tradeoff between number of points simulated, and simulation time.


Experimental Framework

• Test Programs: SPEC Benchmarks (26 applications, about half integer, half FP; designed to stress all aspects of a processor.)

• Simulation: SimpleScalar, Alpha architecture.

• Metrics: accuracy of simulation measured in CPI prediction error


Million Dollar Question...

• How does phase classification do?

• SPEC2000, 100 million instruction intervals, no more than 10 simulation points

• Gzip, Gcc: only 4 and 8 phases found, respectively


• How accurate is this thing?

• A lot better than “current” methods.....



• How much time are we saving?

• In previous result, we're only simulating 400-800 million instructions for SimPoint results. According to SPEC benchmark data sheet, 'reference' input configurations are 50 billion and 80 billion instructions, respectively.

• So, baseline simulation needed to execute ~100 times more instructions for this configuration – took several months!

• Imagine if we needed to run on a few thousand combinations of cache size, memory latency, etc....

• Intel / Microsoft use it - must be pretty good.



Putting it all together

• First implementation of machine learning techniques to perform program phase analysis.

• Main thing to take away: applications (even complex ones) only exhibit a few unique behaviors – they're simply interleaved with each other over time.

• Using machine learning, we can find these behaviors with methods that are independent of architectural metrics.

• By doing so, we only need to simulate a few carefully chosen intervals, which greatly reduces simulation time.


Related / Future Work

• Other clustering algorithms with same data (multinomail clustering, regression trees) – k-means appears to do the best.

• “Un-tie” simulation points from binary – how could we do this?

• Map behavior back to source level after detecting it

• Now, we can use same simulation points for different compilations / input of a program

• Accuracy is just about as good as with fixed intervals (Lau et. al)

Presented by: John Tully Dept of Computer & Information Sciences University of Delaware

Documents

Transcript of Presented by: John Tully Dept of Computer & Information Sciences University of Delaware