CSE 567M Computer Systems Analysis
-
Upload
noel-richmond -
Category
Documents
-
view
57 -
download
11
description
Transcript of CSE 567M Computer Systems Analysis
1-1©2006 Raj Jain www.rajjain.com
CSE 567MCSE 567MComputer Systems Computer Systems
AnalysisAnalysis
1-2©2006 Raj Jain www.rajjain.com
Text BookText Book
R. Jain, “Art of Computer Systems Performance Analysis,” Wiley, 1991, ISBN:0471503363(Winner of the “1992 Best Computer Systems Book” Award from Computer Press Association”)
1-3©2006 Raj Jain www.rajjain.com
Objectives: What You Will LearnObjectives: What You Will Learn
Specifying performance requirements Evaluating design alternatives Comparing two or more systems Determining the optimal value of a parameter (system tuning) Finding the performance bottleneck (bottleneck identification) Characterizing the load on the system (workload
characterization) Determining the number and sizes of components (capacity
planning) Predicting the performance at future loads (forecasting).
1-4©2006 Raj Jain www.rajjain.com
Basic TermsBasic Terms
System: Any collection of hardware, software, and firmware
Metrics: Criteria used to evaluate the performance of the system. components.
Workloads: The requests made by the users of the system.
1-5©2006 Raj Jain www.rajjain.com
Main Parts of the CourseMain Parts of the Course
An Overview of Performance Evaluation Measurement Techniques and Tools Experimental Design and Analysis
1-6©2006 Raj Jain www.rajjain.com
Measurement Techniques and ToolsMeasurement Techniques and Tools Types of Workloads Popular Benchmarks The Art of Workload Selection Workload Characterization Techniques Monitors Accounting Logs Monitoring Distributed Systems Load Drivers Capacity Planning The Art of Data Presentation Ratio Games
1-7©2006 Raj Jain www.rajjain.com
ExampleExample
Which type of monitor (software or hardware) would be more suitable for measuring each of the following quantities: Number of Instructions executed by a processor? Degree of multiprogramming on a timesharing
system? Response time of packets on a network?
1-8©2006 Raj Jain www.rajjain.com
ExampleExample
The performance of a system depends on the following three factors: Garbage collection technique used: G1, G2, or
none. Type of workload: editing, computing, or AI. Type of CPU: C1, C2, or C3.
How many experiments are needed? How does one estimate the performance impact of each factor?
1-9©2006 Raj Jain www.rajjain.com
ExampleExample The average response time of a database system is three
seconds. During a one-minute observation interval, the idle time on the system was ten seconds.
Using a queueing model for the system, determine the following: System utilization Average service time per query Number of queries completed during the observation
interval Average number of jobs in the system Probability of number of jobs in the system being greater
than 10 90-percentile response time 90-percentile waiting time
1-10©2006 Raj Jain www.rajjain.com
Common Mistakes in EvaluationCommon Mistakes in Evaluation
1. No Goals No general purpose model Goals Techniques, Metrics, Workload Not trivial
2. Biased Goals ``To show that OUR system is better than THEIRS'‘ Analysts = Jury
3. Unsystematic Approach4. Analysis Without Understanding the Problem5. Incorrect Performance Metrics6. Unrepresentative Workload7. Wrong Evaluation Technique
1-11©2006 Raj Jain www.rajjain.com
Common Mistakes (Cont)Common Mistakes (Cont)
8. Overlook Important Parameters9. Ignore Significant Factors10. Inappropriate Experimental Design11. Inappropriate Level of Detail12. No Analysis13. Erroneous Analysis14. No Sensitivity Analysis15. Ignoring Errors in Input16. Improper Treatment of Outliers17. Assuming No Change in the Future18. Ignoring Variability19. Too Complex Analysis
1-12©2006 Raj Jain www.rajjain.com
Common Mistakes (Cont)Common Mistakes (Cont)
20. Improper Presentation of Results
21. Ignoring Social Aspects
22. Omitting Assumptions and Limitations
1-13©2006 Raj Jain www.rajjain.com
Checklist for Avoiding Common MistakesChecklist for Avoiding Common Mistakes
1. Is the system correctly defined and the goals clearly stated?
2. Are the goals stated in an unbiased manner?
3. Have all the steps of the analysis followed systematically?
4. Is the problem clearly understood before analyzing it?
5. Are the performance metrics relevant for this problem?
6. Is the workload correct for this problem?
7. Is the evaluation technique appropriate?
8. Is the list of parameters that affect performance complete?
9. Have all parameters that affect performance been chosen as factors to be varied?
1-14©2006 Raj Jain www.rajjain.com
Checklist (Cont)Checklist (Cont)
10. Is the experimental design efficient in terms of time and results?
11. Is the level of detail proper?12. Is the measured data presented with analysis and
interpretation?13. Is the analysis statistically correct?14. Has the sensitivity analysis been done?15. Would errors in the input cause an insignificant change in the
results?16. Have the outliers in the input or output been treated properly?17. Have the future changes in the system and workload been
modeled?18. Has the variance of input been taken into account?
1-15©2006 Raj Jain www.rajjain.com
Checklist (Cont)Checklist (Cont)
19. Has the variance of the results been analyzed?
20. Is the analysis easy to explain?
21. Is the presentation style suitable for its audience?
22. Have the results been presented graphically as much as possible?
23. Are the assumptions and limitations of the analysis clearly documented?
1-16©2006 Raj Jain www.rajjain.com
A Systematic Approach to A Systematic Approach to Performance EvaluationPerformance Evaluation
1. State Goals and Define the System2. List Services and Outcomes3. Select Metrics4. List Parameters5. Select Factors to Study6. Select Evaluation Technique7. Select Workload8. Design Experiments9. Analyze and Interpret Data10. Present Results Repeat
1-17©2006 Raj Jain www.rajjain.com
Criteria for Selecting an Evaluation TechniqueCriteria for Selecting an Evaluation Technique
1-18©2006 Raj Jain www.rajjain.com
Three Rules of ValidationThree Rules of Validation
Do not trust the results of an analytical model until they have been validated by a simulation model or measurements.
Do not trust the results of a simulation model until they have been validated by analytical modeling or measurements.
Do not trust the results of a measurement until they have been validated by simulation or analytical modeling.
1-19©2006 Raj Jain www.rajjain.com
Selecting Performance MetricsSelecting Performance Metrics
1-20©2006 Raj Jain www.rajjain.com
Selecting MetricsSelecting Metrics Include:
Performance Time, Rate, Resource Error rate, probability Time to failure and duration
Consider including: Mean and variance Individual and Global
Selection Criteria: Low-variability Non-redundancy Completeness
1-21©2006 Raj Jain www.rajjain.com
Case Study: Two Congestion Control AlgorithmsCase Study: Two Congestion Control Algorithms
Service: Send packets from specified source to specified destination in order.
Possible outcomes: Some packets are delivered in order to the correct
destination. Some packets are delivered out-of-order to the
destination. Some packets are delivered more than once
(duplicates). Some packets are dropped on the way (lost
packets).
1-22©2006 Raj Jain www.rajjain.com
Case Study (Cont)Case Study (Cont)
Performance: For packets delivered in order, Time-rate-resource
Response time to deliver the packets Throughput: the number of packets per unit of time. Processor time per packet on the source end system. Processor time per packet on the destination end
systems. Processor time per packet on the intermediate systems.
Variability of the response time Retransmissions Response time: the delay inside the network
1-23©2006 Raj Jain www.rajjain.com
Case Study (Cont)Case Study (Cont)
Out-of-order packets consume buffersProbability of out-of-order arrivals.
Duplicate packets consume the network resources Probability of duplicate packets
Lost packets require retransmission Probability of lost packets
Too much loss cause disconnection Probability of disconnect
1-24©2006 Raj Jain www.rajjain.com
Case Study (Cont)Case Study (Cont)
Shared Resource Fairness
Fairness Index Properties: Always lies between 0 and 1. Equal throughput Fairness =1. If k of n receive x and n-k users receive zero
throughput: the fairness index is k/n.
1-25©2006 Raj Jain www.rajjain.com
Case Study (Cont)Case Study (Cont)
Throughput and delay were found redundant ) Use Power.
Variance in response time redundant with the probability of duplication and the probability of disconnection
Total nine metrics.
1-26©2006 Raj Jain www.rajjain.com
Commonly Used Performance MetricsCommonly Used Performance Metrics
Response time and Reaction time
1-27©2006 Raj Jain www.rajjain.com
Response Time (Cont)Response Time (Cont)
1-28©2006 Raj Jain www.rajjain.com
CapacityCapacity
1-29©2006 Raj Jain www.rajjain.com
Common Performance Metrics (Cont)Common Performance Metrics (Cont)
Nominal Capacity: Maximum achievable throughput under ideal workload conditions. E.g., bandwidth in bits per second. The response time at maximum throughput is too high.
Usable capacity: Maximum throughput achievable without exceeding a pre-specified response-time limit
Knee Capacity: Knee = Low response time and High throughput
1-30©2006 Raj Jain www.rajjain.com
Common Performance Metrics (cont)Common Performance Metrics (cont) Turnaround time = the time between the submission of a batch
job and the completion of its output. Stretch Factor: The ratio of the response time with
multiprogramming to that without multiprogramming. Throughput: Rate (requests per unit of time) Examples:
Jobs per second Requests per second Millions of Instructions Per Second (MIPS) Millions of Floating Point Operations Per Second (MFLOPS) Packets Per Second (PPS) Bits per second (bps) Transactions Per Second (TPS)
1-31©2006 Raj Jain www.rajjain.com
Common Performance Metrics (Cont)Common Performance Metrics (Cont)
Efficiency: Ratio usable capacity to nominal capacity. Or, the ratio of the performance of an n-processor system to that of a one-processor system is its efficiency.
Utilization: The fraction of time the resource is busy servicing requests. Average fraction used for memory.
1-32©2006 Raj Jain www.rajjain.com
Common Performance Metrics (Cont)Common Performance Metrics (Cont)
Reliability: Probability of errors Mean time between errors (error-free seconds).
Availability: Mean Time to Failure (MTTF) Mean Time to Repair (MTTR) MTTF/(MTTF+MTTR)
1-33©2006 Raj Jain www.rajjain.com
Utility Classification of MetricsUtility Classification of Metrics
1-34©2006 Raj Jain www.rajjain.com
Setting Performance RequirementsSetting Performance Requirements Examples:
“ The system should be both processing and memory efficient. It should not create excessive overhead”
“ There should be an extremely low probability that the network will duplicate a packet, deliver a packet to the wrong destination, or change the data in a packet.”
Problems:Non-SpecificNon-MeasurableNon-AcceptableNon-RealizableNon-Thorough
SMART
1-35©2006 Raj Jain www.rajjain.com
Case Study 3.2: Local Area NetworksCase Study 3.2: Local Area Networks Service: Send frame to D Outcomes:
Frame is correctly delivered to D Incorrectly delivered Not delivered at all
Requirements: Speed
The access delay at any station should be less than one second. Sustained throughput must be at least 80 Mbits/sec.
Reliability: Five different error modes. Different amount of damage Different level of acceptability.
1-36©2006 Raj Jain www.rajjain.com
Case Study (Cont)Case Study (Cont) The probability of any bit being in error must be less than
1E-7. The probability of any frame being in error (with error
indication set) must be less than 1%. The probability of a frame in error being delivered without
error indication must be less than 1E-15. The probability of a frame being misdelivered due to an
undetected error in the destination address must be less than 1E-18.
The probability of a frame being delivered more than once (duplicate) must be less than 1E-5.
The probability of losing a frame on the LAN (due to all sorts of errors) must be less than 1%.
1-37©2006 Raj Jain www.rajjain.com
Case Study (Cont)Case Study (Cont)
Availability: Two fault modes – Network reinitializations and permanent failures The mean time to initialize the LAN must be less
than 15 milliseconds. The mean time between LAN initializations must
be at least one minute. The mean time to repair a LAN must be less than
one hour. (LAN partitions may be operational during this period.)
The mean time between LAN partitioning must be at least one-half a week.
1-38©2006 Raj Jain www.rajjain.com
Measurement Techniques and ToolsMeasurement Techniques and Tools
Measurements are not to provide numbers but insight - Ingrid Bucher
1. What are the different types of workloads?
2. Which workloads are commonly used by other analysts?
3. How are the appropriate workload types selected?
4. How is the measured workload data summarized?
5. How is the system performance monitored?
6. How can the desired workload be placed on the system in a controlled manner?
7. How are the results of the evaluation presented?
1-39©2006 Raj Jain www.rajjain.com
TerminologyTerminology Test workload: Any workload used in performance studies.
Test workload can be real or synthetic. Real workload: Observed on a system being used for normal
operations. Synthetic workload:
Similar to real workload Can be applied repeatedly in a controlled manner No large real-world data files No sensitive data Easily modified without affecting operation Easily ported to different systems due to its small size May have built-in measurement capabilities.
1-40©2006 Raj Jain www.rajjain.com
Test Workloads for Computer SystemsTest Workloads for Computer Systems
1. Addition Instruction
2. Instruction Mixes
3. Kernels
4. Synthetic Programs
5. Application Benchmarks
1-41©2006 Raj Jain www.rajjain.com
Addition InstructionAddition Instruction
Processors were the most expensive and most used components of the system
Addition was the most frequent instruction
1-42©2006 Raj Jain www.rajjain.com
Instruction MixesInstruction Mixes
Instruction mix = instructions + usage frequency Gibson mix: Developed by Jack C. Gibson in 1959 for IBM
704 systems.
1-43©2006 Raj Jain www.rajjain.com
Instruction Mixes (Cont)Instruction Mixes (Cont) Disadvantages:
Complex classes of instructions not reflected in the mixes. Instruction time varies with:
Addressing modes Cache hit rates Pipeline efficiency Interference from other devices during processor-memory
access cycles Parameter values Frequency of zeros as a parameter The distribution of zero digits in a multiplier The average number of positions of preshift in floating-point
add Number of times a conditional branch is taken
1-44©2006 Raj Jain www.rajjain.com
Instruction Mixes (Cont)Instruction Mixes (Cont)
Performance Metrics: MIPS = Millions of Instructions Per Second MFLOPS = Millions of Floating Point Operations
Per Second
1-45©2006 Raj Jain www.rajjain.com
KernelsKernels
Kernel = nucleus Kernel= the most frequent function Commonly used kernels: Sieve, Puzzle, Tree
Searching, Ackerman's Function, Matrix Inversion, and Sorting.
Disadvantages: Do not make use of I/O devices
1-46©2006 Raj Jain www.rajjain.com
Synthetic ProgramsSynthetic Programs
To measure I/O performance lead analysts ) Exerciser loops
The first exerciser loop was by Buchholz (1969) who called it a synthetic program.
A Sample Exerciser: See program listing Figure 4.1 in the book
1-47©2006 Raj Jain www.rajjain.com
Synthetic ProgramsSynthetic Programs Advantage:
Quickly developed and given to different vendors. No real data files Easily modified and ported to different systems. Have built-in measurement capabilities Measurement process is automated Repeated easily on successive versions of the operating systems
Disadvantages: Too small Do not make representative memory or disk references Mechanisms for page faults and disk cache may not be
adequately exercised. CPU-I/O overlap may not be representative. Loops may create synchronizations ) better or worse
performance.
1-48©2006 Raj Jain www.rajjain.com
Application BenchmarksApplication Benchmarks
For a particular industry: Debit-Credit for Banks Benchmark = workload (Except instruction mixes) Some Authors: Benchmark = set of programs taken
from real workloads Popular Benchmarks
1-49©2006 Raj Jain www.rajjain.com
SieveSieve
Based on Eratosthenes' sieve algorithm: find all prime numbers below a given number n.
Algorithm: Write down all integers from 1 to n Strike out all multiples of k, for k=2, 3, …, n.
Example: Write down all numbers from 1 to 20. Mark all as prime:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 Remove all multiples of 2 from the list of primes:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
1-50©2006 Raj Jain www.rajjain.com
Sieve (Cont)Sieve (Cont)
The next integer in the sequence is 3. Remove all multiples of 3:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
5 > 20 Stop Pascal Program to Implement the Sieve Kernel:
See Program listing Figure 4.2 in the book
1-51©2006 Raj Jain www.rajjain.com
Other BenchmarksOther Benchmarks
Whetstone U.S. Steel LINPACK Dhrystone Doduc TOP Lawrence Livermore Loops Digital Review Labs Abingdon Cross Image-Processing Benchmark
1-52©2006 Raj Jain www.rajjain.com
SPEC Benchmark SuiteSPEC Benchmark Suite
Systems Performance Evaluation Cooperative (SPEC): Non-profit corporation formed by leading computer vendors to develop a standardized set of benchmarks.
Release 1.0 consists of the following 10 benchmarks: GCC, Espresso, Spice 2g6, Doduc, LI, Eqntott, Matrix300, Fpppp, Tomcatv
Primarily stress the CPU, Floating Point Unit (FPU), and to some extent the memory subsystem compare CPU speeds.
Benchmarks to compare I/O and other subsystems may be included in future releases.
1-53©2006 Raj Jain www.rajjain.com
SPEC (Cont)SPEC (Cont)
The elapsed time to run two copies of a benchmark on each of the N processors of a system (a total of 2N copies) is measured and compared with the time to run two copies of the benchmark on a reference system (which is VAX-11/780 for Release 1.0).
For each benchmark, the ratio of the time on the system under test and the reference system is reported as SPECthruput using a notation of #CPU@Ratio. For example, a system with three CPUs taking 1/15 times as long as the the reference system on GCC benchmark has a SPECthruput of 3@15.
Measure of the per processor throughput relative to the reference system
1-54©2006 Raj Jain www.rajjain.com
SPEC (Cont)SPEC (Cont)
The aggregate throughput for all processors of a multiprocessor system can be obtained by multiplying the ratio by the number of processors. For example, the aggregate throughput for the above system is 45.
The geometric mean of the SPECthruputs for the 10 benchmarks is used to indicate the overall performance for the suite and is called SPECmark.
1-55©2006 Raj Jain www.rajjain.com
The Art of Workload SelectionThe Art of Workload Selection
Services Exercised Example: Timesharing Systems Example: Networks Example: Magnetic Tape Backup System
Level of Detail Representativeness Timeliness Other Considerations in Workload Selection
1-56©2006 Raj Jain www.rajjain.com
The Art of Workload SelectionThe Art of Workload Selection
Considerations: Services exercised Level of detail Loading level Impact of other components Timeliness
1-57©2006 Raj Jain www.rajjain.com
Services ExercisedServices Exercised
SUT = System Under Test CUS = Component Under Study
1-58©2006 Raj Jain www.rajjain.com
Services Exercised (Cont)Services Exercised (Cont) Do not confuse SUT w CUS Metrics depend upon SUT: MIPS is ok for two CPUs but not for
two timesharing systems. Workload: depends upon the system. Examples:
CPU: instructions System: Transactions Transactions not good for CPU and vice versa Two systems identical except for CPU
Comparing Systems: Use transactions Comparing CPUs: Use instructions
Multiple services: Exercise as complete a set of services as possible.
1-59©2006 Raj Jain www.rajjain.com
Example: Timesharing SystemsExample: Timesharing Systems
Applications ) Application benchmark
Operating System ) Synthetic Program
Central Processing Unit ) Instruction Mixes
Arithmetic Logical Unit ) Addition instruction
1-60©2006 Raj Jain www.rajjain.com
Example: NetworksExample: Networks
1-61©2006 Raj Jain www.rajjain.com
Level of Detail (Cont)Level of Detail (Cont) Average resource demand
Used for analytical modeling Grouped similar services in classes
Distribution of resource demands Used if variance is large Used if the distribution impacts the performance
Workload used in simulation and analytical modeling: Non executable: Used in analytical/simulation modeling Executable workload: can be executed directly on a system
1-62©2006 Raj Jain www.rajjain.com
RepresentativenessRepresentativeness
The test workload and real workload should have the same:
Elapsed Time Resource Demands Resource Usage Profile: Sequence and the amounts in
which different resources are used.
1-63©2006 Raj Jain www.rajjain.com
TimelinessTimeliness
Users are a moving target. New systems new workloads Users tend to optimize the demand. Fast multiplication Higher frequency of
multiplication instructions. Important to monitor user behavior on an ongoing
basis.
1-64©2006 Raj Jain www.rajjain.com
Other Considerations in Workload SelectionOther Considerations in Workload Selection
Loading Level: A workload may exercise a system to its: Full capacity (best case) Beyond its capacity (worst case) At the load level observed in real workload (typical case). For procurement purposes Typical For design best to worst, all cases
Impact of External Components: Do not use a workload that makes external component a
bottleneck All alternatives in the system give equally good performance.
Repeatability
1-65©2006 Raj Jain www.rajjain.com
Workload Characterization TechniquesWorkload Characterization Techniques
Terminology Components and Parameter Selection Workload Characterization Techniques: Averaging, Single
Parameter Histograms, Multi-parameter Histograms, Principal Component Analysis, Markov Models, Clustering
Clustering Method: Minimum Spanning Tree, Nearest Centroid Problems with Clustering
1-66©2006 Raj Jain www.rajjain.com
TerminologyTerminology
User = Entity that makes the service request
Workload components: Applications Sites User Sessions
Workload parameters or Workload features: Measured quantities, service requests, or resource demands.For example: transaction types, instructions, packet sizes, source-destinations of a packet, and page reference pattern.
1-67©2006 Raj Jain www.rajjain.com
Components and Parameter SelectionComponents and Parameter Selection
The workload component should be at the SUT interface.
Each component should represent as homogeneous a group as possible. Combining very different users into a site workload may not be meaningful.
Domain of the control affects the component: Example: mail system designer are more interested in determining a typical mail session than a typical user session.
Do not use parameters that depend upon the system, e.g., the elapsed time, CPU time.
1-68©2006 Raj Jain www.rajjain.com
Components (Cont)Components (Cont)
Characteristics of service requests: Arrival Time Type of request or the resource demanded Duration of the request Quantity of the resource demanded, for example,
pages of memory Exclude those parameters that have little impact.
1-69©2006 Raj Jain www.rajjain.com
Workload Characterization TechniquesWorkload Characterization Techniques
1. Averaging
2. Single-Parameter Histograms
3. Multi-parameter Histograms
4. Principal Component Analysis
5. Markov Models
6. Clustering
1-70©2006 Raj Jain www.rajjain.com
AveragingAveraging
Mean
Standard deviation:
Coefficient Of Variation:
Mode (for categorical variables): Most frequent value
Median: 50-percentile
s=¹x
1-71©2006 Raj Jain www.rajjain.com
Case Study: Program Usage Case Study: Program Usage in Educational Environmentsin Educational Environments
High Coefficient of Variation
1-72©2006 Raj Jain www.rajjain.com
Characteristics of an Average Editing SessionCharacteristics of an Average Editing Session
Reasonable variation
1-73©2006 Raj Jain www.rajjain.com
Single Parameter HistogramsSingle Parameter Histograms
n buckets £ m parameters £ k components values. Use only if the variance is high. Ignores correlation among parameters.
1-74©2006 Raj Jain www.rajjain.com
Multi-parameter HistogramsMulti-parameter Histograms
Difficult to plot joint histograms for more than two parameters.
1-75©2006 Raj Jain www.rajjain.com
Principal Component AnalysisPrincipal Component Analysis
Key Idea: Use a weighted sum of parameters to classify the components.
Let xij denote the ith parameter for jth component.
yj = i=1n wi xij
Principal component analysis assigns weights wi's such that yj's provide the maximum discrimination among the components.
The quantity yj is called the principal factor. The factors are ordered. First factor explains the
highest percentage of the variance.
1-76©2006 Raj Jain www.rajjain.com
Principal Component Analysis (Cont)Principal Component Analysis (Cont) Statistically:
The y's are linear combinations of x's:
yi = j=1n aij xj
Here, aij is called the loading of variable xj on factor yi.
The y's form an orthogonal set, that is, their inner product is zero:
<yi, yj> = k aikakj = 0
This is equivalent to stating that yi's are uncorrelated to each other.
The y's form an ordered set such that y1 explains the highest percentage of the variance in resource demands.
1-77©2006 Raj Jain www.rajjain.com
Finding Principal FactorsFinding Principal Factors
Find the correlation matrix. Find the eigen values of the matrix and sort them in
the order of decreasing magnitude. Find corresponding eigen vectors.
These give the required loadings.
1-78©2006 Raj Jain www.rajjain.com
Principal Component ExamplePrincipal Component Example
1-79©2006 Raj Jain www.rajjain.com
Principal Component ExamplePrincipal Component Example
Compute the mean and standard deviations of the variables:
1-80©2006 Raj Jain www.rajjain.com
Principal Component (Cont)Principal Component (Cont)
Similarly:
1-81©2006 Raj Jain www.rajjain.com
Principal Component (Cont)Principal Component (Cont)
Normalize the variables to zero mean and unit standard deviation. The normalized values xs
0 and xr0 are given by:
1-82©2006 Raj Jain www.rajjain.com
Principal Component (Cont)Principal Component (Cont)
Compute the correlation among the variables:
Prepare the correlation matrix:
1-83©2006 Raj Jain www.rajjain.com
Principal Component (Cont)Principal Component (Cont)
Compute the eigen values of the correlation matrix: By solving the characteristic equation:
The eigen values are 1.916 and 0.084.
1-84©2006 Raj Jain www.rajjain.com
Principal Component (Cont)Principal Component (Cont) Compute the eigen vectors of the correlation matrix. The
eigen vectors q1 corresponding to 1=1.916 are defined by the following relationship:
{ C}{ q}1 = 1 { q}1
or:
or:
q11=q21
1-85©2006 Raj Jain www.rajjain.com
Principal Component (Cont)Principal Component (Cont) Restricting the length of the eigen vectors to one:
Obtain principal factors by multiplying the eigen vectors by the normalized vectors:
Compute the values of the principal factors. Compute the sum and sum of squares of the principal factors.
1-86©2006 Raj Jain www.rajjain.com
Principal Component (Cont)Principal Component (Cont)
The sum must be zero. The sum of squares give the percentage of variation
explained.
1-87©2006 Raj Jain www.rajjain.com
Principal Component (Cont)Principal Component (Cont)
The first factor explains 32.565/(32.565+1.435) or 95.7% of the variation.
The second factor explains only 4.3% of the variation and can, thus, be ignored.
1-88©2006 Raj Jain www.rajjain.com
Markov ModelsMarkov Models
Markov the next request depends only on the last request
Described by a transition matrix:
Transition matrices can be used also for application transitions. E.g., P(Link|Compile)
Used to specify page-reference locality.
P(Reference module i | Referenced module j)
1-89©2006 Raj Jain www.rajjain.com
Transition ProbabilityTransition Probability Given the same relative frequency of requests of different
types, it is possible to realize the frequency with several different transition matrices.
If order is important, measure the transition probabilities directly on the real system.
Example: Two packet sizes: Small (80%), Large (20%) An average of four small packets are followed by an
average of one big packet, e.g., ssssbssssbssss.
1-90©2006 Raj Jain www.rajjain.com
Transition Probability (Cont)Transition Probability (Cont)
Eight small packets followed by two big packets.
Generate a random number x. x < 0.8 ) generate a small packet; otherwise generate a large packet.
1-91©2006 Raj Jain www.rajjain.com
ClusteringClustering
1-92©2006 Raj Jain www.rajjain.com
Clustering StepsClustering Steps
1. Take a sample, that is, a subset of workload components.
2. Select workload parameters.
3. Select a distance measure.
4. Remove outliers.
5. Scale all observations.
6. Perform clustering.
7. Interpret results.
8. Change parameters, or number of clusters, and repeat steps 3-7.
9. Select representative components from each cluster.
1-93©2006 Raj Jain www.rajjain.com
1. Sampling1. Sampling
In one study, 2% of the population was chosen for analysis; later 99% of the population could be assigned to the clusters obtained.
Random selection Select top consumers of a resource.
1-94©2006 Raj Jain www.rajjain.com
2. Parameter Selection2. Parameter Selection
Criteria: Impact on performance Variance
Method: Redo clustering with one less parameter Principal component analysis: Identify parameters
with the highest variance.
1-95©2006 Raj Jain www.rajjain.com
3. Transformation3. Transformation
If the distribution is highly skewed, consider a function of the parameter, e.g., log of CPU time
1-96©2006 Raj Jain www.rajjain.com
4. Outliers4. Outliers
Outliers = data points with extreme parameter values Affect normalization Can exclude only if that do not consume a significant
portion of the system resources. Example, backup.
1-97©2006 Raj Jain www.rajjain.com
5. Data Scaling5. Data Scaling1. Normalize to Zero Mean and Unit Variance:
2. Weights:xik
0 = wk xik
wk / relative importance or wk = 1/sk
3. Range Normalization:
Affected by outliers.
1-98©2006 Raj Jain www.rajjain.com
Data Scaling (Cont)Data Scaling (Cont)
Percentile Normalization:
1-99©2006 Raj Jain www.rajjain.com
Distance MetricDistance Metric
1. Euclidean Distance: Given {xi1, xi2, …., xin} and {xj1, xj2, …., xjn} d={k=1
n (xik-xjk)2}0.5
2. Weighted-Euclidean Distance: d=k=1
n {ak(xik-xjk)2}0.5
Here ak, k=1,2,…,n are suitably chosen weights for the n parameters.
3. Chi-Square Distance:
1-100©2006 Raj Jain www.rajjain.com
Distance Metric (Cont)Distance Metric (Cont)
The Euclidean distance is the most commonly used distance metric.
The weighted Euclidean is used if the parameters have not been scaled or if the parameters have significantly different levels of importance.
Use Chi-Square distance only if x.k's are close to each other. Parameters with low values of x.k get higher weights.
1-101©2006 Raj Jain www.rajjain.com
Clustering TechniquesClustering Techniques
Goal: Partition into groups so the members of a group are as similar as possible and different groups are as dissimilar as possible.
Statistically, the intragroup variance should be as small as possible, and inter-group variance should be as large as possible.
Total Variance = Intra-group Variance + Inter-group Variance
1-102©2006 Raj Jain www.rajjain.com
Clustering Techniques (Cont)Clustering Techniques (Cont)
Nonhierarchical techniques: Start with an arbitrary set of k clusters, Move members until the intra-group variance is minimum.
Hierarchical Techniques: Agglomerative: Start with n clusters and merge Divisive: Start with one cluster and divide.
Two popular techniques: Minimum spanning tree method (agglomerative) Centroid method (Divisive)
1-103©2006 Raj Jain www.rajjain.com
Minimum Spanning Tree-Clustering MethodMinimum Spanning Tree-Clustering Method
1. Start with k = n clusters.
2. Find the centroid of the ith cluster, i=1, 2, …, k.
3. Compute the inter-cluster distance matrix.
4. Merge the the nearest clusters.
5. Repeat steps 2 through 4 until all components are part of one cluster.
1-104©2006 Raj Jain www.rajjain.com
Minimum Spanning Tree ExampleMinimum Spanning Tree Example
Step 1: Consider five clusters with ith cluster consisting solely of ith program.
Step 2: The centroids are {2, 4}, {3, 5}, {1, 6}, {4, 3}, and {5, 2}.
1-105©2006 Raj Jain www.rajjain.com
Spanning Tree Example (Cont)Spanning Tree Example (Cont)
Step 3: The Euclidean distance is:
Step 4: Minimum inter-cluster distance = 2. Merge A+B, D+E.
1-106©2006 Raj Jain www.rajjain.com
Spanning Tree Example (Cont)Spanning Tree Example (Cont)
Step 2: The centroid of cluster pair AB is {(2+3) 2, (4+5) 2}, that is, {2.5, 4.5}. Similarly, the centroid of pair DE is {4.5, 2.5}.
1-107©2006 Raj Jain www.rajjain.com
Spanning Tree Example (Cont)Spanning Tree Example (Cont)
Step 3: The distance matrix is:
Step 4: Merge AB and C. Step 2: The centroid of cluster ABC is {(2+3+1) ¥ 3,
(4+5+6) ¥ 3}, that is, {2, 5}.
1-108©2006 Raj Jain www.rajjain.com
Spanning Tree Example (Cont)Spanning Tree Example (Cont)
Step 3: The distance matrix is:
Step 4: Minimum distance is 12.5.Merge ABC and DE Single Custer ABCDE
1-109©2006 Raj Jain www.rajjain.com
DendogramDendogram
Dendogram = Spanning Tree Purpose: Obtain clusters for any given maximum allowable
intra-cluster distance.
1-110©2006 Raj Jain www.rajjain.com
Nearest Centroid Method Nearest Centroid Method Start with k = 1. Find the centroid and intra-cluster variance for ith cluster,
i= 1, 2, …, k. Find the cluster with the highest variance and arbitrarily divide
it into two clusters. Find the two components that are farthest apart, assign other
components according to their distance from these points. Place all components below the centroid in one cluster and
all components above this hyper plane in the other. Adjust the points in the two new clusters until the inter-cluster
distance between the two clusters is maximum. Set k = k+1. Repeat steps 2 through 4 until k = n.
1-111©2006 Raj Jain www.rajjain.com
Cluster InterpretationCluster Interpretation
Assign all measured components to the clusters. Clusters with very small populations and small total
resource demands can be discarded.
(Don't just discard a small cluster) Interpret clusters in functional terms, e.g., a business
application, Or label clusters by their resource demands, for example, CPU-bound, I/O-bound, and so forth.
Select one or more representative components from each cluster for use as test workload.
1-112©2006 Raj Jain www.rajjain.com
Problems with ClusteringProblems with Clustering
1-113©2006 Raj Jain www.rajjain.com
Problems with Clustering (Cont)Problems with Clustering (Cont)
Goal: Minimize variance. The results of clustering are highly variable. No rules
for: Selection of parameters Distance measure Scaling
Labeling each cluster by functionality is difficult. In one study, editing programs appeared in 23
different clusters. Requires many repetitions of the analysis.