CSE 567M Computer Systems Analysis

1-1©2006 Raj Jain www.rajjain.com

CSE 567MCSE 567MComputer Systems Computer Systems

AnalysisAnalysis


Text BookText Book

R. Jain, “Art of Computer Systems Performance Analysis,” Wiley, 1991, ISBN:0471503363(Winner of the “1992 Best Computer Systems Book” Award from Computer Press Association”)


Objectives: What You Will LearnObjectives: What You Will Learn

Specifying performance requirements Evaluating design alternatives Comparing two or more systems Determining the optimal value of a parameter (system tuning) Finding the performance bottleneck (bottleneck identification) Characterizing the load on the system (workload

characterization) Determining the number and sizes of components (capacity

planning) Predicting the performance at future loads (forecasting).


Basic TermsBasic Terms

System: Any collection of hardware, software, and firmware

Metrics: Criteria used to evaluate the performance of the system. components.

Workloads: The requests made by the users of the system.


Main Parts of the CourseMain Parts of the Course

An Overview of Performance Evaluation Measurement Techniques and Tools Experimental Design and Analysis


Measurement Techniques and ToolsMeasurement Techniques and Tools Types of Workloads Popular Benchmarks The Art of Workload Selection Workload Characterization Techniques Monitors Accounting Logs Monitoring Distributed Systems Load Drivers Capacity Planning The Art of Data Presentation Ratio Games


ExampleExample

Which type of monitor (software or hardware) would be more suitable for measuring each of the following quantities: Number of Instructions executed by a processor? Degree of multiprogramming on a timesharing

system? Response time of packets on a network?


ExampleExample

The performance of a system depends on the following three factors: Garbage collection technique used: G1, G2, or

none. Type of workload: editing, computing, or AI. Type of CPU: C1, C2, or C3.

How many experiments are needed? How does one estimate the performance impact of each factor?


ExampleExample The average response time of a database system is three

seconds. During a one-minute observation interval, the idle time on the system was ten seconds.

Using a queueing model for the system, determine the following: System utilization Average service time per query Number of queries completed during the observation

interval Average number of jobs in the system Probability of number of jobs in the system being greater

than 10 90-percentile response time 90-percentile waiting time


Common Mistakes in EvaluationCommon Mistakes in Evaluation

1. No Goals No general purpose model Goals Techniques, Metrics, Workload Not trivial

2. Biased Goals ``To show that OUR system is better than THEIRS'‘ Analysts = Jury

3. Unsystematic Approach4. Analysis Without Understanding the Problem5. Incorrect Performance Metrics6. Unrepresentative Workload7. Wrong Evaluation Technique


Common Mistakes (Cont)Common Mistakes (Cont)

8. Overlook Important Parameters9. Ignore Significant Factors10. Inappropriate Experimental Design11. Inappropriate Level of Detail12. No Analysis13. Erroneous Analysis14. No Sensitivity Analysis15. Ignoring Errors in Input16. Improper Treatment of Outliers17. Assuming No Change in the Future18. Ignoring Variability19. Too Complex Analysis


Common Mistakes (Cont)Common Mistakes (Cont)

20. Improper Presentation of Results

21. Ignoring Social Aspects

22. Omitting Assumptions and Limitations


Checklist for Avoiding Common MistakesChecklist for Avoiding Common Mistakes

1. Is the system correctly defined and the goals clearly stated?

2. Are the goals stated in an unbiased manner?

3. Have all the steps of the analysis followed systematically?

4. Is the problem clearly understood before analyzing it?

5. Are the performance metrics relevant for this problem?

6. Is the workload correct for this problem?

7. Is the evaluation technique appropriate?

8. Is the list of parameters that affect performance complete?

9. Have all parameters that affect performance been chosen as factors to be varied?


Checklist (Cont)Checklist (Cont)

10. Is the experimental design efficient in terms of time and results?

11. Is the level of detail proper?12. Is the measured data presented with analysis and

interpretation?13. Is the analysis statistically correct?14. Has the sensitivity analysis been done?15. Would errors in the input cause an insignificant change in the

results?16. Have the outliers in the input or output been treated properly?17. Have the future changes in the system and workload been

modeled?18. Has the variance of input been taken into account?


Checklist (Cont)Checklist (Cont)

19. Has the variance of the results been analyzed?

20. Is the analysis easy to explain?

21. Is the presentation style suitable for its audience?

22. Have the results been presented graphically as much as possible?

23. Are the assumptions and limitations of the analysis clearly documented?


A Systematic Approach to A Systematic Approach to Performance EvaluationPerformance Evaluation

1. State Goals and Define the System2. List Services and Outcomes3. Select Metrics4. List Parameters5. Select Factors to Study6. Select Evaluation Technique7. Select Workload8. Design Experiments9. Analyze and Interpret Data10. Present Results Repeat


Criteria for Selecting an Evaluation TechniqueCriteria for Selecting an Evaluation Technique


Three Rules of ValidationThree Rules of Validation

Do not trust the results of an analytical model until they have been validated by a simulation model or measurements.

Do not trust the results of a simulation model until they have been validated by analytical modeling or measurements.

Do not trust the results of a measurement until they have been validated by simulation or analytical modeling.


Selecting Performance MetricsSelecting Performance Metrics


Selecting MetricsSelecting Metrics Include:

Performance Time, Rate, Resource Error rate, probability Time to failure and duration

Consider including: Mean and variance Individual and Global

Selection Criteria: Low-variability Non-redundancy Completeness


Case Study: Two Congestion Control AlgorithmsCase Study: Two Congestion Control Algorithms

Service: Send packets from specified source to specified destination in order.

Possible outcomes: Some packets are delivered in order to the correct

destination. Some packets are delivered out-of-order to the

destination. Some packets are delivered more than once

(duplicates). Some packets are dropped on the way (lost

packets).


Case Study (Cont)Case Study (Cont)

Performance: For packets delivered in order, Time-rate-resource

Response time to deliver the packets Throughput: the number of packets per unit of time. Processor time per packet on the source end system. Processor time per packet on the destination end

systems. Processor time per packet on the intermediate systems.

Variability of the response time Retransmissions Response time: the delay inside the network



Out-of-order packets consume buffersProbability of out-of-order arrivals.

Duplicate packets consume the network resources Probability of duplicate packets

Lost packets require retransmission Probability of lost packets

Too much loss cause disconnection Probability of disconnect



Shared Resource Fairness

Fairness Index Properties: Always lies between 0 and 1. Equal throughput Fairness =1. If k of n receive x and n-k users receive zero

throughput: the fairness index is k/n.



Throughput and delay were found redundant ) Use Power.

Variance in response time redundant with the probability of duplication and the probability of disconnection

Total nine metrics.


Commonly Used Performance MetricsCommonly Used Performance Metrics

Response time and Reaction time


Response Time (Cont)Response Time (Cont)


CapacityCapacity


Common Performance Metrics (Cont)Common Performance Metrics (Cont)

Nominal Capacity: Maximum achievable throughput under ideal workload conditions. E.g., bandwidth in bits per second. The response time at maximum throughput is too high.

Usable capacity: Maximum throughput achievable without exceeding a pre-specified response-time limit

Knee Capacity: Knee = Low response time and High throughput


Common Performance Metrics (cont)Common Performance Metrics (cont) Turnaround time = the time between the submission of a batch

job and the completion of its output. Stretch Factor: The ratio of the response time with

multiprogramming to that without multiprogramming. Throughput: Rate (requests per unit of time) Examples:

Jobs per second Requests per second Millions of Instructions Per Second (MIPS) Millions of Floating Point Operations Per Second (MFLOPS) Packets Per Second (PPS) Bits per second (bps) Transactions Per Second (TPS)



Efficiency: Ratio usable capacity to nominal capacity. Or, the ratio of the performance of an n-processor system to that of a one-processor system is its efficiency.

Utilization: The fraction of time the resource is busy servicing requests. Average fraction used for memory.



Reliability: Probability of errors Mean time between errors (error-free seconds).

Availability: Mean Time to Failure (MTTF) Mean Time to Repair (MTTR) MTTF/(MTTF+MTTR)


Utility Classification of MetricsUtility Classification of Metrics


Setting Performance RequirementsSetting Performance Requirements Examples:

“ The system should be both processing and memory efficient. It should not create excessive overhead”

“ There should be an extremely low probability that the network will duplicate a packet, deliver a packet to the wrong destination, or change the data in a packet.”

Problems:Non-SpecificNon-MeasurableNon-AcceptableNon-RealizableNon-Thorough

SMART


Case Study 3.2: Local Area NetworksCase Study 3.2: Local Area Networks Service: Send frame to D Outcomes:

Frame is correctly delivered to D Incorrectly delivered Not delivered at all

Requirements: Speed

The access delay at any station should be less than one second. Sustained throughput must be at least 80 Mbits/sec.

Reliability: Five different error modes. Different amount of damage Different level of acceptability.


Case Study (Cont)Case Study (Cont) The probability of any bit being in error must be less than

1E-7. The probability of any frame being in error (with error

indication set) must be less than 1%. The probability of a frame in error being delivered without

error indication must be less than 1E-15. The probability of a frame being misdelivered due to an

undetected error in the destination address must be less than 1E-18.

The probability of a frame being delivered more than once (duplicate) must be less than 1E-5.

The probability of losing a frame on the LAN (due to all sorts of errors) must be less than 1%.



Availability: Two fault modes – Network reinitializations and permanent failures The mean time to initialize the LAN must be less

than 15 milliseconds. The mean time between LAN initializations must

be at least one minute. The mean time to repair a LAN must be less than

one hour. (LAN partitions may be operational during this period.)

The mean time between LAN partitioning must be at least one-half a week.


Measurement Techniques and ToolsMeasurement Techniques and Tools

Measurements are not to provide numbers but insight - Ingrid Bucher

1. What are the different types of workloads?

2. Which workloads are commonly used by other analysts?

3. How are the appropriate workload types selected?

4. How is the measured workload data summarized?

5. How is the system performance monitored?

6. How can the desired workload be placed on the system in a controlled manner?

7. How are the results of the evaluation presented?


TerminologyTerminology Test workload: Any workload used in performance studies.

Test workload can be real or synthetic. Real workload: Observed on a system being used for normal

operations. Synthetic workload:

Similar to real workload Can be applied repeatedly in a controlled manner No large real-world data files No sensitive data Easily modified without affecting operation Easily ported to different systems due to its small size May have built-in measurement capabilities.


Test Workloads for Computer SystemsTest Workloads for Computer Systems

1. Addition Instruction

2. Instruction Mixes

3. Kernels

4. Synthetic Programs

5. Application Benchmarks


Addition InstructionAddition Instruction

Processors were the most expensive and most used components of the system

Addition was the most frequent instruction


Instruction MixesInstruction Mixes

Instruction mix = instructions + usage frequency Gibson mix: Developed by Jack C. Gibson in 1959 for IBM

704 systems.


Instruction Mixes (Cont)Instruction Mixes (Cont) Disadvantages:

Complex classes of instructions not reflected in the mixes. Instruction time varies with:

Addressing modes Cache hit rates Pipeline efficiency Interference from other devices during processor-memory

access cycles Parameter values Frequency of zeros as a parameter The distribution of zero digits in a multiplier The average number of positions of preshift in floating-point

add Number of times a conditional branch is taken


Instruction Mixes (Cont)Instruction Mixes (Cont)

Performance Metrics: MIPS = Millions of Instructions Per Second MFLOPS = Millions of Floating Point Operations

Per Second


KernelsKernels

Kernel = nucleus Kernel= the most frequent function Commonly used kernels: Sieve, Puzzle, Tree

Searching, Ackerman's Function, Matrix Inversion, and Sorting.

Disadvantages: Do not make use of I/O devices


Synthetic ProgramsSynthetic Programs

To measure I/O performance lead analysts ) Exerciser loops

The first exerciser loop was by Buchholz (1969) who called it a synthetic program.

A Sample Exerciser: See program listing Figure 4.1 in the book


Synthetic ProgramsSynthetic Programs Advantage:

Quickly developed and given to different vendors. No real data files Easily modified and ported to different systems. Have built-in measurement capabilities Measurement process is automated Repeated easily on successive versions of the operating systems

Disadvantages: Too small Do not make representative memory or disk references Mechanisms for page faults and disk cache may not be

adequately exercised. CPU-I/O overlap may not be representative. Loops may create synchronizations ) better or worse

performance.


Application BenchmarksApplication Benchmarks

For a particular industry: Debit-Credit for Banks Benchmark = workload (Except instruction mixes) Some Authors: Benchmark = set of programs taken

from real workloads Popular Benchmarks


SieveSieve

Based on Eratosthenes' sieve algorithm: find all prime numbers below a given number n.

Algorithm: Write down all integers from 1 to n Strike out all multiples of k, for k=2, 3, …, n.

Example: Write down all numbers from 1 to 20. Mark all as prime:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 Remove all multiples of 2 from the list of primes:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20


Sieve (Cont)Sieve (Cont)

The next integer in the sequence is 3. Remove all multiples of 3:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20

5 > 20 Stop Pascal Program to Implement the Sieve Kernel:

See Program listing Figure 4.2 in the book


Other BenchmarksOther Benchmarks

Whetstone U.S. Steel LINPACK Dhrystone Doduc TOP Lawrence Livermore Loops Digital Review Labs Abingdon Cross Image-Processing Benchmark


SPEC Benchmark SuiteSPEC Benchmark Suite

Systems Performance Evaluation Cooperative (SPEC): Non-profit corporation formed by leading computer vendors to develop a standardized set of benchmarks.

Release 1.0 consists of the following 10 benchmarks: GCC, Espresso, Spice 2g6, Doduc, LI, Eqntott, Matrix300, Fpppp, Tomcatv

Primarily stress the CPU, Floating Point Unit (FPU), and to some extent the memory subsystem compare CPU speeds.

Benchmarks to compare I/O and other subsystems may be included in future releases.


SPEC (Cont)SPEC (Cont)

The elapsed time to run two copies of a benchmark on each of the N processors of a system (a total of 2N copies) is measured and compared with the time to run two copies of the benchmark on a reference system (which is VAX-11/780 for Release 1.0).

For each benchmark, the ratio of the time on the system under test and the reference system is reported as SPECthruput using a notation of #CPU@Ratio. For example, a system with three CPUs taking 1/15 times as long as the the reference system on GCC benchmark has a SPECthruput of 3@15.

Measure of the per processor throughput relative to the reference system


SPEC (Cont)SPEC (Cont)

The aggregate throughput for all processors of a multiprocessor system can be obtained by multiplying the ratio by the number of processors. For example, the aggregate throughput for the above system is 45.

The geometric mean of the SPECthruputs for the 10 benchmarks is used to indicate the overall performance for the suite and is called SPECmark.


The Art of Workload SelectionThe Art of Workload Selection

Services Exercised Example: Timesharing Systems Example: Networks Example: Magnetic Tape Backup System

Level of Detail Representativeness Timeliness Other Considerations in Workload Selection


The Art of Workload SelectionThe Art of Workload Selection

Considerations: Services exercised Level of detail Loading level Impact of other components Timeliness


Services ExercisedServices Exercised

SUT = System Under Test CUS = Component Under Study


Services Exercised (Cont)Services Exercised (Cont) Do not confuse SUT w CUS Metrics depend upon SUT: MIPS is ok for two CPUs but not for

two timesharing systems. Workload: depends upon the system. Examples:

CPU: instructions System: Transactions Transactions not good for CPU and vice versa Two systems identical except for CPU

Comparing Systems: Use transactions Comparing CPUs: Use instructions

Multiple services: Exercise as complete a set of services as possible.


Example: Timesharing SystemsExample: Timesharing Systems

Applications ) Application benchmark

Operating System ) Synthetic Program

Central Processing Unit ) Instruction Mixes

Arithmetic Logical Unit ) Addition instruction


Example: NetworksExample: Networks


Level of Detail (Cont)Level of Detail (Cont) Average resource demand

Used for analytical modeling Grouped similar services in classes

Distribution of resource demands Used if variance is large Used if the distribution impacts the performance

Workload used in simulation and analytical modeling: Non executable: Used in analytical/simulation modeling Executable workload: can be executed directly on a system


RepresentativenessRepresentativeness

The test workload and real workload should have the same:

Elapsed Time Resource Demands Resource Usage Profile: Sequence and the amounts in

which different resources are used.


TimelinessTimeliness

Users are a moving target. New systems new workloads Users tend to optimize the demand. Fast multiplication Higher frequency of

multiplication instructions. Important to monitor user behavior on an ongoing

basis.


Other Considerations in Workload SelectionOther Considerations in Workload Selection

Loading Level: A workload may exercise a system to its: Full capacity (best case) Beyond its capacity (worst case) At the load level observed in real workload (typical case). For procurement purposes Typical For design best to worst, all cases

Impact of External Components: Do not use a workload that makes external component a

bottleneck All alternatives in the system give equally good performance.

Repeatability


Workload Characterization TechniquesWorkload Characterization Techniques

Terminology Components and Parameter Selection Workload Characterization Techniques: Averaging, Single

Parameter Histograms, Multi-parameter Histograms, Principal Component Analysis, Markov Models, Clustering

Clustering Method: Minimum Spanning Tree, Nearest Centroid Problems with Clustering


TerminologyTerminology

User = Entity that makes the service request

Workload components: Applications Sites User Sessions

Workload parameters or Workload features: Measured quantities, service requests, or resource demands.For example: transaction types, instructions, packet sizes, source-destinations of a packet, and page reference pattern.


Components and Parameter SelectionComponents and Parameter Selection

The workload component should be at the SUT interface.

Each component should represent as homogeneous a group as possible. Combining very different users into a site workload may not be meaningful.

Domain of the control affects the component: Example: mail system designer are more interested in determining a typical mail session than a typical user session.

Do not use parameters that depend upon the system, e.g., the elapsed time, CPU time.


Components (Cont)Components (Cont)

Characteristics of service requests: Arrival Time Type of request or the resource demanded Duration of the request Quantity of the resource demanded, for example,

pages of memory Exclude those parameters that have little impact.


Workload Characterization TechniquesWorkload Characterization Techniques

1. Averaging

2. Single-Parameter Histograms

3. Multi-parameter Histograms

4. Principal Component Analysis

5. Markov Models

6. Clustering


AveragingAveraging

Mean

Standard deviation:

Coefficient Of Variation:

Mode (for categorical variables): Most frequent value

Median: 50-percentile

s=¹x


Case Study: Program Usage Case Study: Program Usage in Educational Environmentsin Educational Environments

High Coefficient of Variation


Characteristics of an Average Editing SessionCharacteristics of an Average Editing Session

Reasonable variation


Single Parameter HistogramsSingle Parameter Histograms

n buckets £ m parameters £ k components values. Use only if the variance is high. Ignores correlation among parameters.


Multi-parameter HistogramsMulti-parameter Histograms

Difficult to plot joint histograms for more than two parameters.


Principal Component AnalysisPrincipal Component Analysis

Key Idea: Use a weighted sum of parameters to classify the components.

Let xij denote the ith parameter for jth component.

yj = i=1n wi xij

Principal component analysis assigns weights wi's such that yj's provide the maximum discrimination among the components.

The quantity yj is called the principal factor. The factors are ordered. First factor explains the

highest percentage of the variance.


Principal Component Analysis (Cont)Principal Component Analysis (Cont) Statistically:

The y's are linear combinations of x's:

yi = j=1n aij xj

Here, aij is called the loading of variable xj on factor yi.

The y's form an orthogonal set, that is, their inner product is zero:

<yi, yj> = k aikakj = 0

This is equivalent to stating that yi's are uncorrelated to each other.

The y's form an ordered set such that y1 explains the highest percentage of the variance in resource demands.


Finding Principal FactorsFinding Principal Factors

Find the correlation matrix. Find the eigen values of the matrix and sort them in

the order of decreasing magnitude. Find corresponding eigen vectors.

These give the required loadings.


Principal Component ExamplePrincipal Component Example


Principal Component ExamplePrincipal Component Example

Compute the mean and standard deviations of the variables:


Principal Component (Cont)Principal Component (Cont)

Similarly:



Normalize the variables to zero mean and unit standard deviation. The normalized values xs

0 and xr0 are given by:



Compute the correlation among the variables:

Prepare the correlation matrix:



Compute the eigen values of the correlation matrix: By solving the characteristic equation:

The eigen values are 1.916 and 0.084.


Principal Component (Cont)Principal Component (Cont) Compute the eigen vectors of the correlation matrix. The

eigen vectors q1 corresponding to 1=1.916 are defined by the following relationship:

{ C}{ q}1 = 1 { q}1

or:

or:

q11=q21


Principal Component (Cont)Principal Component (Cont) Restricting the length of the eigen vectors to one:

Obtain principal factors by multiplying the eigen vectors by the normalized vectors:

Compute the values of the principal factors. Compute the sum and sum of squares of the principal factors.



The sum must be zero. The sum of squares give the percentage of variation

explained.



The first factor explains 32.565/(32.565+1.435) or 95.7% of the variation.

The second factor explains only 4.3% of the variation and can, thus, be ignored.


Markov ModelsMarkov Models

Markov the next request depends only on the last request

Described by a transition matrix:

Transition matrices can be used also for application transitions. E.g., P(Link|Compile)

Used to specify page-reference locality.

P(Reference module i | Referenced module j)


Transition ProbabilityTransition Probability Given the same relative frequency of requests of different

types, it is possible to realize the frequency with several different transition matrices.

If order is important, measure the transition probabilities directly on the real system.

Example: Two packet sizes: Small (80%), Large (20%) An average of four small packets are followed by an

average of one big packet, e.g., ssssbssssbssss.


Transition Probability (Cont)Transition Probability (Cont)

Eight small packets followed by two big packets.

Generate a random number x. x < 0.8 ) generate a small packet; otherwise generate a large packet.


ClusteringClustering


Clustering StepsClustering Steps

1. Take a sample, that is, a subset of workload components.

2. Select workload parameters.

3. Select a distance measure.

4. Remove outliers.

5. Scale all observations.

6. Perform clustering.

7. Interpret results.

8. Change parameters, or number of clusters, and repeat steps 3-7.

9. Select representative components from each cluster.


1. Sampling1. Sampling

In one study, 2% of the population was chosen for analysis; later 99% of the population could be assigned to the clusters obtained.

Random selection Select top consumers of a resource.


2. Parameter Selection2. Parameter Selection

Criteria: Impact on performance Variance

Method: Redo clustering with one less parameter Principal component analysis: Identify parameters

with the highest variance.


3. Transformation3. Transformation

If the distribution is highly skewed, consider a function of the parameter, e.g., log of CPU time


4. Outliers4. Outliers

Outliers = data points with extreme parameter values Affect normalization Can exclude only if that do not consume a significant

portion of the system resources. Example, backup.


5. Data Scaling5. Data Scaling1. Normalize to Zero Mean and Unit Variance:

2. Weights:xik

0 = wk xik

wk / relative importance or wk = 1/sk

3. Range Normalization:

Affected by outliers.


Data Scaling (Cont)Data Scaling (Cont)

Percentile Normalization:


Distance MetricDistance Metric

1. Euclidean Distance: Given {xi1, xi2, …., xin} and {xj1, xj2, …., xjn} d={k=1

n (xik-xjk)2}0.5

2. Weighted-Euclidean Distance: d=k=1

n {ak(xik-xjk)2}0.5

Here ak, k=1,2,…,n are suitably chosen weights for the n parameters.

3. Chi-Square Distance:


Distance Metric (Cont)Distance Metric (Cont)

The Euclidean distance is the most commonly used distance metric.

The weighted Euclidean is used if the parameters have not been scaled or if the parameters have significantly different levels of importance.

Use Chi-Square distance only if x.k's are close to each other. Parameters with low values of x.k get higher weights.


Clustering TechniquesClustering Techniques

Goal: Partition into groups so the members of a group are as similar as possible and different groups are as dissimilar as possible.

Statistically, the intragroup variance should be as small as possible, and inter-group variance should be as large as possible.

Total Variance = Intra-group Variance + Inter-group Variance


Clustering Techniques (Cont)Clustering Techniques (Cont)

Nonhierarchical techniques: Start with an arbitrary set of k clusters, Move members until the intra-group variance is minimum.

Hierarchical Techniques: Agglomerative: Start with n clusters and merge Divisive: Start with one cluster and divide.

Two popular techniques: Minimum spanning tree method (agglomerative) Centroid method (Divisive)


Minimum Spanning Tree-Clustering MethodMinimum Spanning Tree-Clustering Method

1. Start with k = n clusters.

2. Find the centroid of the ith cluster, i=1, 2, …, k.

3. Compute the inter-cluster distance matrix.

4. Merge the the nearest clusters.

5. Repeat steps 2 through 4 until all components are part of one cluster.


Minimum Spanning Tree ExampleMinimum Spanning Tree Example

Step 1: Consider five clusters with ith cluster consisting solely of ith program.

Step 2: The centroids are {2, 4}, {3, 5}, {1, 6}, {4, 3}, and {5, 2}.


Spanning Tree Example (Cont)Spanning Tree Example (Cont)

Step 3: The Euclidean distance is:

Step 4: Minimum inter-cluster distance = 2. Merge A+B, D+E.



Step 2: The centroid of cluster pair AB is {(2+3) 2, (4+5) 2}, that is, {2.5, 4.5}. Similarly, the centroid of pair DE is {4.5, 2.5}.



Step 3: The distance matrix is:

Step 4: Merge AB and C. Step 2: The centroid of cluster ABC is {(2+3+1) ¥ 3,

(4+5+6) ¥ 3}, that is, {2, 5}.



Step 3: The distance matrix is:

Step 4: Minimum distance is 12.5.Merge ABC and DE Single Custer ABCDE


DendogramDendogram

Dendogram = Spanning Tree Purpose: Obtain clusters for any given maximum allowable

intra-cluster distance.


Nearest Centroid Method Nearest Centroid Method Start with k = 1. Find the centroid and intra-cluster variance for ith cluster,

i= 1, 2, …, k. Find the cluster with the highest variance and arbitrarily divide

it into two clusters. Find the two components that are farthest apart, assign other

components according to their distance from these points. Place all components below the centroid in one cluster and

all components above this hyper plane in the other. Adjust the points in the two new clusters until the inter-cluster

distance between the two clusters is maximum. Set k = k+1. Repeat steps 2 through 4 until k = n.


Cluster InterpretationCluster Interpretation

Assign all measured components to the clusters. Clusters with very small populations and small total

resource demands can be discarded.

(Don't just discard a small cluster) Interpret clusters in functional terms, e.g., a business

application, Or label clusters by their resource demands, for example, CPU-bound, I/O-bound, and so forth.

Select one or more representative components from each cluster for use as test workload.


Problems with ClusteringProblems with Clustering


Problems with Clustering (Cont)Problems with Clustering (Cont)

Goal: Minimize variance. The results of clustering are highly variable. No rules

for: Selection of parameters Distance measure Scaling

Labeling each cluster by functionality is difficult. In one study, editing programs appeared in 23

different clusters. Requires many repetitions of the analysis.

CSE 567M Computer Systems Analysis

Documents

Transcript of CSE 567M Computer Systems Analysis