Applied Data Analysis (With SPSS)

download Applied Data Analysis (With SPSS)

of 19

Transcript of Applied Data Analysis (With SPSS)

  • 7/28/2019 Applied Data Analysis (With SPSS)

    1/19

    Research Methodology: Tools

    Applied Data Analysis (with SPSS)

    Lecture 04: Cluster Analysis

    March 2011

    Prof. Dr. Jrg Schwarz [email protected]

    MSc Business Administration

    Slide 2

    Contents

    Aims ___________________________________________________________________________________________________ 5

    Introduction _____________________________________________________________________________________________ 6

    Outline _________________________________________________________________________________________________ 9

    Concepts of Cluster Analysis______________________________________________________________________________ 10

    Cluster Analysis with SPSS: A detailed example______________________________________________________________ 24

  • 7/28/2019 Applied Data Analysis (With SPSS)

    2/19

  • 7/28/2019 Applied Data Analysis (With SPSS)

    3/19

    Slide 5

    Aims

    Aims of the lecture

    You know different types of measures of distance / similarity

    You know the key steps in conducting a cluster analysis.

    You can conduct a cluster analysis with SPSS

    (Hierarchical agglomerative methods: Between-groups linkage and Ward)

    In particular, you know how to

    choose the appropriate measure of distance / similarity

    interpret the agglomeration schedule

    use the dendrogram to determine the number of clusters

    interpret the meaning of a cluster

    Slide 6

    Introduction

    Example

    Marketing research: Customer survey on brand awareness ("Markenbewusstsein")

    Brandawareness[Index]

    Yearly income [Index]

    Survey features

    Sample of n = 150 customers

    Brand awareness index consist of 3 items:

    How likely is it that you will use thebrand again in the future?

    How likely would you be to recommendthe brand to your friends?

    Overall, how satisfied are you with thebrand?

    Also included in the dataset:

    yearly income

  • 7/28/2019 Applied Data Analysis (With SPSS)

    4/19

    Slide 7

    Question

    Is there a linear relation between brand awareness and yearly income?

    Hypothesis: The higher a person's income, the higher his/her brand awareness.

    Conduct regression analysis with SPSS

    Brandawareness[Index]

    Output (summarized)

    Overall model test (F-test)

    Significance p = .014

    Test of coefficients

    Constant p = .000

    Income p = .014

    Coefficient of determination

    R Square = .040

    It is a really poor model

    It seems to have structure in the

    brand awareness dataset.

    Yearly income [Index]

    Slide 8

    Question

    Is there structure in the brand awareness dataset?

    Are there clusters for the combination of yearly income and brand awareness?

    Conduct cluster analysis

    Brandawareness[Index]

    Yearly income [Index]

    Output

    SPSS identified 3 distinct clusters

    Interpretation

    People with low income are least aware

    because they lack money.

    People with middle income have the

    highest brand awareness because of

    the dream of being richer.

    People with high income are moderately

    brand aware because they have a

    certain status but don't need to show off.

  • 7/28/2019 Applied Data Analysis (With SPSS)

    5/19

    Slide 9

    Outline

    Cluster analysis is a multivariate procedure for detecting natural groupings in data.

    The grouping is based on the scores of several measures (e.g. income and awareness).

    Brand

    awareness[Index]

    Yearly income [Index]

    Goals in conducting cluster analysisElements within a group should be as

    similar as possible

    distance d should be small

    Similarities between the groups should be

    minimal

    distance D should be large

    FeaturesBecause all information is used for

    grouping, cluster analysis is more

    objective than just a subjective impression.

    There is no optical illusion.

    D

    d

    Slide 10

    Concepts of Cluster Analysis

    Key steps in using a cluster analysis

    1. Measure of distance or similarity between objects (also called proximity measure)

    Depends on type of data: interval, counts, binary

    Distance: geometrical measure. Similarity: content-related measure

    2. Formation of clusters

    Calculation of proximity matrix

    Many different procedures: Hierarchical / non-hierarchical, agglomerative / divisive etc.

    3. Tools / criteria for determining the number of clusters

    Tools: Agglomeration schedule, structural chart, dendrogram, icicle plot ("Eiszapfen-Plot")

    Criteria (not available in SPSS): F-value, information criterion, etc.

    4. Display and save cluster membership

    Done by SPSS

    5. Interpretation of clusters

    Taking into account means (possibly variances) of cluster members

  • 7/28/2019 Applied Data Analysis (With SPSS)

    6/19

    Slide 11

    How to measure proximity

    From dataset ...

    Variable 1 Variable 2 Variable 3 : Variable j

    Object 1

    Object 2

    Object 3

    :

    Object k

    ... to proximity matrix (done by SPSS internally)

    Object 1 Object 2 Object 3 : Object k

    Object 1

    Object 2

    Object 3

    :

    Object k

    raw data

    distance or similarity

    Slide 12

    Different proximity measures, depending on type of data

    Measureallows specifying the distance (d) or similarity (s) to be used in clustering.

    Interval (e.g. brand awareness, yearly income)

    Euclidean distance (d)

    City block distance (d)

    Pearson correlation (d):

    Counts (e.g. number of clients)

    Chi-square measure (s)

    Phi-square measure (s):

    Binary (e.g. yes/no, female/male)

    Euclidean distance (d)

    Russel and Rao (s)

    Simple matching (s)

    Dice (s)

    (only a selection of 27!)

  • 7/28/2019 Applied Data Analysis (With SPSS)

    7/19

    Slide 13

    Proximity measure with interval variables

    Example: Brand awareness

    Theorem of Pythagoras about right triangle

    cbacba 22222 =+=>=+

    Distance between "pers_001" and "pers_002"

    [ ][ ]

    407.1

    488.1490.0

    73.195.297.067.1d

    2/1

    2/122

    002,001

    =

    +=

    +=

    Coordinates {x-axis, y-axis}

    a

    bc

    0.97 1.67

    2.95

    1.73

    1.407

    a

    bc

    0.97 1.67

    2.95

    1.73

    1.407

    {1.67, 1.73}

    {0.97, 2.95}

    Slide 14

    Generalized equation

    Minkowski distance (Hermann Minkowski, 1864 - 1909, German physicist)r/1

    J

    1j

    r

    ljkjl,k xxd

    =

    =

    r = Minkowski's constant

    dk,l = Distance between objects k and l (e.g. distance between persons 001 and 002)

    J = Number of cluster variables (e.g. variables income and awareness)xkj, xlj = Values of variable j of objects k and l (e.g. income of persons 001 and 002)

    Values of Minkowski's constant

    r = 1: City block distance (also called L1-norm)

    r = 2: Euclidean distance (also called L2-norm)

    City block distance

    Manhattan distance

    Taxi distance

    L2

    L1

    L2

    L1

  • 7/28/2019 Applied Data Analysis (With SPSS)

    8/19

    Slide 15

    Proximity measure with binary variables

    Example: Car configuration

    Identification of similarities between two objects by means of comparison

    ABS Airbag ESP Navi MetallicMercedes 0 1 1 1 0

    BMW 0 1 1 0 1

    Case D A A C B

    0 = feature not present 1 = feature present

    Configuration

    4 Cases

    A = Feature exists in both comparison objects

    B, C = Feature exists in one comparison object

    D = Feature exists in none of the comparison objects

    Non-existence is also an important similarity in proximity definition

    Slide 16

    Binary proximity measures

    Proximity measure between two objects i and j depends on whether and how

    the cases are included and how they are weighted (weights , i und ).

    General case: Simple Matching Coefficient*

    ij

    a dS

    a (b c) d1

    2

    + =

    + + +

    Variants Description Definition

    Russel und Rao Case d reduces proximityij

    aS

    a b c d=

    + + +

    Simple matching Case d raises proximityij

    a dS

    a b c d

    +=

    + + +

    DiceCase d is not taken into account

    Similar features are weighted more ij 2aS 2a b c=

    + +

    *Sokal, R.R. and Michener, C.D., Statistical method for evaluating systematic relationships,*University of Kansas science bulletin, 38:1409--1438, 1958.

    a = Number of cases of case "A"b = Number of cases of case "B"

    :

  • 7/28/2019 Applied Data Analysis (With SPSS)

    9/19

    Slide 17

    Example: Car configuration

    ABS Airbag ESP Navi Metallic

    Mercedes 0 1 1 1 0

    BMW 0 1 1 0 1

    Case D A A C B

    0 = feature not present 1 = feature present

    Configuration

    Measure Proximity

    Russel and Raoij

    2 2S 0.4

    2 1 1 1 5= = =

    + + +

    Simple matchingij

    2 1 3S 0.62 1 1 1 5

    += = =+ + +

    Diceij

    2 2 4S 0.67

    2 2 1 1 6

    = = =

    + +

    Some remarks

    Sij varies between 0 and 1

    There is no "right" proximity measure

    Important question/decision:

    Is non-existence important?

    ( taking case d into account?)

    Count of cases

    a = 2

    b = 1

    c = 1

    d = 1

    Slide 18

    How to form Clusters

    Cluster A Cluster B

    1.

    2.

    3.

    How to define similarity?

    Similarity between cluster A and cluster B is measured by

    1. Nearest neighbor (also called single linkage in the cluster formation tree on slide 20)

    ... the minimum of all possible distances between the cases in cluster A and the cases in B.2. Centroid clustering (also called other linkage)

    ... the distance between the centroids of cluster A and of cluster B.

    3. Furthest neighbor (also called complete linkage)

    ... the maximum of all ossible distances between the cases in cluster A and the cases in B.

  • 7/28/2019 Applied Data Analysis (With SPSS)

    10/19

    Slide 19

    Similarity between cluster A and cluster B is measured by

    Between-groups linkage (also called average linkage)

    ... the average of all the possible distances between the cases in cluster A and the cases in B.

    Within-groups linkage (also called other linkage)

    ... the average of all the possible distances between the cases within a single new cluster

    determined by combining cluster A and cluster B.

    Median clustering (also called other linkage)

    ... the distance between the SPSS determined median for the cases in cluster A and the median

    for the cases in cluster B.

    Special case, taking into account sum of squares

    Wards method

    For a cluster the sum of squares is the sum of squared distances of each case from the centroid.

    d1

    d2

    Sum of squared distances

    =

    =++k

    1i

    2

    i

    2

    2

    2

    1 d...dd

    Slide 20

    Cluster formation tree (rules for cluster formation)

    There are several types of clustering procedures:

    DivisiveAgglomerative

    Variance

    methods

    Linkage

    methods

    Wardsprocedure

    Singlelinkage

    Clusteralgorithms

    Hierarchical

    Completelinkage

    Averagelinkage

    k-Meansprocedure

    Non-hierarchical

    Otherlinkage

    Non-hierarchical clustering is also called k-means clustering.

    Average linkage between groups is the default in SPSS ("Between-groups linkage")

    used in this course

  • 7/28/2019 Applied Data Analysis (With SPSS)

    11/19

    Slide 21

    Pros and cons

    Hierarchical clustering

    No a priori decision about the number of clusters

    Can be very slow

    Non-hierarchical clustering

    Need to specify the number of clusters (can be an arbitrary number)

    Faster, more reliable

    Features

    Procedure Proximity measure Remark

    Single linkage distance or similarity tendency to form chains

    Complete linkage distance or similarity tendency to smaller groups of same size

    Average linkage distance or similarity "between" single and complete linkage

    Other linkage only distance No remark

    Ward's method only distance tendency to groups of same size

    Slide 22

    Example of hierarchical method: Single linkage (nearest neighbor)

    Tendency to form chains

    Suitable for the detection of outliers

    Close groups are badly separated

    Cognitive Psychology Unit at Saarland University (www.uni-saarland.de) (Date of access: March, 2011)

    nearest neighbor

    Step k Step k + 1

    "chain"

  • 7/28/2019 Applied Data Analysis (With SPSS)

    12/19

    Slide 23

    Example of hierarchical method: Complete linkage (furthest neighbor)

    Tendency to form smaller groups with same size

    Not suitable for detecting outliers

    Cognitive Psychology Unit at Saarland University (www.uni-saarland.de) (Date of access: March, 2011)

    furthest neighbor

    Step k Step k + 1

    Slide 24

    Cluster Analysis with SPSS: A detailed example

    Marketing research: Customer survey on brand awareness

    Data

    Random sub-sample of n = 15

    (Why this small sub-sample?

    Just to keep track of what SPSS does.)

    Data set: cluster_small.sav

    Syntax: cluster_small.sps

    Brandawareness[Index]

    Yearly income [Index]

  • 7/28/2019 Applied Data Analysis (With SPSS)

    13/19

    Slide 25

    SPSS Elements:

  • 7/28/2019 Applied Data Analysis (With SPSS)

    14/19

    Slide 27

    First step: Measure of distance or similarity between objects

    Output

    Proximity Matrix (Distances or similarities between items)

    Values represent Euclidian distances

    Example:

    Distance between cases 9 and 7

    :

    :

    Slide 28

    Second step: Formation of clusters

    Between-groups linkage

    Stage 1: Cases 7 and 9 have smallest distance ("Coefficients" = .203) => first cluster {7,9}

    First cluster {7,9} will be clustered with case 10 in stage 5 => cluster {7,9,10}

    Stage 2: Cases 13 and 14 have second smallest distance => second cluster {13,14}

    Second cluster {13,14} will be clustered with case 11 in stage 3 => cluster {11,13,14}

    :

    Agglomeration schedule: Displaysthe clusters combined at each stage.

  • 7/28/2019 Applied Data Analysis (With SPSS)

    15/19

    Slide 29

    Dendrogram

    Stage 1

    Stage 5

    Stage 2

    Stage 3

    Slide 30

    Icicle plot

    14 clusters: Cases 7 and 9 in one cluster, all others each in their own clusters.

    13 clusters: 7 and 9 in one cluster, 13 and 14 in one cluster, all others each in their clusters.

    12 clusters: 7 and 9 in one cluster, 11, 13 and 14 in one cluster, all others each in their clusters.

    :

    The figure is called an

    icicle plot because the

    columns look like icicles

    hanging from above.

    The plot shows how cases

    are merged into clusters.

    Read it from bottom to top

  • 7/28/2019 Applied Data Analysis (With SPSS)

    16/19

    Slide 31

    Third step: Determining the number of clusters

    0) Theoretical and empirical reasons (But, be careful about optical illusion!)

    In the case of brand awareness there are some indications for three clusters.

    A) Elbow criterion in the structure chart (can't be done with SPSS, but with Excel)

    0.0

    0.5

    1.0

    1.5

    2.0

    2.5

    3.0

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

    Please note: Mostly there is large effect from cluster 1 to cluster 2 which is not the "elbow".

    P

    roximity("Coefficients")

    Number of clusters (=sample size - "Stage")

    elbow => choose 3 clusters

    Slide 32

    B) Dendrogram

    Choose the number of clusters within the largest increase in heterogeneity

    Standardized distance

    Largest increase in heterogeneity

  • 7/28/2019 Applied Data Analysis (With SPSS)

    17/19

    Slide 33

    Fourth step: Display and save cluster membership

    Output table of cluster membership

    Example of brand awareness: assumed 3 clusters

    If you're not sure

    about the number of

    clusters, choose a fullrange

    Slide 34

    Saving the cluster membership

    Used for drawing a scatter plot, for example.

    Range of solutions: 2 to 5

    Example of brand awareness: assumed 3 clusters

  • 7/28/2019 Applied Data Analysis (With SPSS)

    18/19

    Slide 35

    Scatter plot:

    Slide 36

    One point was assigned incorrectly

  • 7/28/2019 Applied Data Analysis (With SPSS)

    19/19

    Slide 37

    Fifth step: Interpretation of clusters

    In the case of the brand awareness example, the interpretation is obvious and straightforward.

    Taking into account means

    The means of the clusters with respect to the original variablesindicate how the clusters can be interpreted.

    Example of Lecture 01: Marketing survey on consumer buying behavior

    Questionnaire to ask people about their attitudes.

    Among other questions:

    "What is your general attitude to life?" (variable x1)

    "What is your attitude to innovation?" (variable x2)

    "What is your willingness to take risks?" (variable x3)

    Scales of variables vary

    from extremely negative (1)

    to extremely positive (7)

    general

    attitude to life

    attitude to

    innovation

    willingness

    to take risks

    Person A 1 2 2

    Person B 1 3 3

    Person C 2 4 2

    Person D 5 4 3

    Person E 5 4 4

    Person F 7 6 7

    Objects

    Attributes

    Data of 6 people

    Slide 38

    general

    attitude to life

    attitude to

    innovation

    willingness

    to take risks

    (A, B, C) 1.3 3 2.3

    (D, E) 5 4 3.5

    (F) 7 6 7Cluster

    Attributes

    Mean of clusters, regarding the cluster variables

    Cluster 1 (A, B, C): pessimistic people who live in fear

    Cluster 2 (D, E): slightly optimistic normalos

    Cluster 3 (F): life-affirming adventurer