Feeback:(Clustering(and(k Means - Marina...
Transcript of Feeback:(Clustering(and(k Means - Marina...
1
Feeback: Clustering and k-‐Means Question: If we want to find a market segment to sell children’s clothes, why don't we simply look at records that contain children’s number >0 and target these people? Or we might find the intersection of marital status and children and target these people?
You can do it, of course: that’s traditional programming. if you have database, you could simply write a (sql) query and extract all the relevant records that meet your criteria.
This feasibility does not necessarily imply efficacy. You might have hidden patterns in your data that you overlook using this traditional approach.
Remember what we said at the beginning of the course (lect 1):
2
Supervised and Unsupervised Machine learning algorithms can be both used for classification and data exploration.
3
Preliminaries
àk-‐means & nominal/numeric attributes
Generally speaking, k-‐Means is often implemented for numerical attributes. Usually, if you have nominal attribute in the dataset, it is common practice to binarize them.
But … apparently Weka’s implementation can handle nominal data. For computing the centroids, weka computes the mean (numeric attributes) or the mode (nominal attribute). [math course: the mode is a measure of central tendency : mode is value that appears most often).
In real life, convert nominal data into binary, because nominal attributes might cause weird results1. But for this exercise we trust what weka does.
ATT. : always check how weka implements a ML algorithm: hover on the name of the algorithm in the Classify tab or in the Cluster tab.
k-‐Means Review
The most commonly used clustering strategy is based on the square-‐root error criterion. Objective: to minimize the square-‐error where square-‐error is the sum of the Euclidean distances (or any other distance you decide) between each instance/observation and its cluster center. The sum of squared error (SSE2) indicates how compact a cluster is: the lower the value, the better3. Conversely, the larger the inter-‐cluster distance, the better. In short, the smaller the intra-‐cluster distance and the bigger the inter-‐cluster distance, the better the clustering will be.
1 See Witten et al. (2011: 480-‐481) and also http://stackoverflow.com/questions/28396974/weka-‐simple-‐k-‐means-‐handling-‐nominal-‐attributes 2 See also < https://hlab.stanford.edu/brian/error_sum_of_squares.html > Please read this very interesting thread in the weka mailing list: some of the answers describe approaches to how to minimize the within cluster sum of squared errors. Remember that there are many empirical ways (or euristics) to make sense of the data you have: < http://weka.8497.n7.nabble.com/Ignore-‐the-‐class-‐td33195.html >. 2 See also < https://hlab.stanford.edu/brian/error_sum_of_squares.html > Please read this very interesting thread in the weka mailing list: some of the answers describe approaches to how to minimize the within cluster sum of squared errors. Remember that there are many empirical ways (or euristics) to make sense of the data you have: < http://weka.8497.n7.nabble.com/Ignore-‐the-‐class-‐td33195.html >. 3 See also < http://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/1999/clustering/node17.html >. Remember, when attributes are not on comparable scales, you might want to apply the normalize or the standardize filters (depending on what is more appropriate for your data). Wehn using the Eucledian distance if you do not normalize/standardise your data then the variables measured in large valued units will dominate the computed dissimilarity and variables that are measured in small valued units will contribute very little.
4
While in hierarchical clustering an explicit measure of inter-‐cluster distance is provided (the linkage type), in the weka implementation of k-‐means the sum of square error is given, which measures the distance to the centroid. As the objective is to minimize this error, it can be considered equialent to the intra-‐cluster distance.
The algorithm assesses each instance/observation, moving it into the nearest cluster. The nearest cluster is the one which has the smallest Euclidean distance (or any other distance metric that you decide) between the observation and the centroid of the cluster. If you have nominal data, weka will compute the mode (not the mean), but other implementation might not handle nominal data…
Centroid: K-‐means procedures work best when you provide good initial points for clusters (it is hard to know what is good initial point, there are several techniques to find out; in weka the initial point is defined by the parameter random seed (default 10)).
Also the order of the instances of the dataset has an impact on the final clustering solutions (do you remember that perceptron had a similar problem?). Weka provides the option preserveInstancesOrder, default value False.
When a cluster changes (ie it loses or gain an istance/observation, cluster centroids are recalculated.
This process repeats until no more istances/observations can be moved into a different cluster. At this point, all observations are in their nearest cluster by the previous criterion.
Difference with hierarchical clustering: Clusters in hierarchical clustering cannot change. With k-‐means, on the contrary, it is possible for two observations to be split into separate clusters after they are joined together.
When all the instances/observations have been assigned, the final Sum of Squared error is computed…
5
The customer dataset
Mapping clusters to classes DOES make sense also in an exploratory/descriptive scenario.
Cluster Mode: Use training set
=== Run information ===
Scheme:weka.clusterers.SimpleKMeans -‐N 5 -‐A "weka.core.EuclideanDistance -‐R first-‐last" -‐I 500 -‐S 10 Relation: customers Instances: 9 Attributes: 5 income age children marital_status education Test mode:evaluate on training data
6
According to the position of the attributes in the dataset, we could say that education would be the default class of this dataset if we were doing supervised classification. But in this run both education and marital status are simply taken as attributes.
Remember in the USA: College refers to undergraduate studies. If you want to go on for a Master's of Doctorate, you would go to graduate school.
(digression: watch out: what said in the ian witten’s tutorial about the class of the iris dataset can be a bit misleading here: he suggestes to ignore the class or to use the class to evaluate the clusters. You can do this when you are doing ”classification”: he wants to classify the iris dataset using an unsupervised learning algorithm. He wants to put the irisis flowers into 3 separated clusters that make sense. Here, with the customer dataset, we want to explore possible groupings that make sense to focus our marketing campain. We are doing exploration and not classification. We consider all the attributes equally important and we do not remove any of them.-‐-‐-‐ end of digression).
7
So, for the customer dataset we take all the attributes into account for an initial exploration of our problem. We know that we want 5 clusters, and the algorithm computes the cluster assignment based on this request and on the default random seeds.
We do not know how compact our clusters: SSE: 5.1; iterations: 3 -‐-‐-‐ uhm… lets try something else. not too bad, but I am not convinced about cluster 3, its income is unconvincing.
If we change the random seeds to 100, we get: squared errors: 2.6 and iterations: 2
If we change the random seeds to 1000, we get: squared errors: 3.0 and iterations: 2
We make several tries because we are EXPLORING the best cluster solutions based on our distance metric and we think that setting random seeds to 100, and sumo f squared errors: 2.6 and iterations: 2.
We look at the clusters and we see that in cluster 0, we have one instance of a customer who has an income of 200 000 dollars, and avarage age of 45, and 5 children, married and attended graduate school. This cluster contains rich people, highly educated, with a high number of children, so potentially could be of interest for our campain: they have money, lots of children, they might want to buy quality/expensive clothes.
8
You can also notice that cluster 1 has a decent income, they are married, but have no children, and they are quite old.
Cluster 3 is made of single people with children, and very low income.
and so on.
We have unveilded some patters that are plausible and we can guess some bying behaviours, such as we might focus a selling campain targeted to well-‐off people with expensive outdoor children’s clothes for example (cluster 0), we can focus another campain for low income single parents, who need to buy budget clothes (cluster3).
All the clusters, except cluster 1, have different ”mean” of children associatated with a different income and with a different marital status and education. This can have an impact on the buying behavious.
This patter, if we think it is reliable, allows us make a fine-‐grained selling campaing, not a one-‐size-‐fits-‐all campaign, which can be less rewarding or less effectiv . Even more! Althouth, the purpose that we have is ”We will target the advertising only to the persons with young children. “ Creatively, we might target another selling campain to people who do not have children, who are on the verge of retirement and stimulate them to contribute to charity by ”dress the undressed: buy a t-‐shirt for a distressed child in underdeveloped countries” (I am just inventing).
What is important here is to see that you have discovered patters that can be useful to create a fine-‐grained and customized selling campaign that you would not have discovered just by targeting people with children and/or being married. We get a model that might be useful to make sense of unseen/future data. We learn from what we have, a model is created that generalize on actual data and is (hopefully) capable of mke sense of data that we have not seen yet. So it is a way to have an eye into the future. Something that cannot be done with traditional programmig or static modelling like classical database approch where you extract with an SQL query only records that meet some requirements, such has having children or being married. Such a solution is possible, it might be appropriate in certain cases but it is very short-‐sighted and might not give the selling results you expect.
We do not know if these clusters make sense or not in the real life, for our purpose. It depends on our knowledge of the fields, on our intuition, etc.
What we could do since we have the possibility to do so it to use the nominal attributes to get further insights. Weka allows u sto use the attributes as classes to evaluate against. We can do this as part of our unsupervised exploratory analysis of our data.
Task 3 asked to use ”marital status” to evaluate the clusters. Why this can be intresting? What does this tell us?
9
marital status attribute is not taken into account when building the model.
small sum of squared errors: Number of iterations: 2; Within cluster sum of squared errors: 1.3 (value that is different )
we get a slight different pattern (but cluster 0 is confirmed).
how is the class assigned to clusters? let’s learn how to read the output.
Classes to clusters evaluation. In this mode Weka first ignores the class attribute and generates the clustering. Then during the test phase it assigns classes to the clusters, based on the majority value of the class attribute within each cluster. Then it computes the classification error, based on this assignment and also shows the corresponding confusion matrix.
What we can understand from this clustering solution is that those who are married (cluster 4) tend to have fewer children and a higher income than those who are divorced (cluster 2) and single (cluster 3). Conversely, those who are divorced or single have quite many children but a lower income. All the members of these 3 clusters are relatively young (betw 25 and 35), etc
you can see in the confusion matrix that cluster 0 has been assigned to married (exactly as in the previous run), but this info is ignored in the final calculation of the classification error because it is not the majority class of the marital status attribute within each cluster. also cluster 1 has been assingned to married and divorced. Both cluster 0 and 1 have children and high income.
Again this patterns gives us idea on how to tailor a marketing campain to different slices of the market, this time we give special importance to the relation between the marital status and the potential of spending money to buy children clothes. Maybe the initial
10
assumption was that married people with a medium income (cluster 3) might be inclined to spend more more in clothes, but this pattern shows us they tend to have fewer children, so in practical terms they are going to spend less money all in all. Therefore it might be good to set up selling campaigns and special offers for single parents with many children (just inventing J )
In conclusion, we can say that also when exploring data, it might be informative to map the clusters into the classes that we have.
This mapping is not done to measure the goodness of our classification model, but to gain more insight into the possible relations and correlation existing in the data.
as Ian Witten said, it is a kind of black magic.
One important thing is to be rememberd: if you have strong patterns in data, these patterns will surface in all kind of manipulations.
With the junk dataset, on the other hand, we evaluate the clusters against the classes, we possibly aim at the evaluation the goodness of our unsupervised model against correct classification:
with default value, we get the following performance, which is not too bad for an unsupervised model.
The underpinnigs: repetition
• exploration/description vs classification/prediction • discover patterns that are not evident and that cannot be unveiled with traditional
methods • build a model to capture unseen or future cases • describe and interpret data data using with the use of underlying mathematical
models • we do not set any limits to what we can do with data we have, provided that it makes
sense: we can basically use all the options and facilities that weka allow us to use to gain insight into the data.