CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L9.pdf · Machine Learning and Data...

CPSC 340: Machine Learning and Data Mining

Density-Based Clustering

Fall 2015

• Tutorials today.

• Office hours tomorrow

• Assignment 2 due Friday.

K-Means++

• Steps of k-means++:

1. Select initial mean µ1, from among the object xi.

2. Compute distance dic of object xi to each mean µc.

3. For each object set di to the minimum distance across all clusters c.

4. Choose next mean by sampling proportional to (di)2.

5. Stop when we have k means, otherwise return to 2.

• Expected approximation ratio is O(log(k)).

K-Means++

First mean is a random example.

K-Means++

Weight examples by distance squared.

K-Means++

Sample mean proportional to distances squared.

K-Means++

Weight examples by squared distance to mean.

K-Means++

Weight examples by squared distance to mean.

K-Means++

(We’ve now hit target k=4.)

K-Means++

Assign each object to the closest mean.

K-Means++

Update the mean of each group.

K-Means++

Assign each object to the closest mean.

Keep going until no o objects change groups.

Shape of K-Means Clusters

• K-means clusters are formed by the intersection of half-spaces.

Half-space

• K-means clusters are formed by the intersection of half-spaces.

Half-space

Intersection

Half-space

Red over green half-space

Green over red half-space

Blue over green half-space

Green over blue half-space

Magenta over green half-space

Green over magenta half-space

• Intersection of half-spaces forms a convex set:

– Line between any two points in the set stays in the set.

Convex Convex

Not Convex

K-Means with Non-Convex Clusters

https://corelifesciences.com/human-long-non-coding-rna-expression-microarray-service.html

K-means cannot separate non-convex

Though over-clustering can help (next class)

K-means cannot separate non-convex

Application: Elephant Range Map

• Find habitat area of African elephants.

– Useful for assessing/protecting population.

• Build clusters from observations of locations.

• Clusters are non-convex:

– affected by vegetation, relief, rivers, water access.

• We do not want a partition:

– Some regions should not have a cluster.

http://www.defenders.org/elephant/basic-facts

Motivation for Density-Based Clustering

• Density-based clustering is a non-parametric clustering method:

– Clusters are defined by connected dense regions.

• Become more complicated the more data we have.

– Data points in non-dense regions are not assigned a cluster.

http://www.defenders.org/elephant/basic-facts

Other Potential Applications

• Where are high crime regions of a city?

• Where should taxis patrol?

• Where does Iguodala make/miss shots?

• Which products are similar to this one?

• Which pictures are in the same place?

• Where can protein ‘dock’?

https://en.wikipedia.org/wiki/Cluster_analysis https://www.flickr.com/photos/dbarefoot/420194128/ http://letsgowarriors.com/replacing-jarrett-jack/2013/10/04/ http://www.dbs.informatik.uni-muenchen.de/Forschung/KDD/Clustering/

• Density-based clustering algorithm (DBSCAN) has two parameters:

– Radius: minimum distance between points to be considered ‘close’.

– MinPoints: number of ‘close’ points needed to define a cluster.

• Pseudocode for DBSCAN:

– For each example xi:

• If xi is already assigned to a cluster, do nothing.

• If xi is not core point (less than minPoints neighbours with distance ≤ ‘r’), do nothing.

• If xi is a core point, expand cluster.

– Expand cluster function:

• Assign all xj within distance ‘r’ of core point xi to cluster.

• For each newly-assigned neighbour xj that is a core point, expand cluster.

Density-Based Clustering Issues

• Some points are not assigned to a cluster.

– Good or bad, depending on the application.

• Sensitive to the choice of radius and minPoints.

• Ambiguity of ‘non-core’ (boundary) points:

– They could be assigned more than once.

• Other than this ambiguity, not sensitive to initialization.

• Assigning new points to clusters is expensive.

• In high-dimensions, need a lot of points to ‘fill’ the space.

Summary

1. K-means++: randomized initialization with good expected performance.

2. Shape of K-means clusters: intersection of half-spaces => convex sets.

3. Density-based clustering: useful for finding non-convex connected clusters.

4. DBSCAN algorithm: assign points in dense regions to same cluster.

• Next time:

– Dealing with clusters of different densities.

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L9.pdf · Machine Learning and Data...

Documents

Transcript of CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L9.pdf · Machine Learning and Data...

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F19/L25.pdf · •AdaBoost with shallow decision trees gives fast/accurate classifiers. –lassically viewed as one of the

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L2.pdf · 2015-09-12 · Machine Learning and Data Mining Data Preprocessing and Exploration September 11, 2015. ...

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F19/L13.pdf · •We can adapt our classification methods to perform regression: –Regression tree: tree with mean value

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F19/L10.pdf · •Biclustering: clustering of the examples andthe features. •Outlier detection is task of finding unusually

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F19/L1.pdf · Big Data Phenomenon •What do you do with all this data? –Too much data to search through it manually.

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F17/L3.pdf · 2017. 12. 23. · Naïve Supervised Learning: Predict Mode •A very naïve supervised learning method: –Count

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F19/L28.pdf · Digression: Data Centering (Important) •In PCA, we assume that the data X is centered. –Each column of

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F19/L2.pdfData Mining: Some Typical Steps 1) Learn about the application. 2) Identify data mining task. 3) Collect data.

CPSC 340: Data Mining Machine Learningschmidtm/Courses/LecturesOnML/L5.pdf · •Machine learning research: –Large focus on models that are useful across many applications. Application:

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F19/L8.pdf · 2019-09-21 · Machine Learning and Data Mining K-Means Clustering Fall 2019. Last Time: Random Forests •Random

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F17/L18.pdf · –Is gender relevant? •If we want to find relevant causal factors, gender is not relevant. •If we want

CPSC 340: Data Mining Machine Learningschmidtm/Courses/LecturesOnML/...Data Mining vs. Machine Learning •Data mining and machine learning are very similar: –Data mining often viewed

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F16/L31.pdfMachine Learning and Data Mining Convolutional Neural Networks Fall 2016 Admin •Assignment 5: –Due Friday,

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F16/L11.pdf · Machine Learning and Data Mining Outlier Detection Fall 2016. Admin •Assignment 1 solutions will be posted

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F15/L11.pdfAssociation Rules •Consider two sets of items S and ZT: –For example: S = {sunglasses, sandals} and T = {sunscreen}.

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F16/L2.pdf · •You can compute Zdistance between documents. The International Conference on Machine Learning (ICML) is

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F16/L5.pdfThe est Machine Learning Model •Implications of the lack of a best model: –We need to learn about and try out

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F17/L20.pdf · –Polynomial basis. –Radial basis functions (non-parametric basis). ... 17 •A linear classifier would

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F17/L25.pdf · –We treat features X as fixed dont care about their distribution. –So the MLE maximizes the conditional

CPSC 340: Data Mining Machine Learningschmidtm/Courses/340-F16/L15.pdf · Admin •Assignment 2: –2 late days to hand it in Friday, 3 for Monday. •Assignment 3 is out. –Due