The
Clustering Models Project
(CluMP)
Page



Overview

Clustering is a process of pattern induction. The kinds of patterns which are discovered vary according to the clustering method used. When one has some idea what patterns are to be found, this information can be used to improve the perceived quality of the discovered clusters. Two ways in which prior knowledge can exploited are: Clustering Models

There are many clustering methods out there. The most popular are the hierarcical agglomerative clustering methods. These start with each point in its own cluster, and merge clusters repeatedly until the desired number of clusters remain. On the other hand, the most well-founded and well-understood methods are the probabilistic model-based clustering methods.

Many people see the hierarchical agglomerative methods as a "hack," but we have shown that they are in fact model-based methods after all.

Constrained Clustering [Demo]

Sometimes you have an idea what kinds of clusters you want to get out of a clustering system. For example you may know that certain points belong together while others don't. See Kiri Wagstaff's constrained clustering page for other work in this area.

Let's say we have two points that the clustering proceedure put into different clusters, but we know they actually belong together. It's possible that the problem is that one or both points are outliers -- that we don't want them to behave like the points nearby. However, it's more likely that the points near them are wrong, too. To fix multiple points at once, we can interpret the constraint as a space-level constraint, and use it to warp the space so those two points become closer together. Check out the demo!

Doing this gives much better performance that simply using instance level constraints. The graphs below show accuracy as number of constraints increases, comparing instance-level constraints (with K-means) to space-level constraints (with complete-link clustering). The space-level constraints give better final accuracy, despite being used with an inferior clustering algorithm. Most clustering algorithms work best on separated spherical data. Our clustering method takes non-spherical data and uses constraints to contract the similarity space in a way that makes the clusters tighter and more spherical.


Publications



Contact Information

Comments about the project page? Feel free to email Dan or Sep.