Loading web-font TeX/Main/Regular

Nov 23, 2014

Density-based clustering in spatial data (2)

This is the second of a series of posts on cluster-algorithms and ideas in data analysis (and related fields).

Ordering points to identify the clustering structure (OPTICS) is a data clustering algorithm presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander in 1999 [1]. It builds on the ideas of DBSCAN, which I described in a previous post. But let's cut the intro and dive right into it...

Let (X,d) be a finite discrete metric space, i.e.~let X be a data set on which we have a notion of similarity expressed in terms of a suitable distance function d. Given a positive integer m \in \mathbb{N} we can define a notion of density on points of our data set, which in turn can be used to define a perturbed metric. Given a starting point x_0 \in X the OPTICS-algorithm iteratively defines a linear ordering on X in terms of a filtration X_0 \subset X_1 \subset \ldots \subset X_n of X, where X_{n+1} is obtained from X_n by appending the closest element (with respect to the perturbed metric) in its complement.
The original OPTICS-algorithm also takes an additional parameter \varepsilon > 0. However this is only needed to reduce the time complexity and is ignored in our discussion for now -- to be more precise, we implicitly set \varepsilon = \infty.


Definitions

Let (X,d) be a finite discrete metric space and m \in \mathbb{N} a positive integer. We define the (co-)density \delta_m(x) of a point x by \delta_m(x) := d(x, \mathfrak{nn}_m(x) ),
where \mathfrak{nn}_m(x) is an m-nearest-neighbor of x. Loosely speaking: the lower the value \delta_m(x) the closer the neighbors of x are distributed around x. Note that with \varepsilon = \delta_m(x) the point x is a core point in the sense of [2] -- in a previous post about DBSCAN we called such a point \varepsilon-dense. That is why in the literature \delta_m(x) is referred to as core-distance. I however like to think of it, and hence refer to it, as a notion of co-density. Further note that for any y \in X we have (as long as d satisfies the triangle inequality) \delta_m(y) \leq d(x,y) + \delta_m(x).
Given the co-density \delta_m(.) we can define the reachability distance r_x(y) of y from x by r_x(y) := \text{max} \big\{ \delta_m(x), d(x,y) \big\}.
Note that r_x(y) is not symmetric, since the density of x and y may differ.

Sketch of the algorithm

Choose a starting point x_0 \in X. Then we can iteratively define a filtration X_0 \subset \ldots \subset X_n of the data set X by X_0 := \{ x_0 \} \ \ \text{and} \ \ X_{k+1} := X_k \cup \{ x_{k+1} \},
where x_{k+1} minimizes r_{X_k}(.) over X \setminus X_k. In parallel we define a sequence (r_n) of distances by r_0 = 0 \ \ \text{and} \ \ r_{k+1} := r_{X_k}(x_{k+1}).
Note that a small distance r_k may be understood as x_k being close to a rather dense region. Therefore the filtration tries to stay in dense regions for as long as possible, before it passes a less dense region. The cluster structure can now be extracted by analyzing the reachability-plot, that is a 2-dimensional plot, with the ordered x_k on the x-axis and the associated distances r_k on the y-axis. By the above considerations it should be clear that clusters show up as valleys in the reachability plot. The deeper the valley, the denser the cluster.

Example

Consider the data points X\subset \mathbb{R}^2 in the euclidian plane given in the following figure:


Below you see the reachability plot corresponding the OPTICS-algorithm applied to the above data set with m=4. The starting point is chosen among the group of points on the right.



References

[1] M. Ankerst, M.M. Breunig, H.-P. Kriegel, OPTICS: Ordering Points To Identify the Clustering Structure, ACM SIGMOD international conference on Management of data (1999), pp 49--60.
[2] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996), pp. 226--231.

1 comment :

  1. Dear Mirko,

    this is very interesting again. :-)

    I have one question or thought about the reachability and the filtration. If I were to implement that algorithm, my intuition would tell me "let's try to define the core-distance/co-density with respect to X_k instead of X". Because this way the algorithm would prefer to extend the filtration on those points which are dense in X_k (instead of dense in X). I think it would be interesting to see what difference that makes.

    To me that would have been a more natural choice at first, but the longer I think about it I am already reconsidering. My variation of the algorithm would probably not have the ability (or 'tendency') to drift to more dense regions if it started in a sparse region. That would seem to be the main difference. Ok, not a real question after all..

    But very interesting post again, I enjoy reading your blog. :-)

    ReplyDelete