Nov 17, 2014

Density-based clustering in spatial data (1)

This is the first of a series of posts on cluster-algorithms and ideas in data analysis (and related fields).

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996 [1]. It uses notions of connectivity and density of points in the data set to compute the connected components of dense enough regions of the data set. But let's cut the intro and dive right into it...


Definitions

Let $(X,d)$ be a finite discrete metric space, i.e. let $X$ be a data set on which we have a notion of similarity expressed in terms of a suitable distance function $d$. Assume we fixed a pair of parameters $\varepsilon > 0$ and $m \in \mathbb{N}$. We say that two points $x,x' \in X$ are $\varepsilon$-connected if there is a sequence of points $x=x_0,\ldots,x_n=x'$ such that \[ d(x_i,x_{i+1}) < \varepsilon, \text{ for $i=0,\ldots,n-1$}. \] Note that $\varepsilon$-connectivity defines an equivalence relation on $X$ which we denote by $\sim_\varepsilon$.
For a subset $A \subset X$ we define its $\varepsilon$-boundary $\Delta_\varepsilon A$ to be the set of points whose distance to $A$ is at most $\varepsilon$, i.e. we set \[ \Delta A := \Delta_\varepsilon A := \{ x \in X \setminus A \ | \ d(A,x) \leq \varepsilon \}. \] Here $d(A,x)$ denotes the minimum $\text{min}_{a \in A} \{ d(a,x) \}$ over all points in $A$. This notion of boundary is borrowed from graph theory and slightly adapted to our situation. Don't worry if you don't like that notion it could also be omitted -- I will elaborate on that later on. The choice of the parameters above allows us to define a discrete notion of density $\rho(x)$ of a point $x$ by \[ \rho(x) = \rho_{\varepsilon}(x) := |N_\varepsilon(x)|, \] where $|.|$ counts the number of elements of a set and $N_\varepsilon(x)$ denotes the $\varepsilon$-neighborhood of $p$ (i.e.~the set of points whose distance to $x$ is at most $\varepsilon$). A point is called $m$-dense if its ($\varepsilon$-)density is greater or equal to $m$, i.e.~its $\varepsilon$-neighborhood $N_\varepsilon(p)$ contains at least $m$ points. In the literature these points usually are referred to as core points.

Sketch of the DBSCAN-algorithm

The idea behind DBSCAN can be formulated as follows. Choose pair of parameters $(\varepsilon, m) \in \mathbb{R}_{\geq 0} \times \mathbb{N}$. The first parameter, $\varepsilon > 0$, gives rise to the notions of connectivity, density, and the boundary of subset as described above. Let \[ X_{m \leq \rho} := \rho^{-1}\big( [m,\infty) \big) \subset X \] denote the points of density at least $m$ and let $C_1,\ldots,C_n$ denote its connected components with respect to the notion of $\varepsilon$-connectivity defined above, i.e. \[ \pi_0^\varepsilon(X_{m \leq \rho}) := \big( X_{m \leq \rho} \big)/_{x \sim_\varepsilon x'} = \big\{ C_1,\ldots,C_n \big\}. \] The components $C_1,\ldots,C_n$ can be understood as dense cores of the final clusters. Note that these sets are mutually disjoint. The final clusters $\overline{C}_1,\ldots,\overline{C}_n$ are obtained from $C_1,\ldots,C_n$ by adding their respetive boundaries, i.e. for $i=1,\ldots,n$ we define \[ \overline{C}_i := C_i \cup \Delta C_i. \] Actually $\overline{C}_i$ is just the $\varepsilon$-neighborhood $N_\varepsilon(C_i)$ of $C_i$, but I like the formulation in terms of $\Delta_\varepsilon$. Note that that $\overline{C}_1,\ldots,\overline{C}_n$ are not necessarily pairwise disjoint. To achieve that we need to make a choice (usually following the order in which the points are visited) and in that sense an implementation of DBSCAN is usually not deterministic.

References

[1] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (1996), AAAI Press, pp. 226–231

2 comments :

  1. Hi Mirko,

    I stumpled across your blog and enjoyed reading it very much.
    And I think I even understood something. Or not, let's see. :-)

    Here are some questions and comments:
    1) I wonder if the sign is correct in your definition X_{ρ≥m}? Shouldnt it be \leq instead of \geq? Or maybe you want to take the inverse set? Or I am missing something.
    2) For Newbies like me, maybe you could add some 1-2 sentences about m. It obviously adds some filtering of the data, but due to the sign uncertainty I am confused whether it filters dense or sparse points. ;-) Is it supposed to be some noise suppression threshold?
    3) Why would you calculate the envelope in the end, if it is not pairwise disjoint? Is it correct, that the envelope is calculated according to the original set prior filtering and what would be the benefit? I find that to be most strange, or may be I miss something again. If that intermediate filtering is noise suppression, why would I want to re-include that in the end?

    Colin

    ReplyDelete
    Replies
    1. Hi Colin :)

      1+2) You are totally right about the wrong sign. It should be "points of density at least(!) m" not "at most". I changed that. Thanks!!

      3) Well, I would stop with the "dense cores" as well. I think that there is a version of DBSCAN that does so. However the original DBSCAN-algorithm doesn't, so I followed that line.

      I may add examples and some code in a future post to illustrate things better.

      Take care,
      Mirko

      Delete