Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996 [1]. It uses notions of connectivity and density of points in the data set to compute the connected components of dense enough regions of the data set. But let's cut the intro and dive right into it...
Definitions
Let $(X,d)$ be a finite discrete metric space, i.e. let $X$ be a data set on which we have a notion of similarity expressed in terms of a suitable distance function $d$. Assume we fixed a pair of parameters $\varepsilon > 0$ and $m \in \mathbb{N}$. We say that two points $x,x' \in X$ are $\varepsilon$-connected if there is a sequence of points $x=x_0,\ldots,x_n=x'$ such that \[ d(x_i,x_{i+1}) < \varepsilon, \text{ for $i=0,\ldots,n-1$}. \] Note that $\varepsilon$-connectivity defines an equivalence relation on $X$ which we denote by $\sim_\varepsilon$.For a subset $A \subset X$ we define its $\varepsilon$-boundary $\Delta_\varepsilon A$ to be the set of points whose distance to $A$ is at most $\varepsilon$, i.e. we set \[ \Delta A := \Delta_\varepsilon A := \{ x \in X \setminus A \ | \ d(A,x) \leq \varepsilon \}. \] Here $d(A,x)$ denotes the minimum $\text{min}_{a \in A} \{ d(a,x) \}$ over all points in $A$. This notion of boundary is borrowed from graph theory and slightly adapted to our situation. Don't worry if you don't like that notion it could also be omitted -- I will elaborate on that later on. The choice of the parameters above allows us to define a discrete notion of density $\rho(x)$ of a point $x$ by \[ \rho(x) = \rho_{\varepsilon}(x) := |N_\varepsilon(x)|, \] where $|.|$ counts the number of elements of a set and $N_\varepsilon(x)$ denotes the $\varepsilon$-neighborhood of $p$ (i.e.~the set of points whose distance to $x$ is at most $\varepsilon$). A point is called $m$-dense if its ($\varepsilon$-)density is greater or equal to $m$, i.e.~its $\varepsilon$-neighborhood $N_\varepsilon(p)$ contains at least $m$ points. In the literature these points usually are referred to as core points.
Hi Mirko,
ReplyDeleteI stumpled across your blog and enjoyed reading it very much.
And I think I even understood something. Or not, let's see. :-)
Here are some questions and comments:
1) I wonder if the sign is correct in your definition X_{ρ≥m}? Shouldnt it be \leq instead of \geq? Or maybe you want to take the inverse set? Or I am missing something.
2) For Newbies like me, maybe you could add some 1-2 sentences about m. It obviously adds some filtering of the data, but due to the sign uncertainty I am confused whether it filters dense or sparse points. ;-) Is it supposed to be some noise suppression threshold?
3) Why would you calculate the envelope in the end, if it is not pairwise disjoint? Is it correct, that the envelope is calculated according to the original set prior filtering and what would be the benefit? I find that to be most strange, or may be I miss something again. If that intermediate filtering is noise suppression, why would I want to re-include that in the end?
Colin
Hi Colin :)
Delete1+2) You are totally right about the wrong sign. It should be "points of density at least(!) m" not "at most". I changed that. Thanks!!
3) Well, I would stop with the "dense cores" as well. I think that there is a version of DBSCAN that does so. However the original DBSCAN-algorithm doesn't, so I followed that line.
I may add examples and some code in a future post to illustrate things better.
Take care,
Mirko