Dec 8, 2014

Mapper - A discrete generalization of the Reeb graph

This is the third of a series of posts on cluster-algorithms and ideas in data analysis.

Mapper is a construction that uses a given cluster-algorithm to associate a simplicial complex to a reference map on a given data set. It was introduced by Carlsson--Mémoli--Singh in [1] and lies at the core of the of the topological data analysis startup Ayasdi. A good reference, I personally enjoyed reading, is [2]. Mapper is closely related to the Reeb graph of a real-valued function on a topological space. Just as the Reeb graph it is able to capture certain topological and geometrical features of the underlying space.

Nov 23, 2014

Density-based clustering in spatial data (2)

This is the second of a series of posts on cluster-algorithms and ideas in data analysis (and related fields).

Ordering points to identify the clustering structure (OPTICS) is a data clustering algorithm presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander in 1999 [1]. It builds on the ideas of DBSCAN, which I described in a previous post. But let's cut the intro and dive right into it...

Let $(X,d)$ be a finite discrete metric space, i.e.~let $X$ be a data set on which we have a notion of similarity expressed in terms of a suitable distance function $d$. Given a positive integer $m \in \mathbb{N}$ we can define a notion of density on points of our data set, which in turn can be used to define a perturbed metric. Given a starting point $x_0 \in X$ the OPTICS-algorithm iteratively defines a linear ordering on $X$ in terms of a filtration $X_0 \subset X_1 \subset \ldots \subset X_n$ of $X$, where $X_{n+1}$ is obtained from $X_n$ by appending the closest element (with respect to the perturbed metric) in its complement.
The original OPTICS-algorithm also takes an additional parameter $\varepsilon > 0$. However this is only needed to reduce the time complexity and is ignored in our discussion for now -- to be more precise, we implicitly set $\varepsilon = \infty$.

Nov 17, 2014

Density-based clustering in spatial data (1)

This is the first of a series of posts on cluster-algorithms and ideas in data analysis (and related fields).

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996 [1]. It uses notions of connectivity and density of points in the data set to compute the connected components of dense enough regions of the data set. But let's cut the intro and dive right into it...