Background Flow: clustering

Tuesday, December 13, 2011

rank-aware clustering

By Julia Stoyanovich at U Penn

Example application: results from a search for dates on yahoo!singles

Often a correlation between ranking funciton & selection criteria results in tremendous homogineity in the top results.

Note that the correlations may be localized: age & education correlate up to age 30, while age & income correlate up to age 50.

BARAC Bottom-up Algorirthm for Rank Aware Clustering
an elaboration of subspace clutering (review: Parsons SIGKDD 2006)
One cannot always use i.e. PCA to reduce dimensions, as one cluster might occupy dims a&b while a second is in b&c. Subspace clustering solves this.

Three steps to algo:

Build grid Split each dimension into intervals, rank all items in that interval. Runs in ~ linear time
Merge intervals For each neighboring set of intervals, merge them unless one dominates the other. ~ constant time. Only for dimensions with an ordering (?)
Join dimensions Build K-dimensional clusters from (K-1)-dim clusters, using a quality score derived from the rankings. ~ exponential time, but in practice polinomial and manageable.

Merge insures tightness Join insures maximality, and via the quality score criteria, rank awareness
Interval dominance If Theta percent (Theta in 0.5 to 1) of the top N results in the join of interval 1 and 2 are also in the top N of interval 1, then 1 dominates 2. (top(I_1,N)) intersect (top(I_1 + I_2,N))/N > Theta
Ranked subspace A subspace is a collection of intervals. Compare the top N values in the subspace to the top N values of the intervals-- we are not looking for density of points, but for density of winning points.
Want at least Q items from each interval, where Q can be the M ranks or sum of ranking score (in case the score the rank is based on has high variance)

Friday, August 19, 2011

biclustering

Useful for biological systems, as it can identify modules of cells/genes/whatever

cMonkey is one promising algorithm
cmonkey

at ISMB, someone presented on
multi-species c-monkey
pmed link
Abstract:

We describe an algorithm, multi-species cMonkey, for the simultaneous biclustering of heterogeneous multiple-species data collections and apply the algorithm to a group of bacteria containing Bacillus subtilis, Bacillus anthracis, and Listeria monocytogenes. The algorithm reveals evolutionary insights into the surprisingly high degree of conservation of regulatory modules across these three species and allows data and insights from well-studied organisms to complement the analysis of related but less well studied organisms.