Tuesday, December 13, 2011

rank-aware clustering

By Julia Stoyanovich at U Penn

Example application: results from a search for dates on yahoo!singles

Often a correlation between ranking funciton & selection criteria results in tremendous homogineity in the top results.

Note that the correlations may be localized: age & education correlate up to age 30, while age & income correlate up to age 50.

BARAC Bottom-up Algorirthm for Rank Aware Clustering
an elaboration of subspace clutering (review: Parsons SIGKDD 2006)
One cannot always use i.e. PCA to reduce dimensions, as one cluster might occupy dims a&b while a second is in b&c. Subspace clustering solves this.

Three steps to algo:
  1. Build grid Split each dimension into intervals, rank all items in that interval. Runs in ~ linear time
  2. Merge intervals For each neighboring set of intervals, merge them unless one dominates the other. ~ constant time. Only for dimensions with an ordering (?)
  3. Join dimensions Build K-dimensional clusters from (K-1)-dim clusters, using a quality score derived from the rankings. ~ exponential time, but in practice polinomial and manageable.
Merge insures tightness Join insures maximality, and via the quality score criteria, rank awareness
Interval dominance If Theta percent (Theta in 0.5 to 1) of the top N results in the join of interval 1 and 2 are also in the top N of interval 1, then 1 dominates 2. (top(I_1,N)) intersect (top(I_1 + I_2,N))/N > Theta
Ranked subspace A subspace is a collection of intervals. Compare the top N values in the subspace to the top N values of the intervals-- we are not looking for density of points, but for density of winning points.
Want at least Q items from each interval, where Q can be the M ranks or sum of ranking score (in case the score the rank is based on has high variance)

No comments:

Post a Comment