Example application: results from a search for dates on yahoo!singles
Often a correlation between ranking funciton & selection criteria results in tremendous homogineity in the top results.
Note that the correlations may be localized: age & education correlate up to age 30, while age & income correlate up to age 50.
BARAC Bottom-up Algorirthm for Rank Aware Clustering
an elaboration of subspace clutering (review: Parsons SIGKDD 2006)
One cannot always use i.e. PCA to reduce dimensions, as one cluster might occupy dims a&b while a second is in b&c. Subspace clustering solves this.
Three steps to algo:
- Build grid Split each dimension into intervals, rank all items in that interval. Runs in ~ linear time
- Merge intervals For each neighboring set of intervals, merge them unless one dominates the other. ~ constant time. Only for dimensions with an ordering (?)
- Join dimensions Build K-dimensional clusters from (K-1)-dim clusters, using a quality score derived from the rankings. ~ exponential time, but in practice polinomial and manageable.
Interval dominance If Theta percent (Theta in 0.5 to 1) of the top N results in the join of interval 1 and 2 are also in the top N of interval 1, then 1 dominates 2. (top(I_1,N)) intersect (top(I_1 + I_2,N))/N > Theta
Ranked subspace A subspace is a collection of intervals. Compare the top N values in the subspace to the top N values of the intervals-- we are not looking for density of points, but for density of winning points.
Want at least Q items from each interval, where Q can be the M ranks or sum of ranking score (in case the score the rank is based on has high variance)
No comments:
Post a Comment