**Author****:** Atish Das Sarma, Ashwin Lall, Danupon Nanongkai, Richard J. Lipton, Jun Xu

**Download:** PDF

**Conference: **ICDE 2011: the IEEE International Conference on Data Engineering [link]

**Abstract:**

The study of skylines and their variants has receivedconsiderable attention in recent years. Skylines are essentiallysets of most interesting (undominated) tuples in a database.However, since the skyline is often very large, much researcheffort has been devoted to identifying a smaller subset of (sayk) “representative skyline” points. Several different deﬁnitionsof representative skylines have been considered. Most of theseformulations are intuitive in that they try to achieve some kindof clustering “spread” over the entire skyline, with k points. Inthis work, we take a more principled approach in deﬁning therepresentative skyline objective. One of our main contributionsis to formulate the problem of displaying k representative skylinepoints such that the probability that a random user would clickon one of them is maximized.

Two major research questions arise naturally from this formu-lation. First, how does one mathematically model the likelihoodwith which a user is interested in and will “click” on a certaintuple? Second, how does one negotiate the absence of theknowledge of an explicit set of target users; in particular whatdo we mean by “a random user”? To answer the ﬁrst question,we model users based on a novel formulation of thresholdpreferences which we will motivate further in the paper. Toanswer the second question, we assume a probability distributionof users instead of a ﬁxed set of users. While this makes theproblem harder, it lends more mathematical structures that canbe exploited as well, as one can now work with probabilities ofthresholds and handle cumulative density functions.

On the theoretical front, our objective is NP-hard. For thecase of a ﬁnite set of users with known thresholds, we presenta simple greedy algorithm that attains an approximation ratio of of the optimal. For the case of user distributions,we show that a careful yet similar greedy algorithm achieves thesame approximation ratio. Unfortunately, it turns out that thisalgorithm is rather involved and computationally expensive. Sowe present a threshold sampling based algorithm that is morecomputationally affordable and, for any ﬁxed , has anapproximation ratio of . We perform experimentson both real and synthetic data to show that our algorithmsigniﬁcantly outperforms previously proposed approaches.

**Update History**

**[v1] **November 14, 2010 (Conference version)