Skip Navigation
York U: Redefine the PossibleHOME | Current Students | Faculty & Staff | Research | International
Search »FacultiesLibrariesCampus MapsYork U OrganizationDirectorySite Index
Future Students, Alumni & Visitors
2002 Technical Reports

Cardinality Estimation of Skyline Queries: Harmonics in Data

Parke Godfrey

Technical Report CS-2002-03

York University

October 2002

Abstract

There is interest to support queries with preferences in relational systems. In accordance, the SKYLINE operator has been proposed as an extension to SQL. The corresponding skyline clause extends on the aggregate operators of MAX and MIN. It essentially selects the tuples that are mutually optimal over a number of attributes (by filtering out all tuples with worse values on every skyline attribute compared with the skyline tuples).

Recent work has focused on how to compute efficiently skyline queries in relational systems over large datasets. Steps remain, however, before the skyline operator can be incorporated into today's relational systems. A better understanding is needed on how the skyline operator composes with the other relational operators, both logically and algorithmically. The query optimizer must be able to incorporate the skyline operator. This will require extending the cost model for skyline queries. A critical component of the cost model must be a cardinality estimator for skyline queries' result sets.

Skyline cardinality estimation is the focus of this work. Under a basic model of assumptions of sparseness of values on attributes' domains (that is, virtually no duplicate values over an attribute) and statistical independence across attributes, we prove the expected skyline cardinality for two-dimensional skyline queries is the harmonic of the number of input tuples, and we generalize to prove the expected value for multi-dimensional skyline queries is a higher-order harmonic (a particular two parameter extension of the harmonic numbers) with respect to the number dimensions (skyline attributes) and the number of input tuples. We then consider the ramifications on the estimates as we relax the assumptions of the basic model, some of which are counter-intuitive. Our results provide a basis for a cost model for the skyline operator, and general insight into queries over multi-dimensional criteria.

Download paper in PDF format.



The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.