Project #2: Tipping Point CSE-4411M | Winter 2013

CSE-4411M
Database Management Systems
York University
Winter 2014

Project #2 [v2]
The Tipping Point
using IBM DB2

Introduction

For this project, you will explore, albeit briefly, IBM DB2's query optimizer (v9) and how it optimizes SQL queries. The DB2 server you will use — db2inst1 — is on keyword.cse.yorku.ca. (It is running DB2-v9. The DB2 available on the PRISM machines and RED is DB2-v10. Our v10 client is not playing nice with our v9 server … apologies.)

This hosts the database stl. In particular, your task is to explore when the optimizer switches from one plan to another for the same templated query, as reduction factors are changed. These are the tipping points in the plan space.

The STL Database

The StL data-warehouse is for St Lawrence Booksellers, Inc. It records sales by customer and book from the years 1997 to 2003. StL is a small, boutique web-based bookseller.

The StL Data-warehouse Schema

The schema of StL-DW consists of three tables:

`Customer`
`cust#`	`BIGINT`
`country`	`CHAR(3)`
`state`	`CHAR(2)`
`city`	`VARCHAR(15)`

`Book`
`book#`	`BIGINT`
`language`	`VARCHAR(10)`
`genre`	`VARCHAR(15)`
`publisher`	`VARCHAR(25)`
`price`	`DECIMAL(5, 2)`

`Purchase`
`cust#`	`BIGINT`
`book#`	`BIGINT`
`when`	`DATE`
`qnty`	`SMALLINT`
`sale`	`DECIMAL(5, 2)`

The underlined attributes show the primary keys. In the language of data-warehousing, Customer and Book are dimension tables, and Purchase is the fact table. Customer has 100,000 rows, Book has 14,831 rows, and Purchase has 6,838,088. There is a foreign key from Purchase to Customer and from Purchase to Book.

In Purchase, cust#, book#, and when are the dimension columns. The measure column is sale. It reports how much the sale was for. Each sale is equal to the book's price × qnty. The column qnty indicates how many copies of this book was bought in that given purchase. In most cases, qnty = 1. Note that sale has already accommodated qnty! In your queries, you do not have to multiply by qnty.

Since StL is a web-based book-seller, we mail the books to the customers' addresses. So one can think of "customer" as a location. Call that dimension location then. That dimension offers a natural roll-up: country > state > city > cust#.

The next dimension is book. Books have three attributes of interest: language (what language it is written in), genre (what category it is in, e.g., humor, politics, etc.), and publisher (what company publishes it). These form three dimensions on book, and offer another natural cube (treating Book itself as a small fact table).

The last dimension is time, which is represented by the attribute when in Purchase. It is of type DATE. So it does not show the time of a purchase, but reports the day of purchase. Notice there is no table Time, but we could pretend there is a virtual table for Time. E.g.,

`Time`
`year`	`YEAR`
`month`	`MONTH`
`day`	`DAY`

Thus, the time dimension provides for a natural roll-up. Standard SQL provides many functions for dealing with DATE and TIMESTAMP values. In fact, we can roll up when in many ways.

There are no NULL values in the data warehouse.

Indexes

The clustered indexes are on the primary keys as follows.

customer: cust#
Book: book#
Purchase: cust#, book#, when

Table Purchase has two additional unclustered indexes:

book#
when

Accessing the database

The database is accessible via DB2 on KEYWORD, i.e., keyword.cse.yorku.ca.

The database is named stl. So that is what you need to connect to. E.g.,


		% db2 "connect to stl"

The three tables are in a schema named stl. So to access them, one needs to prepend ‘stl.’ in front of the table name. E.g.,


		% db2 "select count(*) from stl.purchase"

(See the other instruction pages for DB2 off of the class homepage for more help.)

The StL-DW is reasonably large. Most of the (correct) queries for this project take 30 seconds to a couple of minutes to execute. Be patient. For real DWs, query execution time is often hours. Design your queries carefully. Do them in steps, and debug along the way. Also reuse (parts of) queries once you have them working in subsequent challenges.

Profiling Query Plans

We shall use the tool db2expln to explore query plans that the DB2 optimizer creates.

(We used the command db2exfmt in class. This provides more detailed information. But it is not readily available on KEYWORD, and is not needed here. DB2 also provides GUI interfaces for this type of stuff, with proper setup.)

You need to be able to “see” commands in DB2's install. From any shell you are using DB2, run


		% source ~db2inst1/.cshrc

This sets environment variables and command paths to see the tools in the instance of the DB2 system running on KEYWORD.

The following command will dynamically prepare the query in file sql_file, and output and explanation of DB2's plan for it to the shell.


		% db2expln -d stl -terminal -z \; -f sql_file

This assumes the SQL query in sql_file is terminated with a semicolon, as is convention. (The “-z \;” tells db2expln this. Otherwise, db2expln assumes the query text to be all on one line in the file. The “\” simply escapes the semicolon for shell, as semicolon has a special meaning in shell.)

To output into a file, say


		% db2expln -d stl -o output_file -z \; -f sql_file

To prepare and explain a query just typed on the command line:


		% db2expln -d stl -o output_file -z \;
			-q "query text here"

Tasks

Design the following queries. For each, you are finding a query with two variations: the query is the same except the values of a predicate is — or predicates are — different between the two versions. You are wanting that DB2's query optimizer finds different query plans for the two versions. E.g.,


        select sum(sale) as total
    from stl.purchase
    where when between '01/01/2001' and '12/31/2001';

and


        select sum(sale) as total
    from stl.purchase
    where when between '12/01/2001' and '12/31/2001';

Design queries with variants for the following.

Design a query that results in a plan that uses an index-nested loop join for one variant, but not for the other variant.
Design a query that uses aggregation that requires an explicit sort for the aggregation in the plan for one variant, but does not need the sort for the aggregation in the other variant — that is, it does the aggregation on the fly.
Design another query where the two variations result in different plans.

Deliverables

Make a directory named tipping for your project.

Within the directory, put the SQL for your queries into files named oneA.sql, oneB.sql, twoA.sql, twoB.sql, threeA.sql, and threeB.sql, where one, two, and three correspond to the queries in the above section, and A and B for each represent the two variants.

Save the results of db2expl for each in oneA.plan, oneB.plan, twoA.plan, twoB.plan, threeA.plan, and threeB.plan, for the queries, respectively.

Write a text file named report.txt Explain why each of the three query variations result in different query plans. Keep the report simple (under 600 words).

Submit the directory as follows.


		% submit 4411 tipping tipping

parke godfrey