|
CSE-4411M
Database Management Systems
York University
Winter 2014
|
Project #2 [v2]
The Tipping Point
using IBM DB2
|
|
|
|
For
this project,
you will explore,
albeit briefly,
IBM DB2's query optimizer (v9)
and how it optimizes SQL queries.
The DB2 server you will use
—
db2inst1
—
is on keyword.cse.yorku.ca .
(It is running DB2-v9.
The DB2 available on the PRISM machines and RED
is DB2-v10.
Our v10 client is not playing nice with our v9 server
…
apologies.)
This hosts the database stl .
In particular,
your task is to
explore when the optimizer switches from
one plan to another
for the same templated query,
as reduction factors are changed.
These are the tipping points in the plan space.
|
|
|
|
The
StL data-warehouse is for St Lawrence Booksellers, Inc.
It records sales by customer and book from the years 1997 to 2003.
StL is a small, boutique web-based bookseller.
|
|
The StL Data-warehouse Schema
The
schema of StL-DW consists of three tables:
Customer
|
cust#
|
BIGINT
|
country
|
CHAR(3)
|
state
|
CHAR(2)
|
city
|
VARCHAR(15)
|
|
Book
|
book#
|
BIGINT
|
language
|
VARCHAR(10)
|
genre
|
VARCHAR(15)
|
publisher
|
VARCHAR(25)
|
price
|
DECIMAL(5, 2)
|
|
Purchase
|
cust#
|
BIGINT
|
book#
|
BIGINT
|
when
|
DATE
|
qnty
|
SMALLINT
|
sale
|
DECIMAL(5, 2)
|
|
The underlined attributes show the primary keys.
In the language of data-warehousing,
Customer
and
Book
are dimension tables,
and
Purchase
is the fact table.
Customer has 100,000 rows,
Book has 14,831 rows,
and
Purchase has 6,838,088.
There is a foreign key
from Purchase to Customer
and from Purchase to Book .
In Purchase ,
cust# ,
book# , and
when
are the dimension columns.
The measure column is sale .
It reports how much the sale was for.
Each sale is equal
to the book's price × qnty .
The column qnty indicates how many copies of this book
was bought in that given purchase.
In most cases, qnty = 1.
Note that sale has already accommodated qnty !
In your queries,
you do not have to multiply by qnty .
Since StL is a web-based book-seller,
we mail the books to the customers' addresses.
So one can think of "customer" as a location.
Call that dimension location then.
That dimension offers a natural roll-up:
country >
state >
city >
cust# .
The next dimension is book.
Books have three attributes of interest:
language
(what language it is written in),
genre
(what category it is in, e.g., humor, politics, etc.),
and
publisher (what company publishes it).
These form three dimensions on book,
and offer another natural cube
(treating Book itself as a small fact table).
The last dimension is time,
which is represented by the attribute when
in Purchase .
It is of type DATE .
So it does not show the time of a purchase,
but reports the day of purchase.
Notice there is no table Time ,
but we could pretend there is a virtual table
for Time .
E.g.,
Time
|
year
|
YEAR
|
month
|
MONTH
|
day
|
DAY
|
Thus, the time dimension provides for a natural roll-up.
Standard SQL provides many functions for dealing with DATE
and TIMESTAMP values.
In fact,
we can roll up when in many ways.
There are no NULL values in the data warehouse.
|
|
Indexes
The
clustered indexes are on the primary keys as follows.
customer :
cust#
Book :
book#
Purchase :
cust# ,
book# ,
when
Table Purchase
has two additional unclustered indexes:
|
|
Accessing the database
The
database is accessible via DB2 on KEYWORD,
i.e., keyword.cse.yorku.ca .
The database is named stl .
So that is what you need to connect to.
E.g.,
% db2 "connect to stl"
The three tables are in a schema named stl .
So to access them, one needs to prepend
‘stl. ’
in front of the table name.
E.g.,
% db2 "select count(*) from stl.purchase"
(See the other instruction pages for DB2
off of the class homepage for more help.)
The StL-DW is reasonably large.
Most of the (correct) queries
for this project take 30 seconds to a couple
of minutes to execute.
Be patient.
For real DWs,
query execution time is often hours.
Design your queries carefully.
Do them in steps,
and debug along the way.
Also reuse (parts of) queries
once you have them working
in subsequent challenges.
|
|
|
|
We
shall use the tool db2expln
to explore query plans
that the DB2 optimizer creates.
(We used the command db2exfmt
in class.
This provides more detailed information.
But it is not readily available on KEYWORD,
and is not needed here.
DB2 also provides GUI interfaces for this type of stuff,
with proper setup.)
You need to be able to “see”
commands in DB2's install.
From any shell you are using DB2,
run
% source ~db2inst1/.cshrc
This sets environment variables and command paths
to see the tools in the instance of the DB2 system
running on KEYWORD.
The following command will dynamically prepare
the query in file sql_file ,
and output and explanation of DB2's plan for it to the shell.
% db2expln -d stl -terminal -z \; -f sql_file
This assumes the SQL query in sql_file
is terminated with a semicolon, as is convention.
(The “-z \; ”
tells db2expln this.
Otherwise, db2expln assumes
the query text to be all on one line in the file.
The
“\”
simply escapes the semicolon for shell,
as semicolon has a special meaning in shell.)
To output into a file, say
% db2expln -d stl -o output_file -z \; -f sql_file
To prepare and explain a query just typed on the command line:
% db2expln -d stl -o output_file -z \;
-q "query text here"
|
|
|
|
Design
the following queries.
For each,
you are finding a query with two variations:
the query is the same except
the values of a predicate is
—
or predicates are
—
different
between the two versions.
You are wanting that DB2's query optimizer finds different
query plans for the two versions.
E.g.,
select sum(sale) as total
from stl.purchase
where when between '01/01/2001' and '12/31/2001';
and
select sum(sale) as total
from stl.purchase
where when between '12/01/2001' and '12/31/2001';
Design queries with variants for the following.
-
Design a query that
results in a plan that uses an index-nested loop join
for one variant, but not for the other variant.
-
Design a query that
uses aggregation
that requires an explicit sort for the aggregation
in the plan for one variant,
but does not need the sort
for the aggregation in the other variant
—
that is, it does the aggregation on the fly.
-
Design another query
where the two variations result in different plans.
|
|
|
|
Make
a directory named tipping for your project.
Within the directory,
put the SQL for your queries into files named
oneA.sql ,
oneB.sql ,
twoA.sql ,
twoB.sql ,
threeA.sql ,
and
threeB.sql ,
where one, two, and three
correspond to the queries in the above section,
and A and B for each represent the two variants.
Save the results of db2expl for each in
oneA.plan ,
oneB.plan ,
twoA.plan ,
twoB.plan ,
threeA.plan ,
and
threeB.plan ,
for the queries, respectively.
Write a text file named report.txt
Explain why each of the three query variations
result in different query plans.
Keep the report simple
(under 600 words).
Submit the directory as follows.
% submit 4411 tipping tipping
|
|
|
parke godfrey
|
|
|