The training data set consists of 468 Web pages, 232 of which are from the Computer
Science Department at the University of Texas and 234 of which are from the CSE department
of the University of Washington. As mentioned earlier, each of the Web
pages in the data set belongs to one of the three classes: student, faculty and course.
The files are organized into a directory structure,
one directory for each class.
Each directory contains Web pages in plain text files.
The file name of each page corresponds to its URL, where '/' was replaced with '^'.
Note that the pages start with a MIME-header.
The test data set contains 206 Web pages from the Computer Science Department at Cornell
University. The files are also organized into three directories, one directory for
each class.
The data sets are available at the following course directories:
- Training data (/cs/course/4412/train/)
- Test data (/cs/course/4412/test/)
Notice that your program can read the data directly from the above directories
if you use the path name.
You can also obtain the gzipped tar file of the data sets from
here.
Steps
The project involves the following steps:
- Preprocessing the data. The goal of this step is to extract features from the
Web pages in the training set and use these features to represent each page
in the training and test data sets. After the preprocessing step, each page is represented
as a tuple, which is an example, and both training and test sets contain a set of tuples.
- Use a classification learning algorithm provided by Weka to learn a classifier from
the set of training examples. You can use any of the learning algorithm in Weka for this
purpose.
- Test the learned classifier on the test set and report the testing statistics.
The most challenging part of this project is the data preprocessing step. The learning step
is also important, but by using Weka, this step is supposed to be less time-consuming compared
to the data preprocessing step. Generally speaking, data preprocessing takes more than 80%
of the time for real world data mining projects. In this project, you will obtain
experience in preprocessing raw data for data mining.
|
Data Preprocessing
The first question that you may have for data preprocessing is how to represent a Web page
with features, i.e., what attributes can we use to represent a Web page?
A simple method for representing a Web page (or a text document in general) is
the "bag-of-words" method, which uses a set of words to represent a document. How to
choose the set of words to represent documents is an important issue. Generally
speaking, we should choose the words that can distinguish Web pages of one class
from the pages of other classes.
Words that appear frequently in one class of pages but not frequently in the other
classes of pages are more useful than words that appear frequently (or infrequently) in
all categories. You'd better select such type of words.
However, you can use any word/feature selection method that you think is reasonable, such as:
- Document frequency (Seleting words according to the number of Web pages
containing the word)
- Information gain (Selecting words according to the information gain measure)
- Any other method in Weka or that you may design
After the set of words is selected, you can use one of the three methods we discussed in
class to represent attribute values:
- Use 0 or 1 to indicate the presence of a word in a Web page
- Use the absolute or relative frequency of a word in a Web page
- Use the TF/IDF value of a word in a Web page
For example, if you use the absolute frequency of a word in a Web page to represent an attribute
value and if the set of words is {word1, word2, ..., wordm} and
there are a total of n pages in a data set, the set of pages can be represented as follows:
| word1 | word2 | ... | wordm | Class |
page 1 | 3 | 0 | ... | 2 | faculty |
page 2 | 0 | 12 | ... | 1 | course |
page 3 | 1 | 7 | ... | 0 | faculty |
... | ... | ... | ... | ... | ... |
page n | 6 | 0 | ... | 0 | student |
In the table, the number in cellij represents the number of occurrence
of wordj in page i. Please note that you can use values other than
counts (such as the relative frequencies of the words in the page so that the
values are normalized) to
represent the page.
Also note that a word here means a string of characters. It may not be meaningful
in English, but may be useful in distinguishing different classes of Web pages.
You may conduct the following steps in your data preprocessing:
- Extract words from each Web page. To make the project easier, I have
written two small Bourne shell scripts for this task. The
exractwords.sh program is for extracting a word
from a single file, and the extractmultfiles.sh
program calls extractwords.sh to extract words from multiple files (such as
files in a directory). Feel free to modify them to suit your own need
if you wish.
- (Optional) Remove stop words from the extracted words. You can find a
list of stop words here. Please note that you
can add some domain-specific
stop words (such as head, body, mime, etc) to the list if you wish.
- (Optional) Stemming. Replace a list of words with their stems. You can find the
source code for the Portor stemming algorithm
here.
- Select a set of words as features/attributes to represent web pages. You can do this by following the steps below:
- (Optional) Build an inverted index file for the training data set. The inverted index file is
structured as follows:
word1 directory/filename1 count1
word2 directory/filename2 count2
word3 directory/filename3 count3
word4 directory/filename4 count4
... ... ...
That is, the index file contains three columns and a number of lines. Each
line contains a word, the file name of a Web page that contains the word, and
the number of occurrences (count) of the word in the Web page. If a word appears
in multiple Web pages, it appears in multiple lines in the index file. The lines
are better sorted in an order (e.g., alphabetic order of the words).
- Select a set of words for representing Web pages. Write a program that uses a word
selection method to select a "bag of words" based on the information in the inverted
index file for the training data. Alternatively, you can use some tools in Weka
that can convert test files into ARFF format, and then use Weka's feature selection
functions to select a "bad of words". Note that for feature selection, you can only use
the training data. You cannot use the test data for such a purpose because at the
training stage, the test data are "unknown".
- Generate the training data table (in arff format) based on the
selected words.
- Generate the test data table (in arff format) based on the selected words.
After the training and test data tables are prepared, you can use Weka to learn
a classification model from the training data. and test the learned model on
the test data. Note that your selection of a learning algorithm should be based only on the training data. You should describe clearly in your report what type of method (e.g., cross-validation) that you use to select the algorithm and provide the evidence (e.g., the cross validation result on the training data)
to support your selection of the learning algorithm.
|
What to hand in
- Your programs for this project (if any).
- The stop word list if you have added domain-specific stop words in the stop word list
- The training data in the arff format
- The test data in the arff format
- A report (.pdf file) that describes:
- the objective of the project (you may use an introduction section to describe it).
- your methods for preprocessing the data.
- what learning algorithm(s) (in Weka) that you use to learn the
model(s) (including the parameter settings if you do not use the default
ones), and
- the testing statitics of the learned model(s) on the test data
(such as classification accuray and confusion matrix).
- any discussion and conclusion that we can draw from the project.
- how to compile and use your programs (as an appendix)
|
How to hand in
You should submit your work before the due time online using the
Web submit system.
Please choose 4412 for "Course" and Project for "Assignment".
Or you can use the submit command to turn in your work before the due time. If
your work is in directory Dir, use
submit 4412 Project Dir
|
Marking Scheme
The project will account for 20% of your total mark. It will be evaluated out of 20 points.
Below is the mark breakdown:
- Clearness and organization of your report (6 points)
- Soundness and correctness of your solution (as described in your report and implemented in your programs if any) (6 points).
- Classification accuracy on the test data in comparison with the accuracies of other groups (6 points)
- Team working performance mark (2 marks) - This mark will come from other students in your group. At the end of
the project, each of you will be asked to rate the performance of other members in your group. The ratings on you
by the other members of your group will determine your team working performance mark. This is to encourage all
the students to get involved in the project.
|
|