Project
EECS-4412
Fall 2017
(Due Monday December 4 by 6pm)

Overview

The topic of this project is Web page classification. You are given a training data set containing a collection of Web pages from two computer science departments, and a test data set containing a collection of Web pages from another computer science department. The pages in these two data sets are hand-classified into the following categories (i.e., classes):
  • Student
  • Faculty
  • Course
You will learn a Web page classifier from the training data, and then use the classifier to classify the Web pages in the test data set. You can use Weka to learn a classifier. Please note that you can do this project in a group of 2 students.

The Data Sets

The training data set consists of 468 Web pages, 232 of which are from the Computer Science Department at the University of Texas and 234 of which are from the CSE department of the University of Washington. As mentioned earlier, each of the Web pages in the data set belongs to one of the three classes: student, faculty and course. The files are organized into a directory structure, one directory for each class. Each directory contains Web pages in plain text files. The file name of each page corresponds to its URL, where '/' was replaced with '^'. Note that the pages start with a MIME-header.

The test data set contains 206 Web pages from the Computer Science Department at Cornell University. The files are also organized into three directories, one directory for each class. The data sets are available at the following course directories:

  • Training data (/cs/course/4412/train/)
  • Test data (/cs/course/4412/test/)
Notice that your program can read the data directly from the above directories if you use the path name. You can also obtain the gzipped tar file of the data sets from here.

Steps

The project involves the following steps:

  1. Preprocessing the data. The goal of this step is to extract features from the Web pages in the training set and use these features to represent each page in the training and test data sets. After the preprocessing step, each page is represented as a tuple, which is an example, and both training and test sets contain a set of tuples.
  2. Use a classification learning algorithm provided by Weka to learn a classifier from the set of training examples. You can use any of the learning algorithm in Weka for this purpose.
  3. Test the learned classifier on the test set and report the testing statistics.
The most challenging part of this project is the data preprocessing step. The learning step is also important, but by using Weka, this step is supposed to be less time-consuming compared to the data preprocessing step. Generally speaking, data preprocessing takes more than 80% of the time for real world data mining projects. In this project, you will obtain experience in preprocessing raw data for data mining.

Data Preprocessing

The first question that you may have for data preprocessing is how to represent a Web page with features, i.e., what attributes can we use to represent a Web page?

A simple method for representing a Web page (or a text document in general) is the "bag-of-words" method, which uses a set of words to represent a document. How to choose the set of words to represent documents is an important issue. Generally speaking, we should choose the words that can distinguish Web pages of one class from the pages of other classes. Words that appear frequently in one class of pages but not frequently in the other classes of pages are more useful than words that appear frequently (or infrequently) in all categories. You'd better select such type of words. However, you can use any word/feature selection method that you think is reasonable, such as:
  • Document frequency (Seleting words according to the number of Web pages containing the word)
  • Information gain (Selecting words according to the information gain measure)
  • Any other method in Weka or that you may design
After the set of words is selected, you can use one of the three methods we discussed in class to represent attribute values:
  • Use 0 or 1 to indicate the presence of a word in a Web page
  • Use the absolute or relative frequency of a word in a Web page
  • Use the TF/IDF value of a word in a Web page
For example, if you use the absolute frequency of a word in a Web page to represent an attribute value and if the set of words is {word1, word2, ..., wordm} and there are a total of n pages in a data set, the set of pages can be represented as follows:

word1word2...wordmClass
page 130...2faculty
page 2012...1course
page 317...0faculty
..................
page n60...0student

In the table, the number in cellij represents the number of occurrence of wordj in page i. Please note that you can use values other than counts (such as the relative frequencies of the words in the page so that the values are normalized) to represent the page.

Also note that a word here means a string of characters. It may not be meaningful in English, but may be useful in distinguishing different classes of Web pages.

You may conduct the following steps in your data preprocessing:
  1. Extract words from each Web page. To make the project easier, I have written two small Bourne shell scripts for this task. The exractwords.sh program is for extracting a word from a single file, and the extractmultfiles.sh program calls extractwords.sh to extract words from multiple files (such as files in a directory). Feel free to modify them to suit your own need if you wish.
  2. (Optional) Remove stop words from the extracted words. You can find a list of stop words here. Please note that you can add some domain-specific stop words (such as head, body, mime, etc) to the list if you wish.
  3. (Optional) Stemming. Replace a list of words with their stems. You can find the source code for the Portor stemming algorithm here.
  4. Select a set of words as features/attributes to represent web pages. You can do this by following the steps below:
    • (Optional) Build an inverted index file for the training data set. The inverted index file is structured as follows:

      word1   directory/filename1   count1
      word2   directory/filename2   count2
      word3   directory/filename3   count3
      word4   directory/filename4   count4
      ... ... ...


      That is, the index file contains three columns and a number of lines. Each line contains a word, the file name of a Web page that contains the word, and the number of occurrences (count) of the word in the Web page. If a word appears in multiple Web pages, it appears in multiple lines in the index file. The lines are better sorted in an order (e.g., alphabetic order of the words).
    • Select a set of words for representing Web pages. Write a program that uses a word selection method to select a "bag of words" based on the information in the inverted index file for the training data. Alternatively, you can use some tools in Weka that can convert test files into ARFF format, and then use Weka's feature selection functions to select a "bad of words". Note that for feature selection, you can only use the training data. You cannot use the test data for such a purpose because at the training stage, the test data are "unknown".
  5. Generate the training data table (in arff format) based on the selected words.
  6. Generate the test data table (in arff format) based on the selected words.
After the training and test data tables are prepared, you can use Weka to learn a classification model from the training data. and test the learned model on the test data. Note that your selection of a learning algorithm should be based only on the training data. You should describe clearly in your report what type of method (e.g., cross-validation) that you use to select the algorithm and provide the evidence (e.g., the cross validation result on the training data) to support your selection of the learning algorithm.

What to hand in

  • Your programs for this project (if any).
  • The stop word list if you have added domain-specific stop words in the stop word list
  • The training data in the arff format
  • The test data in the arff format
  • A report (.pdf file) that describes:
    • the objective of the project (you may use an introduction section to describe it).
    • your methods for preprocessing the data.
    • what learning algorithm(s) (in Weka) that you use to learn the model(s) (including the parameter settings if you do not use the default ones), and
    • the testing statitics of the learned model(s) on the test data (such as classification accuray and confusion matrix).
    • any discussion and conclusion that we can draw from the project.
    • how to compile and use your programs (as an appendix)

How to hand in

You should submit your work before the due time online using the Web submit system. Please choose 4412 for "Course" and Project for "Assignment".

Or you can use the submit command to turn in your work before the due time. If your work is in directory Dir, use

submit 4412 Project Dir

Marking Scheme

The project will account for 20% of your total mark. It will be evaluated out of 20 points. Below is the mark breakdown:
  • Clearness and organization of your report (6 points)
  • Soundness and correctness of your solution (as described in your report and implemented in your programs if any) (6 points).
  • Classification accuracy on the test data in comparison with the accuracies of other groups (6 points)
  • Team working performance mark (2 marks) - This mark will come from other students in your group. At the end of the project, each of you will be asked to rate the performance of other members in your group. The ratings on you by the other members of your group will determine your team working performance mark. This is to encourage all the students to get involved in the project.