CSE1710 Lab

CSE 1710.03A Programming for Digital Media

Lab 10

Due date: Nov 23, 2009 at 20:00

Extracting Headlines and Generating a New WWW Page

Your task is to create a WWW page that lists the current news headlines, with some text substitutions to make things a bit more interesting. The program should be designed so that it can be called automatically every hour to refresh the contents of the generated HTML file.

As a source of headlines you will be using http://www.cse.yorku.ca/course/1710/labs/GoogleNews.html, which is a local copy of the WWW page http://news.google.com. The text subsitutions should swap two random words in the headline.

For this, you have to write a function createNews() that has no arguments Each time it is called, it should automatically get the news headlines and (over-)write the output file, which has to be located at: "Z:\news.html".

The format of the output should be a normal HTML page, with an appropriate header. Each headline should be written as a level-1 heading, according to the HTML specification, i.e. using the '<h1>' tag. Do not forget to finish the generated HTML page with appropriate HTML sequences.

You must name your program lab10.py.

As a starting point, you can use the following code segment, which downloads the contents of a specified WWW page into a string.

import urllib

contents = urllib.urlopen('http://www.cbc.ca')
text = contents.read()
contents.close()
print text

Here are a couple of hints:

Each (major) headline on the WWW page news.google.com is embedded as a level-2 heading as follows:
```
<h2 class="title"> <a target="_self" class="irrelevant stuff"
href="some link">Maple Leafs finally call it quits</a> </h2>
```
Hence, you have to start by searching for each instance of '<h2 class="title">', then skip over the following link tag (e.g. by searching for the closing '>'), then extract all text after that until the start of the following tag.
Note that the text above has been broken into two lines to make it easier to understand. However, you can safely assume that the headline itself will not have a newline character inside it (but one of the tags may have a newline embedded).
To swap the two random words in each headline, it is easiest to first parse the string into a list of words, using string.split(), as discussed in one of the recent lectures. Then generate two random (integer) numbers that are between 0 and the number of words in the headline minus 1, and use these two numbers as indices into the list of words. Then swap the two words.
The reason you can't open the Google news page directly with urlopen() is that Google actively prevents access to the page from programs. The simplest way to circumvent this is to use a local copy.

What to Turn in

As mentioned previously, you have to add comments with your own identification (name, student ID, date). Remember to save the file after you modify it!

How to submit the lab

For details on how to submit, please refer to lab2. However, this time please submit to lab10, i.e. issue the command

submit 1710 lab10 lab10.py

Note: You must do all the above steps correctly for receiving full credit for this labtest.