< -- Paperboy : Specification -------->

Collection, Processing and Presentation of Online News

Andrew Flegg, Andrew.Flegg@dcs.warwick.ac.uk,
12th October 1999 http://www.dcs.warwick.ac.uk/~csube/paperboy/

Introduction

The recent popularisation of the World Wide Web, combined with the ability to publish at any time of day, as regularly as possible has led to the proliferation of web sites providing news on a wide range of topics. However, the increase in use of the Internet has led to, at times, painful download rates as the bandwidth available is not sufficient to cope with demand. This results in users becoming increasingly infuriated and abandoning hope of reading the news relevant to them and progressing on to more pressing matters, such as their job or degree, or even hanging-up if the user is on a relatively expensive dial-up connection.

There have been several agent based attempts, and claims, regarding automated presentation of news of interest to a user - none have been totally successful. This is the reason for this project. Paperboy will aim to solve some of these problems by using user-entered rules to find stories of interest on various web sites before locally caching them and later displaying the stories to the user off-line.

All web sites, however, are different in terms of layout and syntax: some have titles and summaries on an index page, others just have titles as links to stories; the project will use a plug-in system to ensure that each site can be understood and passed back to the rest of the program in a standard manner, that is: the headline, a short summary paragraph and the main story. The use of plug-ins will also allow future expansion and correction if, and when, a site changes the syntax of one of its pages.

For greatest efficiency the project will have two parts:

  1. "Gather mode" where each web site/story will be downloaded and stored if appropriate.
  2. Display mode where the stored files will be displayed back to the user and the options and rules for the stories edited.

The former will run requiring no user input, with an option of displaying a small status window. This is intended to be run as an automated job, to get the latest news; or a task run on completing a dial-up collection. Once the news has been stored the second mode will display the stories without delay and without requiring the user to be online. The display mode will be the main application from the usersĘ point of view and will be the graphical user interface used to view the stories and set and change the rules and any other options necessary.

When the user comes to look at the stories various view options can be chosen: headlines only, summaries only or even both. All cached stories or just unread ones could be shown and a facility would be provided to delete a story (if it was wrongly chosen) freeing up the local storage it would use - this is, however, unlikely to modify the rules - although a message could be sent to the rules system so that it could, at a later date, be replaced with a more intelligent system which learns which stories the user is interested in; this delete message would be used to teach the system. The aim for the project, however, is for a less ambitious system based on exact rules, possibly based on a syntax similar to that used by AltaVista; eg. warwick NEAR university.

Another feature would prevent seeing duplicate stories, even if the headline (or summary) were different - identical URLs could be used to check for identical stories; this would definitely work on sites such as BBC Online.

The Objectives

The following objectives are the aims and technical challenges of the project in order of importance and will form the basis of the timetable further on in this document.

O0: Research implementation details

The design of the project should be separate from the implementation, however knowledge of how the program will be put together will allow for a more effective design.

O1: Design: GUI for display mode/plug-in API

As the display mode is a graphically intense part of the project the API for the plug-insĘ options etc. will be decided after the GUI itself. This will allow the API to be as simple, yet as flexible, as possible. Plug-ins must be capable of returning specific abstract items of a story, for instance the headline, summary and body text as well as describing any additional options they may have for the options system.

O2: Write parser for rules

Once the design phase has been completed the rules parser will be written: this will take semi- natural language strings, such as "andrew flegg" OR ajf, and return an object capable of reporting whether or not a block of text conforms to a particular rule.

O3: Fetch web pages through various proxies

Web pages are relatively easy to fetch, however to increase the project's usefulness it is intended to allow pages to be retrieved through different types of HTTP proxy, including authenticated ones often in use at large companies.

O4: Complete gather mode

The infrastructure for providing plug-ins with the ability to fetch, process and store articles should now be complete and proper system testing (rather than component testing) can begin.

O5: Write plug-in for BBC Online

This first plug-in will provide access to a wide range of news topics. BBC Online stories also contain links to related stories which may be fetched if a plug-in specific option is selected.

O6: Display fetched stories

This component is, effectively, an HTML viewer which will display locally stored HTML portions, images and text. This will relieve the user of the need to be online to read the news.

O7: Complete display mode

With this now finished the program is completed (from a technical point of view), however, further plug-ins will be written and various improvements can still be made (eg. a "learning" rules system).

O8: Write further plug-ins

Once the framework for plug-ins has been completed, plug-ins for additional sites like The Register, Slashdot and Wired News will increase the range of news available to the user.

O9: Expand rules system

The rules system should be expanded to learn from its mistakes: when a user selects a story for deletion the rules used to select that story should be given less weight and so reduce the rank of future stories returned.

It is expected to at least complete O7 before the project can be considered successful and from O9 onwards are listed possible goals for future improvemement; however, if time allows then work will be done on these.

Methods

It is planned for implementation to take place in Sun's JDK 1.2 (Java 2) using the Swing toolkit. The rules parser will be written using JavaCC. All other code will be written from scratch unless in the course of the research more effective methods are found.

Timetable

Term Week Primary task Expected duration (weeks)
1 2 Specification drawn up 1 (complete)
3 Design phase (components/classes/APIs) 5 (complete)
8 (started) Progress report started 1
10 Progress Report due  
2 1 Development begins 8
1 Web page fetching code 0.5
1-2 HTML page processing classes 1
2 Rules parser 1.5
3-4 Article storage/retrieval 0.5
4-5 GUI/event handler stubs 1
6 Options system 1
7 Display stories 1
8 Story search 0.5
8-9 Online help system/text 1.5
10 Presentation  
3 2 Final report due  

Resources

Primarily development will be undertaken on the Department of Computer Science's workstations. Due to it being a Java project the higher-end workstations (such as the Ultra 5s and 10s) will be preferable.

Testing will also be done on other platforms such as Win32 and Linux to ensure cross-platform compatibility.