CS120a – Final Project Ideas

CS227 – Final Projects

Project Ideas

The following is a list of topics that should form the basis for your final projects. These topics, as given, are somewhat broad. It is your job to use these ideas (or any other that you come up with) as a springboard for a final project proposal. This proposal is due on March 8 at the start of class, so you should meet with your group as soon as possible to choose your project and define its scope.

Automatic Mirror Selector

Many web sites use mirroring in an attempt to balance the load requests of large numbers of clients. An example of this is Real Player, which requires downloads of its product be made from sites that are in geographic proximity to the requesting client. The problem with this approach is that there is no way for clients to know what are the relative workloads at each mirror site. It may be worth a client situated in Boston to download a file from a site across the country rather than from a site in New York, if the load at the New York site is heavy enough to make a cross-country file transfer faster.

For this project, the goal is to come up with an automated mirror selector that predicts (from the client-side) from a collection of mirror sites, the one that will offer the best response time for a given file request. You should come up with a number of predictive tools (an obvious one that comes to mind is ping) and run experiments that determine how well the tool predicts the response times of file requests that follow. You should vary the tools according to those that require no intrusion on the server site (e.g., ping) to those that require only negligible intrusion (e.g., "planting" a small file available for retrieval at the server).

Query Benchmarks for XML-Based Query System

The purpose of this project is to give you a chance to experiment with a couple of research prototypes that implement an XML query language. Since they are prototypes, its a little unfair to compare their performance, but, on the other hand, it is a great opportunity to learn something about how they work and about how you might go about constructing a benchmark. You will use XML-QL from AT&T Bell Labs and LORE from Stanford University. We are open to suggestions about alternate choices here.

Your job is to first install and become familiar with the way that these systems work. You should then pick an application area for which you will test these systems. An example, would be a digital library. To construct the benchmark, you will first need a data source. You can get this in any (legal) way that you want. You can get it off the net, you can get it from a friend, or you can build it yourself. Part of the exercise is to get this data into a form that can be processed by your target systems.

You must then design a mix of queries that you will send to each of the systems. You will also need a clear idea of what you will measure as a result of the system's execution of your queries. A good set of queries should exercise different aspects of a system. They should be representative of what you might expect in practice. You will justify the design of your benchmark in your project write-up.

Of course, you will run your benchmark on both systems and report your measured results.

Client-side Profile Language and Interpreter

For this project, you should come up with a site-specific profile language and interpreter. For example, the chosen site might be CNN.com, and your profile language might specify the pages that interest you that can be retrieved from CNN.com as they get generated. For example, your profile might specify interest in poll results for the 2000 election primaries as they get generated, basketball game results where some scorer got over 40 points, or weather forecasts for any U.S. city that predicts over a foot of snow to fall.

Many sites include "server-side" profiles. For example, CNN.com includes a feature called "myCNN" which allows users to specify from a checklist, those topics of interest to them, and automatically generates pages with links to stories related to chosen topics. The project proposed here would differ in the following ways:

it would process profiles on the client side rather than the server side (among other advantages, this would allow you to expand this project to profile and integrate data from multiple sites)
it would allow finer-grained specification of profile interests (e.g., based on the content of the news stories (players scored over 40 points in a basketball game), or based on how recent the story was published (I’m not interested in stories that are a week old)).

The exact flavor of the profile language would be up to you. You should define the language unambiguously (e.g., with an annotated grammar) and write an interpreter that would process a profile and generate a personalized web page that includes links to the pages specified in the profile.

Web Server Dashboard

This project would involve designing and building a web server tool that monitored, and displayed graphically, analysis of the ‘current’ workload of a web server, such that what is ‘current’ (last 5 minutes, last 24 hours, etc.) can be specified by tweaking a "knob". A simple version of this tool might simply monitor the web access log, and maintain current request arrival rates (for time intervals that could also be specified with a "knob"), per-file retrieval statistics, per-requester retrieval statistics and average response times. It might also use clustering techniques to display an analytical model of the current workload as was discussed in Chapter 6 of the Capacity Planning text. The dashboard could also have a "spam detector" to detect excessive numbers of HTTP requests arriving from a single IP address.

An important part of this project would involve specifying, in advance, its functionality and GUI. The functionality proposed here is meant solely to help you come up with ideas, and is in no way meant to limit the product you produce.

Web Cache Simulator

The task here is to better understand the role of a web cache and its effect on web performance. While it is commonly agreed that caching holds the key to good performance for popular web sites, there is still a good deal of disagreement about how they should work and how and where they should be deployed. You can think of this project as a white paper containing experimental results that could be used to inform that process.

The fundamental operation of a cache is very much like any other cache that might be maintained in your computer. The trick is to always keep the most items that are most likely to be accessed next. If your cache manager were omniscient, you would always win. However, this is difficult to achieve so the best we can do is to use a good heuristic like LRU.

Since it is not that easy to obtain and configure a web cache server, we will be satisfied with a simulated environment. You will simulate a workload and a set of components that manage a fictitious cache. Your cache will sit in front of a database (read web site) with some specified size. You can vary the size as a part of your simulation. Your cache will also have a specified size that is smaller than the database.

The documents in your database will also be of varying size. You can describe the variation in their size with some form of statistical distribution. For example, you might choose to model this size variation as a Poisson or a normal distribution. Playing with the parameters of the distribution will allow you to investigate varying degrees of skew in document size.

Other things that you might want to vary would include the type of workload (uniform, hot-cold) and its intensity (exponential interarrival times, bursty, etc.). The caching policy is also up for grabs. While LRU is a good starting point, it is likely for web traffic that you could do better. This is your chance to be creative. Try to design something that more closely reflects the web and that is simple (efficient) to implement.

Your final problem is to figure out what to measure. Response time improvement (with or without the cache) is an obvious choice, but you will likely think of more interesting metrics as your project progresses.

Workload Generator

The tool wwwstat that was developed at UC Irvine generates an analytic workload model using the web access log. The model (table) that can be generated with this tool is limited. The only columns (i.e., workload parameters) of the table that can be generated are "bytes transferred", "requests", "% bytes transferred" and "% requests". The only rows of the table (i.e., workload classes) that can be generated are classes based on time intervals (days or hours), files requested or IP addresses of requesting clients. The goal of this project would be to build a more general wwwstat tool. This tool would permit specification of the columns (e.g., arrival rates) and rows desired (e.g., rows based on file sizes or automatically generated rows based on clustering), as well as permit global analysis of the workload file for identifying such phenomena as load spikes and causality relationships (e.g., people who load file A tend also load file B 95% of the time.)

Profile-based Prefetcher

This project is based on our Data Recharging proposal that was recently submitted to NSF. This project would involve building a tool that analyzes profile specifications generated by existing software (e.g., schedule files for calendar managers) and retrieves information from various sources (web pages, newsgroups, email) that is judged to be relevant according to these profiles. As an example, a schedule might reveal that a client has a meeting with Bruce Lindsay at IBM Almaden regarding Data Recharging on Friday. Your prefetcher tool might then send your client an email that lists web pages involving Data Recharging, recent mails to you from Bruce Lindsay, and directions to IBM Almaden (perhaps including flight schedules to the Bay area, seat sales etc.) Of course, all of this is fairly ambitious, so a key component of this project would be to specify that part of it which could be delivered in the time allotted.

Web-page Monitor

This project also has applications to Data Recharging. The goal of this project would be to build a web-page "monitor" that accepts as input from a client, a set of URL’s (i.e., a bookmark file). This tool would notify the client whenever changes are made to any of these URL’s within some specified time period or since the client’s last visit to the pages. The processing done by this tool might involve building and maintaining a table that stores a hash value for each site listed in the bookmark file for a given client. This hash value would be computed on the basis of the contents of some preprocessed form of the page (e.g., with advertisements removed). Then, periodic revisits to pages and subsequent processing would generate new hash values that could be compared to previous hash values to detect changes. Such a tool could be used by any processor of user profiles that expresses interest in pages according to update criteria.

Project Proposals

One-to-two page proposals for your final projects are due on Wednesday, March 8 at the start of class. These proposals should describe your project intentions with the following things clearly stated. The idea is that the proposal should clearly sketch all of what you intend to do so that you can 1) get valuable feedback from your colleagues / classmates, and 2) you will be able to set out to do it with as few missteps as possible. They should include:

A clear description of the goals of the project work.
A description of the applications of your project to the real-world. What benefit will your final project serve to the web and/or database communities?
Preliminary thoughts about the design of the project.
Intended milestones for completing the project.
A description of the deliverable to be submitted at the project’s end.

Evaluation criteria that could be used to assess the success or failure of the project. Note that the evaluation criteria chosen should reflect upon the goals of the project described in (1).