Intro

Congrats, you've just made it to the final round of the national dodgeball championship! In order to maintain an edge over your competitors, you need a way to hit as many as you can in a single throw. To do so, you must determine which competitors are closest to you. Luckily, you're taking CS32 and you've heard of a nifty algorithm that can do just that: a KD-tree search! In this project, you will implement a KD-tree to find the nearest dodgeball "stars" that you need to eliminate! Good luck!

major league dodgeball logo

Overview

You will be implementing a multi-dimensional nearest neighbors algorithm to search for stars in space. Your program will be able to do two things efficiently:

  1. Find the k-nearest neighbors: Take in a location in space (specified as either an x,y,z-coordinate or the name of a star) and the desired number of neighbors to be found, a positive integer k. Returns the closest k stars to the given point.
  2. Radius search: Take in a location in space and a distance. Return all of the stars within the given distance to the point.
We expect your main() to be in a Main class in edu.brown.cs.<username>.stars. However, you should be thinking about making parts of your code general purpose. Therefore, you use package names that reflect the general nature of any particular class. If the code is useful beyond reading and querying a star database, it doesn't belong in a stars package.

Roadmap

The details of your Stars implementation will, as always in this course, be up to you. However, here is a general roadmap towards a functional project. Some of these may not make sense until you read the rest of this page! If you want more help getting started, go to the project gear up or talk to a TA.

  1. Read this entire page before starting.

  2. Pay attention to the design questions, as they will help you think abstractly about structuring your project.

  3. Clone your GitHub repository

  4. Start your Test Suite

  5. Our official advice is to write your system tests before you begin your implementation, and your JUnit tests once you have outlined your classes. Writing tests helps you understand the expected functionality of your code, define the API of your objects, and promotes incremental testing from the get-go. However, actually coding the project can provide further insight on the program itself and what tests to write. So, don't be worried if your testing plan evolves as you go.

  6. Write some code!

  7. You can complete these steps in any order you want. For example, you can start with your command line implementation (REPL) without filling in your search functions, or you can start with the CSV parser since it's fairly independent. Try to make all these individual peices abstract and reusable!

  8. Make a GUI

  9. Answer the Design Questions in your README

  10. Check your Code style and functionality

  11. Turn in your project using GitHub

    Don't underestimate the final Handin step! It involves generating a github pull request and is accompanied by a seperate Project Handin Page.

Tips for a stellar stars project and beyond

In Software Engineering, although the functionality of your code is important, the quality is also.

Imagine you created an application, but at some point, it was too much to maintain by yourself. How readable and workable is your code for others? Say users wanted a new feature in your app. Is your code generic, modular, and extensible so you can easily add new features?

One of the key objectives of CS32 is to teach students to write code that not only works, but also follows strong coding paradigms and best practices. That is why we designed the projects to build upon each other and also why you will have a partner & group project. So, with this in mind, try to make your Stars code:

  • Generic: think about how your code can handle multiple types of objects not just one
  • Modular: try to maintain abstraction and group related functionality in packages
  • Readable: use consistent style (32 Style Guide), comment descriptively, and avoid redundant code
  • Efficient: consider space and time complexity
  • Tested: test your code incrementally to make sure it works for general and edge cases

k-d Trees

In order to find nearby stars, you will write and use a k-d tree data structure. A k-d tree is a special form of binary tree in which each node represents some k-dimensional subspace and all the items contained in that space. Although there is far more information about k-d trees online than we will provide here (which you are encouraged to read about, but not to use any online code), the basic idea is that the space corresponding to each non-leaf node of the tree is split between its two children. The dimension used to split each node's space in half depends on the depth of the node (it alternates through each dimension is descends). Leaf nodes may contain one or a few items, depending on your design. Internal nodes may or may not contain items, also depending on your design. Tell us about your choices and why you made them in your README.

Finding a single nearest neighbor can be done much more quickly in a k-d tree than in a list because large portions of the search space can be eliminated quickly. The expected runtime for nearest neighbor lookups is logarithmic for balanced k-d trees, because each level of the tree should eliminate half of the remaining stars. You should probably begin by writing code to find the single star closest to a given point.

Once you can find one nearest neighbor, you should extend your program to find an arbitrary number of nearest neighbors. We do not expect you to support insertions and deletions after initial construction.

For more information on searching k-d trees and some high level pseudocode for the nearest neighbors search function, check out this document:

K-d Tree Help Doc

Data

We provide several CSV (Comma Separated Value) files that contain star data for you to load into your program for testing. The main file is data/stars/stardata.csv, which contains the locations of thousands of stars from the HYG database. This is an example of a good input file. You cannot use a CSV library to parse the file -- you must write your own!

What does a valid input file look like?

A valid input file will have a header line at the top containing 'StarID', 'ProperName', 'X', 'Y' and 'Z'. After a header line, each line in the file contains an ID, a name if one exists, and x, y, z coordinates for the star separated by commas. If a file does not conform with the said format, your program should print an informative error message.

What can I assume about the input files?

You may assume there will be no duplicate star names and that no star name contains a comma.

Command Line/REPL Specification

Running ./run in a terminal should start your program. It should accept two command-line arguments: --gui and --port <PORT>. We've included these for you in the stencil.

Your Java program will always have a command-line interface. It will have a graphical interface if specified with a --gui argument (explained in GUI Specification). In the command line, your program should run a 'REPL' (Read, eval, print loop), which should loop indefinitely, reading one line at a time, interpreting each line as a command (available commands specified below). It should then print the results of each command, in order. It should exit when it hits "end-of-file", which you can supply at the terminal by hitting Ctrl-D. To implement the REPL, see Java's BufferedReader class.

Your REPL should handle the following five commands:

  1. stars filename where filename is the path to a file containing stars information. This command loads in star position information. If you run this command multiple times, it should overwrite the k-d tree each time, rather than adding new elements to the existing k-d tree

  2. neighbors k x y z where k is the nonnegative, integral number of nearest neighbors to find, and x, y, z represents the position from which to find the nearest neighbors.

  3. neighbors k "name" where name is the name of a star, which must be given in quotes. For both of these neighbors commands, if there is a tie between stars (two stars being in the same location or two stars being equidistant from the given location), you may return one arbitrarily if choosing both would push you above the limit k.

  4. radius r x y z where r is the nonnegative radius around the position to search, and x, y, z represents the position. A radius of 0 includes a star if and only if it is exactly at the specified position.

  5. radius r "name" where name is the quoted name of a star, as in the neighbors command described above.

Example Output

Read n stars from <filename> should be printed in response to a successful stars command. For easier testing on our part, please do not worry about whether or not "stars" should be plural. It should always be plural in this response.

Input: stars data/stars/stardata.csv
Output: Read 119617 stars from data/stars/stardata.csv

Input: stars data/stars/just_one_star.csv
Output: Read 1 stars from data/stars/just_one_star.csv

For the neighbors and radius commands, the stars should be printed in sorted order, from nearest to furthest away. If a star is specified by name in the query, it should not be counted or printed during the search.

Your output should only print star IDs, not names. Here are some example input/output when your program is run against the supplied data file (though we will try it on many other commands, and many other files. Your tests should as well!)

Input:
neighbors 5 "Sol"
Output:
70667
71454
71457
87666
118721
Input:
neighbors 5 0 0 0
Output:
0
70667
71454
71457
87666

Be sure the format of your output looks like what we have above. Do not actually print "Input" and "Output"!

If the line read cannot be interpreted in a meaningful way (an incorrect command, a non-existent star name, etc), your program should print a reasonable error message and continue running. All error messages should begin with "ERROR:" (without the quotes). For example:

Input: radius -4 7 3 2
Output (example): ERROR: Radius must be non-negative.

Input: stars file_that_doesnt_exist.csv
Output (example): ERROR: File does not exist.

The "ERROR:" tag allows us to automatically test your code against error conditions without specifying exactly what your messages should contain (or even even hinting what all the errors you should detect are). We do award more credit for “better” error messages.

Design Questions

With every project we will also have 2-3 design questions that you must answer in your README. These questions are supposed to help you make good design decisions and get you to look at how your program can be used beyond the scope of this project. We highly recommend reading the design questions and taking some time to design your system before you start coding so that you could easily add the functionality we specified in the future. Good design is one of the most important skills of being a software engineer, so taking the time to figure out a good design will greatly help you in the future.

You should think about your answers to these questions before you start programming. Nonetheless, you may find once you've finished your project that the way you designed your program is not ideal. In that case, you can explain how you would have redesigned your code so that you can give the best answer possible. We will grade these answers on how strong the design you described is and not what was implemented in your handin. (You could still lose points for your design, but not simply because your design is different from what you describe here.)

Here is an example design question and answer for Boggle:

How could you change your code to handle board that can be arbitrarily sized instead of 4x4?

In the Board class, I would add another board constructor Board(int size) that changed the board construction loops to loop from 0 to size instead of 0 to 4. The default constructor would remain the same, and another constructor could be added to allow for both a particular string set and board size. I would remove the static SIZE constant in favor of a size instance variable, associated with each instance, that would be set in the constructor.

Design questions for Stars:

  1. Suppose that in addition to "neighbors" and "radius" you wanted to support 10+ more commands. How would you change your code - particularly your repl parsing - to do this? Don't worry about algorithmic details with the k-d tree, we're interested in the design.
  2. What are some problems you would run into if you wanted to use your k-d tree to find points that are close to each other on the earth's surface? You do not need to explain how to solve these problems.
  3. Your k-d tree supports most of the methods required by the Collection interface. What would you have to add/change to make code like Collection<Star> db = new KDTree<Star>() compile and work properly?

GUI Specification

If your program is invoked with "--gui" as an argument, it should start a web interface in addition to reading command-line input. If you haven't done so yet, you should look at Project 0 - Boggle for an introduction to Spark and sample code to help you begin. Our first lab will also help you get started!

If your program is invoked with "--gui --port 4567", http://localhost:4567/stars should display a page that allows a user to conveniently make all of the same types of queries that the CLI offers, but it should do so in a friendly, interactive manner. As will be the norm in 32, you have considerable leeway in designing your graphical interfaces, but we will be looking for convenient, clear interfaces. Since this is the first project, we will only expect a basic GUI. The CSV files can still be loaded on the command line, but you should create an HTML form that allows you to conduct k-nearest neighbors and radius searches, then displays the results. We encourage you to make the output more user-friendly than your command-line interface (don't just print IDs!).

Testing

We require thorough JUnit tests as a part of your testing plan. This project will be very challenging if you do not incrementally test your implementation as you create it. Hint: compare the results of your implementation with a simple, slow, and known correct implementation. We covered testing in the first lab. Also, be sure to use Jacoco through mvn site to evaluate how well the code you have written is covered by your JUnit tests!

We require that you pass the basic system tests that are provided by us in your repository (they're in your tests/ta/stars/ directory). These system tests are checking for basic functionality. We will run more comprehensive tests on your projects when grading, so don't be complacent about correctness. We'll also be testing your system tests on our own implementations of Stars to see if they catch various bugs in our code. Please include your own system tests in the tests/student/stars/ directory of your repository. Your system tests should look at basic functionality, edge cases, and malformed commands. Check out our cs32-test guide for more info.

Style

A portion of your grade in this course will be on code style. The standards which we will grade by are outlined in our official CS32 Style Guide. Definitely read over the overview and keep it in mind as you begin coding. More importantly, read your Checkstyle report, as generated by mvn site (the Boggle project explains how to find it). Do this early, and often, until you understand the rules well. It is sad to lose points on this, but we feel it is important. If Checkstyle raises an error that you find unreasonable, note the error and why you think it is unreasonable in your README. We expect you to have no unexplained Checkstyle errors in your final handin. Configuring Eclipse as suggested in the Working From Home Guide can save you literally hours of work when fixing style errors.

Graduate Student Requirements

Grad students must additionally complete the following requirements:

  1. Add a switch <naive|kdtree> command that changes the data structure that subsequent commands use
  2. Benchmark the difference between a naive O(n) implementation and a k-d tree

For benchmarking, run

time cs32-test tests/ta/stars/benchmark_naive.test

and

time cs32-test tests/ta/stars/benchmark_kdtree.test

and write the results in your README.md (we only care about the real)

Setting up your development environment

We're providing the config files for Checkstyle (config/cs32_checks.xml) and Maven (pom.xml) and an executable to run your program (run). We'll also provide you with a stencil Java project to get started. We will cover how to run and build your program in lab 1.

GitHub

If you haven't already, you should have signed the collaboration policy. When signing, we'll ask you to set up a Github account, and we'll collect your Github username.

Once you've done that, click on this GitHub Classroom link to retrieve the stencil code. This will create your own, personal repository (aka "repo") that contains your code. The repository will be hosted somewhere like: https://github.com/cs032-2020/stars-student.

Now you'll make a local clone of your repo. Click the green "Clone or download" button and copy the URL. On the computer you'll be working on, open a terminal and navigate to where you'd like to store your projects. Once you're there, run git clone <URL>. This copies the repo to your computer in a new directory, named "projects".

To update the remote repo (on Github) after editing your code, you'll first have to add the files you updated, then make a commit of your changes (with a message summarizing your changes), and finally push your changes to the remote repo. These commands, along with other helpful setup information can be found on the Getting Set Up lab as well as the Git+Testing lab. (We've also copied them below.)

  1. Use git add -A to add all changed files. Or, you can selectively add files with git add /path/to/file.

  2. Use git commit -m "<DESCRIPTIVE SUMMARY>" to save your changes locally. Think of this as setting a checkpoint that you'll be able to return to whenever you'd like. You'll need to add any files you want to save before making a commit. Be sure to include a short summary of your changes.

  3. Finally, use git push to upload your changes to the remote repo on Github. You can log on to Github and navigate to the repository to see your changes.

Git is a powerful version control system, and we've only cracked the surface. Since you're the only one editing this repo, you won't have to worry about what happens when someone else wants to make a change to the same file your working on. We'll explore other features of Git later, but you're encouraged to learn more here.

Minimum Functionality

As specified in the course missive you must meet minimum functionality for each assignment in order to pass CS32. If you hand in a project that misses minimum functionality, you have until the end of semester to implement minimum functionality and get it checked off at TA hours.

For Stars, minimum functionality requires several things. First, your project must pass the basic system tests provided. Next, you need a GUI which supports all four search types and displays results without crashing. Finally, your implementation must use a KDTree.

Grading Summary

Many of you may not have a lot of experience in building and working with large scale projects like Stars, so only for this project, we are giving you a brief overview of what grading looks like for CS32.

  1. README: This is one of the first things we will be looking at. A big part of software engineering is to provide explanations on your project design details and how to run your program for users (in our case, your grader). Furthermore, your design questions will all be graded.

  2. Functionality: Along with the test suite that we provide you, we will also be running your implementation on our hidden test suite. This test suite ensures accuracy as well as effective error handling, so make sure to consider all possible edge cases.

  3. Testing: This is another important part of the rubric. We grade both unit tests and system tests. For unit tests, we want to see that there is comprehensive test coverage on the majority of your classes (with an emphasis on your core classes, like K-d tree). Feel free to check out the Getting Set Up lab, which has information about JaCoCo that you can use to check your test coverage. Note that we are not grading you on JaCoCo, but it is a nice tool to have. And for system tests, you should be throughly testing the specs of the project, ensuring that any sort of error or bad input won't crash the program. Keep in mind that testing is quality over quantity!

  4. GUI: As mentioned in the GUI Specification section, your GUI should efficiently allow for neighbors and radius commands to run and display results (it's fine if you need to run stars command in the REPL). We only expect a basic GUI for this project, however, it should still be intuitive to use for users.

  5. Good code/Bad code: This section evaluates your code and class design for things that were done above and beyond, and for things that should be improved for the future. Some of the points are for general good/bad code styles, like having thorough and informative comments. Other points are for how organized and intuitive your project's structure is, and how reusable and extensible your classes and data structures are.

(Optional) Design Check

For this project, there is an optional design check component. For this, you must s how a TA at hours a functioning REPL. By functioning, we mean that it satisfies the constraints: (1) it opens and prompts the user, (2) the user can type in some input, hit enter, and the REPL will return to a state where the user can type more input, and (3) the REPL exits cleanly (meaning it closes upon some command). You can get checked off for completing this design check by coming to any TA (not lab) hours Saturday, Feb 1st to Monday, Feb 10th, inclusive and showing your sufficient REPL. If a TA notes that your REPL meets our standards for this design check, you will recieve one extra late day. Note: a REPL that satisfies these constraints is by no means a perfect REPL, but a good jumping off point for developing a modular and extensible REPL.

Handing In

As with every project, we expect you to update the README.md file, describing:

  • Known bugs
  • Design details specific to your code
  • Any runtime/space optimizations you made beyond the minimum requirements
  • How to run any tests you wrote/tried by hand
  • How to build/run your program
  • Answers to design questions
  • Explanations for any Checkstyle errors your code has (hopefully none)

Include these subsections under the "Stars" section in the README.md file.

Very Important Note: TAs should be able to run your program by executing ./run from the root project directory. Make sure that this works for your program before handing your project in or there will be significant point deductions. Writing "run in eclipse" in your README is not acceptable.

Most Important Note: You will be using Pull Requests to hand in your projects. Follow the Project Handin Guide and be sure to read all of it!! If you encounter any issues the TAs will be more than happy to help in hours.

FAQs

Check out the FAQs for Stars.