A 3D model can be reconstructed from a sequence of images

Project 5: Tracking and Structure from Motion
CS 143: Introduction to Computer Vision

Brief

Stencil code: /course/cs143/asgn/proj5/stencil/
Data: /course/cs143/data/proj5/
Handin: cs143_handin proj5
Required files: README, code/, html/, html/index.html
Due date: December 12th, 11:59 PM

Overview

In this assignment you'll be reconstructing the 3D shape of an object from a series of observations (a video). There are three essential components to this: keypoint selection, feature tracking, and structure from motion.

Keypoint Selection

The stencil code includes a Harris corner detector in harris.m, which you should use to select your keypoints. You can implement a better keypoint selection method for extra credit. Good features to track are ones whose motion can be estimated reliably (although note that for the tracker you'll be implementing Harris corners are optimal).

Writeup requirements

Include an image of the selected keypoints overlayed over the first video frame.

Feature Tracking

You'll be implementing a Kanade-Lucas-Tomasi tracker for the keypoints you detect. This essentially involves computing optical flow between successive video frames and moving the selected keypoints from the first frame along the flow field. You do not have to implement coarse-to-fine KLT as the video doesn't have any large movements.

The first assumption the KLT tracker makes is brightness constancy. A point should have the same value after translation in the next frame (where I is the image function):

Take the Taylor expansion of I(x + u, y + v, t + 1), where I_x/I_y is the x/y-derivative of the image and I_t is the temporal gradient:

Therefore:

And by the first equation:

This is only one constraint when we have two unknowns (u and v). We get more by assuming that nearby pixels at points p_i, i ∈ [1, 225] (in a 15×15 box around the pixel) move with the same u and v:

We can solve this overconstrained linear system via least-squares (abbreviating the above to Ad = b):

This results in two equations with two unknowns for each pixel. You can solve for u and v by inverting the 2x2 matrix on the left-hand side and multiplying it by the vector on the right-hand side. Once you have the per-pixel optical flow, you can easily track a point through the flow by looking up its displacement in the frame and adding that to its position to get the position for the next frame. Use interp2 to do the lookup or your points will drift significantly. Discard any keypoints if their predicted location moves out-of-frame (or anywhere near the borders, to be safe).

Useful functions: fspecial, conv2, gradient, interp2, isnan.

Writeup requirements

For 20 random keypoints, draw the 2D path over the sequence of frames using line segments.
On top of the first frame, plot the points which have moved out of frame at some point along the sequence.

Structure from Motion

Now, use the discovered keypoint tracks as input for the affine structure from motion procedure described in Shape and Motion from Image Streams under Orthography: a Factorization Method (Tomasi and Kanade 1992). See section 3.4 of this paper for an overview of the algorithm. Note that there should be a suffcient number of tracks that persist throughout the sequence to perform the factorization on a dense matrix. There is no need to fill in missing data for this problem.

To eliminate the affine ambiguity (i.e. apply the metric constraints) by discovering QQ^T (eq. 16 of Tomasi and Kanade), let L = QQ^T and solve for L using least squares. Finally, compute the Cholesky decomposition of L = QQ^T to recover the appropriate Q. A more detailed explanation of this "metric transformation" can be found in Section 2.3 of A Sequential Factorization Method for Recovering Shape and Motion from Image Streams, Morita and Kanade, 1997.

Useful functions: svd, chol.

Writeup requirements

Plot the predicted 3D locations of the tracked points for 3 different viewpoints. Choose several viewpoints so that the 3D structure is clearly visible.
Plot the predicted 3D path of the cameras. The camera position for each frame is given by the cross product k_f = i_f × j_f. For consistent results, normalize all k_f to be unit vectors. Give three plots, one for each dimension of k_f.

Resources

Szeliski Chapter 8.1.3, Dense Motion Estimation: Incremental Refinement
Szeliski chapter 7.3, Structure from Motion: Factorization
Lecture 23, Feature Tracking and Optical Flow
Lecutre 25, Structure from Motion
Shape and Motion from Image Streams under Orthography: a Factorization Method, Tomasi and Kanade, 1992. Section 3.
A Sequential Factorization Method for Recovering Shape and Motion from Image Streams, Morita and Kanade, 1997. Section 2.

Write up

For this project, and all other projects, you must do a project report in HTML. In the report you will describe your algorithm and any decisions you made to write your algorithm a particular way. Then you will show and discuss the results of your algorithm. Discuss any extra credit you did, and clearly show what contribution it had on the results (e.g. performance with and without each extra credit component).

Extra Credit

For all extra credit, be sure to analyze on your web page cases whether your extra credit has improved your results. Each item is "up to" some amount of points because trivial implementations may not be worthy of full extra credit.

Some ideas:

up to 5 pts: Add iterative refinement to your KLT tracker demonstrate improved tracking accuracy.
up to 5 pts: Integrate a pyramid into your KLT tracker and demonstrate improvement on sequences with large frame-to-frame displacements.
up to 5 pts: Correct for drift using the initial local image patches.
up to 10 pts: Use incomplete features to perform structure from motion on more challenging sequences where no feature track survives the entire sequence.
up to 15 pts: Implement structure-from-motion using a perspective camera using the eight-point algorithm.

Rubric

+40 pts: Working implementation of KLT tracking
+40 pts: Working implementation of structure from motion
+20 pts: Writeup
+15 pts: Extra credit (up to fifteen points)
-5*n pts: Lose 5 points for every time (after the first) you do not follow the instructions for the hand in format

Graduate Credit

To get graduate credit on this project you must do 10 points worth of extra credit. Those 10 points will not be added to your grade, but additional extra credit will be.

Web-Publishing Results

All the results for each project will be put on the course website so that the students can see each other's results. In class we will have presentations of the projects and the students will vote on who got the best results. If you do not want your results published to the web, you can choose to opt out. If you want to opt out, email cs143tas[at]cs.brown.edu saying so.

Handing in

This is very important as you will lose points if you do not follow instructions. Every time after the first that you do not follow instructions, you will lose 5 points. The folder you hand in must contain the following:

README - text file containing anything about the project that you want to tell the TAs
code/ - directory containing all your code for this assignment
html/ - directory containing all your html report for this assignment, including images (any images not under this directory won't be published)
html/index.html - home page for your results

Then run: cs143_handin proj5
If it is not in your path, you can run it directly: /course/cs143/bin/cs143_handin proj5

Credits

This project is adapted from Derek Hoiem's CS 543/ECE 549 course at the University of Illinois at Urbana-Champaign.

Project 5: Tracking and Structure from Motion CS 143: Introduction to Computer Vision

Brief

Overview

Keypoint Selection

Writeup requirements

Feature Tracking

Writeup requirements

Structure from Motion

Writeup requirements

Resources

Write up

Extra Credit

Rubric

Graduate Credit

Web-Publishing Results

Handing in

Credits

Project 5: Tracking and Structure from Motion
CS 143: Introduction to Computer Vision