Visual Odometry (Part 2)
This post is second in a series on Visual Odometry. Find part 1 here.
Ok, the last time around we got stuck because:
- we were not detecting enough features
- the feature matches that we found were not great
Shown below is where we were at — using the ORB detector’s default detect
method did not lead to a lot of good features.
Additionally, the matches weren’t great with large Hamming distances in the “worse” matches.
1 2 3 4 5 6 7 8 9 10 |
|
Number of keypoints: 500 and 500
Feature (re-) detection
Another way to get features is to first find nice corners in the frame and then using ORB to figure out descriptors for those specific corners.
If we can choose nice robust corner features, it will be easier to track them between frames.
Thankfully, openCV has a function for this - goodFeaturesToTrack
.
Below, we first find the good features from both frames and convert them the to KeyPoint
type to work with other openCV functions.
The results are good!
We have 3000 corners with the same corners being detected in both frames, so we can be hopeful for the feature matching step.
1 2 3 4 5 6 7 8 9 10 |
|
Feature (re-) matching
We can now compute ORB descriptors here (remember that detector
is an ORB object).
Then, we can use the brute-force matcher to find matches between the two frames, however this time we will use the k-nearest neighbors algorithm.
The default match()
method of the BFMatcher
object returns the best match, while the knnMatch()
method returns the k best matches.
This will be useful to do some processing to distinguish good matches from bad ones using Lowe’s ratio test.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Displaying the first 10 matches, we can see they look pretty good. Even the bad matches look pretty good!
1 2 |
|
These matches will still be pretty noisy if I were to detect them across all frames in the video, so we need to somehow enforce the perspective projection/projective geometry model on them. This can be done by using the Fundamental Matrix.
Filtering bad matches using the Fundamental Matrix
The fundamental matrix \(F\) is a matrix that maps points in one image to their matches in the other image (I’m not fully sure, but this is typically done in a stereo setting — however a single moving camera is approximately a stereo camera that is not moving, as long as the frame rate is high enough compared to the camera motion). Ok, technically it doesn’t match points to points, but points to their respective epipolar lines. Due to the epipolar constraint, it is guaranteed that a point in one image plane will have its match in the other (stereo) image frame on a line (called the epipolar line). This is really cool, because now when we search for a match, we just need to search a line in the image instead of the whole 2-D image. This constraint restricts our search space by a lot! Here, we will use this fact in a reverse way. We already have the matches, thanks to our brute-force matcher — albeit not all of them are high quality as we saw above. And we know (actually, we assume) that the matches must satisfy the epipolar constraint, which is basically saying that they must follow the mapping (or transformation) of the fundamental matrix. Now, we can tell by looking at each match if it is a good quality match or not, but this constraint gives us a quantitative way of deciding whether a match is a good match or not. So, the points that don’t obey this fundamental matrix transform (or mapping) are bad matches.
There is an algorithm that fits the best model with specified structure to noisy data, called RANSAC, which can help us check which mapping satisfies most of our data. RANSAC stands for Random Sample Consensus. While that may sound fancy, the algorithm itself is fairly common sense. RANSAC takes a random sample of data and fits a model to it. It keeps repeating this process until it fits a model that represents most of the data well enough (according to some pre-specified parameters). This is the mathematical equivalent of forming an opinion on a topic by asking a bunch of people what they think is right, and then going with whatever most people said is right. All things considered, its not a terrible way to form opinions. We will use this to filter out bad matches.
1 2 3 4 |
|
Fundamental Matrix:
[[ 3.64257554e-09 6.53331526e-06 -3.25147208e-03]
[-6.53220141e-06 -1.25524925e-08 6.15791865e-03]
[ 3.24476011e-03 -6.12805433e-03 -9.17567059e-03]]
1 2 3 4 5 6 |
|
Above, we can see the keypoints in both frames that best satisfy the epipolar constraint as enforced by the fundamental matrix.
I’m gonna call it here for this post. This is where we should’ve been after the first post, but we’re exploring here. In the next post we will talk about estimating the pose of the camera — it will be math heavy. See you there.
References
- First Principles of Computer Vision YouTube channel
- twitchslam by George Hotz
- KITTI Odometry with OpenCV Python by Nate Cibik
- Visual Odometry by David Scaramuzza and Friedrich Fraundorfer