http://www.cs.unc.edu/~nashel/290-075/
8 May 2000
Abstract
In this paper, I describe methods for and implementation of enhanced correlation between stereo image pairs. Given two rectified stereo images, a simple implementation of the correlation matching algorithm runs in O(XYW2R), where X and Y are the width and height of the image, W defines the fixed window size and R defines the disparity range over which to search. Improvements in the order of execution time are possible by reordering computation and buffering results, and scalar speed increases are made possible through interpolation of lower resolution depth images.
1. Introduction
The two challenges for a stereo vision system are correspondence and reconstruction. Correspondence involves choosing which parts of the left and right images are projections of the same real world object, while reconstruction involves finding the 3-D locations of points and objects in the scene. Furthermore, correspondence algorithms may be classified as either correlation-based or feature-based. Feature-based correspondence works on a sparse set of points, classified into features based upon some knowledge of the environment. Correlation works on the all of the image points, and assuming that no a priori information is available, correlation-based correspondence methods are preferable. Correspondence is usually the most computationally expensive process in a stereo vision system, and so speed up in this area should yield the greatest overall system improvements.
I am motivated to work on this problem by my experiences with the Office of the Future research group here at UNC. One goal for the project is to implement real time 3-D depth extraction from multiple camera views of people in an office environment. At the time of this report, speeds of approximately 800ms for depth extraction from a single pair of images were being obtained. This is clearly insufficient for the 33ms extraction time necessary for 30 frames per second full motion video, but it is a significant improvement over simple brute force models. Thus, to improve my understanding of the problem area I have implemented multiple methods for stereo correlation.
The correlation algorithms assume that the images are generated from a set of set of rectified stereo images. Test images were taken from the Carnegie Mellon Calibrated Imaging Laboratory stereo data sets, for which multiple images along with full intrinsic and extrinsic camera parameters are known.
2. Methodology
Beginning with two common assumptions for correspondence matching algorithms:
(1) Most scene points are visible from both viewpoints.
(2) Corresponding image regions are similar.
we may consider with the correlation matching algorithm as presented by Trucco and Verri [3, §7.2]. Given two rectified input images, Il and Ir, let pl and pr be pixels in the left and right image, 2W+1 be the width of the correlation window, and 2R+1 be the range of disparities from pl for which to search for matches in the right image, and ψ(u,v) a function of two intensity values. Furthermore, the two images are rectified using the epipolar geometry, and so the disparity search becomes a one-dimensional problem. The algorithm for correlation, c, is as follows:
For each pixel pl = [x,y]T
in Il
for each displacement d
in -R to R
let c = 0
for each value k in the search window -W to W
for each value l in the search window -W to W
c = c + ψ(Il(x+k,y+l), Ir(x+k+d,y+l))
The disparity, d for pl is:
d such that c(pl) is maximum correlation value
For ψ(u,v), I have implemented two functions:
ψ(u,v) = uv
which yields the cross-correlation between the search regions, and:
ψ(u,v) = -(u - v)2
the SSD (sum of square differences) [1].
2.1 Naïve Implementation
If the algorithm is implemented in the most straightforward manner, looping for each pixel and displacement, and then determining the correlation between search windows in either image, we have X x Y x(2R+1) x (2W+1) evaluations of ψ. Clearly, both implementations for ψ run in constant time, and so the time required for the algorithm is O(XYW2R). When such complexity is considered with real world data, for example, X=160, Y=120, W=10, and R=10, the approximate number of operations to determine correspondence is greater than 4 billion. At the present time, this naïve method is far too many for practical real-time depth extraction, which would entail 120 billion+ operations per second.
2.2 Algorithmic Order Improvements
We can take advantage of correlation window overlaps for pairs of pixels to precompute all ψ values for a particular disparity. Ishii [2] describes this method and the following case of further improvement. By first iterating over each disparity value, we calculate such ψ values just once and store them in a buffer from which they are recalled during an iteration over the image. Such a method will reduce the number of evaluations for ψ to O(MNR), but because the ψ values must still be summed, the overall complexity is still O(XYW2R). We refer to this improvement as the precomputed method.
Even with this improvement, as the correlation window moves across the scanline, most of the computations for ψ are unnecessarily repeated, with an overlap of 2W columns of (2W+1) pixels for each iteration. Thus, we may compute and store the column sum values, and compute the correlation for each succeeding column in constant time by subtracting the column sum of the previous column and adding that of the following column. The running time for the algorithm is reduced to O(MNR), which is not dependent upon the size of the correlation window. We refer to this combination of enhancements as the optimized method.
2.3 Scalar Execution Improvements
Because we have textures registered with, but separate from, the correlation and disparity maps, we can perform lower-resolution correspondence and interpolate to determine values for all texture points. Depending on the accuracy required for the extracted depth images, various interpolation parameters may be used. The interpolation scaling factor determines which image coordinates will have correspondence values calculated: a value of 2 means that every other pixel in the x and y directions will have a depth computed, 3 means that every third will, and so on. The interpolation method was an adapted optimized method with an interpolation scale factor added.
With the increase in speed comes a loss of precision for depth estimation. If the data is originally sampled at single or sub-pixel resolution and then interpolated, all information with a frequency greater than the scaling factor will be lost. This may pose a problem when dealing with artificial or natural shapes with a great amount of depth detail, such as a fence or dense trees. However, with shapes such as a human face, there are limited amounts of high frequency information, and the texture information may be much more important.
3. Results
The algorithms were implemented with MATLAB 5.3 and tested on a 450MHz
Pentium II Xeon system running Windows NT. Left and right images
from the CMU CIL stereo datasets were preprocessed and converted to appropriate
formats for use in MATLAB, as shown in figures 1 and 5. Correlation
matching was performed using both cross-correlation and SSD methods, for
all four correlation methods. See figures 2, 3, and 6 for graphical
depictions of the correlation results.
![]() |
![]() |
Figure 2: Castle - correlation map using SSD
Figure 3: Castle - correlation map using cross-correlation
A test suite of trial cases was generated for speed comparision purposes. Using a fixed disparity range defined by R=4, correlation was performed for all four methods with varying correlation window sizes. The results were as expected, presented in table 1 and graphed on a log scale in figure 4. We see that precomputation offers a scaled factor of improvement over the naïve method, although it is still dependent upon the size of W. The optimized and the related interpolated methods are both linear with respect to W and generally perform much faster. Interpolation with a scale factor of 2 resulted in a method almost twice as fast as the optimized computation.
Table 1: Correlation performance results using SSD, R=4
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Figure 4: Correlation performance results using SSD, R=4
4. Conclusions
Clearly, naïve implementations of the correspondence algorithm are much too slow to be of practical use in real-time or near real-time stereo depth extraction systems, even with the use of epipolar geometry, so critical in reducing the correlation search space. By methods of precomputation and buffering of ψ across both disparities and scanlines, the algorithm speed may be significantly enhanced. Furthermore, there exist useful methods for scalar speed improvements. especially when speed is more critical for success than depth extraction precision. When these methods are combined, significant speed up is possible, but more optimization is needed for real-time depth extraction.
References
| [1] | Fusiello, A., V. Roberto, and E. Trucco. Efficient Stereo with Multiple Windowing. Proc. IEEE Intern. Conf. on Computer Vision and Pattern Recognition, 858-863, 1997. |
| [2] | Ishii, Hiroshi. Optimized Correlation-Based Stereo Matching. http://www.stanford.edu/~onihei/cs223b/project/report.html. |
| [3] | Trucco, E. and A. Verri. Introductory Techiques for 3D Computer Vision. Prentice Hall, 1998. |
Appendix
Additional Images
![]() |
![]() |
Figure 6: Wall - correlation map using cross-correlation
Source Code Listings
The following files are availible online at http://www.cs.unc.edu/~nashel/290-075/:
ψ(u,v) functions
SSD.m
cross_correlation.m
Correlation functions
naive_corr.m
precomp_corr.m
optimized_corr.m
interp2_corr.m