Final Document Print

8/10/2019 Final Document Print

1/78

3D Reconstruction Based on Image Pyramid and Block Matching

Dept. of ECE, MRITS 1

CHAPTER 1

INTRODUCTION TO 3D

1.1 3D RECONSTRUCTION FROM SINGLE 2D IMAGES

A single image of an everyday object, a sculptor can recreate its 3D shape (i.e.,

produce a statue of the object), even if the particular object has never been seen before.

Presumably, it is familiarity with the shapes of similar3D objects (i.e., objects from the

same class) and how they appear in images, which enables the artist to estimate its shape.

This might not be the exact shape of the object; but it is often a good enough estimate for

many purposes Motivated.

In general, the problem of 3D reconstruction from a single2D image is ill posed,

since different shapes may give rise to the same intensity patterns. To solve this,

additional constraints are required. Here, we constrain the reconstruction process by

assuming that similarly looking objects from the same class (e.g., faces, fish), have

similar shapes. We maintain a set of 3D objects, selected as examples of a specific class.

We use these objects to produce a database of images of the objects in the class (e.g., by

standard rendering techniques), along with their respective depth maps. These provide

examples of feasible mappings from intensities to shapes and are used to estimate the

shapes of objects in query images.

Methods for single image reconstruction commonly use cues such as shading,

silhouette shapes, texture, and vanishing. These methods restrict the allowable

reconstructions by placing constraints on the properties of reconstructed objects (e.g.,

reflectance properties, viewing conditions, and symmetry). A few approaches explicitly

use examples to guide the reconstruction process. One approach reconstructs outdoor

scenes assuming they can be labelled as ground, sky, and vertical billboards.

The target of the system is the geometric model of the scenes. So here considergeometric reconstruction and not photometric (or image-based) reconstruction, which

directly generates new views of a scene without (completely) reconstructing the 3D

structure. With the stated purposes stated and application context set the limits as:


2/78



Static scenes: There is no moving object or the movement of objects is relatively same.

Un-calibrated cameras: The input data is captured by an un calibrated camera, i.e. the

camera's intrinsic parameters such as focal length is unknown.

Varying intrinsic camera parameters: The camera intrinsic parameters (e.g. focallength) can vary freely. Together with the previous, this assumption adds flexibility to the

system.

1.2 3DRECONSTRUCTION FROM VIDEO SEQUENCES

The application-oriented description of 3D reconstruction from video sequences

(shortly called 3D reconstruction)

1. The process starts with the data capturing step, in which a person moves around

and captures a static scene using a hand-held camera.

2.

The recorded video sequence is then pre-processed (e.g. selecting frames),

removing (noise, normalizing illumination).

3. After that, the video sequence is processed to produce a 3D model of the scene.

4. Finally, the 3D model can be rendered, or exported for editing using 3D modeling

tools.

Fig.1.1: Main tasks of 3D reconstruction

The 3D reconstruction (step 3) can be divided into 4 main tasks, which are as following:1.

Feature detection and matching: The objective of this step is to find out the

same features in different images and match them.

2. Structure and motion recovery: This step recovers the structure and motion of

the scene (i.e. 3D coordinates of detected features; position ,orientation and parameters of

the camera at capturing positions).


3/78



3. Stereo mapping: This step creates a dense matching map. In conjunction with the

structure recovered in the previous step, this enables to build a dense depth map.

4. Modeling: This step includes procedures needed to make a real model of the

scene (e.g. building mesh models, mapping textures).

Some define the input as an image sequence but in fig.1.1 defines it as a video

sequence since our practical objective is a system that does reconstruction from video. By

defining it like that, we want to clearly state that the intermediate step to go from video to

image sequences (i.e. frame selection) is a part of there construction process.

1.2.1 Feature Detection and Matching

This process creates relations used by the next step, structure and motionrecovery, by detecting and matching features in different images. Until now, the features

used in structure recovery processes are points and lines. So here features are understood

as points or lines.

Fig.1.2: Pollefeys 3D modelling framework

Detectors: Given an image a feature detector is s a process to detect features from theimage. The most important information a detector gives is the location of features but other

characteristics such as the scale can also be detected. Two characteristics that a good

detector needs are repeatability and reliability. Repeatability means that the same feature can

be detected in different images. Reliability means that the detected point should be

distinctive enough so that the number of its matching candidates is small.


4/78


5/78


6/78



Classification: Point descriptors are classified into the following categories:

Distribution based descriptors: Histograms are used to represent the characteristics of

the region. The characteristics could be pixel intensity, distance from the centre point

relative ordering of intensity or gradient.Spatial-frequency descriptors: These techniques are used in the domain of texture

classification and description. Texture description using Gabor filters is standardized in

MPEG7.

Differential descriptors: The descriptor are used to evaluate detectors reliability is an

example of a differential descriptor, in which a set of local derivatives (local jet) is used

to describe an interest region.

Moments: Moments are used moments to describe a region. The central moment of a

region in combination with the moment's order and degree forms the invariant.

1.3 LINES

Two-view projective reconstruction can only use point correspondences. But in

three or more view structure recovery it is possible to use line correspondences.

1.3.1 Line detection

Line detection usually includes edge detection, followed by line extraction.

1.3.2 Edge detection

The key to solve the problem is the intensity change, which is shown via the

gradient of the image. Edge detectors usually follow the same routine: smoothing,

applying edge enhancement filters, applying a threshold, and edge tracing.

Evaluations of edge detectors are inconsistent and not convergent for reasons such

as unclear objective and varying parameters. A series of evaluation in different tasks in

which the application acts as the black box to test algorithms. One of them is structure

from motion. The evaluation shows that overall, the canny detector is most suitable

because of its Performance.

Fastest s peed and low sensitivity to parameters variation .However the structure

from motion algorithm used there is not a three-view one and uses line segments rather


7/78



than lines as in also the intermediate processing" (line extraction and corresponding) that

would affect the final result is fixed. Thus the result is not concrete enough.

1.3. 3 Line Extraction

Extracting lines could be done in several ways. The Hough transform is famous in

curve fitting. Despite of having a long history Hough transform and its extensions are still

used widely. A simpler approach connects line segments with a limit of angle changes

and then uses the least median square method to fit the connected paths into lines. As

with edge detection, no complete and concrete evaluation of line extraction is available.

1.3.4 Line matching

Lines can be matched based on attributes such as orientation, length, or extent ofOverlap. Some matching strategies such as nearest line, or additional view verification

can be used to increase the speed and accuracy. Optical flow can be employed in the case

of short baseline Matching groups of lines (graph-matching) is more accurate than

individual matching Beardsley et al use the geometric constraints in both two-view and

three view cases to match lines. The constraints are found by a robust method with

corresponding points.

Lines are generally highly structured features give stronger constraints. Lines are

many and easy to extract in scenes with dominant artificial objects, e.g. urban

architectures. However, the fact that evaluations on line extraction and matching for

structure recovery are not complete and concrete probably is the reason why the theory of

three-view reconstruction with lines are available for along time but methods in structure

recovery usually use point correspondence. One of the few works that uses line

correspondences and trifocal tensors is of Breads but lines are not used directly. Still

point correspondences are used first to recover geometry information.

1.4 STRUCTURE AND MOTION RECOVERY

The second task Structure and motion recover the structure of the scene and the

motion information of the camera. The motion information is the position, orientation,


8/78



and intrinsic parameters of the camera at the captured views. The structure information is

captured by the 3D coordinates of features.

Given feature correspondences, the geometric constraints among views can be

established. The projection matrices that represent the motion information then can berecovered. Finally, 3D coordinates of features, i.e. structure information can be computed

via triangulation.

Fig.1.5: Structure and motion recovery process

1.5 ADVANTAGES AND PROBLEMS OF USING VIDEO

SEQUENCES

It is possible to do 3D reconstruction from images. But in practice, it is more

natural to use video sequences since it eases the capturing process and provides more

complete data. But also problems arise. The following describes the advantages and the

problems of using video sequences as input.

1.5.1 AdvantagesThe most important advantage of using input of video sequences is the higher

quality one can obtain. Both geometric accuracy and visual quality can be improved by

exploiting the redundancy of data. Intuitively, more back-projecting rays of a point's

projections limits the possible 3D coordinates of the point. The best texture found by

selecting the best view or super-resolution can be used to get better visualization quality.


9/78



Image sequences also enable some techniques to deal with shadow, shading and

highlights.

Other advantages are the automaticity and flexibility. Capturing data by a

handheld camera is more comfortable since a person does not have to worry about

missing information or to consider if the captured information is enough for

reconstruction .And on the processing time, instead of manually selecting some images

from a video, it is better to have a system that can do everything automatically.

1.5.2 Problems

To take advantage of the use of video sequences we have to deal with some

problems, ranging from pre-processing (frame selection, sequence segmentation), during

processing as has been seen in previous sub-sections, to post-processing (bundle

adjustment, structure fusion).

Frame selection: Among a number of frames, selecting good frames will improve the

reconstruction result. Good frames are ones that have proper geometric attributes and

good photometric quality. The problem is related to the estimation of views' position and

orientation and photometric quality evaluation.

Sequence segmentation: Reconstruction algorithms assume that a sequence is

continuously captured. The sequence should be broken into proper scene parts and

reconstruct separately and fuse later.

Structure fusion: Results of processing different video segments (generated either by

different captures or by segmentation) must be fused together to create a final unique

result.

Bundle adjustment: The reconstruction process includes local updates (e.g. feature

matching, structure update) and bias assumptions (e.g. use of first view coordinate

system). Those lead to inconsistency and accumulated errors in the global result. Thereshould be global optimization step to produce a unique consistent result.


10/78



1.6 CRITICAL CASES

A critical case happens when it is impossible to make a metric reconstruction

from the input data. It either because of the characteristics of the scene or the capturing

positions.

In practice, metric reconstruction from video sequences captured by a person

using a hand-held camera hardly falls into an absolute critical case. However, nearly

critical cases are common in practice, e.g. a camera moving along a wall or on an elliptic

orbit around the object. That is why studying critical configurations and detecting those

cases is extremely important to create a robust reconstruction method, or select the most

suitable method for the case.

There are two kinds of critical cases: (i) critical surface or critical configuration

and (ii) critical motion sequences (of camera). The first class depends on the observed

points. The later depends only on camera motion, i.e. can happen with any scene. A

brute force" approach to select the best algorithm. This however only helps to reject the

case but not to take the proper method for it. Some important notes about critical cases

are:

Normal cases in some conditions, e.g. calibrated or fixed intrinsic parameters, can

turn into critical ones when conditions change. The more un calibrated the camera, i.e.

less cameras' parameters are known, the more ambiguous the reconstruction will be.

1.7 IMAGE-BASED 3D RECONSTRUCTION

Image-based 3D reconstruction is an active field of research in Photogrammetric

and Computer Vision. The need for detailed 3D models for mapping and navigation,

inspection, cultural heritage conservation or photorealistic image-based rendering for the

entertainment industry lead to the development of several techniques to recover the shape

of objects. To achieve precise and high detailed reconstructions is often employed

providing 2.5D range images and the respective 3D point cloud in a metric scale.

On the other hand, laser-based methods are complex to handle for large scale

outdoor scenes, especially for aerial data acquisition. In contrast to that, passive image-


11/78



based methods that utilize multiple overlapping views are easily deployable and are low

cost compared to but require some post-processing effort to derive depth information. In

this work we investigate how redundancy and baseline influence the depth accuracy of

multiple view matching methods.

In particular performance synthetic experiments on a typical aerial camera

network that corresponds to a2D flight with 80% forward-overlap and 60% side-lap as

shown in fig.1.6. By covariance analysis of triangulated scene points the theoretical

bound of depth accuracy is determined according to the triangulation angle and the

number of measurements (i.e. the redundancy).

One of the main findings is that true multi-view matching/triangulation

outperforms two-view fused stereo results by at least one order of magnitude in terms of

depth accuracy. Furthermore, present a fast, accurate and robust matching and

reconstruction technique suitable for high resolution images of large scale scenes that is

able to compete through leveraging the redundancy of many views. The solution to multi-

view reconstructions is based on pair-wise stereo, employing efficient and robust optical

flow that is restricted to the epipolar geometry. Unlike standard aerial

Fig.1.6(a) :The view network, a sparse reconstruction and uncertainties

(magnified by 1000 for better visibility) for selected 3D points on a regularly

sampled grid on the ground plane


12/78



Fig.1.6(b) : Reconstructed dense point cloud from our multi-view method of an

urban scene.

matching approaches that rely on 2.5D data fusion of pair wise stereo depth maps,the correspondence chaining (i.e. measurement linking) and triangulation approach that

takes full advantage of the achievable baseline (i.e. triangulation angles). In contrast to

voxel-based approaches, Polygonal meshes and local patches, focus on algorithms

representing geometry as a set of depth maps. It eliminates the need for resembling the

geometry in the three-dimensional domain and can be easily parallelized. Evaluate the

approach on the multitier benchmark data set that provides accurate ground truth and on

large scale aerial images.

1.8 UNCERTAINTY OF SCENE POINTS

The depth uncertainty of a rectified stereo pair can be directly determined from

the disparity error

z=

-

~

. d... (1)

where z is the point depth, f the focal length and b the image baseline. Hence the depth

precision is mainly a function of the ray intersection angle. In contrast, for multi view

image matching and triangulation , the redundancy not only implies more measurements

but additionally constrains the 3D point location through multiple ray intersections. These

entities are not independent but are coupled, since they rely on the network geometric


13/78



configuration that determines image overlap (i.e. redundancy) and baseline,

simultaneously. Given a photogrammetric network of cameras and correspondences with

known error distribution, the precision of triangulated points can be determined from the

3D confidence ellipsoid (i.e. covariance matrix CX) as shown in fig. 1.7. An empiricalestimate of the covariance ellipsoid corresponding to multi view triangulation can be

computed by statistical simulation. For the moment we assume that camera orientations

and 3D structure are fixed and known.

The cameras are distributed along a 2D grid (corresponding to flight paths) in

order to achieve a 80% forward overlap and 60% side-lap as shown fig.1.7 According to

a large format digital aerial camera (e.g. Ultra Cam D from Microsoft) the image

resolution is set to 7500 _ 11500 pixel with a field of view _ = 54_. Furthermore,3D

points are evenly distributed on a 2D plane that corresponds to the bold earth surface

observed from a flying height of 900m. Therefore, an average Ground Sampling Distance

(GSD) of8cm/pixel is achieved.

Given the cameras Pi=1 : NP (i.e. calibration and poses) and 3D points Xi=1:M X,

respective ground truth projections are produced xij= PX. Therefore, for every 3D point a

set of point-tracks (i.e. 2D measurements) is generated m = (< x 1; y1>;< x2; y2> : : : ;). Next, 2Dprojections are perturbed by zero mean Gaussian isotropic noise ^x = x +

N(0; ),

= ( )........................................................................................... (2)

with standard deviation x= = 1 pixel (i.e. _ 8cm GSD). Given the set perturbed pointtracks^m = (< ^x1; ^y1>;< ^x2; ^y2> : : : ;< ^xk; ^yk>) and ground truth projection

matrices Pi=1:N, the 3Dposition of the respective point in space is determined. This

process requires the intersection of at least two known rays in space. Hence, we use a

linear triangulation method to determine the 3D position of point tracks. This method

generalizes easily to the intersection of multiple rays providing a least squares solution.

Optionally, a non-linear optimizer based on the Levenberg-Marquardt algorithm issued to


14/78



refine the 3D point by minimizing the projection error. Through Monte Carlo Simulation

on the perturbed measurement vectors ^m we obtain a distribution of 3D points Xi

around a mean position ^X. From the Law of Large Numbers it follows that for a large

number N of simulations,one can approximate the mean 3D position by,

....(3)and its respective covariance matrix by,

CX = EN[( XiEN [Xi]) (XiEN[Xi] )T ] ...(4)

Using the singular value decomposition the covariance matrix can then be diagonalized,

CX = U( )V

T

(5)

where U represents the main diagonals of the covariance ellipsoid i and i are the

respective standard deviations. The decomposition of the covariance matrix in equation

5into its main diagonals directly relates to the uncertainty in x-y and z direction. Under

the assumption of front to parallel image acquisition the largest singular value 1

corresponds to the uncertainty in depth and 2 and 3to the uncertainty in x - y direction,

respectively.

1.9 LITERATURE SURVEY

With the advent of the multimedia age and the spread of Internet, video storage on

CD/DVD and video has been gaining a lot of popularity. The ISO Moving Picture

Experts Group (MPEG) video coding standards pertain towards compressed video

storage on physical media like CD/DVD, whereas the International Telecommunications

Union (ITU) addresses real-time point-to-point or multi-point communications over a

network. The former has the advantage of having higher bandwidth for data transmission.

In either standard the basic flow of the entire compression decompression process

is largely the same and is shown in fig. 1.7. The encoding side estimates the motion in the

current frame with respect to a previous frame. A motion compensated image for the

current frame is then created that is built of blocks of image from the previous frame. The


15/78



motion vectors for blocks used for motion estimation are transmitted, as well as the

difference of the compensated image with the current frame with respect to a previous

frame. A motion compensated image for the current frame is then created that is built of

blocks of image from the previous frame.

The motion vectors for blocks used for motion estimation are transmitted, as well

as the difference of compensated image motion estimation are transmitted as well as the

difference of the compensated image with the current frame is also JPEG encoded and

sent. The encoded image that is sent is then decoded at the encoder and used as a

reference frame for the subsequent frames. The decoder reverses the process and creates a

full frame. The whole idea behind motion estimation based video compression ratio 30:1

is to save on bits by sending JPEG encoded difference images which inherently have lessenergy and can be highly compressed as compared to sending a full frame that is JPEG

encoded. Motion JPEG where all frames are JPEG encoded, achieves anything between

10:1 to 15:1 compression ratio, whereas MPEG can achieve a compression ratio of 30:1

and is also useful at 100:1 ratio. It should be noted that the first frame is always sent full,

and so are some other frames that might accurate some regular interval (like every 6th

frame). The standards do not specify this and this might change with every video being

sent based on the dynamics of the video.

The most computationally expensive and resource hungry operation in the entire

compression process is motion estimation. Hence, this field has seen the highest activity

and research interest in the past two decades. The algorithms that have been implemented

are Exhaustive Search (ES), Three Step Search (TSS), New Three Step Search (NTSS),

Simple and Efficient TSS (SES), Four Step Search (4SS), Diamond Search (DS), and

Adaptive Rood Pattern Search (ARPS).


16/78



Fig.1.7: MPEG / H.26x video compression process flow

.

The most computationally expensive and resource hungry operation in the entire

compression process is motion estimation. Hence, this field has seen the highest activity

and research interest in the past two decades. The algorithms that have been implemented

are Exhaustive Search (ES), Three Step Search (TSS), New Three Step Search (NTSS),

Simple and Efficient TSS (SES), Four Step Search (4SS), Diamond Search (DS), and

Adaptive Rood Pattern Search (ARPS).

1.10 BLOCK MATCHING ALGORITHMS

The underlying supposition behind motion estimation is that the patternscorresponding to objects and background in a frame of video sequence move within the

frame to form corresponding objects on the subsequent frame. The idea behind block

matching is to divide the current frame into a matrix of macro blocks that are then

compared with corresponding block and its adjacent neighbours in the previous frame to

create a vector that stipulates the movement of a macro block from one location to


17/78


18/78



MSE =N-1i=0 2(7)

where N is the side of the macro bock, C and Rijare the pixels being compared in current

macro block and reference macro block respectively, Peak-Signal-to-Noise-Ratio (PSNR)

given by equation 3characterizes the motion compensated image that is created by using

motion vectors and macro clocks from the reference frame

PSNR=10log10 [ ](a)Exhaustive Search (ES)

This algorithm, also known as Full Search, is the most computationally expensive

block matching algorithm of all. This algorithm calculates the cost function at each

possible location in the search window. As a result of which it finds the best possible

match and gives the highest PSNR amongst any block matching algorithm. Fast block

matching algorithms try to achieve the same PSNR doing as little computation as

possible. The disadvantage to ES is that the larger the search window gets the more

computations it requires.

(b)Three Step Search (TSS)

The general idea is represented in Fig.1.9. It starts with the search location at the

centre and sets the step size S = 4, for a usual search parameter value of 7. It then

searches at eight locations +/- pixels around location (0,0). From these nine locations

searched so far it picks the one giving least cost and makes it the new search origin. It

then sets the new step size S = S/2,and repeats similar search for two more until S = 1. At

that point it finds the location with the least cost function and the macro block at that

location is the best match. The calculated motion vector is then saved for transmission. It

gives a flat reduction in computation by a factor of 9. So that for p = 7, ES will compute

cost for 225 macro blocks where as TSS computes cost for 25 macro blocks. The ideabehind TSS is that the error surface due to motion in every macro block is unimodal. A

unimodal surface is a bowl shaped surface such that the weights generated by the cost

function increase monotonically from the global minimum.


19/78



Fig.1.9: Three Step Search procedure. The motion vector is (5, -3).

(c)New Three Step Search (NTSS)

NTSS improves on TSS results by providing a centre biased searching scheme

and having provisions for half way stop to reduce computational cost. It was one of the

first widely accepted fast algorithms and frequently used for implementing earlier

standards like MPEG 1 and H.261.The TSS uses a uniformly allocated checking pattern

for motion detection and is prone to missing small motions. The NTSS process is

illustrated graphically in fig.1.10. In the first step 16 points are checked in addition to the

search origin for lowest weight using a cost function. Of these additional search

locations, 8 are a distance of S = 4 away (similar to TSS) and the other 8 are at S = 1away from the search origin. If the lowest cost is at the origin then the search is stopped

right here and the motion vector is set as (0, 0). If the lowest weight is at any one of the 8

locations at S = 1, then we change the origin of the search to that point and check for

weights adjacent to it.


20/78



Fig.1.10: New Three Step Search block matching.

Big circles are checking points in the first step of TSS and the squares are the

extra 8 points added in the first step of NTSS. Triangles and diamonds are second step of

NTSS showing 3 points and 5 points being checked when least weight in first step is at

one of the 8 neighbours of window centre.


21/78



Fig.1.11: Search patterns corresponding to each selected quadrant: (a) Shows all

quadrants (b) quadrant I is selected (c) quadrant II is selected (d) quadrant III is

selected (e) quadrant IV is selected.

Depending on which point it is end up checking 5 points or 3 points (Fig 1.11(b)

& (c)). The location that gives the lowest weight is the closest match and motion vector is

set to that location. On the other hand if the lowest weight after the first step was one of

the 8 locations at S = 4, then follow the normal TSS procedure. Hence although this

process might need a minimum of 17 points to check every macro block, it also has the

worst-case scenario of 33 locations to check.

(d)Simple and Efficient Search (SES)

SES is another extension to TSS and exploits the assumption of unimodal error

surface. The main idea behind the algorithm is that for a unimodal surface there cannot be

two minimums in opposite directions and hence the 8 point fixed pattern search of TSS

can be changed to incorporate this and save on computations. The algorithm still has

three steps like TSS, but the innovation is that each step like TSS, but the innovation is

that each step has further two phases. The search area is divided into four quadrants and

the algorithm checks three locations A, B and C as shown in fig... A is at the origin and B

and C are S = 4 locations away from A in orthogonal directions. Depending on certain

weight distribution amongst the three the second phase selects few additional points as

shown in fig. 2.3. The rules for determining a search quadrant for seconds phase are as

follows:


22/78



If MAD(A) _ MAD(B) and MAD(A) _ MAD(C), select (b);

If MAD(A) _ MAD(B) and MAD(A) _ MAD(C), select (c)

If MAD(A)_ MAD(B) and MAD(A) < MAD(C), select (d);

If MAD(A) _ MAD(B) and MAD(A) _ MAD(C), select (e)If MAD(A) < MAD(B) and MAD(A) _ MAD(C), select (e)

Once selected the points to check for in second phase, we find the location with

the lowest weight and set it as the origin. We then change the step size similar to TSS and

repeat the above SES procedure again until we reach S = 1.The location with the lowest

weight is then noted down in terms of motion vectors and transmitted. An example

process is illustrated in fig.1.12

Fig.1.12: The SES procedure. The motion vector is (3, 7) in this example.

Although this algorithm saves a lot on computation as compared to TSS, it was

not widely accepted for two reasons. Firstly, in reality the error surfaces are not strictly

unimodal and hence the PSNR achieved is poor compared to TSS. Secondly, there was


23/78



another algorithm, Four Step Search, that had been published a year before that presented

low computational cost compared to TSS and gave significantly better PSNR.

(e)Four Step Search (4SS)

Similar to NTSS, 4SS also employs center biased searching and has a halfway

stop provision. 4SS sets a fixed pattern size of S = 2 for the first step, no matter what the

search parameter p value is. Thus it looks at 9 locations in a5x5 window. If the least

weight is found at the centre of search jumps to fourth step. If the least weight is at one of

the eight locations except the centre, then we make it the search origin and move to the

second step. The search window is still maintained as 5x5 pixels wide. Where the least

weight location was, might end up checking weights at 3 locations or 5 locations. The

patterns are shown in Fig 1.13

(c) (d)

Fig.1.13: Search patterns of the FSS. (a) First step (b) Second/Third

step(c)Second/Third Step (d) Fourth Step.

Once again if the least weight location is at the center of the 5x5 search window

we jump to fourth step or else we move on to third step. The third is exactly the same as

the second step. IN the fourth step the window size is dropped to 3x3, i.e.S = 1. The


24/78



location with the least weight is the best matching macro block and the motion vector is

set to point o that location. A sample procedure is shown in Fig 1.14. This search

algorithm has the best case of algorithm has the best case of 17 checking points and worst

case of 27 checking points.

(f)Diamond Search (DS)

DS algorithm is exactly the same as 4SS, but the search point pattern is changed

from a square to a diamond, and there is no limit on the number of steps that the

algorithm can take.DS uses two different types of fixed patterns, one is Large Diamond

Search Pattern (LDSP) and the other is Smal lDiamond Search Pattern (SDSP). These

two patterns and the DS procedure are illustrated in Fig.1.13. Just like in FSS, the first

step uses LDSP and if the least weight is at the center location we jump to fourth step.

The consequent steps, except the last step, are also similar and use LDSP, but the number

of points where cost function is checked are either 3 or 5 and are illustrated in second and

third steps of procedure shown in Fig.1.14

. Fig. 1.14: Diamond Search procedure.


25/78



This figure shows the large diamond search pattern and the small diamond search

pattern. It also shows an example path to motion vector (-4, -2) in five search steps four

times of LDSP and one time of SDSP.

The last step uses SDSP around the new search origin and the location with the

least weight is the best match. As the search pattern is neither too small nor too big and

the fact that there is no limit to the number of steps, this algorithm can find global

minimum very accurately. The end result should see a PSNR close to that of ES while

computational expense should be significantly less.

Adaptive Rood Pattern Search (ARPS)

ARPS algorithm makes use of the fact that the general motion in a frame is

usually coherent, i.e. if the macro blocks around the current macro block moved in aparticular direction hen there is a high probability that the current macro block will also

have a similar motion vector. This algorithm uses the motion vector of the macro block to

its immediate left to predicts own motion vector. An example is shown in fig.1.15.

Fig.1.15: Adaptive Root Pattern: The predicted motion vector is (3,-2), and the step

size S = Max (|3|, |-2|) = 3.


26/78



The predicted motion vector points to (3, -2). In addition to checking the location

pointed by the predicted motion vector, it also checks at a rood pattern distributed points,

as shown in fig.1.15, where they are at a step size of S = Max (|X|, |Y|). X and Y are the

x- coordinate and y-coordinate of the predicted motion vector. This rood pattern search isalways the first step. It directly puts the search in an area where there is a high probability

of finding a good matching block.

The point that has the least weight becomes the origin for subsequent search steps,

and the search pattern is changed to SDSP. The procedure keeps on doing SDSP until

least weighted point is found to be at the center of the SDSP. A further small

improvement in the algorithm can be to check for Zero Motion Prejudgment, using which

the search is stopped half way if the least weighted point is already at the center of the

rood pattern

Fig. 1.16. Search points per macro block while computing the PSNR

Performance of Fast Block Matching Algorithms

The main advantage of this algorithm over DS is if the predicted motion vector is

(0, 0), it does not waste computational time in doing LDSP, it rather directly starts using


27/78



SDSP. Furthermore, if the predicted motion vector is far away from the center, then again

ARPS save on computations by directly jumping to that vicinity and using SDSP,

whereas DS takes its time doing LDSP.

Care has to be taken to not repeat the computations at points that were checked

earlier. Care also needs to be taken when the predicted motion vector turns to match one

of the rood pattern location. So have to avoid double computation at that point. For macro

blocks in the first column of the frame, rood pattern step size is fixed at 2 pixel.

Fig.1.17: PSNR performance of Fast Block Matching Algorithms.

1.11 THESIS OUTLINE

The summary of chapter 1 deals with general concept of 3D reconstruction and

their methods of how the image is reconstructed from 2D to 3D image. Types of methods

are Exhaustive Search (ES), Three Step Search (TSS), New Three Step Search

(NTSS),Simple and Efficient Search (SES),Four Step Search (4SS),Diamond Search

(DS),Adaptive Rood Pattern Search (ARPS).


28/78



Chapter 2 deals with the stereo vision algorithms. Chapter 3 deals with the

implementation of algorithms. Chapter 4 deals with the simulation results. Chapter 5

describes the conclusion.


29/78



CHAPTER 2

STEREO VISION ALGORITHMS

2.1 INTRODUCTION TO STEREO VISION

Stereo correspondence problem has historically been, and continues to be, one of

the most investigated topics in computer vision, and a larger number of literatures on it

have been published. The correspondence problem in computer vision concerns the

matching of points, or other kinds of primitives, in two or more images such that the

matched elements are the projections of the same physical elements in 3D scene, and the

resulting displacement of a projected point in one image with respect to the other is

termed as disparity. Similarity is the guiding principle for solving the correspondence

problem; however, the stereo correspondence problem is an ill-posed task, in order to

make it tractable, it is usually necessary to exploit some additional information or

constraints.

The most popular constraint is the epipolar constraint, which can reduce the

search to one-dimension rather than two. Other constraints commonly used are the

disparity uniqueness constraint and the continuous constraint.

The origin of the word stereo is the Greek word stereos which means firm or

solid, with stereo vision, the objects are seen solid in three dimensions with range. In

stereo vision, the same seen is captured using two sensors from two different angles. The

captured two images have a lot of similarities and smaller number of differences. In

human sensitivity, the brain combines the captured to images together by matching the

similarities and integrating the differences to get a three dimension model for the seen

objects.

In machine vision, the three dimension model for the captured objects is obtained

finding the similarities between the stereo images and using projective geometry to

process these matches. The difficulties of reconstruction using stereo is finding matching

correspondences between the stereo pair.


30/78



Latest trends in the field mainly pursue real-time execution speeds, as well as

decent accuracy. As indicated by this survey, the algorithms theoretical matching cores

are quite well established leading the researchers towards innovations resulting in more

efficient hardware implementations.

Detecting conjugate pairs in stereo images is a challenging research problem

known as the correspondence problem, i.e., to find for each point in the left image, the

corresponding point in the right one. To deter-mine these two points from a conjugate

pair, it is necessary to measure the similarity of the points. The point to be matched

without any ambiguity should be distinctly different from its surrounding pixels. Several

algorithms have been proposed in order to address this problem. However, every

algorithm makes use of a matching cost function so as to establish correspondencebetween two pixels.

The most common ones are absolute intensity differences (AD), the squared

intensity differences (SD) and the normalized cross correlation (NCC) evaluation of

various matching costs can be found. Usually, the matching costs are aggregated over

support regions. Those support regions, often referred to as support or aggregating

windows, could be square or rectangular, fix-sized or adaptive ones. The aggregation of

the aforementioned cost functions, leads to the core of most of the stereo vision methods,

which can be mathematically SAD expressed as follows, for the case of the sum of

absolute differences

(SAD).(x, y, d) = (I l (x, y) - I r (x, y - d)) (1)

For the case of the sum of squared differences (SSD)

SSD (x, y, d) `= (I l (x, y) - I r (x, y - d))2

..(2)

And for the case of the NCC

NCC (x, y, d) = (I l (x, y) * I r (x, y - d))/sqrt ( (I2

l (x, y) I2

r (x, y - d))..(3)

Where IL and Ir are the intensity values in left and right image, (x, y) are the pixels

coordinates, d is the disparity value under consideration and W is the aggregated support


31/78



region. The selection of the appropriate disparity value for each pixel is performed

afterwards. The simpler algorithms make use of winner-takes-all (WTA) method of

disparity selection.

D (x, y) = arg min SAD (x, y, d) .. ...................(4)

i.e., for every pixel (x, y) and for constant value of disparity d the minimum cost is

selected. Equation 1.4 refers to the SAD method but any other could be used instead.

However, in many cases disparity selection is an iterative process, since each pixels

disparity is depending on its neighbouring pixels disparity. As a result, more than one

iterations are needed in order to find the best set of disparities. This stage differentiates

the local from the global algorithms, which will be analyzed. An additional disparity

refinement step is frequently used

2.2 GOAL OF STEREO VISION

The recovery of the 3D structure of a scene using two or more images of the 3D

scene, each acquired from a different viewpoint in space. The images can be obtained

using multiple cameras or one moving camera. The term binocular vision is used when

two cameras are employed

Fig2.1: General setup of cameras


32/78



2.2.1 Stereo setup and terminology

Fixation point:the point of intersection of the optical axis.

Baseline:the distance between the centres of projection.

Epipolar plane:the plane passing through the centers of projection and the point in the

scene.

Epipolar line:the intersection of the epipolar plane with the image plane.

Conjugate pair:any point in the scene that is visible in both cameras will be projected to

a pair of image points in the two images

Disparity: the distance between corresponding points when the two images are

superimposed.

Disparity map:the disparities of all points from the disparity map (can be displayed as

an image).

Fig2.2: Internal projection of camera in stereo vision


33/78



Figure2.3: Two cameras in arbitrary position and orientation

2.2.2 Triangulation - the principle underlying stereo vision

The 3D location of any visible object point in space is restricted to the straight line

that passes through the center of projection and the projection of the object point.

Binocular stereo vision determines the position of a point in space by finding the

intersection of the two lines passing through center of projection and the projection of

the point in each image

Fig2.4: Positions of binocular


34/78



2.2.3 The problems of stereo

The correspondence problem.

The reconstruction problem.

2.2.4 The correspondence problem

Finding pairs of matched points such that each point in the pair is the projection

of the same 3D point.

Triangulation depends crucially on the solution of the correspondence problem.

Ambiguous correspondence between points in the two images may lead to several

different consistent interpretations of the scene.

Fig2.5: Correspondence problem in stereo vision

2.2.5The reconstruction problem

Given the corresponding points, we can compute the disparity map.

The disparity map can be converted to a 3D map of the scene (i.e., recover the 3D

structure) if the stereo geometry is known


35/78



Fig2.6: Reconstruction problem in stereo vision

2.3 STEREO CORRESPONDENCE

Stereo correspondence problem has historically been, and continues to be, one of

the most investigated topics in computer vision, and a larger number of literatures on it

have been published. The correspondence problem in computer vision concerns the

matching of points, or other kinds of primitives, in two or more images such that the

matched elements are the projections of the same physical elements in 3D scene, and the

resulting displacement of a projected point in one image with respect to the other is

termed as disparity.

Similarity is the guiding principle for solving the correspondence problem;

however, the stereo correspondence problem is an ill-posed task, in order to make it

tractable, it is usually necessary to exploit some additional information or constraints.

The most popular constraint is the epipolar constraint, which can reduce the search to

one-dimension rather than two. Other constraints commonly used are the disparity

uniqueness constraint and the continuous constraint.

The existing techniques for general two-view stereo correspondence roughly fall

into two categories: local method and global method. Local methods use only small areas

neighborhoods surrounding the pixels, while global methods optimize some global

(energy) function. Local methods, such as block matching, gradient-based optimization,


36/78



and feature matching can be very efficient, but they are sensitive to locally ambiguous

regions in images (e.g., occlusion regions or regions with uniform texture).

Global methods, such as dynamic programming, intrinsic curves, graph cuts, and

belief propagation can be less sensitive to these problems since global constraints provide

additional support for regions difficult to match locally. However, these methods are

more expensive in their computational cost.

Stereo correspondence algorithms can be grouped into those producing sparse

output and those giving a dense result. Feature based methods stem from human vision

studies and are based on matching segments or edges between two images, thus resulting

in a sparse output. This disadvantage, dreadful for many purposes, is counterbalanced by

the accuracy and speed obtained. However, contemporary applications demand more and

more dense output.

In order to categorize and evaluate them a context has been proposed. According

to this, dense matching algorithms are classified in local and global ones. Local methods

trade accuracy for speed. They are also referred to as window-based methods because

disparity computation at a given point depends only on intensity values within a finite

support window. Global methods (energy-based) on the other hand are time consuming

but very accurate.

Their goal is to minimize a global cost function, which combines data and

smoothness terms, taking into account the whole image. Of course, there are many other

methods that are not strictly included in either of these two broad classes. The issue of

stereo matching has recruited a variation of computation tools. Advanced computational

intelligence techniques are not uncommon and present interesting and promiscuous

results.

While the aforementioned categorization involves stereo matching algorithms in

general, in practice it is valuable for software implemented algorithms only. Software

implementations make use of general purpose personal computers (PC) and usually result

in considerably long running times. However, this is not an option when the objective is


37/78



the development of autonomous robotic platforms, simultaneous localization and

mapping (SLAM) or virtual reality (VR) systems.

Such tasks require real-time, efficient performance and demand dedicated

hardware and consequently specially developed and optimized algorithms. Only a small

subset of the already proposed algorithms is suitable for hardware implementation.

Hardware implemented algorithms are characterized from their theoretical algorithm as

well as the implementation itself. There are two broad classes of hardware

implementations: the field-programmable gate arrays (FPGA) and the application-

specific integrated circuits (ASIC) based ones. Figure 2 depicts an ASIC chip (a) and a

FPGA development board (b). Each one can execute stereo vision algorithms without the

necessity of a PC, saving volume, weight and consumed energy. However, the evolutionof FPGA has made them an appealing choice due to the small prototyping times, their

flexibility and their good performance

2.4 STEREO MATCHING ALGORITHMS

The issue of stereo correspondence is of great importance in the field of machine

vision, computer vision, depth measurements and environment reconstruction as well as

in many other aspects of production, security, defense, exploration, and entertainment.

Calculating the distance of various points or any other primitive in a scene relative to the

position of a camera is one of the important tasks of a computer vision system.

The most common method for extracting depth information from intensity images

is by means of a pair of synchronized camera-signals, acquired by a stereo rig. The point-

by-point matching between the two images from the stereo setup derives the depth

images, or the so called disparity maps. This matching can be done as a one dimensional

search if accurately rectified stereo pairs in which horizontal scan lines reside on thesame epipolar line are assumed, as shown in Figure 2.7. A point P1 in one image plane

may have arisen from any of points in the line C1P1, and may appear in the alternate

image plane at any point on the so-called epipolar line.


38/78



Thus, the search is theoretically reduced within a scan line, since corresponding

pair points reside on the same epipolar line. The difference on the horizontal coordinates

of these points is the disparity. The disparity map consists of all disparity values of the

image. Having extracted the disparity map, problems such as 3D reconstruction,positioning, mobile robot navigation, obstacle avoidance, etc., can be dealt with in a more

efficient way.

Fig 2.7: Geometry of epipolar lines, where C1 and C2 are the left and right camera

lens centers, respectively. Point P1 in one image plane may have arisen from any of

points in the line C1P1, and may appear in the alternate image plane at any point on

the epipolar line E2.

As numerous methods have been proposed since then, this section aspires to review the

most recent ones, i.e., Most of the results presented in the rest of this paper are based on

the image sets and test provided there.

The most common image sets are presented in Figure 2.8. Table 2.1 summarizes

their size as well the number of disparity levels. Experimental results based on theseimage sets are given, where available. The preferred metric adopted by in this paper, in

order to depict the quality of the resulting disparity maps, is the percentage of pixels

whose absolute disparity error is greater than 1 in the unconcluded areas of the image.

This metric, considered the most representative of the results quality, was used so as to


39/78



make comparison easier. Other metrics, like error rate and root mean square error are also

employed.

Fig.2.8:Left image of the stereo pair (left) and ground truth (right) for the Tsukuba

(a), Sawtooth (b), Map (c), Venus (d), Cones (e) and Teddy (f) stereo pair.

The speed with which the algorithms process input image pairs is expressed in

frames per second (fps). This metric has of course a lot to do with the used computational

platform and the kind of the implementation. Inevitably, speed results are not directly

comparable.

Tsukuba Map Sawtooth Venus Cone Teddy

Size in

pixels

384288 284216 434380 434383 450375 450375

Disparity

levels

16 30 20 20 60 60

Table 2.1: Characteristics of the most common image sets


40/78



2.5 DENSE DISPARITY ALGORITHMS

Methods that produce dense disparity maps gain popularity as the computational

power grows. Moreover, contemporary applications are benefited by, and consequently

demand dense depth information. Therefore, during the latest years efforts towards this

direction are being reported much more frequently than towards the direction of sparse

results.

Dense disparity stereo matching algorithms can be divided in two general classes,

according to the way they assign disparities to pixels. Firstly, there are algorithms that

decide the disparity of each pixel according to the information provided by its local,

neighboring pixels. There are, however, other algorithms which assign disparity values to

each pixel depending on information derived from the whole image. Consequently, the

former ones are called local methods while the latter ones global.

2.5.1. Local methods.

Local methods are usually fast and can at the same time produce descent results.

Several new methods have been presented. In Figure 2.9 Venn diagram presents the main

characteristics of the below presented local methods. Under the term color usage we have

grouped the methods that take advantage of the chromatic information of the image pair.

Any algorithm can process color images but not everyone can use it in a more beneficial

way. Furthermore, in Figure 2.3 NCC stands for the use of normalized cross correlation

and SAD for the use of sum of absolute differences as the matching cost function.

As expected, the use of SAD as matching cost is far more widespread than any

other. A method that uses the sum of absolute differences (SAD) correlation measure for

RGB color images. It achieves high speed and reasonable quality. It makes use of the left

to right consistency and uniqueness constraints and applies a fast median filter to the

results.

It can achieve 20 fps for 160120 pixels image size, making this method suitable

for real time applications. The PC platform is Linux on a dual processor 800MHz

Pentium III system with 512 MB of RAM.


41/78



Fig.2.9: Diagrammatic representation of the local methods categorization.

Another fast area-based stereo matching algorithm, which uses the SAD as error

function, is presented is based on the uniqueness constraint, it rejects previous matches as

soon as better ones are detected. In contrast to bidirectional matching algorithms this one

performs only one matching phase, having though similar results. The results obtained

are tested for reliability and sub-pixel refined. It produces dense disparity maps in real-

time using an Intel Pentium III processor running at 800MHz. The algorithm achieves

39.59 fps speed for 320240 pixels and 16 disparity levels and the root mean square error

for the standard Tsukuba pair is 5.77%.

The object is to achieve minimum segmentation. The experimental results

indicate 1.77%, 0.61%, 3.00%, and 7.63% error percentages. The execution speed of the

algorithm varies from 1 to 0.2 fps on a 2.4GHz processor.

Another method that presents almost real-time performance is it makes use of a

refined implementation of the SAD method and a left-right consistency check. The errors

in the problematic regions are reduced using different sized correlation windows. Finally,

a median filter is used in order to interpolate the results. The algorithm is able to process


42/78



7 fps for 320240 pixels images and 32 disparity levels. These results are obtained using

an Intel Pentium 4 at 2.66GHz Processor.

A window-based method for correspondence search is presented in that uses

varying support-weights. The support-weights of the pixels in a given support window

are adjusted based on color similarity and geometric proximity to reduce the image

ambiguity. The difference between pixel colors is measured in the CIE Lab color space

because the distance of two points in this space is analogous to the stimulus perceived by

the human eye. The running time for the image pair with a 3535 pixels support window

is about 0.016 fps on an AMD 2700processor. The error ratio is 1.29%, 0.97%, 0.99%,

and 1.13% and Map image sets, respectively. These figures can be further improved

through a left-right consistency check.

For given input images, specular free two-band images are generated. The

similarity between pixels of these input-image representations can be measured using

various correspondence search methods such as the simple SAD-based method, the

adaptive support-weights method and the dynamic programming (DP) method. This pre-

processing step can be performed in real time and compensates satisfactory for specular

reflections.

On the other hand the zero mean normalized cross correlation (ZNCC) as

matching cost. This method integrates a neural network (NN) model, which uses the

least-mean-square delta rule for training. The NN decides on the proper window shape

and size for each support region. The results obtained are satisfactory but the 0.024 fps

running speed reported for the common image sets, on a Windows platform with a

300MHz processor, renders this method as not suitable for real-time applications.

Based on the same matching cost function a more complex area-based method is

proposed in a perceptual organization framework, considering both binocular and

monocular cues is utilized. An initial matching is performed by a combination of

normalized cross correlation techniques. The correct matches are selected for each pixel

using tensor voting. Matches are then grouped into smooth surfaces. Disparities for the

unmatched pixels are assigned so as to ensure smoothness in terms of both surface


43/78



orientation and color. The percentage of un occluded pixels whose absolute disparity

error is greater than 1 is 3.79, 1.23, 9.76, and 4.38 for the image sets. The execution

speed reported is about 0.002 fps for the image pair with 20 disparity levels running on

an Intel Pentium 4 processor at 2.8MHz.

There are, of course, more hardware-oriented proposals as well. Many of them

take advantage of the contemporary powerful graphics machines to achieve enhanced

results in terms of processing time and data volume. A hierarchical disparity estimation

algorithm implemented on programmable 3D graphics processing unit (GPU) is reported

in this method can process either rectified or un calibrated image pairs. Bidirectional

matching is utilized in conjunction with a locally aggregated sum of absolute intensity

differences.

Moreover, the use of Cellular Automata (CA) presents architecture for real-time

extraction of disparity maps. It is capable of processing 1Mpixels image pairs at more

than 40 fps. The core of the algorithm relies on matching pixels of each scan-line using a

one-dimensional window and the SAD matching cost as described in this method

involves a pre-processing mean filtering step and a post-processing CA based filtering

one.

CA is models of physical systems, where space and time are discrete and

interactions are local. They can easily handle complicated boundary and initial

conditions. In CA analysis, physical processes and systems are described by a cell array

and a local rule, which defines the new state of a cell depending on the states of its

neighbors. All cells can work in parallel due to the fact that each cell can independently

update each own state. Therefore the proposed CA algorithm is massively parallel and is

an ideal candidate to be implemented in hardware.

2.5.2 Global methods

Contrary to local methods, global ones produce very accurate results. Their goal is

to find the optimum disparity function d = d (x, y) which minimizes a global cost

function E, which combines data and smoothness terms.


44/78



E (d) = E data (d) +. E smooth (d) (5)

Where E datatakes into consideration the (x, y) pixels value throughout the image, E

smoothprovides the algorithms smoothening assumptions and k is a weight factor.

The main disadvantage of the global methods is that they are more time

consuming and computational demanding. The source of these characteristics is the

iterative refinement approaches that they employ. They can be roughly divided in those

performing a global energy minimization and those pursuing the minimum for

independent scan lines using DP.

In Figure 2.10 the main characteristics of the below discussed global algorithms

are presented. It is clear that the recently published works utilizes global optimizationpreferably rather than DP. This observation is not a surprising one, taking into

consideration the fact that under the term global optimization there are actually quite a

few different methods. Additionally, DP tends to produce inferior ,thus less impressive,

results. Therefore, applications that dont have running speed constraints, preferably

utilize global optimization methods

2.6 REVIEW OF STEREO VISION ALGORITHMS

Fig.2.10: Diagrammatic representation of the global methods categorization


45/78



2.6.1. Global optimization

The algorithms that perform global optimization take into consideration the whole

image in order to determine the disparity of every single pixel. An increasing portion of

the global optimization methodologies involves segmentation of the input images

according to their colors.

The algorithm presented uses color segmentation. Each segment is described by a

planar model and assigned to a layer using a mean shift based clustering algorithm. A

global cost function is used that takes into account the summed up absolute differences,

the discontinuities between segments and the occlusions. The assignment of segments to

layers is iteratively updated until the cost function improves no more. The experimental

results indicate that the percentage of un-concluded pixels whose absolute disparity error

is greater than 1 is 1.53, 0.16, and 0.22 for the image sets, respectively.

The stereo matching algorithm proposed in makes use of color segmentation in

conjunction with the graph cuts method. The reference image is divided in non-

overlapping segments using the mean shift color segmentation algorithm. Thus, a set of

planes in the disparity space is generated. The goal of minimizing an energy function is

faced in the segment rather than the pixel domain. A disparity plane is fitted to each

segment using the graph cuts method. This algorithm presents good performance in the

texture less and occluded regions as well as at disparity discontinuities. The running

speed reported is 0.33 fps for a 384288 pixel image pair when tested on a 2.4GHz

Pentium 4 PC. The percentage of bad matched pixels and Map image sets is found to be

1.23, 0.30, 0.08, and 1.49, respectively.

The ultimate goal of the work describe is to render dynamic scenes with

interactive viewpoint control produced by a few cameras suitable color segmentation-

based algorithm is developed and implemented on a programmable ATI 9800 PRO GPU.

Disparities within segments must vary smoothly, each image is treated equally,

occlusions are modelled explicitly and consistency between disparity maps is enforced

resulting in higher quality depth maps. The results for each pixel are refined in

conjunction with the others.


46/78



Another method that uses the concept of image color segmentation is reported in

an initial disparity map is calculated using an adapting window technique. The segments

are combined in larger layers iteratively. The assignment of segments to layers is

optimized using a global cost function. The quality of the disparity map is measured bywarping the reference image to the second view and comparing it with the real image and

calculating the color dissimilarity.

For the 384288 pixel and the 434383 pixel Venus test set, the algorithm

produces results at 0.05 fps rate. For the 450375 pixel Teddy image pair, the running

speed decreased to 0.01 fps due to the increased scene complexity. Running speeds refer

to an Intel Pentium 4 2.0GHz processor. The root mean square error obtained is 0.73 for

the 0.31 for the Venus and 1.07 for the image pair.

Moreover, Sun and his colleagues presented a method which treats the two

images of a stereo pair symmetrically within an energy minimization framework that can

also embody color segmentation as a soft constraint. This method enforces that the

occlusions in the reference image are consistent with the disparities found for the other

image. Belief propagation iteratively refines the results. Moreover, results for the version

of the algorithm that incorporates segmentation are better.

The percentage of pixels with disparity error larger than 1 is 0.97, 0.19, 0.16, and

0.16 for the Map image sets, respectively. The running speed for the aforementioned data

sets is about 0.02 fps tested on a 2.8GHz Pentium 4 processor.

Color segmentation is utilized as well. The matching cost used here is a self-

adapting dissimilarity measure that takes into account the sum of absolute intensity

differences as well as a gradient based measure. Disparity planes are extracted using an

insensitive to outliers technique. Disparity plane labelling is performed using belief

propagation. Execution speed varies between 0.07 and 0.04 fps on a 2.21GHz AMD

Athlon 64 processor. The results indicate 1.13, 0.10, 4.22, and 2.48 percent of bad

matched pixels in non-occluded areas for the image sets, respectively.


47/78



Finally, one more algorithm that utilizes energy minimization, color

segmentation, plane fitting and repeated application of hierarchical belief propagation is

presented in this algorithm takes into account a color weighted correlation measure.

Discontinuities and occlusions are properly handled. The percentage of pixels withdisparity error larger than 1 is 0.88, 0.14, 3.55, and 2.90 for the Cones image sets,

respectively.

In two new symmetric cost functions for global stereo methods are proposed. A

symmetric data cost function for the likelihood, as well as a symmetric discontinuity cost

function for the prior in the MRF model for stereo is presented. Both the reference image

and the target image are taken into account to improve performance without modelling

half-occluded pixels explicitly and without using color segmentation. The use of both ofthe two proposed symmetric cost functions in conjunction with a belief propagation based

stereo method is evaluated.

Experimental results for standard test bed images show that the performance of

the belief propagation based stereo method is greatly improved by the combined use of

the proposed symmetric cost functions. The percentage of pixels badly matched for the

non-occluded areas was found 1.07, 0.69, 0.64, and 1.06 for the image sets, respectively.

The incorporation of Markov random fields (MRF) as a computational tool is also a

popular approach.

A method based on the Bayesian estimation theory with a prior MRF model for

the assigned disparities is described in the continuity, coherence and occlusion constraints

as well as the adjacency principal are taken into account. The optimal estimator is

computed using a Gauss-Markov random field model for the corresponding posterior

marginal, which results in a diffusion process in the probability space. The results are

accurate but the algorithm is not suitable for real-time applications, since it needs a few

minutes to process a 256255 stereo pair with up to 32 disparity levels, on an Intel

Pentium III running at 450MHz.

On the other hand, treat every pixel of the input images as generated either by a

process, responsible for the pixels visible from the reference camera and which obey to


48/78



the constant brightness assumption, or by an outlier process, responsible for the pixels

that cannot be corresponded. Depth and visibility are jointly modelled as a hidden MRF,

and the spatial correlations of both are explicitly accounted for by defining a suitable

Gibbs prior distribution. An expectation maximization (EM) algorithm keeps track ofwhich points of the scene are visible in which images, and accounts for visibility

configurations. The percentages of pixels with disparity error larger than 1 are 2.57, 1.72,

6.86 and 4.64 for the image sets, respectively.

Moreover, a stereo method specifically designed for image-based rendering is

described in this algorithm uses over-segmentation of the input images and computes

matching values over entire segments rather than single pixels. Color-based segmentation

preserves object boundaries. The depths of the segments for each image are computedusing loopy belief propagation within a MRF framework. Occlusions are also considered.

The percentage of bad matched pixels in the un concluded regions is 1.69, 0.50, 6.74, and

3.19 for the Cones image sets, respectively. The aforementioned results refer to a 2.8GHz

PC platform.

An algorithm based on a hierarchical calculation of mutual information based

matching cost is proposed. Its goal is to minimize a proper global energy function, not by

iterative refinements but by aggregating matching costs for each pixel from all directions.

The final disparity map is sub- pixel accurate and occlusions are detected. The processing

speed for the image set is 0.77 fps. The error in un concluded regions is found less than

3% for all the standard image sets. Calculations are made on an Intel Xeon processor

running at 2.8GHz.

Mutual information is once again used as cost function. The extensions applied in

it result in intensity consistent disparity selection for un textured areas and discontinuity

preserving interpolation for filling holes in the disparity maps. It treats successfully

complex shapes and uses planar models for un textured areas. Bidirectional consistency

check, sub-pixel estimation as well as invalid-disparities interpolation are performed.

The experimental results indicate that the percentages of bad matching pixels in

un-concluded regions are 2.61, 0.25, 5.14, and2.77 for the Tsukuba, Venus, Teddy and


49/78



Cones image sets, respectively, with 64disparity levels searched each time. However, the

reported running speed on a2.8GHz PC is less than 1 fps.

The dense disparity estimation is accomplished by a region dividing technique

that uses a Canny edge detector and a simple SAD function. The results are refined by

regularizing the vector fields by means of minimizing an energy function. The root mean

square error obtained from this method is 0.9278 and 0.9094 for the image pairs. The

running speed is 0.15 fps and 0.105 fps respectively on a Pentium 4 PC running Windows

XP.

An uncommon measure is used in this work describes an algorithm which is

focused on achieving contrast invariant stereo matching. It relies on multiple spatial

frequency channels for local matching. The measure for this stage is the deviation of

phase difference from zero. The global solution is found by a fast non-iterative left right

diffusion process. Occlusions are found by enforcing the uniqueness constraint. The

algorithm is able to handle significant changes in contrast between the two images and

can handle noise in one of the frequency channels.

Another algorithm that generates high quality results in real time is reported is

based on the minimization of a global energy function comprising of a data and a

smoothness term. The hierarchical belief propagation iteratively optimizes the

smoothness term but it achieves fast convergence by removing redundant computations

involved. In order to accomplish real-time operation authors take advantage of the

parallelism of graphics hardware (GPU).

Experimental results indicate 16 fps processing speed for 320240 pixel self-

recorded images with 16 disparity levels. The percentages of bad matching pixels in un-

concluded regions for the image sets are found to be 1.49, 0.77, 8.72, and 4.61. The

computer used is a 3GHz PC and the GPU is an NVIDIA 7900 GTX graphics card with

512M video memory.

` The work indicates that computational cost of the graph cuts stereo

correspondence technique can be efficiently decreased using the results of a simple local


50/78



stereo algorithm to limit the disparity search range. The idea is to analyze and exploit the

failures of local correspondence algorithms. This method can accelerate the processing by

a factor of 2.8, compared to the sole use of graph cuts, while the resulting energy is worse

only by an average of 1.7%. These results proceed from an analysis done on a largedataset of 32 stereo pairs using a Pentium4 at 2.6GHz PC.

2.6.2. Dynamic programming

Many researchers develop stereo correspondence algorithms based on DP. This

methodology is a fair trade-off between the complexity of the computations needed and

the quality of the results obtained. In every aspect, DP stands between the local

algorithms and the global optimization ones. However, its computational complexity still

renders it as a less preferable option for hardware implementation.

The work presents a unified framework that allows the fusion of any partial

knowledge about disparities, such as matched features and known surfaces within the

scene. It combines the results from corner, edge and dense stereo matching algorithms to

impose constraints that act as guide points to the standard DP method. The result is a

fully automatic dense stereo system with up to four times faster running speed and greater

accuracy compared to results obtained by the sole use of DP.

One or more disparity candidates for the true disparity of each pixel are assigned

by local matching using oriented spatial filters. Afterwards, a two-pass DP technique that

performs optimization both along and between the scan-lines is performed. The result is

the reduction of false matches as well as of the typical Inter-scan line inconsistency

problem.

The per-pixel matching costs are aggregated in the vertical direction only

resulting in improved inters scan line consistency and sharp object boundaries. This work

exploits the color and distance proximity based weight assignment for the pixels inside a

fixed support window as reported. The real time performance is achieved due to the

parallel use of the CPU and the GPU of a computer. This implementation can process


51/78



320-240 pixel images with 16 disparity levels at 43.5 fps and 640480 pixel images with

16 disparity levels at 9.9 fps.

On the contrary, the algorithm proposed in the DP method not across individual

scan lines but to a tree structure. Thus the minimization procedure accounts for all the

pixels of the image, compensating the known streaking effect without being an iterative

one. Reported running speed is a couple of frames per second for the tested image pairs.

So, real-time implementations are feasible. However, the results obtained are comparable

to those of the time-consuming global methods.

In the pixel-tree approach of the previous work is replaced by a region-tree one.

First of all, the image is color-segmented using the mean-shift algorithm. During the

stereo matching, a corresponding energy function defined on such a region-tree structure

is optimized using the DP technique. Occlusions are handled by compensating for border

occlusions and by applying cross checking. The obtained results indicate that the

percentage of the bad matched pixels in un-concluded regions is 1.39, 0.22, 7.42, and

6.31 for the Cones image sets. The running speed, on a 1.4GHz Intel Pentium M

processor, ranges from 0.1 fps with 16 disparity levels to 0.04 fps for the Cones dataset

with 60 disparity levels.

2.6.3 Other methods

There are of course other methods, producing dense disparity maps, which can be

placed in neither of previous categories. The below discussed methods use either

wavelet-based techniques or combinations of various techniques .Such a method, based

on the continuous wavelet transform (CWT) is found. It makes use of the redundant

information that results from the CWT. Using 1D orthogonal and bi-orthogonal wavelets

as well as 2D orthogonal wavelet the maximum matching rate obtained is 88.22% for the

image pair. Up sampling the pixels in the horizontal direction by a factor of two, through

zero insertion, further decreases the noise and the matching rate is increased to 84.91%.

Another work presents an algorithm based on non-uniform rational B-splines

(NURBS) curves. The curves replace the edges extracted with a wavelet based method.

The NURBS are projective invariant and so they reduce false matches due to distortion


52/78



and image noise. Stereo matching is then obtained by estimating the similarity between

projections of curves of an image and curves of another image. A 96.5% matching rate

for a self-recorded image pair is reported for this method.

Finally, a different way of confronting the stereo matching issue is proposed in

investigate the possibility of fusing the results from spatially differentiated (stereo vision)

scenery images with those from temporally differentiated (structure from motion) ones.

This method takes advantage of both methods merits improving the performance.

2.7 SPARSE DISPARITY ALGORITHMS

Algorithms resulting in sparse, or semi-dense, disparity maps tend to be less

attractive as most of the contemporary applications require dense disparity information.

Though, they are very useful when fast depth estimation is required and at the same time

detail, in the whole picture, is not so important. This type of algorithms tends to focus on

the main features of the images leaving occluded and poorly textured areas unmatched.

Consequently high processing speeds, accurate results but with limited density are

achieved. Very interesting ideas flourish in this direction but since contemporary interest

is directed towards dense disparity maps, only a few indicatory algorithms are discussed

here.

An algorithm that detects and matches dense features between the left and right

images of a stereo pair, producing a semi-dense disparity map. A dense feature is a

conn

Final Document Print

Documents

Transcript of Final Document Print