Final Document Print

download Final Document Print

of 78

Transcript of Final Document Print

  • 8/10/2019 Final Document Print

    1/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 1

    CHAPTER 1

    INTRODUCTION TO 3D

    1.1 3D RECONSTRUCTION FROM SINGLE 2D IMAGES

    A single image of an everyday object, a sculptor can recreate its 3D shape (i.e.,

    produce a statue of the object), even if the particular object has never been seen before.

    Presumably, it is familiarity with the shapes of similar3D objects (i.e., objects from the

    same class) and how they appear in images, which enables the artist to estimate its shape.

    This might not be the exact shape of the object; but it is often a good enough estimate for

    many purposes Motivated.

    In general, the problem of 3D reconstruction from a single2D image is ill posed,

    since different shapes may give rise to the same intensity patterns. To solve this,

    additional constraints are required. Here, we constrain the reconstruction process by

    assuming that similarly looking objects from the same class (e.g., faces, fish), have

    similar shapes. We maintain a set of 3D objects, selected as examples of a specific class.

    We use these objects to produce a database of images of the objects in the class (e.g., by

    standard rendering techniques), along with their respective depth maps. These provide

    examples of feasible mappings from intensities to shapes and are used to estimate the

    shapes of objects in query images.

    Methods for single image reconstruction commonly use cues such as shading,

    silhouette shapes, texture, and vanishing. These methods restrict the allowable

    reconstructions by placing constraints on the properties of reconstructed objects (e.g.,

    reflectance properties, viewing conditions, and symmetry). A few approaches explicitly

    use examples to guide the reconstruction process. One approach reconstructs outdoor

    scenes assuming they can be labelled as ground, sky, and vertical billboards.

    The target of the system is the geometric model of the scenes. So here considergeometric reconstruction and not photometric (or image-based) reconstruction, which

    directly generates new views of a scene without (completely) reconstructing the 3D

    structure. With the stated purposes stated and application context set the limits as:

  • 8/10/2019 Final Document Print

    2/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 2

    Static scenes: There is no moving object or the movement of objects is relatively same.

    Un-calibrated cameras: The input data is captured by an un calibrated camera, i.e. the

    camera's intrinsic parameters such as focal length is unknown.

    Varying intrinsic camera parameters: The camera intrinsic parameters (e.g. focallength) can vary freely. Together with the previous, this assumption adds flexibility to the

    system.

    1.2 3DRECONSTRUCTION FROM VIDEO SEQUENCES

    The application-oriented description of 3D reconstruction from video sequences

    (shortly called 3D reconstruction)

    1. The process starts with the data capturing step, in which a person moves around

    and captures a static scene using a hand-held camera.

    2.

    The recorded video sequence is then pre-processed (e.g. selecting frames),

    removing (noise, normalizing illumination).

    3. After that, the video sequence is processed to produce a 3D model of the scene.

    4. Finally, the 3D model can be rendered, or exported for editing using 3D modeling

    tools.

    Fig.1.1: Main tasks of 3D reconstruction

    The 3D reconstruction (step 3) can be divided into 4 main tasks, which are as following:1.

    Feature detection and matching: The objective of this step is to find out the

    same features in different images and match them.

    2. Structure and motion recovery: This step recovers the structure and motion of

    the scene (i.e. 3D coordinates of detected features; position ,orientation and parameters of

    the camera at capturing positions).

  • 8/10/2019 Final Document Print

    3/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 3

    3. Stereo mapping: This step creates a dense matching map. In conjunction with the

    structure recovered in the previous step, this enables to build a dense depth map.

    4. Modeling: This step includes procedures needed to make a real model of the

    scene (e.g. building mesh models, mapping textures).

    Some define the input as an image sequence but in fig.1.1 defines it as a video

    sequence since our practical objective is a system that does reconstruction from video. By

    defining it like that, we want to clearly state that the intermediate step to go from video to

    image sequences (i.e. frame selection) is a part of there construction process.

    1.2.1 Feature Detection and Matching

    This process creates relations used by the next step, structure and motionrecovery, by detecting and matching features in different images. Until now, the features

    used in structure recovery processes are points and lines. So here features are understood

    as points or lines.

    Fig.1.2: Pollefeys 3D modelling framework

    Detectors: Given an image a feature detector is s a process to detect features from theimage. The most important information a detector gives is the location of features but other

    characteristics such as the scale can also be detected. Two characteristics that a good

    detector needs are repeatability and reliability. Repeatability means that the same feature can

    be detected in different images. Reliability means that the detected point should be

    distinctive enough so that the number of its matching candidates is small.

  • 8/10/2019 Final Document Print

    4/78

  • 8/10/2019 Final Document Print

    5/78

  • 8/10/2019 Final Document Print

    6/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 6

    Classification: Point descriptors are classified into the following categories:

    Distribution based descriptors: Histograms are used to represent the characteristics of

    the region. The characteristics could be pixel intensity, distance from the centre point

    relative ordering of intensity or gradient.Spatial-frequency descriptors: These techniques are used in the domain of texture

    classification and description. Texture description using Gabor filters is standardized in

    MPEG7.

    Differential descriptors: The descriptor are used to evaluate detectors reliability is an

    example of a differential descriptor, in which a set of local derivatives (local jet) is used

    to describe an interest region.

    Moments: Moments are used moments to describe a region. The central moment of a

    region in combination with the moment's order and degree forms the invariant.

    1.3 LINES

    Two-view projective reconstruction can only use point correspondences. But in

    three or more view structure recovery it is possible to use line correspondences.

    1.3.1 Line detection

    Line detection usually includes edge detection, followed by line extraction.

    1.3.2 Edge detection

    The key to solve the problem is the intensity change, which is shown via the

    gradient of the image. Edge detectors usually follow the same routine: smoothing,

    applying edge enhancement filters, applying a threshold, and edge tracing.

    Evaluations of edge detectors are inconsistent and not convergent for reasons such

    as unclear objective and varying parameters. A series of evaluation in different tasks in

    which the application acts as the black box to test algorithms. One of them is structure

    from motion. The evaluation shows that overall, the canny detector is most suitable

    because of its Performance.

    Fastest s peed and low sensitivity to parameters variation .However the structure

    from motion algorithm used there is not a three-view one and uses line segments rather

  • 8/10/2019 Final Document Print

    7/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 7

    than lines as in also the intermediate processing" (line extraction and corresponding) that

    would affect the final result is fixed. Thus the result is not concrete enough.

    1.3. 3 Line Extraction

    Extracting lines could be done in several ways. The Hough transform is famous in

    curve fitting. Despite of having a long history Hough transform and its extensions are still

    used widely. A simpler approach connects line segments with a limit of angle changes

    and then uses the least median square method to fit the connected paths into lines. As

    with edge detection, no complete and concrete evaluation of line extraction is available.

    1.3.4 Line matching

    Lines can be matched based on attributes such as orientation, length, or extent ofOverlap. Some matching strategies such as nearest line, or additional view verification

    can be used to increase the speed and accuracy. Optical flow can be employed in the case

    of short baseline Matching groups of lines (graph-matching) is more accurate than

    individual matching Beardsley et al use the geometric constraints in both two-view and

    three view cases to match lines. The constraints are found by a robust method with

    corresponding points.

    Lines are generally highly structured features give stronger constraints. Lines are

    many and easy to extract in scenes with dominant artificial objects, e.g. urban

    architectures. However, the fact that evaluations on line extraction and matching for

    structure recovery are not complete and concrete probably is the reason why the theory of

    three-view reconstruction with lines are available for along time but methods in structure

    recovery usually use point correspondence. One of the few works that uses line

    correspondences and trifocal tensors is of Breads but lines are not used directly. Still

    point correspondences are used first to recover geometry information.

    1.4 STRUCTURE AND MOTION RECOVERY

    The second task Structure and motion recover the structure of the scene and the

    motion information of the camera. The motion information is the position, orientation,

  • 8/10/2019 Final Document Print

    8/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 8

    and intrinsic parameters of the camera at the captured views. The structure information is

    captured by the 3D coordinates of features.

    Given feature correspondences, the geometric constraints among views can be

    established. The projection matrices that represent the motion information then can berecovered. Finally, 3D coordinates of features, i.e. structure information can be computed

    via triangulation.

    Fig.1.5: Structure and motion recovery process

    1.5 ADVANTAGES AND PROBLEMS OF USING VIDEO

    SEQUENCES

    It is possible to do 3D reconstruction from images. But in practice, it is more

    natural to use video sequences since it eases the capturing process and provides more

    complete data. But also problems arise. The following describes the advantages and the

    problems of using video sequences as input.

    1.5.1 AdvantagesThe most important advantage of using input of video sequences is the higher

    quality one can obtain. Both geometric accuracy and visual quality can be improved by

    exploiting the redundancy of data. Intuitively, more back-projecting rays of a point's

    projections limits the possible 3D coordinates of the point. The best texture found by

    selecting the best view or super-resolution can be used to get better visualization quality.

  • 8/10/2019 Final Document Print

    9/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 9

    Image sequences also enable some techniques to deal with shadow, shading and

    highlights.

    Other advantages are the automaticity and flexibility. Capturing data by a

    handheld camera is more comfortable since a person does not have to worry about

    missing information or to consider if the captured information is enough for

    reconstruction .And on the processing time, instead of manually selecting some images

    from a video, it is better to have a system that can do everything automatically.

    1.5.2 Problems

    To take advantage of the use of video sequences we have to deal with some

    problems, ranging from pre-processing (frame selection, sequence segmentation), during

    processing as has been seen in previous sub-sections, to post-processing (bundle

    adjustment, structure fusion).

    Frame selection: Among a number of frames, selecting good frames will improve the

    reconstruction result. Good frames are ones that have proper geometric attributes and

    good photometric quality. The problem is related to the estimation of views' position and

    orientation and photometric quality evaluation.

    Sequence segmentation: Reconstruction algorithms assume that a sequence is

    continuously captured. The sequence should be broken into proper scene parts and

    reconstruct separately and fuse later.

    Structure fusion: Results of processing different video segments (generated either by

    different captures or by segmentation) must be fused together to create a final unique

    result.

    Bundle adjustment: The reconstruction process includes local updates (e.g. feature

    matching, structure update) and bias assumptions (e.g. use of first view coordinate

    system). Those lead to inconsistency and accumulated errors in the global result. Thereshould be global optimization step to produce a unique consistent result.

  • 8/10/2019 Final Document Print

    10/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 10

    1.6 CRITICAL CASES

    A critical case happens when it is impossible to make a metric reconstruction

    from the input data. It either because of the characteristics of the scene or the capturing

    positions.

    In practice, metric reconstruction from video sequences captured by a person

    using a hand-held camera hardly falls into an absolute critical case. However, nearly

    critical cases are common in practice, e.g. a camera moving along a wall or on an elliptic

    orbit around the object. That is why studying critical configurations and detecting those

    cases is extremely important to create a robust reconstruction method, or select the most

    suitable method for the case.

    There are two kinds of critical cases: (i) critical surface or critical configuration

    and (ii) critical motion sequences (of camera). The first class depends on the observed

    points. The later depends only on camera motion, i.e. can happen with any scene. A

    brute force" approach to select the best algorithm. This however only helps to reject the

    case but not to take the proper method for it. Some important notes about critical cases

    are:

    Normal cases in some conditions, e.g. calibrated or fixed intrinsic parameters, can

    turn into critical ones when conditions change. The more un calibrated the camera, i.e.

    less cameras' parameters are known, the more ambiguous the reconstruction will be.

    1.7 IMAGE-BASED 3D RECONSTRUCTION

    Image-based 3D reconstruction is an active field of research in Photogrammetric

    and Computer Vision. The need for detailed 3D models for mapping and navigation,

    inspection, cultural heritage conservation or photorealistic image-based rendering for the

    entertainment industry lead to the development of several techniques to recover the shape

    of objects. To achieve precise and high detailed reconstructions is often employed

    providing 2.5D range images and the respective 3D point cloud in a metric scale.

    On the other hand, laser-based methods are complex to handle for large scale

    outdoor scenes, especially for aerial data acquisition. In contrast to that, passive image-

  • 8/10/2019 Final Document Print

    11/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 11

    based methods that utilize multiple overlapping views are easily deployable and are low

    cost compared to but require some post-processing effort to derive depth information. In

    this work we investigate how redundancy and baseline influence the depth accuracy of

    multiple view matching methods.

    In particular performance synthetic experiments on a typical aerial camera

    network that corresponds to a2D flight with 80% forward-overlap and 60% side-lap as

    shown in fig.1.6. By covariance analysis of triangulated scene points the theoretical

    bound of depth accuracy is determined according to the triangulation angle and the

    number of measurements (i.e. the redundancy).

    One of the main findings is that true multi-view matching/triangulation

    outperforms two-view fused stereo results by at least one order of magnitude in terms of

    depth accuracy. Furthermore, present a fast, accurate and robust matching and

    reconstruction technique suitable for high resolution images of large scale scenes that is

    able to compete through leveraging the redundancy of many views. The solution to multi-

    view reconstructions is based on pair-wise stereo, employing efficient and robust optical

    flow that is restricted to the epipolar geometry. Unlike standard aerial

    Fig.1.6(a) :The view network, a sparse reconstruction and uncertainties

    (magnified by 1000 for better visibility) for selected 3D points on a regularly

    sampled grid on the ground plane

  • 8/10/2019 Final Document Print

    12/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 12

    Fig.1.6(b) : Reconstructed dense point cloud from our multi-view method of an

    urban scene.

    matching approaches that rely on 2.5D data fusion of pair wise stereo depth maps,the correspondence chaining (i.e. measurement linking) and triangulation approach that

    takes full advantage of the achievable baseline (i.e. triangulation angles). In contrast to

    voxel-based approaches, Polygonal meshes and local patches, focus on algorithms

    representing geometry as a set of depth maps. It eliminates the need for resembling the

    geometry in the three-dimensional domain and can be easily parallelized. Evaluate the

    approach on the multitier benchmark data set that provides accurate ground truth and on

    large scale aerial images.

    1.8 UNCERTAINTY OF SCENE POINTS

    The depth uncertainty of a rectified stereo pair can be directly determined from

    the disparity error

    z=

    -

    ~

    . d... (1)

    where z is the point depth, f the focal length and b the image baseline. Hence the depth

    precision is mainly a function of the ray intersection angle. In contrast, for multi view

    image matching and triangulation , the redundancy not only implies more measurements

    but additionally constrains the 3D point location through multiple ray intersections. These

    entities are not independent but are coupled, since they rely on the network geometric

  • 8/10/2019 Final Document Print

    13/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 13

    configuration that determines image overlap (i.e. redundancy) and baseline,

    simultaneously. Given a photogrammetric network of cameras and correspondences with

    known error distribution, the precision of triangulated points can be determined from the

    3D confidence ellipsoid (i.e. covariance matrix CX) as shown in fig. 1.7. An empiricalestimate of the covariance ellipsoid corresponding to multi view triangulation can be

    computed by statistical simulation. For the moment we assume that camera orientations

    and 3D structure are fixed and known.

    The cameras are distributed along a 2D grid (corresponding to flight paths) in

    order to achieve a 80% forward overlap and 60% side-lap as shown fig.1.7 According to

    a large format digital aerial camera (e.g. Ultra Cam D from Microsoft) the image

    resolution is set to 7500 _ 11500 pixel with a field of view _ = 54_. Furthermore,3D

    points are evenly distributed on a 2D plane that corresponds to the bold earth surface

    observed from a flying height of 900m. Therefore, an average Ground Sampling Distance

    (GSD) of8cm/pixel is achieved.

    Given the cameras Pi=1 : NP (i.e. calibration and poses) and 3D points Xi=1:M X,

    respective ground truth projections are produced xij= PX. Therefore, for every 3D point a

    set of point-tracks (i.e. 2D measurements) is generated m = (< x 1; y1>;< x2; y2> : : : ;). Next, 2Dprojections are perturbed by zero mean Gaussian isotropic noise ^x = x +

    N(0; ),

    = ( )........................................................................................... (2)

    with standard deviation x= = 1 pixel (i.e. _ 8cm GSD). Given the set perturbed pointtracks^m = (< ^x1; ^y1>;< ^x2; ^y2> : : : ;< ^xk; ^yk>) and ground truth projection

    matrices Pi=1:N, the 3Dposition of the respective point in space is determined. This

    process requires the intersection of at least two known rays in space. Hence, we use a

    linear triangulation method to determine the 3D position of point tracks. This method

    generalizes easily to the intersection of multiple rays providing a least squares solution.

    Optionally, a non-linear optimizer based on the Levenberg-Marquardt algorithm issued to

  • 8/10/2019 Final Document Print

    14/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 14

    refine the 3D point by minimizing the projection error. Through Monte Carlo Simulation

    on the perturbed measurement vectors ^m we obtain a distribution of 3D points Xi

    around a mean position ^X. From the Law of Large Numbers it follows that for a large

    number N of simulations,one can approximate the mean 3D position by,

    ....(3)and its respective covariance matrix by,

    CX = EN[( XiEN [Xi]) (XiEN[Xi] )T ] ...(4)

    Using the singular value decomposition the covariance matrix can then be diagonalized,

    CX = U( )V

    T

    (5)

    where U represents the main diagonals of the covariance ellipsoid i and i are the

    respective standard deviations. The decomposition of the covariance matrix in equation

    5into its main diagonals directly relates to the uncertainty in x-y and z direction. Under

    the assumption of front to parallel image acquisition the largest singular value 1

    corresponds to the uncertainty in depth and 2 and 3to the uncertainty in x - y direction,

    respectively.

    1.9 LITERATURE SURVEY

    With the advent of the multimedia age and the spread of Internet, video storage on

    CD/DVD and video has been gaining a lot of popularity. The ISO Moving Picture

    Experts Group (MPEG) video coding standards pertain towards compressed video

    storage on physical media like CD/DVD, whereas the International Telecommunications

    Union (ITU) addresses real-time point-to-point or multi-point communications over a

    network. The former has the advantage of having higher bandwidth for data transmission.

    In either standard the basic flow of the entire compression decompression process

    is largely the same and is shown in fig. 1.7. The encoding side estimates the motion in the

    current frame with respect to a previous frame. A motion compensated image for the

    current frame is then created that is built of blocks of image from the previous frame. The

  • 8/10/2019 Final Document Print

    15/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 15

    motion vectors for blocks used for motion estimation are transmitted, as well as the

    difference of the compensated image with the current frame with respect to a previous

    frame. A motion compensated image for the current frame is then created that is built of

    blocks of image from the previous frame.

    The motion vectors for blocks used for motion estimation are transmitted, as well

    as the difference of compensated image motion estimation are transmitted as well as the

    difference of the compensated image with the current frame is also JPEG encoded and

    sent. The encoded image that is sent is then decoded at the encoder and used as a

    reference frame for the subsequent frames. The decoder reverses the process and creates a

    full frame. The whole idea behind motion estimation based video compression ratio 30:1

    is to save on bits by sending JPEG encoded difference images which inherently have lessenergy and can be highly compressed as compared to sending a full frame that is JPEG

    encoded. Motion JPEG where all frames are JPEG encoded, achieves anything between

    10:1 to 15:1 compression ratio, whereas MPEG can achieve a compression ratio of 30:1

    and is also useful at 100:1 ratio. It should be noted that the first frame is always sent full,

    and so are some other frames that might accurate some regular interval (like every 6th

    frame). The standards do not specify this and this might change with every video being

    sent based on the dynamics of the video.

    The most computationally expensive and resource hungry operation in the entire

    compression process is motion estimation. Hence, this field has seen the highest activity

    and research interest in the past two decades. The algorithms that have been implemented

    are Exhaustive Search (ES), Three Step Search (TSS), New Three Step Search (NTSS),

    Simple and Efficient TSS (SES), Four Step Search (4SS), Diamond Search (DS), and

    Adaptive Rood Pattern Search (ARPS).

  • 8/10/2019 Final Document Print

    16/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 16

    Fig.1.7: MPEG / H.26x video compression process flow

    .

    The most computationally expensive and resource hungry operation in the entire

    compression process is motion estimation. Hence, this field has seen the highest activity

    and research interest in the past two decades. The algorithms that have been implemented

    are Exhaustive Search (ES), Three Step Search (TSS), New Three Step Search (NTSS),

    Simple and Efficient TSS (SES), Four Step Search (4SS), Diamond Search (DS), and

    Adaptive Rood Pattern Search (ARPS).

    1.10 BLOCK MATCHING ALGORITHMS

    The underlying supposition behind motion estimation is that the patternscorresponding to objects and background in a frame of video sequence move within the

    frame to form corresponding objects on the subsequent frame. The idea behind block

    matching is to divide the current frame into a matrix of macro blocks that are then

    compared with corresponding block and its adjacent neighbours in the previous frame to

    create a vector that stipulates the movement of a macro block from one location to

  • 8/10/2019 Final Document Print

    17/78

  • 8/10/2019 Final Document Print

    18/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 18

    MSE =N-1i=0 2(7)

    where N is the side of the macro bock, C and Rijare the pixels being compared in current

    macro block and reference macro block respectively, Peak-Signal-to-Noise-Ratio (PSNR)

    given by equation 3characterizes the motion compensated image that is created by using

    motion vectors and macro clocks from the reference frame

    PSNR=10log10 [ ](a)Exhaustive Search (ES)

    This algorithm, also known as Full Search, is the most computationally expensive

    block matching algorithm of all. This algorithm calculates the cost function at each

    possible location in the search window. As a result of which it finds the best possible

    match and gives the highest PSNR amongst any block matching algorithm. Fast block

    matching algorithms try to achieve the same PSNR doing as little computation as

    possible. The disadvantage to ES is that the larger the search window gets the more

    computations it requires.

    (b)Three Step Search (TSS)

    The general idea is represented in Fig.1.9. It starts with the search location at the

    centre and sets the step size S = 4, for a usual search parameter value of 7. It then

    searches at eight locations +/- pixels around location (0,0). From these nine locations

    searched so far it picks the one giving least cost and makes it the new search origin. It

    then sets the new step size S = S/2,and repeats similar search for two more until S = 1. At

    that point it finds the location with the least cost function and the macro block at that

    location is the best match. The calculated motion vector is then saved for transmission. It

    gives a flat reduction in computation by a factor of 9. So that for p = 7, ES will compute

    cost for 225 macro blocks where as TSS computes cost for 25 macro blocks. The ideabehind TSS is that the error surface due to motion in every macro block is unimodal. A

    unimodal surface is a bowl shaped surface such that the weights generated by the cost

    function increase monotonically from the global minimum.

  • 8/10/2019 Final Document Print

    19/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 19

    Fig.1.9: Three Step Search procedure. The motion vector is (5, -3).

    (c)New Three Step Search (NTSS)

    NTSS improves on TSS results by providing a centre biased searching scheme

    and having provisions for half way stop to reduce computational cost. It was one of the

    first widely accepted fast algorithms and frequently used for implementing earlier

    standards like MPEG 1 and H.261.The TSS uses a uniformly allocated checking pattern

    for motion detection and is prone to missing small motions. The NTSS process is

    illustrated graphically in fig.1.10. In the first step 16 points are checked in addition to the

    search origin for lowest weight using a cost function. Of these additional search

    locations, 8 are a distance of S = 4 away (similar to TSS) and the other 8 are at S = 1away from the search origin. If the lowest cost is at the origin then the search is stopped

    right here and the motion vector is set as (0, 0). If the lowest weight is at any one of the 8

    locations at S = 1, then we change the origin of the search to that point and check for

    weights adjacent to it.

  • 8/10/2019 Final Document Print

    20/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 20

    Fig.1.10: New Three Step Search block matching.

    Big circles are checking points in the first step of TSS and the squares are the

    extra 8 points added in the first step of NTSS. Triangles and diamonds are second step of

    NTSS showing 3 points and 5 points being checked when least weight in first step is at

    one of the 8 neighbours of window centre.

  • 8/10/2019 Final Document Print

    21/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 21

    Fig.1.11: Search patterns corresponding to each selected quadrant: (a) Shows all

    quadrants (b) quadrant I is selected (c) quadrant II is selected (d) quadrant III is

    selected (e) quadrant IV is selected.

    Depending on which point it is end up checking 5 points or 3 points (Fig 1.11(b)

    & (c)). The location that gives the lowest weight is the closest match and motion vector is

    set to that location. On the other hand if the lowest weight after the first step was one of

    the 8 locations at S = 4, then follow the normal TSS procedure. Hence although this

    process might need a minimum of 17 points to check every macro block, it also has the

    worst-case scenario of 33 locations to check.

    (d)Simple and Efficient Search (SES)

    SES is another extension to TSS and exploits the assumption of unimodal error

    surface. The main idea behind the algorithm is that for a unimodal surface there cannot be

    two minimums in opposite directions and hence the 8 point fixed pattern search of TSS

    can be changed to incorporate this and save on computations. The algorithm still has

    three steps like TSS, but the innovation is that each step like TSS, but the innovation is

    that each step has further two phases. The search area is divided into four quadrants and

    the algorithm checks three locations A, B and C as shown in fig... A is at the origin and B

    and C are S = 4 locations away from A in orthogonal directions. Depending on certain

    weight distribution amongst the three the second phase selects few additional points as

    shown in fig. 2.3. The rules for determining a search quadrant for seconds phase are as

    follows:

  • 8/10/2019 Final Document Print

    22/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 22

    If MAD(A) _ MAD(B) and MAD(A) _ MAD(C), select (b);

    If MAD(A) _ MAD(B) and MAD(A) _ MAD(C), select (c)

    If MAD(A)_ MAD(B) and MAD(A) < MAD(C), select (d);

    If MAD(A) _ MAD(B) and MAD(A) _ MAD(C), select (e)If MAD(A) < MAD(B) and MAD(A) _ MAD(C), select (e)

    Once selected the points to check for in second phase, we find the location with

    the lowest weight and set it as the origin. We then change the step size similar to TSS and

    repeat the above SES procedure again until we reach S = 1.The location with the lowest

    weight is then noted down in terms of motion vectors and transmitted. An example

    process is illustrated in fig.1.12

    Fig.1.12: The SES procedure. The motion vector is (3, 7) in this example.

    Although this algorithm saves a lot on computation as compared to TSS, it was

    not widely accepted for two reasons. Firstly, in reality the error surfaces are not strictly

    unimodal and hence the PSNR achieved is poor compared to TSS. Secondly, there was

  • 8/10/2019 Final Document Print

    23/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 23

    another algorithm, Four Step Search, that had been published a year before that presented

    low computational cost compared to TSS and gave significantly better PSNR.

    (e)Four Step Search (4SS)

    Similar to NTSS, 4SS also employs center biased searching and has a halfway

    stop provision. 4SS sets a fixed pattern size of S = 2 for the first step, no matter what the

    search parameter p value is. Thus it looks at 9 locations in a5x5 window. If the least

    weight is found at the centre of search jumps to fourth step. If the least weight is at one of

    the eight locations except the centre, then we make it the search origin and move to the

    second step. The search window is still maintained as 5x5 pixels wide. Where the least

    weight location was, might end up checking weights at 3 locations or 5 locations. The

    patterns are shown in Fig 1.13

    (c) (d)

    Fig.1.13: Search patterns of the FSS. (a) First step (b) Second/Third

    step(c)Second/Third Step (d) Fourth Step.

    Once again if the least weight location is at the center of the 5x5 search window

    we jump to fourth step or else we move on to third step. The third is exactly the same as

    the second step. IN the fourth step the window size is dropped to 3x3, i.e.S = 1. The

  • 8/10/2019 Final Document Print

    24/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 24

    location with the least weight is the best matching macro block and the motion vector is

    set to point o that location. A sample procedure is shown in Fig 1.14. This search

    algorithm has the best case of algorithm has the best case of 17 checking points and worst

    case of 27 checking points.

    (f)Diamond Search (DS)

    DS algorithm is exactly the same as 4SS, but the search point pattern is changed

    from a square to a diamond, and there is no limit on the number of steps that the

    algorithm can take.DS uses two different types of fixed patterns, one is Large Diamond

    Search Pattern (LDSP) and the other is Smal lDiamond Search Pattern (SDSP). These

    two patterns and the DS procedure are illustrated in Fig.1.13. Just like in FSS, the first

    step uses LDSP and if the least weight is at the center location we jump to fourth step.

    The consequent steps, except the last step, are also similar and use LDSP, but the number

    of points where cost function is checked are either 3 or 5 and are illustrated in second and

    third steps of procedure shown in Fig.1.14

    . Fig. 1.14: Diamond Search procedure.

  • 8/10/2019 Final Document Print

    25/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 25

    This figure shows the large diamond search pattern and the small diamond search

    pattern. It also shows an example path to motion vector (-4, -2) in five search steps four

    times of LDSP and one time of SDSP.

    The last step uses SDSP around the new search origin and the location with the

    least weight is the best match. As the search pattern is neither too small nor too big and

    the fact that there is no limit to the number of steps, this algorithm can find global

    minimum very accurately. The end result should see a PSNR close to that of ES while

    computational expense should be significantly less.

    Adaptive Rood Pattern Search (ARPS)

    ARPS algorithm makes use of the fact that the general motion in a frame is

    usually coherent, i.e. if the macro blocks around the current macro block moved in aparticular direction hen there is a high probability that the current macro block will also

    have a similar motion vector. This algorithm uses the motion vector of the macro block to

    its immediate left to predicts own motion vector. An example is shown in fig.1.15.

    Fig.1.15: Adaptive Root Pattern: The predicted motion vector is (3,-2), and the step

    size S = Max (|3|, |-2|) = 3.

  • 8/10/2019 Final Document Print

    26/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 26

    The predicted motion vector points to (3, -2). In addition to checking the location

    pointed by the predicted motion vector, it also checks at a rood pattern distributed points,

    as shown in fig.1.15, where they are at a step size of S = Max (|X|, |Y|). X and Y are the

    x- coordinate and y-coordinate of the predicted motion vector. This rood pattern search isalways the first step. It directly puts the search in an area where there is a high probability

    of finding a good matching block.

    The point that has the least weight becomes the origin for subsequent search steps,

    and the search pattern is changed to SDSP. The procedure keeps on doing SDSP until

    least weighted point is found to be at the center of the SDSP. A further small

    improvement in the algorithm can be to check for Zero Motion Prejudgment, using which

    the search is stopped half way if the least weighted point is already at the center of the

    rood pattern

    Fig. 1.16. Search points per macro block while computing the PSNR

    Performance of Fast Block Matching Algorithms

    The main advantage of this algorithm over DS is if the predicted motion vector is

    (0, 0), it does not waste computational time in doing LDSP, it rather directly starts using

  • 8/10/2019 Final Document Print

    27/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 27

    SDSP. Furthermore, if the predicted motion vector is far away from the center, then again

    ARPS save on computations by directly jumping to that vicinity and using SDSP,

    whereas DS takes its time doing LDSP.

    Care has to be taken to not repeat the computations at points that were checked

    earlier. Care also needs to be taken when the predicted motion vector turns to match one

    of the rood pattern location. So have to avoid double computation at that point. For macro

    blocks in the first column of the frame, rood pattern step size is fixed at 2 pixel.

    Fig.1.17: PSNR performance of Fast Block Matching Algorithms.

    1.11 THESIS OUTLINE

    The summary of chapter 1 deals with general concept of 3D reconstruction and

    their methods of how the image is reconstructed from 2D to 3D image. Types of methods

    are Exhaustive Search (ES), Three Step Search (TSS), New Three Step Search

    (NTSS),Simple and Efficient Search (SES),Four Step Search (4SS),Diamond Search

    (DS),Adaptive Rood Pattern Search (ARPS).

  • 8/10/2019 Final Document Print

    28/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 28

    Chapter 2 deals with the stereo vision algorithms. Chapter 3 deals with the

    implementation of algorithms. Chapter 4 deals with the simulation results. Chapter 5

    describes the conclusion.

  • 8/10/2019 Final Document Print

    29/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 29

    CHAPTER 2

    STEREO VISION ALGORITHMS

    2.1 INTRODUCTION TO STEREO VISION

    Stereo correspondence problem has historically been, and continues to be, one of

    the most investigated topics in computer vision, and a larger number of literatures on it

    have been published. The correspondence problem in computer vision concerns the

    matching of points, or other kinds of primitives, in two or more images such that the

    matched elements are the projections of the same physical elements in 3D scene, and the

    resulting displacement of a projected point in one image with respect to the other is

    termed as disparity. Similarity is the guiding principle for solving the correspondence

    problem; however, the stereo correspondence problem is an ill-posed task, in order to

    make it tractable, it is usually necessary to exploit some additional information or

    constraints.

    The most popular constraint is the epipolar constraint, which can reduce the

    search to one-dimension rather than two. Other constraints commonly used are the

    disparity uniqueness constraint and the continuous constraint.

    The origin of the word stereo is the Greek word stereos which means firm or

    solid, with stereo vision, the objects are seen solid in three dimensions with range. In

    stereo vision, the same seen is captured using two sensors from two different angles. The

    captured two images have a lot of similarities and smaller number of differences. In

    human sensitivity, the brain combines the captured to images together by matching the

    similarities and integrating the differences to get a three dimension model for the seen

    objects.

    In machine vision, the three dimension model for the captured objects is obtained

    finding the similarities between the stereo images and using projective geometry to

    process these matches. The difficulties of reconstruction using stereo is finding matching

    correspondences between the stereo pair.

  • 8/10/2019 Final Document Print

    30/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 30

    Latest trends in the field mainly pursue real-time execution speeds, as well as

    decent accuracy. As indicated by this survey, the algorithms theoretical matching cores

    are quite well established leading the researchers towards innovations resulting in more

    efficient hardware implementations.

    Detecting conjugate pairs in stereo images is a challenging research problem

    known as the correspondence problem, i.e., to find for each point in the left image, the

    corresponding point in the right one. To deter-mine these two points from a conjugate

    pair, it is necessary to measure the similarity of the points. The point to be matched

    without any ambiguity should be distinctly different from its surrounding pixels. Several

    algorithms have been proposed in order to address this problem. However, every

    algorithm makes use of a matching cost function so as to establish correspondencebetween two pixels.

    The most common ones are absolute intensity differences (AD), the squared

    intensity differences (SD) and the normalized cross correlation (NCC) evaluation of

    various matching costs can be found. Usually, the matching costs are aggregated over

    support regions. Those support regions, often referred to as support or aggregating

    windows, could be square or rectangular, fix-sized or adaptive ones. The aggregation of

    the aforementioned cost functions, leads to the core of most of the stereo vision methods,

    which can be mathematically SAD expressed as follows, for the case of the sum of

    absolute differences

    (SAD).(x, y, d) = (I l (x, y) - I r (x, y - d)) (1)

    For the case of the sum of squared differences (SSD)

    SSD (x, y, d) `= (I l (x, y) - I r (x, y - d))2

    ..(2)

    And for the case of the NCC

    NCC (x, y, d) = (I l (x, y) * I r (x, y - d))/sqrt ( (I2

    l (x, y) I2

    r (x, y - d))..(3)

    Where IL and Ir are the intensity values in left and right image, (x, y) are the pixels

    coordinates, d is the disparity value under consideration and W is the aggregated support

  • 8/10/2019 Final Document Print

    31/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 31

    region. The selection of the appropriate disparity value for each pixel is performed

    afterwards. The simpler algorithms make use of winner-takes-all (WTA) method of

    disparity selection.

    D (x, y) = arg min SAD (x, y, d) .. ...................(4)

    i.e., for every pixel (x, y) and for constant value of disparity d the minimum cost is

    selected. Equation 1.4 refers to the SAD method but any other could be used instead.

    However, in many cases disparity selection is an iterative process, since each pixels

    disparity is depending on its neighbouring pixels disparity. As a result, more than one

    iterations are needed in order to find the best set of disparities. This stage differentiates

    the local from the global algorithms, which will be analyzed. An additional disparity

    refinement step is frequently used

    2.2 GOAL OF STEREO VISION

    The recovery of the 3D structure of a scene using two or more images of the 3D

    scene, each acquired from a different viewpoint in space. The images can be obtained

    using multiple cameras or one moving camera. The term binocular vision is used when

    two cameras are employed

    Fig2.1: General setup of cameras

  • 8/10/2019 Final Document Print

    32/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 32

    2.2.1 Stereo setup and terminology

    Fixation point:the point of intersection of the optical axis.

    Baseline:the distance between the centres of projection.

    Epipolar plane:the plane passing through the centers of projection and the point in the

    scene.

    Epipolar line:the intersection of the epipolar plane with the image plane.

    Conjugate pair:any point in the scene that is visible in both cameras will be projected to

    a pair of image points in the two images

    Disparity: the distance between corresponding points when the two images are

    superimposed.

    Disparity map:the disparities of all points from the disparity map (can be displayed as

    an image).

    Fig2.2: Internal projection of camera in stereo vision

  • 8/10/2019 Final Document Print

    33/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 33

    Figure2.3: Two cameras in arbitrary position and orientation

    2.2.2 Triangulation - the principle underlying stereo vision

    The 3D location of any visible object point in space is restricted to the straight line

    that passes through the center of projection and the projection of the object point.

    Binocular stereo vision determines the position of a point in space by finding the

    intersection of the two lines passing through center of projection and the projection of

    the point in each image

    Fig2.4: Positions of binocular

  • 8/10/2019 Final Document Print

    34/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 34

    2.2.3 The problems of stereo

    The correspondence problem.

    The reconstruction problem.

    2.2.4 The correspondence problem

    Finding pairs of matched points such that each point in the pair is the projection

    of the same 3D point.

    Triangulation depends crucially on the solution of the correspondence problem.

    Ambiguous correspondence between points in the two images may lead to several

    different consistent interpretations of the scene.

    Fig2.5: Correspondence problem in stereo vision

    2.2.5The reconstruction problem

    Given the corresponding points, we can compute the disparity map.

    The disparity map can be converted to a 3D map of the scene (i.e., recover the 3D

    structure) if the stereo geometry is known

  • 8/10/2019 Final Document Print

    35/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 35

    Fig2.6: Reconstruction problem in stereo vision

    2.3 STEREO CORRESPONDENCE

    Stereo correspondence problem has historically been, and continues to be, one of

    the most investigated topics in computer vision, and a larger number of literatures on it

    have been published. The correspondence problem in computer vision concerns the

    matching of points, or other kinds of primitives, in two or more images such that the

    matched elements are the projections of the same physical elements in 3D scene, and the

    resulting displacement of a projected point in one image with respect to the other is

    termed as disparity.

    Similarity is the guiding principle for solving the correspondence problem;

    however, the stereo correspondence problem is an ill-posed task, in order to make it

    tractable, it is usually necessary to exploit some additional information or constraints.

    The most popular constraint is the epipolar constraint, which can reduce the search to

    one-dimension rather than two. Other constraints commonly used are the disparity

    uniqueness constraint and the continuous constraint.

    The existing techniques for general two-view stereo correspondence roughly fall

    into two categories: local method and global method. Local methods use only small areas

    neighborhoods surrounding the pixels, while global methods optimize some global

    (energy) function. Local methods, such as block matching, gradient-based optimization,

  • 8/10/2019 Final Document Print

    36/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 36

    and feature matching can be very efficient, but they are sensitive to locally ambiguous

    regions in images (e.g., occlusion regions or regions with uniform texture).

    Global methods, such as dynamic programming, intrinsic curves, graph cuts, and

    belief propagation can be less sensitive to these problems since global constraints provide

    additional support for regions difficult to match locally. However, these methods are

    more expensive in their computational cost.

    Stereo correspondence algorithms can be grouped into those producing sparse

    output and those giving a dense result. Feature based methods stem from human vision

    studies and are based on matching segments or edges between two images, thus resulting

    in a sparse output. This disadvantage, dreadful for many purposes, is counterbalanced by

    the accuracy and speed obtained. However, contemporary applications demand more and

    more dense output.

    In order to categorize and evaluate them a context has been proposed. According

    to this, dense matching algorithms are classified in local and global ones. Local methods

    trade accuracy for speed. They are also referred to as window-based methods because

    disparity computation at a given point depends only on intensity values within a finite

    support window. Global methods (energy-based) on the other hand are time consuming

    but very accurate.

    Their goal is to minimize a global cost function, which combines data and

    smoothness terms, taking into account the whole image. Of course, there are many other

    methods that are not strictly included in either of these two broad classes. The issue of

    stereo matching has recruited a variation of computation tools. Advanced computational

    intelligence techniques are not uncommon and present interesting and promiscuous

    results.

    While the aforementioned categorization involves stereo matching algorithms in

    general, in practice it is valuable for software implemented algorithms only. Software

    implementations make use of general purpose personal computers (PC) and usually result

    in considerably long running times. However, this is not an option when the objective is

  • 8/10/2019 Final Document Print

    37/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 37

    the development of autonomous robotic platforms, simultaneous localization and

    mapping (SLAM) or virtual reality (VR) systems.

    Such tasks require real-time, efficient performance and demand dedicated

    hardware and consequently specially developed and optimized algorithms. Only a small

    subset of the already proposed algorithms is suitable for hardware implementation.

    Hardware implemented algorithms are characterized from their theoretical algorithm as

    well as the implementation itself. There are two broad classes of hardware

    implementations: the field-programmable gate arrays (FPGA) and the application-

    specific integrated circuits (ASIC) based ones. Figure 2 depicts an ASIC chip (a) and a

    FPGA development board (b). Each one can execute stereo vision algorithms without the

    necessity of a PC, saving volume, weight and consumed energy. However, the evolutionof FPGA has made them an appealing choice due to the small prototyping times, their

    flexibility and their good performance

    2.4 STEREO MATCHING ALGORITHMS

    The issue of stereo correspondence is of great importance in the field of machine

    vision, computer vision, depth measurements and environment reconstruction as well as

    in many other aspects of production, security, defense, exploration, and entertainment.

    Calculating the distance of various points or any other primitive in a scene relative to the

    position of a camera is one of the important tasks of a computer vision system.

    The most common method for extracting depth information from intensity images

    is by means of a pair of synchronized camera-signals, acquired by a stereo rig. The point-

    by-point matching between the two images from the stereo setup derives the depth

    images, or the so called disparity maps. This matching can be done as a one dimensional

    search if accurately rectified stereo pairs in which horizontal scan lines reside on thesame epipolar line are assumed, as shown in Figure 2.7. A point P1 in one image plane

    may have arisen from any of points in the line C1P1, and may appear in the alternate

    image plane at any point on the so-called epipolar line.

  • 8/10/2019 Final Document Print

    38/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 38

    Thus, the search is theoretically reduced within a scan line, since corresponding

    pair points reside on the same epipolar line. The difference on the horizontal coordinates

    of these points is the disparity. The disparity map consists of all disparity values of the

    image. Having extracted the disparity map, problems such as 3D reconstruction,positioning, mobile robot navigation, obstacle avoidance, etc., can be dealt with in a more

    efficient way.

    Fig 2.7: Geometry of epipolar lines, where C1 and C2 are the left and right camera

    lens centers, respectively. Point P1 in one image plane may have arisen from any of

    points in the line C1P1, and may appear in the alternate image plane at any point on

    the epipolar line E2.

    As numerous methods have been proposed since then, this section aspires to review the

    most recent ones, i.e., Most of the results presented in the rest of this paper are based on

    the image sets and test provided there.

    The most common image sets are presented in Figure 2.8. Table 2.1 summarizes

    their size as well the number of disparity levels. Experimental results based on theseimage sets are given, where available. The preferred metric adopted by in this paper, in

    order to depict the quality of the resulting disparity maps, is the percentage of pixels

    whose absolute disparity error is greater than 1 in the unconcluded areas of the image.

    This metric, considered the most representative of the results quality, was used so as to

  • 8/10/2019 Final Document Print

    39/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 39

    make comparison easier. Other metrics, like error rate and root mean square error are also

    employed.

    Fig.2.8:Left image of the stereo pair (left) and ground truth (right) for the Tsukuba

    (a), Sawtooth (b), Map (c), Venus (d), Cones (e) and Teddy (f) stereo pair.

    The speed with which the algorithms process input image pairs is expressed in

    frames per second (fps). This metric has of course a lot to do with the used computational

    platform and the kind of the implementation. Inevitably, speed results are not directly

    comparable.

    Tsukuba Map Sawtooth Venus Cone Teddy

    Size in

    pixels

    384288 284216 434380 434383 450375 450375

    Disparity

    levels

    16 30 20 20 60 60

    Table 2.1: Characteristics of the most common image sets

  • 8/10/2019 Final Document Print

    40/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 40

    2.5 DENSE DISPARITY ALGORITHMS

    Methods that produce dense disparity maps gain popularity as the computational

    power grows. Moreover, contemporary applications are benefited by, and consequently

    demand dense depth information. Therefore, during the latest years efforts towards this

    direction are being reported much more frequently than towards the direction of sparse

    results.

    Dense disparity stereo matching algorithms can be divided in two general classes,

    according to the way they assign disparities to pixels. Firstly, there are algorithms that

    decide the disparity of each pixel according to the information provided by its local,

    neighboring pixels. There are, however, other algorithms which assign disparity values to

    each pixel depending on information derived from the whole image. Consequently, the

    former ones are called local methods while the latter ones global.

    2.5.1. Local methods.

    Local methods are usually fast and can at the same time produce descent results.

    Several new methods have been presented. In Figure 2.9 Venn diagram presents the main

    characteristics of the below presented local methods. Under the term color usage we have

    grouped the methods that take advantage of the chromatic information of the image pair.

    Any algorithm can process color images but not everyone can use it in a more beneficial

    way. Furthermore, in Figure 2.3 NCC stands for the use of normalized cross correlation

    and SAD for the use of sum of absolute differences as the matching cost function.

    As expected, the use of SAD as matching cost is far more widespread than any

    other. A method that uses the sum of absolute differences (SAD) correlation measure for

    RGB color images. It achieves high speed and reasonable quality. It makes use of the left

    to right consistency and uniqueness constraints and applies a fast median filter to the

    results.

    It can achieve 20 fps for 160120 pixels image size, making this method suitable

    for real time applications. The PC platform is Linux on a dual processor 800MHz

    Pentium III system with 512 MB of RAM.

  • 8/10/2019 Final Document Print

    41/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 41

    Fig.2.9: Diagrammatic representation of the local methods categorization.

    Another fast area-based stereo matching algorithm, which uses the SAD as error

    function, is presented is based on the uniqueness constraint, it rejects previous matches as

    soon as better ones are detected. In contrast to bidirectional matching algorithms this one

    performs only one matching phase, having though similar results. The results obtained

    are tested for reliability and sub-pixel refined. It produces dense disparity maps in real-

    time using an Intel Pentium III processor running at 800MHz. The algorithm achieves

    39.59 fps speed for 320240 pixels and 16 disparity levels and the root mean square error

    for the standard Tsukuba pair is 5.77%.

    The object is to achieve minimum segmentation. The experimental results

    indicate 1.77%, 0.61%, 3.00%, and 7.63% error percentages. The execution speed of the

    algorithm varies from 1 to 0.2 fps on a 2.4GHz processor.

    Another method that presents almost real-time performance is it makes use of a

    refined implementation of the SAD method and a left-right consistency check. The errors

    in the problematic regions are reduced using different sized correlation windows. Finally,

    a median filter is used in order to interpolate the results. The algorithm is able to process

  • 8/10/2019 Final Document Print

    42/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 42

    7 fps for 320240 pixels images and 32 disparity levels. These results are obtained using

    an Intel Pentium 4 at 2.66GHz Processor.

    A window-based method for correspondence search is presented in that uses

    varying support-weights. The support-weights of the pixels in a given support window

    are adjusted based on color similarity and geometric proximity to reduce the image

    ambiguity. The difference between pixel colors is measured in the CIE Lab color space

    because the distance of two points in this space is analogous to the stimulus perceived by

    the human eye. The running time for the image pair with a 3535 pixels support window

    is about 0.016 fps on an AMD 2700processor. The error ratio is 1.29%, 0.97%, 0.99%,

    and 1.13% and Map image sets, respectively. These figures can be further improved

    through a left-right consistency check.

    For given input images, specular free two-band images are generated. The

    similarity between pixels of these input-image representations can be measured using

    various correspondence search methods such as the simple SAD-based method, the

    adaptive support-weights method and the dynamic programming (DP) method. This pre-

    processing step can be performed in real time and compensates satisfactory for specular

    reflections.

    On the other hand the zero mean normalized cross correlation (ZNCC) as

    matching cost. This method integrates a neural network (NN) model, which uses the

    least-mean-square delta rule for training. The NN decides on the proper window shape

    and size for each support region. The results obtained are satisfactory but the 0.024 fps

    running speed reported for the common image sets, on a Windows platform with a

    300MHz processor, renders this method as not suitable for real-time applications.

    Based on the same matching cost function a more complex area-based method is

    proposed in a perceptual organization framework, considering both binocular and

    monocular cues is utilized. An initial matching is performed by a combination of

    normalized cross correlation techniques. The correct matches are selected for each pixel

    using tensor voting. Matches are then grouped into smooth surfaces. Disparities for the

    unmatched pixels are assigned so as to ensure smoothness in terms of both surface

  • 8/10/2019 Final Document Print

    43/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 43

    orientation and color. The percentage of un occluded pixels whose absolute disparity

    error is greater than 1 is 3.79, 1.23, 9.76, and 4.38 for the image sets. The execution

    speed reported is about 0.002 fps for the image pair with 20 disparity levels running on

    an Intel Pentium 4 processor at 2.8MHz.

    There are, of course, more hardware-oriented proposals as well. Many of them

    take advantage of the contemporary powerful graphics machines to achieve enhanced

    results in terms of processing time and data volume. A hierarchical disparity estimation

    algorithm implemented on programmable 3D graphics processing unit (GPU) is reported

    in this method can process either rectified or un calibrated image pairs. Bidirectional

    matching is utilized in conjunction with a locally aggregated sum of absolute intensity

    differences.

    Moreover, the use of Cellular Automata (CA) presents architecture for real-time

    extraction of disparity maps. It is capable of processing 1Mpixels image pairs at more

    than 40 fps. The core of the algorithm relies on matching pixels of each scan-line using a

    one-dimensional window and the SAD matching cost as described in this method

    involves a pre-processing mean filtering step and a post-processing CA based filtering

    one.

    CA is models of physical systems, where space and time are discrete and

    interactions are local. They can easily handle complicated boundary and initial

    conditions. In CA analysis, physical processes and systems are described by a cell array

    and a local rule, which defines the new state of a cell depending on the states of its

    neighbors. All cells can work in parallel due to the fact that each cell can independently

    update each own state. Therefore the proposed CA algorithm is massively parallel and is

    an ideal candidate to be implemented in hardware.

    2.5.2 Global methods

    Contrary to local methods, global ones produce very accurate results. Their goal is

    to find the optimum disparity function d = d (x, y) which minimizes a global cost

    function E, which combines data and smoothness terms.

  • 8/10/2019 Final Document Print

    44/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 44

    E (d) = E data (d) +. E smooth (d) (5)

    Where E datatakes into consideration the (x, y) pixels value throughout the image, E

    smoothprovides the algorithms smoothening assumptions and k is a weight factor.

    The main disadvantage of the global methods is that they are more time

    consuming and computational demanding. The source of these characteristics is the

    iterative refinement approaches that they employ. They can be roughly divided in those

    performing a global energy minimization and those pursuing the minimum for

    independent scan lines using DP.

    In Figure 2.10 the main characteristics of the below discussed global algorithms

    are presented. It is clear that the recently published works utilizes global optimizationpreferably rather than DP. This observation is not a surprising one, taking into

    consideration the fact that under the term global optimization there are actually quite a

    few different methods. Additionally, DP tends to produce inferior ,thus less impressive,

    results. Therefore, applications that dont have running speed constraints, preferably

    utilize global optimization methods

    2.6 REVIEW OF STEREO VISION ALGORITHMS

    Fig.2.10: Diagrammatic representation of the global methods categorization

  • 8/10/2019 Final Document Print

    45/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 45

    2.6.1. Global optimization

    The algorithms that perform global optimization take into consideration the whole

    image in order to determine the disparity of every single pixel. An increasing portion of

    the global optimization methodologies involves segmentation of the input images

    according to their colors.

    The algorithm presented uses color segmentation. Each segment is described by a

    planar model and assigned to a layer using a mean shift based clustering algorithm. A

    global cost function is used that takes into account the summed up absolute differences,

    the discontinuities between segments and the occlusions. The assignment of segments to

    layers is iteratively updated until the cost function improves no more. The experimental

    results indicate that the percentage of un-concluded pixels whose absolute disparity error

    is greater than 1 is 1.53, 0.16, and 0.22 for the image sets, respectively.

    The stereo matching algorithm proposed in makes use of color segmentation in

    conjunction with the graph cuts method. The reference image is divided in non-

    overlapping segments using the mean shift color segmentation algorithm. Thus, a set of

    planes in the disparity space is generated. The goal of minimizing an energy function is

    faced in the segment rather than the pixel domain. A disparity plane is fitted to each

    segment using the graph cuts method. This algorithm presents good performance in the

    texture less and occluded regions as well as at disparity discontinuities. The running

    speed reported is 0.33 fps for a 384288 pixel image pair when tested on a 2.4GHz

    Pentium 4 PC. The percentage of bad matched pixels and Map image sets is found to be

    1.23, 0.30, 0.08, and 1.49, respectively.

    The ultimate goal of the work describe is to render dynamic scenes with

    interactive viewpoint control produced by a few cameras suitable color segmentation-

    based algorithm is developed and implemented on a programmable ATI 9800 PRO GPU.

    Disparities within segments must vary smoothly, each image is treated equally,

    occlusions are modelled explicitly and consistency between disparity maps is enforced

    resulting in higher quality depth maps. The results for each pixel are refined in

    conjunction with the others.

  • 8/10/2019 Final Document Print

    46/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 46

    Another method that uses the concept of image color segmentation is reported in

    an initial disparity map is calculated using an adapting window technique. The segments

    are combined in larger layers iteratively. The assignment of segments to layers is

    optimized using a global cost function. The quality of the disparity map is measured bywarping the reference image to the second view and comparing it with the real image and

    calculating the color dissimilarity.

    For the 384288 pixel and the 434383 pixel Venus test set, the algorithm

    produces results at 0.05 fps rate. For the 450375 pixel Teddy image pair, the running

    speed decreased to 0.01 fps due to the increased scene complexity. Running speeds refer

    to an Intel Pentium 4 2.0GHz processor. The root mean square error obtained is 0.73 for

    the 0.31 for the Venus and 1.07 for the image pair.

    Moreover, Sun and his colleagues presented a method which treats the two

    images of a stereo pair symmetrically within an energy minimization framework that can

    also embody color segmentation as a soft constraint. This method enforces that the

    occlusions in the reference image are consistent with the disparities found for the other

    image. Belief propagation iteratively refines the results. Moreover, results for the version

    of the algorithm that incorporates segmentation are better.

    The percentage of pixels with disparity error larger than 1 is 0.97, 0.19, 0.16, and

    0.16 for the Map image sets, respectively. The running speed for the aforementioned data

    sets is about 0.02 fps tested on a 2.8GHz Pentium 4 processor.

    Color segmentation is utilized as well. The matching cost used here is a self-

    adapting dissimilarity measure that takes into account the sum of absolute intensity

    differences as well as a gradient based measure. Disparity planes are extracted using an

    insensitive to outliers technique. Disparity plane labelling is performed using belief

    propagation. Execution speed varies between 0.07 and 0.04 fps on a 2.21GHz AMD

    Athlon 64 processor. The results indicate 1.13, 0.10, 4.22, and 2.48 percent of bad

    matched pixels in non-occluded areas for the image sets, respectively.

  • 8/10/2019 Final Document Print

    47/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 47

    Finally, one more algorithm that utilizes energy minimization, color

    segmentation, plane fitting and repeated application of hierarchical belief propagation is

    presented in this algorithm takes into account a color weighted correlation measure.

    Discontinuities and occlusions are properly handled. The percentage of pixels withdisparity error larger than 1 is 0.88, 0.14, 3.55, and 2.90 for the Cones image sets,

    respectively.

    In two new symmetric cost functions for global stereo methods are proposed. A

    symmetric data cost function for the likelihood, as well as a symmetric discontinuity cost

    function for the prior in the MRF model for stereo is presented. Both the reference image

    and the target image are taken into account to improve performance without modelling

    half-occluded pixels explicitly and without using color segmentation. The use of both ofthe two proposed symmetric cost functions in conjunction with a belief propagation based

    stereo method is evaluated.

    Experimental results for standard test bed images show that the performance of

    the belief propagation based stereo method is greatly improved by the combined use of

    the proposed symmetric cost functions. The percentage of pixels badly matched for the

    non-occluded areas was found 1.07, 0.69, 0.64, and 1.06 for the image sets, respectively.

    The incorporation of Markov random fields (MRF) as a computational tool is also a

    popular approach.

    A method based on the Bayesian estimation theory with a prior MRF model for

    the assigned disparities is described in the continuity, coherence and occlusion constraints

    as well as the adjacency principal are taken into account. The optimal estimator is

    computed using a Gauss-Markov random field model for the corresponding posterior

    marginal, which results in a diffusion process in the probability space. The results are

    accurate but the algorithm is not suitable for real-time applications, since it needs a few

    minutes to process a 256255 stereo pair with up to 32 disparity levels, on an Intel

    Pentium III running at 450MHz.

    On the other hand, treat every pixel of the input images as generated either by a

    process, responsible for the pixels visible from the reference camera and which obey to

  • 8/10/2019 Final Document Print

    48/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 48

    the constant brightness assumption, or by an outlier process, responsible for the pixels

    that cannot be corresponded. Depth and visibility are jointly modelled as a hidden MRF,

    and the spatial correlations of both are explicitly accounted for by defining a suitable

    Gibbs prior distribution. An expectation maximization (EM) algorithm keeps track ofwhich points of the scene are visible in which images, and accounts for visibility

    configurations. The percentages of pixels with disparity error larger than 1 are 2.57, 1.72,

    6.86 and 4.64 for the image sets, respectively.

    Moreover, a stereo method specifically designed for image-based rendering is

    described in this algorithm uses over-segmentation of the input images and computes

    matching values over entire segments rather than single pixels. Color-based segmentation

    preserves object boundaries. The depths of the segments for each image are computedusing loopy belief propagation within a MRF framework. Occlusions are also considered.

    The percentage of bad matched pixels in the un concluded regions is 1.69, 0.50, 6.74, and

    3.19 for the Cones image sets, respectively. The aforementioned results refer to a 2.8GHz

    PC platform.

    An algorithm based on a hierarchical calculation of mutual information based

    matching cost is proposed. Its goal is to minimize a proper global energy function, not by

    iterative refinements but by aggregating matching costs for each pixel from all directions.

    The final disparity map is sub- pixel accurate and occlusions are detected. The processing

    speed for the image set is 0.77 fps. The error in un concluded regions is found less than

    3% for all the standard image sets. Calculations are made on an Intel Xeon processor

    running at 2.8GHz.

    Mutual information is once again used as cost function. The extensions applied in

    it result in intensity consistent disparity selection for un textured areas and discontinuity

    preserving interpolation for filling holes in the disparity maps. It treats successfully

    complex shapes and uses planar models for un textured areas. Bidirectional consistency

    check, sub-pixel estimation as well as invalid-disparities interpolation are performed.

    The experimental results indicate that the percentages of bad matching pixels in

    un-concluded regions are 2.61, 0.25, 5.14, and2.77 for the Tsukuba, Venus, Teddy and

  • 8/10/2019 Final Document Print

    49/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 49

    Cones image sets, respectively, with 64disparity levels searched each time. However, the

    reported running speed on a2.8GHz PC is less than 1 fps.

    The dense disparity estimation is accomplished by a region dividing technique

    that uses a Canny edge detector and a simple SAD function. The results are refined by

    regularizing the vector fields by means of minimizing an energy function. The root mean

    square error obtained from this method is 0.9278 and 0.9094 for the image pairs. The

    running speed is 0.15 fps and 0.105 fps respectively on a Pentium 4 PC running Windows

    XP.

    An uncommon measure is used in this work describes an algorithm which is

    focused on achieving contrast invariant stereo matching. It relies on multiple spatial

    frequency channels for local matching. The measure for this stage is the deviation of

    phase difference from zero. The global solution is found by a fast non-iterative left right

    diffusion process. Occlusions are found by enforcing the uniqueness constraint. The

    algorithm is able to handle significant changes in contrast between the two images and

    can handle noise in one of the frequency channels.

    Another algorithm that generates high quality results in real time is reported is

    based on the minimization of a global energy function comprising of a data and a

    smoothness term. The hierarchical belief propagation iteratively optimizes the

    smoothness term but it achieves fast convergence by removing redundant computations

    involved. In order to accomplish real-time operation authors take advantage of the

    parallelism of graphics hardware (GPU).

    Experimental results indicate 16 fps processing speed for 320240 pixel self-

    recorded images with 16 disparity levels. The percentages of bad matching pixels in un-

    concluded regions for the image sets are found to be 1.49, 0.77, 8.72, and 4.61. The

    computer used is a 3GHz PC and the GPU is an NVIDIA 7900 GTX graphics card with

    512M video memory.

    ` The work indicates that computational cost of the graph cuts stereo

    correspondence technique can be efficiently decreased using the results of a simple local

  • 8/10/2019 Final Document Print

    50/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 50

    stereo algorithm to limit the disparity search range. The idea is to analyze and exploit the

    failures of local correspondence algorithms. This method can accelerate the processing by

    a factor of 2.8, compared to the sole use of graph cuts, while the resulting energy is worse

    only by an average of 1.7%. These results proceed from an analysis done on a largedataset of 32 stereo pairs using a Pentium4 at 2.6GHz PC.

    2.6.2. Dynamic programming

    Many researchers develop stereo correspondence algorithms based on DP. This

    methodology is a fair trade-off between the complexity of the computations needed and

    the quality of the results obtained. In every aspect, DP stands between the local

    algorithms and the global optimization ones. However, its computational complexity still

    renders it as a less preferable option for hardware implementation.

    The work presents a unified framework that allows the fusion of any partial

    knowledge about disparities, such as matched features and known surfaces within the

    scene. It combines the results from corner, edge and dense stereo matching algorithms to

    impose constraints that act as guide points to the standard DP method. The result is a

    fully automatic dense stereo system with up to four times faster running speed and greater

    accuracy compared to results obtained by the sole use of DP.

    One or more disparity candidates for the true disparity of each pixel are assigned

    by local matching using oriented spatial filters. Afterwards, a two-pass DP technique that

    performs optimization both along and between the scan-lines is performed. The result is

    the reduction of false matches as well as of the typical Inter-scan line inconsistency

    problem.

    The per-pixel matching costs are aggregated in the vertical direction only

    resulting in improved inters scan line consistency and sharp object boundaries. This work

    exploits the color and distance proximity based weight assignment for the pixels inside a

    fixed support window as reported. The real time performance is achieved due to the

    parallel use of the CPU and the GPU of a computer. This implementation can process

  • 8/10/2019 Final Document Print

    51/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 51

    320-240 pixel images with 16 disparity levels at 43.5 fps and 640480 pixel images with

    16 disparity levels at 9.9 fps.

    On the contrary, the algorithm proposed in the DP method not across individual

    scan lines but to a tree structure. Thus the minimization procedure accounts for all the

    pixels of the image, compensating the known streaking effect without being an iterative

    one. Reported running speed is a couple of frames per second for the tested image pairs.

    So, real-time implementations are feasible. However, the results obtained are comparable

    to those of the time-consuming global methods.

    In the pixel-tree approach of the previous work is replaced by a region-tree one.

    First of all, the image is color-segmented using the mean-shift algorithm. During the

    stereo matching, a corresponding energy function defined on such a region-tree structure

    is optimized using the DP technique. Occlusions are handled by compensating for border

    occlusions and by applying cross checking. The obtained results indicate that the

    percentage of the bad matched pixels in un-concluded regions is 1.39, 0.22, 7.42, and

    6.31 for the Cones image sets. The running speed, on a 1.4GHz Intel Pentium M

    processor, ranges from 0.1 fps with 16 disparity levels to 0.04 fps for the Cones dataset

    with 60 disparity levels.

    2.6.3 Other methods

    There are of course other methods, producing dense disparity maps, which can be

    placed in neither of previous categories. The below discussed methods use either

    wavelet-based techniques or combinations of various techniques .Such a method, based

    on the continuous wavelet transform (CWT) is found. It makes use of the redundant

    information that results from the CWT. Using 1D orthogonal and bi-orthogonal wavelets

    as well as 2D orthogonal wavelet the maximum matching rate obtained is 88.22% for the

    image pair. Up sampling the pixels in the horizontal direction by a factor of two, through

    zero insertion, further decreases the noise and the matching rate is increased to 84.91%.

    Another work presents an algorithm based on non-uniform rational B-splines

    (NURBS) curves. The curves replace the edges extracted with a wavelet based method.

    The NURBS are projective invariant and so they reduce false matches due to distortion

  • 8/10/2019 Final Document Print

    52/78

    3D Reconstruction Based on Image Pyramid and Block Matching

    Dept. of ECE, MRITS 52

    and image noise. Stereo matching is then obtained by estimating the similarity between

    projections of curves of an image and curves of another image. A 96.5% matching rate

    for a self-recorded image pair is reported for this method.

    Finally, a different way of confronting the stereo matching issue is proposed in

    investigate the possibility of fusing the results from spatially differentiated (stereo vision)

    scenery images with those from temporally differentiated (structure from motion) ones.

    This method takes advantage of both methods merits improving the performance.

    2.7 SPARSE DISPARITY ALGORITHMS

    Algorithms resulting in sparse, or semi-dense, disparity maps tend to be less

    attractive as most of the contemporary applications require dense disparity information.

    Though, they are very useful when fast depth estimation is required and at the same time

    detail, in the whole picture, is not so important. This type of algorithms tends to focus on

    the main features of the images leaving occluded and poorly textured areas unmatched.

    Consequently high processing speeds, accurate results but with limited density are

    achieved. Very interesting ideas flourish in this direction but since contemporary interest

    is directed towards dense disparity maps, only a few indicatory algorithms are discussed

    here.

    An algorithm that detects and matches dense features between the left and right

    images of a stereo pair, producing a semi-dense disparity map. A dense feature is a

    conn