Automated 3D Object Modeling from Aerial Video Imagery Prudhvi Krishna Gurram Ph.D. Student Chester...

Automated 3D Object Modeling from Aerial Video Imagery

Prudhvi Krishna GurramPh.D. Student

Chester F. Carlson Center for Imaging Science,

Rochester Institute of Technology

Army Research Laboratory, MD

Date: 09/02/2009

Outline

Introduction Motivation Research objectives Approach & Results

Pre-processing step Building stereo mosaics 3D object identification 3D object modeling

Summary and conclusions Future work recommendations

Prudhvi Gurram, Research Seminar04/19/23 2

Introduction Applications of physically realistic 3D scenes

Military applications include target area simulation and moving target detection

Civilian applications include damage assessment in case of natural disasters

Other applications include medical imaging, robotic vision Reconstructed 3D scenes must conform to real scenes in terms of

Geometry Radiometry

RIT has the Digital Imaging and Remote Sensing Image Generation (DIRSIG) tool which can create spectrally accurate synthetic imagery by simulating different sensor types.

DIRSIG requires pre-defined 3D geometrical scene with spectra assigned to each facet of the scene

04/19/23Prudhvi Gurram, Research Seminar

3

Motivation


4

High-Resolution VideoHigh-Resolution VideoLidar DataLidar Data

Spectral ImagerySpectral Imagery

Spectrally-Accurate Spectrally-Accurate Scene ModelScene Model

Rapidly construct radiometrically-correct scene models based on

multi-sensor data for use in DIRSIG synthetic scene generation

Motivation Objective

Extraction of 3D geometry of a scene from aerial video over a large scene

Possible Approaches Manual Interpretation of Stereo Imagery (Very intensive and time

consuming for large areas in the order of days or even months) Automated processing of video frames to build stereo mosaics for the

extraction of 3D geometry

Combine this with information from Lidar to improve the accuracy of the 3D Scene

Combine the 3D coordinates with material properties from Hyperspectral imaging to render a 3D Scene which conforms both geometrically and radiometrically to real world


5

Research Objectives

Build stereo mosaics from video frames over large scenes

Identify 3D objects like buildings and trees in the scene using stereo mosaics

Accurately model 3D buildings in the scene Improve the accuracy of 3D object

identification and modeling by fusing Lidar data with visual imagery


6

Approach


7

Video Fram

es

Pre-processing of

the video frames

Ray Interpolation

3D Object Identificatio

n and Modeling

3D Models

Exterior Orientation (EO) and Interior Orientation (IO)

parameters

Orientation-corrected video

frames

Stereo Mosaics

Inputs

Intermediate output and visual aid

Final output


8

Video Fram

es

Pre-processing of

the video frames

Ray Interpolation


n and Modeling

3D Models


parameters


frames

Stereo Mosaics

Initial Video FramesExterior and Interior Orientation parameters


9

Rotation matrix

Camera centeror

Viewpoint

1R

1T

2R

2T

NR

NT

Video Camera

100

/0

0/

p

p

ydf

xdf

K

Interior orientation matrix


10

Video Fram

es

Pre-processing of

the video frames

Ray Interpolation


n and Modeling

3D Models


parameters


frames

Stereo Mosaics

Pre-processing of Video Frames


11

Correct the orientation of the frames so that all the frames have same orientation (nadir looking).

Observed motion parallax of objects is due to translational motion of camera only.

)()( TPRTPRRRP worldworld

A world point can be expressed in camera coordinate system with Rotation matrix R and camera center at T as

worldP

World coordinate system ↓

Camera coordinate system

Pre-processing (contd…)


12

The image coordinates in any frame i are transformed by matrix A, to observe only translational motion in the sensor

)(11'iworldimiimim TPKPKKRAPP

iii

TPKR

k

ky

kx

P worldim

The image coordinates of this point are given by Interior Orientation parameters embedded in matrix K

Camera coordinate system ↓

Image coordinate system


13

Video Fram

es

Pre-processing of

the video frames

Ray Interpolation


n and Modeling

3D Models


parameters


frames

Stereo Mosaics


14

Parallel Ray Interpolation Why are we using Parallel Ray Interpolation?

To convert the view from perspective view to parallel-perspective view Simulating a linear pushbroom camera To use motion parallax information (while creating mosaics) To make the stereo mosaics seamless

Perspective view Parallel view

Perspective ViewParallel ViewParallel View from Perspective View

Using Fixed Lines

Parallel and Perspective views


15


16

Ray InterpolationViewpoint 2Viewpoint 1

InterpolatedViewpoint

Image (Mosaic) Plane

Point in the image planefrom viewpoint 1

Point in the image planefrom viewpoint 2

Point in the image plane from the interpolated viewpoint

Acknowledgement:

Zhigang Zhu et al.,

City College of New York,

New York City, NY


17

PRISM (Parallel Ray Interpolation for Stereo Mosaicing)

Frame 1 Frame 2


18

Frame 1:

Frame 2:

Fixed Line

Fixed Line

Overlapped Region

Fixed Lines

Image Frame

Fast PRISM


19

Fra

me

1

Fra

me

2

Source Triangles

Fast PRISM

Des

tinat

ion

Tria

ngle

sin

the

Lef

t M

osai

c


20

Motion Parallax

Frame 1 Frame 2Interpolated Frame(before triangulation)

Can happen with low-flying aircraft and high-rise buildings


21

Interpolation

Frame 1 Frame 2Overlay of Frames 1 and 2Interpolation


22

Interpolation

Frame 1 Frame 2Interpolation


23

Triangulation Problem

Frame 1 Frame 2Interpolated Frame(after triangulation)


24

Modified PRISM

Make sure that none of the triangles include regions with different motion parallax

Find edges of different regions and align the sides of triangles with the edges

Use segmentation to obtain different planar surfaces

The inner boundary of each segment forms an edge of a region/object


25

Overlapped region

Frame 1:

Frame 2:


26

Segmented images

Segmented Frame 1:

Segmented Frame 2:


27

Frame k

Segments in Overlapped Region

One of the segments

Significant points using Convex Hull around the segment

Triangulation


28

Frame k+1Matching curve

The other part of the segment between matching curve and fixed line

Triangulation


29

On the Mosaic

“Orphan” Pixels Orphan pixels filled

• Using a constraint inherent in the Parallel-Perspective view

• Parallel view in dominant motion direction and Perspective view in the other direction

• Do not consider motion parallax along x direction

• In 3D translational case, use sequential linear interpolation to fill the orphan pixels

X

Y

Direction

Results – Set 1


30

MotionParallax

Frame 1 Frame 2

Fast PRISM Modified PRISM

Results – Set 2


31

Fast PRISM Modified PRISM

Publications:1. P. Gurram, E. Saber, and H. Rhody, “A Novel Triangulation Method for Building Parallel-Perspective

Stereo Mosaics”, Proceedings of Electronic Imaging Symposium, SPIE, San Jose, CA, January 2007.2. P. Gurram, E. Saber, and H. Rhody, “Segment-based Mesh Design for Building Parallel-Perspective

Stereo Mosaics”, Accepted for publication in IEEE Transactions on Geoscience and Remote Sensing


32

Stereo Mosaic – Modified PRISM


33

Video Fram

es

Pre-processing of

the video frames

Ray Interpolation


n and Modeling

3D Models


parameters


frames

Stereo Mosaics

3D Object Identification and Modeling (Deterministic Approach)

Build a nadir mosaic along with left and right stereo mosaics with fixed line looking in nadir direction

Use image segmentation to identify the various homogeneous surfaces in the mosaics

Manually set the segmentation input parameters Each homogeneous surface is treated as a planar surface Use deterministic thresholds to identify polyhedral building surfaces based on

elevation map generated using stereo mosaics


34

Image Plane

Scene

Viewpoints

Left mosaic Nadir mosaic Right mosaic

Sensor motion

Deterministic Approach (contd…)


35

Nadir mosaic Segmentation

Tree regions

Building surfaces extraction using

height information

Each surface

Boundary of each surface

Right mosaic

Left mosaic

Plane fit for each surface using disparity

between mosaics

Corners through Curvature Scale

Space

Edges through line fit between corners

CAD model of each building

DTM

Digital Elevation Model (DEM)

Reconstructed scene

100 200 300 400 500 600 700 800 900 1000 1100

50

100

150

200

250

300

350

400

450

500

Noise

Solar shadow

Problems in 3D model of a

building due to solar shadow and noise in

images

Stereo pair

Nadir mosaic


36Raw Lidar CAD Model from Lidar

Good Elevation

and Planes

Solar shadow

Noise

There is no information in these cases as one planar surface merges with a neighboring surface at a different height during segmentation

Video

Lidar

Fusing Lidar data with visual imagery

Problems with Deterministic Approach Had to set the segmentation input parameters manually Manually select Regions-Of-Interest (ROI) around buildings which is

tedious over large scenes Sparse depth/elevation map from stereo mosaics led to inaccurate 3D

models Noise in the images led to problems with modeling 3D buildings (missing

surfaces etc.) Deterministic thresholds led to the models being overfitted to a particular

data set Some of the above problems can be avoided by fusing Lidar data with visual

imagery


37

Publications:1. Prudhvi Gurram, Eli Saber and Harvey Rhody, "Extraction of Digital Elevation Map from Parallel-

Perspective Stereo Mosaics", IS&T/SPIE Electronic Imaging Symposium, San Jose, CA, Jan. 2008.2. Prudhvi Gurram, Steve Lach, Eli Saber, Harvey Rhody and John Kerekes, "3D Scene Reconstruction

through a Fusion of Passive Video and Lidar Imagery", IEEE Applied Imagery Pattern Recognition Workshop, Washington, DC, Oct. 2007.

3D Object Identification and Modeling Two Parts

Object Identification (Buildings, Trees, Terrain, etc.) Using Global Statistics of various features in Bayesian network for

classification of surfaces Features include elevation map from visual imagery (stereo mosaics),

elevation from Lidar data, color information, edges and corners extracted from visual imagery

Object modeling (3D Buildings) Identified building surfaces have inaccurate 3D geometry due to sparse depth

maps provided by stereo mosaics Improving accuracy of the building geometry measurements obtained from

stereo mosaics using local optimization and individual video frames.


38

3D Object Identification

Why Bayesian Networks? Useful in representing causal relationships between the

features/nodes Specify conditional independence among the features Easier to combine prior knowledge (structure of the BN)

with data Easier for an expert to intervene and predict the effects of

such an intervention Avoid overfitting of models to a particular data set


39

Bayesian Network Semantics The features constitute the nodes of the

BN If X is connected to Y (causal

relationship), X is called the parent of Y Any node has conditional probability

distribution P(X|Parents(X)) → P(X|A,B) The probabilities associated with each

node are called parameters BN defined by

Structure (causal relationships) Parameters (probabilities – conditional or

prior)


40

X

Y Z

A B

Parents of X

Conditionally independent

given X

Causal relationship

Bayesian Network Structure


41

BN structure to use visual imagery and Lidar data

BN structure to use visual imagery only

Region

RegionOther features in the BN1.Elevation information from stereo mosaics2.Elevation information from Lidar data3.Corner information from nadir mosaic4.Color information from nadir mosaic5.Texture information from nadir mosaic6.Area of regions

Classes: 1 – Buildings, 2 – Grass, 3 – Trees, 4 – Asphalt, 5 – Misc.

Stereo Mosaics


42

Original Video frames

Left

mosa

icN

adir

m

osa

icR

ight

mosa

ic

Mosaics and Lidar Data


43

Stereo Mosaics

Lidar data (Rasterized and registered to nadir mosaic by flying a linear pushbroom camera over

Lidar point cloud)

Truth Map and Pseudo Truth Map


44

Tru

th m

apN

adir

mos

aic

Seg

men

t m

ap

Pse

udo

Tru

th m

ap

Generated using mean-shift image segmentation with spatial bandwidth and color bandwidth15sh 10ch

Classes: 1 – Buildings, 2 – Grass, 3 – Trees, 4 – Asphalt, 5 – Misc.

Feature Extraction Elevation information from stereo mosaics

Match corresponding points in left, right and nadir mosaic (along the boundaries of segments) using correlation technique and epipolar constraints of parallel-perspective stereo mosaics

Use stereo geometry of parallel-perspective stereo mosaics to extract depth/elevation map

Fit least squares planes for each segment in nadir mosaic Use RANdom SAmple Consensus (RANSAC) algorithm to remove outliers (due to bad

segmentation and bad point matches) during plane fit Use mean height, minimum height, maximum height, number of inliers during plane fit as

features in the BN


45

Observe the noise in the elevation map – this is due to over segmentation of tree regions and bad matches of points

Feature Extraction Elevation information from Lidar data

Elevation map readily available from Lidar point cloud Fit least squares planes for each segment in nadir mosaic Use RANdom SAmple Consensus (RANSAC) algorithm to remove outliers (due to bad

segmentation) during plane fit Use mean height, minimum height, maximum height, number of inliers during plane fit as

features in the BN

Corner information from nadir mosaic Extract 2D corners from each segment of nadir mosaic Orthorectify the corners using initial elevation map from stereo mosaics Use total number of corners, number of right angle corners, 45 degree, 135 degree

corners as features in the BN

Surface area of each segment in absolute units (m2) Mean values of hue and saturation of each segment


46

Feature Extraction


47

Visual entropy determined over 9x9 window on nadir mosaic

Lidar entropy determined over 9x9 window on rasterized Lidar data

Entropy represents the energy of the data over a window – can be used to represent the presence of texture in visual images and presence of changes of height in Lidar data

Bayesian Network Learning and Inference

Equal frequency binning used to discretize the features Known structure and Incomplete data

Hidden nodes introduced by expert to make causal dependencies explicit

Use Expectation-Maximization (EM) algorithm to learn the parameters of the nodes given all the training data

During testing phase, use Junction tree inference algorithm to marginalize over the nodes for which evidence is provided and obtain the posteriori probabilities of desired node (Region)


48

Decision Theory

Decision theory = ‘Probability theory’ + ‘Utility theory’

User-defined utilities are used to determine how the posteriori probabilities are used for making decisions

For region classification, utilities are set in such a way that the class with Maximum A Posteriori (MAP) probability is chosen


49

Automated Choice of Best Segmentation Input Parameters Mean-shift image segmentation input parameters

Spatial bandwidth is set at a constant level since its variation does not considerably change the segmentation results

Color bandwidth is varied from 2 to 20 in steps of 2 10 sets of input parameters

The best set of parameters chosen based on the quality of classification results

Quality metric: Weighted sum of differences of True Positive Rate and False Positive Rate

where represents one case of input parameter

represents the class number True Positive is a hit and False positive is a false alarm True positives and false positives are calculated in terms of pixels but not

regions


50

N

i

ki

kii

hc FPTPWh

kc 1

max

sh

ch

kch

i

Results – Visual Data Only


51

Building Class Trees Class

2 0.8326 0.0802 0.7524 0.9033 0.086 0.81734 0.8375 0.0705 0.767 0.8711 0.0764 0.79476 0.8109 0.0491 0.7618 0.9099 0.1034 0.80668 0.7871 0.0482 0.7389 0.8951 0.0806 0.814510 0.8474 0.0774 0.77 0.9111 0.0968 0.814312 0.7951 0.0634 0.7318 0.8841 0.081 0.803114 0.7735 0.0594 0.7142 0.8415 0.0658 0.775716 0.8322 0.0664 0.7658 0.7879 0.0646 0.723318 0.8131 0.0829 0.7302 0.8267 0.0596 0.767220 0.8026 0.0711 0.7315 0.7648 0.0523 0.7125

ch TP FP FPTP TP FP FPTP

Best Parameter 10ch

Classification map

Weights used:

0,0,0,0,1 54321 WWWWW

Results – Visual Data Only


52

Best Parameter 2ch

Classification map

Weights used:

0,0,0,9.0,1.0 54321 WWWWW

Results – Visual and Lidar fusion


53


2 0.9478 0.0273 0.9204 0.9552 0.0291 0.92614 0.9651 0.0499 0.9151 0.9516 0.0273 0.92436 0.9576 0.0346 0.923 0.9527 0.028 0.92488 0.9609 0.0476 0.9133 0.9512 0.0278 0.923410 0.9604 0.0151 0.9453 0.9504 0.0277 0.922712 0.959 0.0197 0.9393 0.9483 0.0252 0.923114 0.9424 0.0225 0.9199 0.942 0.0227 0.919316 0.9433 0.0481 0.8952 0.9471 0.0224 0.924818 0.9189 0.0996 0.8194 0.9457 0.0226 0.923120 0.9188 0.1236 0.7952 0.9419 0.0225 0.9195


Best Parameter 10ch

Classification map

Weights used:

0,0,0,0,1 54321 WWWWW

Comparison of the Two Classifiers


54


2 0.8326 0.0802 0.7524 0.9033 0.086 0.81734 0.8375 0.0705 0.767 0.8711 0.0764 0.79476 0.8109 0.0491 0.7618 0.9099 0.1034 0.80668 0.7871 0.0482 0.7389 0.8951 0.0806 0.814510 0.8474 0.0774 0.77 0.9111 0.0968 0.814312 0.7951 0.0634 0.7318 0.8841 0.081 0.803114 0.7735 0.0594 0.7142 0.8415 0.0658 0.775716 0.8322 0.0664 0.7658 0.7879 0.0646 0.723318 0.8131 0.0829 0.7302 0.8267 0.0596 0.767220 0.8026 0.0711 0.7315 0.7648 0.0523 0.7125

ch TP FP FPTP TP FP FPTP Building Class Trees Class

2 0.9478 0.0273 0.9204 0.9552 0.0291 0.92614 0.9651 0.0499 0.9151 0.9516 0.0273 0.92436 0.9576 0.0346 0.923 0.9527 0.028 0.92488 0.9609 0.0476 0.9133 0.9512 0.0278 0.923410 0.9604 0.0151 0.9453 0.9504 0.0277 0.922712 0.959 0.0197 0.9393 0.9483 0.0252 0.923114 0.9424 0.0225 0.9199 0.942 0.0227 0.919316 0.9433 0.0481 0.8952 0.9471 0.0224 0.924818 0.9189 0.0996 0.8194 0.9457 0.0226 0.923120 0.9188 0.1236 0.7952 0.9419 0.0225 0.9195


Visual imagery only Visual and Lidar fusion

Visual imagery only Visual and Lidar fusionTruth map

3D Object Modeling

Need for further optimization of 3D buildings Sparse depth maps obtained from stereo mosaics Height of each point quantized into levels defined by

the view angle of the fixed line used for building the stereo mosaics

Project initial 3D models on to individual video frames

Minimize the distance between the projected corners of the building and the actual corners detected in the 2D images


55


56

Initial estimate of orthorectified, georeferenced 3D corners

Project initial 3D corners on to individual video frames in which the building is visible

Find corresponding point pairs between projected 3D corners and 2D corners in individual video frames

Optimize the 3D position of corners in object space to reduce the sum of squared distances between each corner’s

projected 2D coordinates and the actual 2D coordinates of the points in the video frames (nonlinear least squares

problem - Levenberg-Marquardt algorithm)

Fit a RANSAC least squares plane through the optimized 3D corners of the surface to remove outliers due to bad point

pairs

Recalculate the accurate 3D positions of the corners using the plane equation

Over all surfaces

For each surface s

Combine all the corners of adjacent surfaces with common edges by applying appropriate constraints to build accurate

3D model of the building

),,( ZYX

Projected 3D corners

Initial 3D corners

)ˆ,ˆ( yx

Actual 2D corners

),( yx

1

|ˆ

ˆ

33 Z

Y

X

TIKR

k

yk

xk

N

iiiii

ZYXyyxx

1

22

),,()ˆ()ˆ(min

Results – 3D Object Modeling


57

Building surfaces identified by Bayesian Network

Surfaces belonging to a single building identified using connected

components

Summary and Conclusions

Automatically built stereo mosaics from video frames over large scenes using ray interpolation

Automatically identified 3D objects like buildings and trees in the scene using features from stereo mosaics

Improved the accuracy of 3D object identification and modeling by fusing Lidar data with visual imagery

Accurately modeled 3D buildings in the scene In summary, an automated system has been designed to

model 3D buildings from aerial video


58

Original Contributions


59

A segment-based mesh design for aerial triangulation without any prior 3D knowledge of the scene has been developed This new design helps in avoiding visual artifacts in the parallel-perspective stereo mosaics

that are built using ray interpolation Consequently, the errors in the final 3D models of buildings are reduced Publication: P. Gurram, E. Saber, and H. Rhody, “Segment-based Mesh Design for Building

Parallel-Perspective Stereo Mosaics”, Accepted for publication in IEEE Transactions on Geoscience and Remote Sensing

A novel method to set the input parameters of vision algorithms like color segmentation using the data-driven probabilistic inference in Bayesian networks has been designed This method automates the 3D object identification process and precludes the need for

manual intervention to set the accurate input parameters for best quality of the final 3D building models

Publication: P. Gurram, E. Saber, and H. Rhody, “An Automated System for Modeling 3D Buildings from Aerial Video”, to be submitted to ASPRS Photogrammetric Engineering and Remote Sensing Journal

Future Work

Structural learning of Bayesian Networks Design of hierarchical Bayesian Networks Fusion of Lidar data with visual imagery

during 3D modeling step


60

Acknowledgements

Dr. Eli Saber, RIT Dr. Harvey Rhody, RIT Dr. Ferat Sahin, RIT Major Steve Lach, USAF Jason Faulring, LIAS, RIT Matthew Montanaro Archana Devasia Mustafa Jaber04/19/23

Prudhvi Gurram, Research Seminar61

Automated 3D Object Modeling from Aerial Video Imagery Prudhvi Krishna Gurram Ph.D. Student Chester...

Documents

Transcript of Automated 3D Object Modeling from Aerial Video Imagery Prudhvi Krishna Gurram Ph.D. Student Chester...