Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
-
Upload
sunando-sengupta -
Category
Science
-
view
42 -
download
3
Transcript of Urban 3D Semantic Modelling Using Stereo Vision, ICRA 2013
Urban 3D Semantic Modelling Using Stereo Vision
Sunando Sengupta1, Eric Greveson2, Ali Shahrokni2, Philip HS Torr1
1Oxford Brookes Vision Group, 22d3 Sensing.
Urban 3D Semantic Modelling Road Scene
• Given a sequence of stereo images we generate a dense 3D, semantic model
Input Stereo image Sequence Dense 3D Semantic Model
• Stereo images
Pipeline –Semantic Reconstruction
• Depth map generation• Camera estimation
Pipeline –Semantic Reconstruction
• Surface reconstruction
Pipeline –Semantic Reconstruction
• Semantic labelling of street view images
Pipeline –Semantic Reconstruction
• Semantic model generation
Pipeline –Semantic Reconstruction
Camera Estimation
• Feature tracking using left-right pair and consecutive frames
Camera Estimation
• Use the feature tracks to estimate camera poses.
• Use bundle adjustment
[a] Andreas Geiger et. Al. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite CVPR 2012
Depth-Map Estimation
• Semiglobal block matching[1] for disparity estimation
• Per-pixel depth computed as z = B x f / d
[1] H. Hirschmueller, Stereo Processing by Semi-Global Matching and Mutual Information. PAMI 2008.
B – Baselinef - Focal Length
d – pixel disparity
Depth Fusion
• Depth estimates are fused using camera poses.
• Fused into truncated signed distance (TSDF) volumetric representation[1].
[1] Brian Curless and Marc Levoy, A Volumetric Method for Building Complex Models from Range Images Siggraph 96.
TSDF Volume[1]
• Entire space divided into grids of voxels.
• For each voxel compute the truncated signed distance.
– +ve increasing when it lies in the free space, – -ve when it lies behind the surface– zero when lies on the surface
• Performed for all depth maps.
[1] B. Curless et. al. A volumetric method for building complex models from range images.
TSDF Volume
-.8
Camera
Actual surface TSDF volume
TSDF Volume
-1 -.8 -.3 .2 .8 1 1 1
-1 -.9 -.4 .1 .5 1 1 1
-1 -1 -.8 -.2 .1 1 1 1
-1 -1 -.9 -.3 .2 .8 1 1
-1 -1 -.9 -.4 .3 .9 1 1
-1 -1 -.8 -.3 .3 .9 1 1
-1 -1 -.9 -.5 .2 .8 1 1
-1 -1 -.6 .1 .7 1 1 1
Camera
TSDF volume
Actual surface
Incremental Volume Update
• Road scenes are arbitrary length long sequence.
• 3x3x1 volume of voxel grids initialised
Incremental Volume Update
• Road scenes are arbitrary length long sequence.
• 3x3x1 volume of voxel grids initialised
• Incrementally add volume as the vehicle moves out of the region
CRF
construction
Semantic Image Segmentation• We use conditional random field framework (CRF)
Final SegmentationInput Image
17
• Each pixel is a node in a grid graph G = (V,E).• Each node is a random variable x taking a label from
label set.
X
[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for object class image segmentation,” in ICCV, 2009.
Semantic Image Segmentation• Total energy E = Epix + Epair + Eregion
• Epix - Model individual pixel’s cost of taking a label.
– Computed via the dense boosting approach– Multi feature variant of texton boost[1]
x
Car 0.2
Road 0.3
18[1] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Associative hierarchical crfs for object class image segmentation,” in ICCV, 2009.
Semantic Image Segmentation• Total energy E = Epix + Epair + Eregion
• Epair- Model each pixels neighbourhood interaction.
– Encourages label consistency in adjacent pixels and sensitive to edges.
– Contrast sensitive Potts model xi xj
Car
Road
0
g(i,j)Car
Road
19
Epair
Semantic Image Segmentation• Total energy E = Epix + Epair + Eregion
• Eregion - Model behaviour of a group of pixels.
– Encourages all the pixels in a region to take the same label.
– Group of pixels given by a multiple meanshift segmentations
c
Car 0.3Road 0.1
20
Semantic Image Segmentation - Results
• Input Images, output of our image level CRF, ground truths.
Mesh Face Labelling
• A histogram of labels is built for each mesh face (Zf ), by projecting the points from the face into labelled images.
• Majority label is considered as the label of the face.
Semantic Model
Top: Left – Surface reconstruction, Right – Semantic modelBottom: Left - input image, Right- object label set
Evaluation
• The Model is projected back using the estimated camera poses to create labelled images.
• The points in the model far away from the camera are ignored in the projection.
Evaluation• Metrics– Recall = tp/(tp+fn)– Intersection vs Union = tp/(tp+fn+fp)
Video
Future Work
http://cms.brookes.ac.uk/research/visiongroup/projects
• Use semantic to build the structure.
• Realtime implementation.
• Combine image level information and geometric contextual information.
Thank you!!!