Sparse coding for image/video denoising and superresolution

Sparse Coding & Dictionary Learning for Image Denoising & Super-Resolution

Yu Huang Sunnyvale, California

[email protected]

mailto:[email protected]

Outline • The sparse-land model • What is sparse coding? • Methods of solving sparse coding • Orthogonal Matching Pursuit (OMP) • Strategy of dictionary selection (dictionary learning) • What is the K-SVD algorithm? • Image denoising

– Apply Sparse Coding for Denoising – Learned Simultaneous Sparse Coding – Locally Learned Dictionaries – Clustering-based Sparse Represent.

• Image super-resolution – Sparse coding for SR – Joint dictionary learning for SR – Self similarities & group sparsity for SR – Adaptive sparse domain selection in SR – Semi-coupled dictionary learning-based SR

• References

Appendix

• K-nearest nearest neighbor;

• PCA, AP and spectral clustering;

• NMF and pLSA;

• ISOMAP;

• Locally Linear Embedding;

• Laplacian eigenmap;

• Gaussian mixture and EM;

• Graphical model;

• Generative model: MRF;

• Discriminative model: CRF;

• Graph cut;

• Belief propagation.

The Sparseland Model

• Defined as a set {D, X, Y} such that

D Y = X

What is Sparse Coding?

• Given a D and yi, how to find xi ?

• Constraint : xi is sufficiently sparse;

• Finding exact solution is difficult;

• Approximate a solution good enough?

Methods of Solving Sparse Coding • Greedy methods: projecting the residual on some atom;

– Matching pursuit, orthogonal matching pursuit;

• L1-norm: Least Absolute Shrinkage and Selection Operator (LASSO); – Basis pursuit; – The residual is updated iteratively in the direction of the atom;

• Gradient-based finding new search directions – Projected Gradient Descent – Coordinate Descent

• Homotopy: a set of solutions indexed by a parameter (regularization) – LARS (Least Angle Regression)

• First order/proximal methods: Generalized gradient descent – solving efficiently the proximal operator – soft-thresholding for L1-norm – Accelerated by the Nesterov optimal first-order method

• Iterative reweighting schemes – L2-norm: Chartand and Yin (2008) – L1-norm: Cand`es et al. (2008)

Orthogonal Matching Pursuit (OMP)

Select dk with max projection on residue

Select dk with max projection on residue

xk = arg min ||y-Dkxk||

Update residue

r = y - Dkxk

Check terminating condition

D, y x

Features of OMP

• A greedy algorithm, better than MP; – Able to find approximate solution;

• Full backward orthogonality of error;

• Close solution if T is really small;

• Simplistic in nature.

Strategy of Dictionary Selection • What D to use? • A fixed overcomplete set of basis: no adaptivity.

• Steerable wavelet; • Bandlet, curvelet, contourlet; • DCT Basis; • Gabor function; • ….

• Data adaptive dictionary – learn from data; • K-SVD: a generalized K-means clustering process for Vector

Quantization (VQ). – An iterative algorithm to effectively optimize the sparse approximation of

signals in a learned dictionary.

• Other methods of dictionary learning: – non-negative matrix decompositions. – sparse PCA (sparse dictionaries). – fused-lasso regularizations (piecewise constant dictionaries)

• Extending the models: Sparsity + Self-similarity=Group Sparsity

What is the K-SVD Algorithm?

• Select atoms from input; • Atoms can be image patches; • Patches are overlapping.

Initialize Dictionary

Sparse Coding (OMP)

Update Dictionary One atom at a time

• Use OMP or any pursuit method; • Output sparse code for all signals; • Minimize representation error.

Optimization for Dictionary Learning • Classical optimization alternates between D and α, but very slow.

• Online learning handle potentially infinite or dynamic datasets;

– Dramatically faster than batch algorithms;

Sparse PCA • Given data , two views of PCA:

– Analysis view: find the projection of maximum variance (with deflation to obtain more components);

– Synthesis view: find the basis d1, . . . , dk such that all xi have low reconstruction error when decomposed on this basis • Find d1, . . . , dk ∈ Rp sparse so that the following is small

• Penalize/constrain dj/αi by l1-norm/l2-norm;

• For regular PCA, the two views are equivalent

• Sparse extensions – Interpretability;

– High-dimensional inference;

– Two views are different;

Image Denoising • Various assumptions of content internal structures;

• Learning-based – Field of experts (MRF), NN, CRF,…;

– Sparse coding: K-SVD, LSSC,….

• Self-similarity – Gaussian, Median;

– Bilateral filter, anisotropic diffusion;

– Non-local means.

• Sparsity prior – Wavelet shrinkage;

• Use of both Redundancy and Sparsity – BM3D (block matching 3-d filter): benchmark;

Apply Sparse Coding for Denoising

• A cost function for : Y = Z + n

• Solve for: Prior term

• Break problem into smaller problems

• Aim at minimization at the patch level.

Proximity of selected patch

Sparsity of the representations

Global proximity

Image Data

• Extract overlapping patches from a single image; – clean or corrupted, even reference (multiple frames)?

– for example, 100k of size 8x8 block patches;

• Applied the K-SVD, training a dictionary; – Size of 64x256 (n=64, dictionary size k).

– Lagrange multiplier namda = 30/sigma of noise;

• The coefficients from OMP; – the maximal iteration is 180 and noise gain C=1.15;

– the number of nonzero elements L=6 (sigma=5).

• Denoising by normalized weighted averaging:

Extended to Color Images

• Color correction in OMP:

– put more importance of the proximity of the mean value of the color patches.

– Coefficient gama = 5.25;

• channels R, G and B are concatenated in the sparseland model.

Block Matching 3-D for Denoising

• For each patch, find similar patches;

• Group the similar patches into a 3-d stack;

• Perform a 3-D transform (2-d + 1-d) and coefficient thresholding (sparsity);

• Apply inverse 3-D transform (1-d + 2-d);

• Also combine multiple patches in a collaborative way (aggregation);

• Two stages: hard -> wiener (soft).

BM3D Outline

Noisy

K-SVD BM3D

NL Means

Locally Learned Dictionaries (K-LLD)

• Identify dictionary which best captures underlying geometric structure;

• Similar structures will have similar dictionary, similar weights;

• Cluster image based on geometric similarity (K-Means on the SKR weights);

• Learn dictionary and order of regression for each cluster;

• Performance is between K-SVD and BM3D.

K-LLD Outline

Calculate

weights

Learn

dictionaries

Clustering

Iter

ate

Noisy Image

Kernel Regression

Denoised Image

Learned Simultaneous Sparse Coding • Idea: combine dictionary learning and grouping;

– Non-local Means: self-similarity;

– Dictionary learning: sparse coding.

• Different from BM3D:

– Classical fixed orthogonal dictionaries;

• Problem in Sparse Coding: instable sparse decompositions may cause reconstruction artifacts;

• LSSC model: A joint sparsity pattern imposed through a grouped-sparsity regularizer

• The perform. is a little better than BM3D and K-SVD.

j

j i i

Clustering-based Sparse Represent.

• Idea: Combination of local and global sparsity;

– Dictionary learning (K-SVD);

– Structural clustering (BM3D);

• CSR Model:

– PCA/k-means Sparse coding (alpha) + k-NN clustering (beta);

• Equivalence of sparse coding and Bayesian network:

– Clustering in CSR looks like a 2nd stage sparse coding.

• Performance: better than K-SVD, close to BM3D.

• Question: globally get but locally fit?

1 1

Image Super-Resolution (SR) • SR: how to find missing details/HF comp? • Interpolation-based:

– Edge-directed; – B-spline; – Sub-pixel alignment;

• Reconstruction-based; – Gradient prior; – TV (Total Variation); – MRF (Markov Random Field).

• Learning-based (hallucination). – Example-based: texture synthesis, LR-HR mapping; – Self learning: sparse coding, self similarity-based;

• Estimate missing HR detail that isn’t present in the original LR image, and which we can’t make visible by simple sharpening;

• Image database with HR/LR image pairs;

• Algorithm uses a training set to learn the fine details of LR;

• It then uses learned relationships (MRF) to predict fine details.

What is Example Based SR?

One pass Algorithm

SR from a Single Image

• Multi-frame-based SR (alignment);

• Example-based SR.

SR from a Single Image

• Combination of Example-based and Multi-frame-based.

same scale

different scales

FindNN Parent Copy

Example-based Edge Statistics Single Frame

Sparse Coding for SR [Yang et al.08] • HR patches have a sparse represent. w.r.t. an over-complete

dictionary of patches randomly sampled from similar images.

• Sample 3 x 3 LR overlapping patches y on a regular grid.

output HR patch HR dictionary

for some with

The input LR patch satisfies

linear measurements of sparse coefficient vector !

Dictionary of low-resolution patches

Downsampling/Blurring operator

If we can recover the sparse solution to the underdetermined system of linear equations , we can reconstruct as

convex relaxation

T, T’: select overlap between patches F : 1st and 2nd derivatives from LR bicubic interpolation.

Sparse Coding for SR [Yang et al.08] Two training sets: Flower images – smooth area, sharp edge Animal images -- HF textures Randomly sample 100,000 HR-LR patch pairs

from each set of training images.

Sparse coding

MRF / BP [Freeman IJCV ‘00]

Bicubic

Original

Joint Dictionary Learning for SR • Local sparse prior for detail recovery;

• Global constraints for artifact avoiding (L=SH);

• Joint dictionary learning:

extract overlap region previous reconstruct on the overlap

controls the tradeoff between matching the LR input and finding a neighbor-compatible HR patch.

Solved by back-projection: a gradient descent method

Bicubic Sparse coding

MRF / BP [Freeman IJCV ‘00] Input LR

Self Similarities & Group Sparsity • Generate HR-LR patch pairs from image pyramid; • Like LSSC, grouping patch pairs by k-means clustering (Note:

not only within the scale, but also cross scales);

ANN Search

bicubic interpolation

Fill the uncovered area with the back projection method

Self Similarities & Group Sparsity • Features extracted for clustering (1st and 2nd gradients);

• For each cluster, run sparse coding; then reconstruct HR patches.

Adaptive Sparse Domain Selection

• Stability of sparse decomposition by domain selection, i.e. sub-dictionary learning (via PCA) after clustering features (k-means);

• Adaptive selection of sub-dictionary (wavelet iterative shrinkage);

• Local structure encoding with the piecewise AR models;

• Non-local similarities constraints for regularization;

• Reweighted sparsity for regularization as well;

• 727,615 patches of size 7×7 randomly from training images;

• 200 clusters initially, then merge;

• Computational cost: image 256x256, 100 iterations, 2~5 minutes.

the fidelity term local AR model non-local similarity

regularization sparsity penalty

Adaptive Sparse Domain Selection

LR Sparse coding Sparse Domain

Selection

Semi-Coupled Dictionary Learning • Dictionary pair Dh, Dl and a mapping function W will

be simultaneously learned; – Not fully coupled learning; – clustering in sparse and exploiting nonlocal similarities;

• Training: dictionary and mapping update;

• Synthesis: reconstruct

LR ground truth Bicubic Sparse coding Semi-coupled

Reference I: Denoising • K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse

3-D transform-domain collaborative filtering. IEEE T-IP, 16(8):2080–2095, 2007.

• M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE T-IP, 15(12):3736–3745, 2006.

• J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image restoration. ICCV’09, 2009 (LSSC).

• S. Roth and M. Black. Fields of experts. IJCV, 82(2):205–229, 2009. • C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images.

ICCV, pages 839–846, 1998. • H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: can plain

neural networks compete with bm3d? Proc. CVPR, 2012. • A. Buadess, B. Coll, and J. Morel. A non local algorithm for image denoising.

In CVPR, 2005. • W Dong, X Li, L Zhang and G Shi, Sparsity-based Image Denoising vis

Dictionary Learning and Structural Clustering, CVPR’11, 2011. • P. Chatterjee and P. Milanfar, Clustering-based Denoising with Locally

Learned Dictionaries (K-LLD), IEEE T-IP, vol. 18, num. 7, July 2009.

Reference II: SR • W. T. Freeman, T. R. Jones, E. C. Pasztor, Example-based super-resolution,

IEEE CGA, 2002; • D. Glasner, S. Bagon, and M. Irani, Super-resolution from a single image,

IEEE CVPR, 2009; • J. Yang, J. Wright, T. Huang, and Y. Ma, Image Super-Resolution as Sparse

Representation of Raw Image Patches. CVPR 2008; • J. Yang, J. Wright, T. Huang, and Y. Ma, Image super-resolution via sparse

representation, IEEE T-IP, 19(11), pp2861–2873, 2010; • C.-Y. Yang, J.-B. Huang, and M.-H. Yang, Exploiting self-similarities for single

frame super-resolution, ACCV, 2010; • J. Sun, Z. Xu, and H. Y. Shum. Image super-resolution using gradient profile

prior. In CVPR, 2008; • Y. HaCohen, R. Fattal, and D. Lischinski, Image upsampling via texture • hallucination, IEEE ICCP, 2010. • W Dong, D Zhang, G Shi, X Wu, Image Deblurring and Super-Resolution by

Adaptive Sparse Domain Selection and Adaptive Regularization, IEEE T-IP, 20(7), 2011;

• S. Wang, L Zhang, Y Liang, Q Pan, Semi-Coupled Dictionary Learning for Super-Resolution, CVPR, Sept 2012.

Appendix

K-Nearest Neighbors • A non-parametric method for regression

and classification; • Input: the k closest training examples in

the feature space. • Output depends on application cases as

– Classification: a class membership by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors;

– Regression: the property value for the object; average of the values of its k nearest neighbors.

• k-NN is a instance-based learning, or lazy learning, where the function is approximated locally and computation is deferred until classification;

• A shortcoming of k-NN: sensitive to the local structure of the data.

http://iopscience.iop.org/1742-5468/2010/11/P11015/fulltext

K-Nearest Neighbors • Pairwise Distance:

– Euclidean, – Mahalanobis, – city, – correlation, – Minkowski, – Chebychev, – Hamming, – Jaccard; – Spearman;

• kNN can be done by exhaustive search or approximate NN (kd-tree);

• Note: Condensed nearest neighbor (CNN) is an algorithm designed to reduce the data set for k-NN classification.

http://what-when-how.com/advanced-methods-in-computer-graphics/collision-detection-advanced-methods-in-computer-graphics-part-6/

PCA, AP & Spectral Clustering • Principal Component Analysis (PCA) uses orthogonal transformation to

convert a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components.

• This transformation is defined in such a way that the first principal component has the largest possible variance and each succeeding component in turn has the highest variance possible under the constraint that it be orthogonal to the preceding components.

• PCA is sensitive to the relative scaling of the original variables. • Also called as Karhunen–Loève transform (KLT), Hotelling transform, singular

value decomposition (SVD) , factor analysis, eigenvalue decomposition (EVD), spectral decomposition etc.;

• Affinity Propagation (AP) is a clustering algorithm based on the concept of "message passing" between data points.[Unlike clustering algorithms such as k-means or k-medoids, AP does not require the number of clusters to be determined or estimated before running the algorithm;

• Spectral Clustering makes use of the spectrum (eigenvalues) of the data similarity matrix to perform dimensionality reduction before clustering in fewer dimensions. – The similarity matrix consists of a quantitative assessment of the relative similarity

of each pair of points in the dataset.

PCA, AP & Spectral Clustering

NMF & pLSA • Non-negative matrix factorization (NMF): a matrix V is factorized into

(usually) two matrices W and H, that all three matrices have no negative elements.

• The different types arise from using different cost functions for measuring the divergence between V and W*H and possibly by regularization of the W and/or H matrices; – squared error, Kullback-Leibler divergence or total variation (TV);

• NMF is an instance of a more general probabilistic model called "multinomial PCA“, as pLSA (probabilistic latent semantic analysis);

• pLSA is a statistical technique for two-mode (extended naturally to higher modes) analysis, modeling the probability of each co-occurrence as a mixture of conditionally independent multinomial distributions; – Their parameters are learned using EM algorithm;

• pLSA is based on a mixture decomposition derived from a latent class model, not as downsizing the occurrence tables by SVD in LSA.

• Note: an extended model, LDA (Latent Dirichlet allocation) , adds a Dirichlet prior on the per-document topic distribution.

NMF & pLSA

Note: d is the document index variable, c is a word's topic drawn from the document's topic distribution, P(c|d), and w is a word drawn from the word distribution of this word's topic, P(w|c). (d and w are observable variables, c is a latent variable.)

http://sens.tistory.com/319

ISOMAP • General idea:

– Approximate the geodesic distances by shortest graph distance.

– MDS (multi-dimensional scaling) using geodic distances

• Algorithm: – Construct a neighborhood graph

– Construct a distance matrix

– Find the shortest path between every i and j (e.g. using Floyd-Marshall) and construct a new distance matrix such that Dij is the length of the shortest path between i and j.

– Apply MDS to matrix to find coordinates

LLE (Locally Linear Embedding) • General idea: represent each point on the local linear subspace of the manifold

as a linear combination of its neighbors to characterize the local neighborhood relations; then use the same linear coefficient for embedding to preserve the neighborhood relations in the low dimensional space;

• Compute the coefficient w for each data by solving a constraint LS problem;

• Algorithm: – 1. Find weight matrix W of linear coefficients

– 2. Find low dimensional embedding Y that minimizes the reconstruction error

– 3. Solution: Eigen-decomposition of M=(I-W)’(I-W)

i j

jiji YWYY

2

)(

Laplacian Eigenmaps • General idea: minimize the norm of Laplace-Beltrami operator on the manifold

– measures how far apart maps nearby points.

– Avoid the trivial solution of f = const.

– The Laplacian-Beltrami operator can be approximated by Laplacian of the neighborhood graph with appropriate weights.

– Construct the Laplacian matrix L=D-W.

– can be approximated by its discrete equivalent

• Algorithm: – Construct a neighborhood graph (e.g., epsilonball, k-nearest neighbors).

– Construct an adjacency matrix with the following weights

– Minimize

–

– The generalized eigen-decomposition of the graph Laplacian is

– Spectral embedding of the Laplacian manifold:

– • The first eigenvector is trivial (the all one vector).

Gaussian Mixture Model & EM • Mixture model is a probabilistic model for representing the presence

of subpopulations within an overall population; • “Mixture models" are used to make statistical inferences about the properties

of the sub-populations given only observations on the pooled population; • A Gaussian mixture model can be Bayesian or non-Bayesian; • A variety of approaches focus on maximum likelihood estimate (MLE)

as expectation maximization (EM) or maximum a posteriori (MAP); • EM is used to determine the parameters of a mixture with an a priori given

number of components (a variation version can adapt it in the iteration); – Expectation step: "partial membership" of each data point in each constituent

distribution is computed by calculating expectation values for the membership variables of each data point;

– Maximization step: plug-in estimates, mixing coefficients and component model parameters, are re-computed for the distribution parameters;

– Each successive EM iteration will not decrease the likelihood.

• Alternatives of EM for mixture models: – mixture model parameters can be deduced using posterior sampling as indicated

by Bayes' theorem, i.e. Gibbs sampling or Markov Chain Monte Carlo (MCMC); – Spectral methods based on SVD; – Graphical model: MRF or CRF.

Gaussian Mixture Model & EM

Graphical Models

• Graphical Models: Powerful framework for representing dependency structure between random variables.

• The joint probability distribution over a set of random variables. • The graph contains a set of nodes (vertices) that represent random variables, and a set of links (edges) that represent dependencies between those random variables.

• The joint distribution over all random variables decomposes into a product of factors, where each factor depends on a subset of the variables. • Two type of graphical models:

• Directed (Bayesian networks) • Undirected (Markov random fields, Boltzmann machines) • Hybrid graphical models that combine directed and undirected models, such as Deep Belief Networks, Hierarchical-Deep Models.

Generative Model: MRF • Random Field: F={F1,F2,…FM} a family of random variables on set

S in which each Fi takes value fi in a label set L.

• Markov Random Field: F is said to be a MRF on S w.r.t. a neighborhood N if and only if it satisfies Markov property.

– Generative model for joint probability p(x)

– allows no direct probabilistic interpretation

– define potential functions Ψ on maximal cliques A

• map joint assignment to non-negative real number

• requires normalization

• MRF is undirected graphical models

Discriminative Model: CRF • Conditional , not joint, probabilistic sequential models p(y|x) • Allow arbitrary, non-independent features on the observation seq X • Specify the probability of possible label seq given an observation seq • Prob. of a transition between labels depend on past/future observ. • Relax strong independence assumptions, no p(x) required • CRF is MRF plus “external” variables, where “internal” variables Y of

MRF are un-observables and “external” variables X are observables • Linear chain CRF: transition score depends on current observation

– Inference by DP like HMM, learning by forward-backward as HMM

• Optimization for learning CRF: discriminative model – Conjugate gradient, stochastic gradient,…

• A flow network G(V, E) defined as a fully connected directed graph where each edge (u,v) in E has a positive capacity c(u,v) >= 0;

• The max-flow problem is to find the flow of maximum value on a flow network G;

• A s-t cut or simply cut of a flow network G is a partition of V into S and T = V-S, such that s in S and t in T;

• A minimum cut of a flow network is a cut whose capacity is the least over all the s-t cuts of the network;

• Methods of max flow or mini-cut:

– Ford Fulkerson method;

– "Push-Relabel" method.

http://www.hindawi.com/journals/mpe/2012/814356/fig8/

• Mostly labeling is solved as an energy minimization problem;

• Two common energy models:

– Potts Interaction Energy Model;

– Linear Interaction Energy Model.

• Graph G contain two kinds of vertices: p-vertices and i-vertices;

– all the edges in the neighborhood N, called n-links;

– edges between the p-vertices and the i-vertices called t-links.

• In the multiple labeling case, the multi-way cut should leave each p-vertex connected to one i-vertex;

• The minimum cost multi-way cut will minimize the energy function where the severed n-links would correspond to the boundaries of the labeled vertices;

• The approximation algorithms to find this multi-way cut:

– "alpha-expansion" algorithm;

– "alpha-beta swap" algorithm.

A simplified Bayes Net: it propagates info. throughout a graphical model via a series of messages sent between neighboring nodes iteratively; likely to converge to a consensus that determines the marginal probabilities of all the variables;

messages estimate the cost (or energy) of a configuration of a clique given all other cliques; then the messages are combined to compute a belief (marginal or maximum probability);

• Two types of BP methods:

– max-product;

– sum-product.

• BP provides exact solution when there are no loops in graph!

• Equivalent to dynamic programming/Viterbi in these cases;

• Loopy Belief Propagation: still provides approximate (but often good) solution;

• Generalized BP for pairwise MRFs – Hidden variables xi and xj are connected through a

compatibility function;

– Hidden variables xi are connected to observable variables yi by the local “evidence” function;

• The joint probability of {x} is given by

• To improve inference by taking into account higher-order interactions among the variables; – An intuitive way is to define messages that propagate between

groups of nodes rather than just single nodes;

– This is the intuition in Generalized Belief Propagation (GBP).

THANKS!

Sparse coding for image/video denoising and superresolution

Technology

Transcript of Sparse coding for image/video denoising and superresolution