Joint Optimization of Cascaded Classifiers for Computer Aided Detection by M.Dundar and J.Bi Andrey...

“Joint Optimization of Cascaded Classifiers for Computer Aided

Detection” by M.Dundar and J.Bi

Andrey KolobovBrandon Lucia

Talk OutlinePaper summaryProblem statementOverview of cascade classifiers and current

approaches for training themHigh-level idea of AND-OR framework Gory math behind AND-OR frameworkExperimental resultsDiscussion

Paper summary

• Proposes procedure for offline joint learning of cascade classifiers

• Resulting classifier is tested on polyp detection from computed tomography images

• Resulting cascade classifier is more accurate than cascade AdaBoost, on par with SVM, and faster than either

Problem statement• Polyp detection in a CT

image• Methodology: Identify candidate

structures (subwindows)

Compute features of candidate structures

Classify candidates

Cascade classifiers

A digression to previous work...

Why use cascades in the first place?

We motivate their use with Paul Viola and Michael Jones' 2004 work on detecting faces.

(Also, this is more vision-related)

Viola & Jones Face Detection Work

• Used cascaded classifiers to detect faces• To show why cascades are useful, evaluated

one big (200 feature) classifier vs. 10 20-feature classifiers– 5000 faces, 10000 non-faces– Stage n trained on faces + FP of stage n-1– Monolithic trained on union of all sets used to train

each stage in cascade


• Monolithic vs. Cascade similar in accuracy

• Cascade is ~10 times faster– Eliminate FPs early –

later stages don't think about them.


• In a small experiment, even the big classifier works– This paper claims that the big classifier only works

w/ ~10,000 (“maybe ~100,000”) negative examples– Cascaded version sifts through hundreds of

millions of negatives, since many are pared off at each stage.

Viola & Jones Face Detection Work• In a bigger experiment, they show that their 38

classifier cascade approach is 600 times faster than previous work. – Take that, previous work. (Schneiderman & Kanade,

2000)• The Point: Cascades eliminate candidates early on, so

later stages with higher complexity have to evaluate fewer candidates.

Cascade classifiers• Advantages over monolithic classifiers: faster

learning and smaller computation time• Key insight: a small number of features can reject a

big number of false candidates• To train Ci use set Ti of examples that passed

previous stages• Low false negative rate is critical at all stages

Example: cascade training with AdaBoost

• Given: set H = {h1, …, hM} of single-feature classifiers• Goal: construct cascade C1, …, CN s.t. Ci is a weighted

sum of an mi-subset of H, mi << M• Train Ci with AdaBoost to select an mi-subset using

examples that passed through previous cascade stages.• Optimal mi and N are very hard to find. They are set

empirically. Each stage is trained to achieve a some false positive rate FPi and/or false negative rate FNi

Drawbacks of modern cascade training

• Greedy; classifier at stage i is optimal for this stage but not globally.

• Drawback of AdaBoost cascade training: feature computational complexity is ignored, leading to inefficiency in early stages.

Proposition: AND-OR training• Each stage is trained for optimal system, not stage,

performance• All examples are used for training every stage• The stage classifier complexity increases down the

cascade• Parameters of different stage classifiers may be

adjusted throughout the training process, depending on how other classifiers are doing

• Motivation: a negative example is classified correctly iff it’s rejected by at least one stage (OR); a positive one must pass all stages (AND)

Proposition: AND-OR training

Review of Hyperplane Classifiers w/ Hinge Loss

Hinge Loss

If hinge loss is 0, correct classification.

If hinge loss > 0, incorrect classification.

Goal: Minimize Hinge Loss.

J (α) = Φ(α) + ∑Li=1

wi (1 − αT yi xi)+*

Min: J (α) = Φ(α) + ∑Li=1

wi E

E ≥ (1 − αT yi xi), E ≥ 0 *there are L pairs {xi,yi} of training data

Sequential Cascaded Classifiers So all that pertains to one classifier on its own.

Cascaded Version, w/ k classifiers. (xi gets split into k non-overlapping

subvectors)

– Subvectors ordered by computational complexity

For αk, J(αk) = Φ(αk) + ∑Li=1

wi (1 − αkT yi xik)+ if i is in

Tk-1

T is set of “yes”'s from previous classifiers.

Cut as many “no”'s as possible, leave the rest for the rest of the α's

AND-OR Cascaded Classifiers Everything at once – same starting pt. for each

J(α1, ... ,αK) = ∑Kk=1Φk(α) + ∑

i in F (1 − αkT yi xik)+ + ∑i in Tmax(0,(1

− α1T yi xi1), ... , (1 − αk

T yi xik))k=1

K

AND-OR Cascaded Classifiers Everything at once – same starting pt. for each

J(α1, ... ,αK) = ∑Kk=1Φk(α) + ∑

i in F (1 − αkT yi xik)+ + ∑i in Tmax(0,(1

− α1T yi xi1), ... , (1 − αk

T yi xik))k=1

K

“AND” “OR”Regularization Terms

This is the training cost function.

Optimizing Cascaded Classifiers New minimization (similar to the first one)

For each k in K, fix all (αj) | j ≠ k

Minimize Φk(α) +(∑ i in F wiEi) + (∑

i in T Ei)

Ei ≥ 1 − αkT yi xik, Ei ≥ 0

Ei ≥ max(0,(1 − α1T yi xi1),...,(1 − αk-1

T yi xik-1),

(1 − αk+1T yi xik+1),...,(1 − αK

T yi xiK))

all but αk are not variable, so this is easy (linear prog./quadratic prog.

solver)

This subproblem is convex.

Cyclic Optimization Algorithm 0.Initialize all αk to αk

0 using init. training dataset.

1.for all k: fix all αi, i != k, minimize eq. on prev. slide.

2.Compute J(α1, ... αk ... αK) using αkc instead of αk

c-1

3.If Jc has improved enough, or we've run long enough, stop, otherwise, go back to step 1.

Convergence Analysis They show that they are globally convergent to the set of

sub-optimal solutions

Using magic, and a theorem by two people named Fiotot, and Huard.

• Idea behind proof: Since we fix all vars but one, and minimize, the sequence of solutions is a decreasing sequence, with a lower bound of zero – so we converge to local min. or 0. We threshold and limit iterations, so we could hit a flat spot or just fall short.

Evaluation

• Application: polyp detection in computed tomography images. Important because polyps are an early stage of cancer

• Training set: 338 volumes (images), 88 polyps, 137.3 false positives per volume

• Test set: 396 volumes, 106 polyps, 139.4 false positives per volume

• Goal: reduce false positives per volume (FP/vol) to 0-5

Evaluation (cont’d)

• 46 features per candidate object• Comparison of AdaBoost-trained cascade, single-

stage SVM, and AND-OR trained cascade.• Features were split into 3 sets, in the increasing

order of complexity• 3-stage AND-OR classifier was built.• AdaBoost classifier was built in 3 phases, where

phase n used feature sets 1 through n for training.

Evaluation (cont’d)

• Sensitivity thresholds for AdaBoost were 0 missed polyps for phase 1, 2 for phase 2, and 1 for phase 3

• Number` of stages in each phase was picked to satisfy these thresholds.

Results

Results (cont’d)

• Cascade AdaBoost - 118 CPU secs • Cascade AND-OR – 81 CPU secs• SVM – 286 CPU secs• Cascade AND-OR is as accurate as SVM and

30% better than cascade AdaBoost

Discussion• Other methods of solving the optimization• How to assign cascade parameters

– How are feature sub-vectors sorted– heuristic? What is “ordered by computational complexity”?

• Speed increase was more prominent result, but hardly mentioned next to discussion of false positives.

• Generally better to use parallel cascades?

Joint Optimization of Cascaded Classifiers for Computer Aided Detection by M.Dundar and J.Bi Andrey...

Documents

Transcript of Joint Optimization of Cascaded Classifiers for Computer Aided Detection by M.Dundar and J.Bi Andrey...