From Deep BP-Learning to Internal X-Learning

S.Y. Kung

Princeton

From Deep BP-Learning to Internal X-Learning

IEEE AI & Industry

Outline A. From AI1.0 to AI2.0 to XAI3.0B. X-Learning: Enhance Accuracy via Model Reduction

• ITL: node ranking and structural learning• DI: Internal Optimization Metrics (IOM)

C. Iterative X-Learning, DNs, & Results• NP iteration: Joint PS Learning• Deleterious Nodes (DNs)• Xnet: MNIST, CIFAR, & ImageNet

D. Extension to Broader Applications• LAX-learning, EX-learning• Regression Analysis: RX-Learning• IMaX-learning (5G-AI)

深度学习

NN1.0(MLP)⇒ NN 2.0(DNN) ⇒ Xnet

AI 1.0 ⇒ AI 2.0 ⇒ XAI 3.0

BP

Knowledge Systems Laboratory,

(1970 Feigenbaum) ⇒ Data-driven

1950 ⇒ 2000 ⇒ 2020

MIT AI Lab (1958, M. Minsky)

Training by Examples

M. Minsky, first neural network simulator, Princeton Ph. D. 1951.

Deep BP-Learning

深入学习

Rule-based

Internal X-Learning

Feature Engineering Label EngineeringFeature Extraction

W W’

vector

AI2.0: DLN = Feature Engineering + Label Engineering

BP TeacherBP

NN2.0 = MLP + ConvNet + Deep BPLearning, Memorization vs. Generalization…. doing BP on thousands (if not millions) of parameters

Deep learning networks represent the state-of-the-arts of ML.

From Rule-based AI1.0 to Data-Driven AI2.0

NN2.0 ↔ AI2.0

NN≅DL

ML≅AI

Characterization of Convolutional Neural Nets

24

24

1

20

55

1212

1

20

24

24

LeNet-5[Han15]:

28

28

88

800500

44

8=12-5+124=28-5+1

800=50×42

BP friendly

1

50

1

50

X-learning

From Deep BP-Learning to Internal X-Learning

Back-propagation (BP) is only for parameter learning.

X-learning adopts internal & Explainable Learning (Xnet), going beyond BP to train both NN’s structure and parameters.

DARPA XAI ⇒ X-Learning

Internal Hidden Nodes

DARPA's XAI ⇒ X-Learning

airplane

bird

drone

Motivated by XAI, X-learning will play vital roles in XAI.

End User Explainability: ``XAI is most interested in explanations of higher-level decisions that would be relevant to the end user and the missions he/she needs to manage.”[DARPA2016]

Internal Node EXplainability

Biology for Internal X-Learning

Bird

or Drone

Why Internal X-Learning?

Compute 𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤1, 𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤2

𝑤𝑤2

Parameter Learning:

EOM Gradient DescentStructural Learning :

IOM Guided Adaptation

• The number of layers in the model.• The number of nodes in each layer.

Structural characterization:

+X-Learning facilitates Both Parameter/Structure Learning!!

B. X-Learning

𝑤𝑤1

A quantum jump from finite to zero:

• individually (connection)

• or collectively (node).

X-learning⇒ Structural Gradient

1 ⇒ 0

2.3⇒2.2⇒0

LASSO (Link/Node)

Optimization Algorithm

X

Synaptic Density

At birth 6 year old 14 year old

B(i): Why X-Learning?Biological Motivation

Biological Justification of Neuron Pruning

Human brain prunes more than 5000 brain neurons daily.

Dense4-3-2-2

Sparse5-4-3-2

4×3+3×2+2×2

= 2238-10 =28 Drop 3 nodes

Evaluation of Links Vs. Evaluation of Neurons

ReductionEffectiveness

Accuracy

Goal: Ranking of Neurons to facilitate structural gradient.

B(ii): ITL and IOM

For BP learning, all neuron are created equal.

Internal X-learning for node ranking and structural learning!!

Local Metrics for classification-type

hidden-nodes

local input local teacher

(1) Internal Teacher Labels (ITL)

hidden-nodes

(2) Internal Optimization Metrics (IOM)

The internal learning paradigm facilitates structural learning.

For classification problem, the internal teacher labels can be metaphorically hidden in “Trojan-horses” and transported (along with the data) from the input layer to all hidden nodes. promising ones:

Low DI

Nonlinear transformations in the neural network is meant to improve the IOM (e.g. DI) from layer to layer.

B

B

C

A

High DI

B

B

C

A

……..t = 1t = 2

t = 1

t = 2

t = 1

t = 2

⇒⇒

⇒

BB

B

BB

BB

BB

BB

B

BA

A

BA

ETL is space-divided, the labels correspond to the output nodes.

In contrast, ITL is time-divided, the labels are assigned to each of incoming input samples at one instance.

Space-division

Time-division B

t = 1B

B

⇒⇒

⇒

BA

A

AA

AA

AA

AA

A

BA

A

BA

Batch Method

Time-division A

ETL is space-divided, the labels correspond to the output nodes.

Space-division t = 2Data Adaptive BP

A

A

B(iii): Internal Optimization Metrics (IOM)

Carl Friedrich Gauss – Born 1777The numerous things named in honor of Gauss include: The normal distribution, also known at the Gaussian distribution, the most common bell curve in statistics.

Sir Ronald Aylmer Fisher FRS (1890 –1962) was a British statistician and "a genius who almost single-handedly created the foundations for modern statistical science" and "the single most important figure in 20th century statistics".

Claude Elwood Shannon was "the father of information theory". Shannon is noted for having founded information theory with a landmark paper, A Mathematical Theory of Communication 1948.

Standing on Giant’s Shoulders

https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss

https://www.google.com/search?q=carl+friedrich+gauss+born&stick=H4sIAAAAAAAAAOPgE-LQz9U3MCyJr9ISy0620i9IzS_ISQVSRcX5eVZJ-UV5i1glkxOLchTSijJTU4oykzMU0hNLi4sVQHIAEJ4XCEAAAAA&sa=X&ved=2ahUKEwjymvvq6LHiAhULTd8KHef2DkwQ6BMoADAdegQIDxAG

DI (Discriminant Information)Three Scatter Matrices

Dettrace

C.R. Rao vs. S.Y. Kung

C.R. Rao vs. S.Y. Kung

Dettrace

• Rao: the final score is equal to the product of individual scores, i.e. the volumetric score.

• Kung the total score is equal to the sum of individual scores.

DI (Discriminant Information)

A vital condition is that there exists no inter-component redundancy), a condition termed by C.R.Rao as ``Canonical Orthogonality”. (For DCA, all the eigenvalues are generically distinct, therefore, all the columns of V are canonically orthogonal to each other.)

Numerically, the optimal DCA projection matrix can be derived from the principal eigenvectors of the Discriminant Matrix.

Low-DI Space (2 nodes) High-DI Space (2 nodes)Suppose a hidden layer has 2 nodes (or more):

NN

NN

C. Iterative X-Learning, DNs, & Results

• DI with SOTP Property

• NP iteration: Joint PS Learning

• Deleterious Nodes (DNs)

• Xnet: MNIST, CIFAR, & ImageNet

B(iii): Internal Optimization Metrics (IOM)

Beauty of Math: A simple change (from determinant to trace) leads to

total = sum of individuals

which then bears offer a vital monotonic increasing property, crucial for using DI to rank the nodes and subspace.

C(i): Subspace/Component DI

Three Scatter Matrices

ρ: variance of additive noise

WT W WT WDI(W) =Subspace DI: tr

To assess the IOM of the full space of a layer, we set W= I:

For (supervised) deep compression, we adopt

Wi-keep / Wi-drop to keep/drop only the i-th node/channel:

Subspace DI For node evaluation:

WT W WT WDI(W) =

Compute 𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤1, 𝜕𝜕𝐿𝐿 𝜕𝜕𝑤𝑤2

𝑤𝑤1

𝑤𝑤2

Parameter Learning:

EOM Gradient Descent ranking of nodes

Both Parameter/Structure Gradient-Type Learning

NP-Iterative Pruning Method

DI-based Pruning Method Offers Win-Win Deep Compression/Quantization

N: IOMP: EOM

Structural Learning Space

Parameter Learning Space

C(ii): Performance Enhancement via Model Reduction

• Constructive (discriminative) nodes

• Redundant (but robust) nodes

• Deleterious nodes

DN= Deleterious neurons

Theme: Finding/Pruning deleterious neurons enhances performance.

深入学习DN= Deleterious neurons

X-learning adopts a joint parameter/structural gradient learningto gradually reduce the network towards an optimal structure, while traversing across the winning structural space.

DNs

(2 nodes: left-vs-right, top-down-neuron)Differential-DI Subspace

High-DI: left-vs-right neuron

Low-DI: top-down neuron

Our punchline:1 + 1 > 2 2 - 1 > 2

Visualization via Chanel Images [KHL19]

… …

… …

DI metric is used to determine which channels to prune

⇐ Low-DI ⇒

⇐ High-DI ⇒

Pruning enhances robustness

Large Net

DI-Reduced Net

Lenet on MNIST

By simply removing redundant and harmful nodes, one can enhance the modelrobustness. This leads to a joint parameter/structural learning paradigm.

Deleterious neurons:

C(iii): Experimental Results

• MNIST, CIFAR Datasets• ImageNet• LPIRC

CIFAR-10: Speedup

10×

MINIST: Speedup & Storage

100×

CIFAR-100: Speedup

5×*(ITL)

Basic Xnet: Layer-Uniform (Layer-Independent) ITL

Accuracy

Speed (power, energy, latency)Storage

O project

ion

• Very few Highly Deleterious nodes

• Many a Redundant (but robust) nodes

X-learning can spot

Speed Winner (power, energy, latency)Accuracy Winner

• LAXnet: Adaptive (Layer-dependent) ITL• From coarse ITL to fine ITL

• EXnet: End-User Explainable Learning ITLDI(A) for A or Ā (=not A)

DI(A) and DI(A') for a special-interest group such as:

• Ā but A'• A but Ā'

D. Extension

• Regression/Generation • Super-Resolution Image Enhancement

Internal Hidden Nodes

DARPA's XAI ⇒ X-Learning

airplane

bird

drone

Motivated by XAI, X-learning will play vital roles in XAI.

End User Explainability: ``XAI is most interested in explanations of higher-level decisions that would be relevant to the end user and the missions he/she needs to manage.”[DARPA2016]

Internal Node EXplainability

Our Approach to XAI: EXnet

Bird

or Drone

From Low DI to High DI channels (neurons)

End-user: (subset = top 3 ABC classes):

A= airplane, B= bird, and C= car

2019 ELE571_Mohammad_Prerit_Pesentation.pptx

Conv1

EXnet’s Approach to XAIABC Example

Layer Structure in (Optical Preprocessing) Retina Nervous Systems

LAX-Learning: Different Teachers for Different Layers

Multi-resolution LAX-LearningInitial tantalizing layers Final maturing layers

Initially: 3-class ITL Finally: 6-class ITL

coarse explainability( 1/3 ITL)

fine explainability( 1/6 ITL)

Multi-resolution X-Learning

0.768

0.769

AXnet improves the accuracy by 0.85% over the basic Xnet.

Multi-resolution pruning policy:

76.96%Model Accuracy FLOPs SizeResNet-164 (Baseline) 76.63% 505M 1.7MResNet-164 (Liu et al., 2017) 76.09% 247M 1.2MX-ResNet-164 (Uni-resolution) 76.11% 210M 0.95MX-ResNet-164 (multi-resolution) 76.96% 210M 0.95M

LAXnet: Layer-Adaptive (Layer-dependent) ITL

Regression Analysis: RX-Learning

JPEX Examples on

Super-Resolution Enhancement/Restoration

5X Compression4X Model

3.6X FLOPS/Latency

X Learning for 5G Inter-machine Intelligence

1M AI machines within 1 km2

Transferable Learning

Transferable pruning policy: 76.8%

IMaX-Learning: parameter transfer vs. structure transfer

Structural Transfer Learning: ResNet164 on CIFAR(CIFAR10-to-CIFAR100)

IMaX inter-machine learning (5G-AI)

Reduction/Trimming of Deep Learning Networks

Automatic Bigdata Deep Compiler (ABDC)

Automation of X-Learning

Thank You!!

From Deep BP-Learning to Internal X-Learning

Documents

Transcript of From Deep BP-Learning to Internal X-Learning