Ted Willke, Sr Principal Engineer, Intel at MLconf SEA - 5/20/16

Can Cognitive Neuroscience Provide a Theory of Deep Learning Capacity?Ted Willke and the Mind’s Eye Team

Intel Labs

May 20, 2016

“Breakthrough innovation occurs when we bring down boundaries and encourage disciplines to learn from each other”

― Gyan Nagpal, Talent Economics: The Fine Line Between Winning and Losing the Global War for Talent

MIND’S EYE

Cognitive Neuroscience

Cognitive Neuroscience Is the study of the neurobiological mechanisms that underlie

cognitive processes, like attention, control, and decision making

Answer questions like: How does the brain coordinate behaviour to achieve goals? What are the brain structures upon which these functions depend? How does brain function differ amongst people?

Draws upon brain imaging/recordings and other observations to derive models

Context-Dependent Decision Making

Michael Shvartsman, Vibhav Srivatsava, Narayanan Sundaram, Jonathan D. Cohen, “Using behavior to decode allocation of attention in context dependent decision making”, accepted at International Conference on Cognitive Modeling, 2016.

Selective Forgetting

Kim, Ghootae and Lewis-Peacock, Jarrod A. and Norman, Kenneth A. and Turk-Browne, Nicholas B., “Pruning of memories by context-based prediction error,” Proceedings of the National Academy of Sciences, 2014

Production and comprehension of naturalistic narrative speech

Silbert LJ, Honey CJ, Simony E, Poeppel D, Hasson U (2014) Coupled neural systems underlie the production and comprehension of naturalistic narrative speech. Proc Natl Acad Sci USA 111:E4687-4696.

CRACKS APPEAR, DISRUPTIVE IDEAS 30 years on

MIT Press, 1986

Cognitive Neuroscience

Adapted from Marvin Minksy in Artificial Intelligence at MIT, Expanding Frontiers, Patrick H. Winston (Ed.), Vol.1, MIT Press, 1990. Reprinted in AI Magazine, Summer 1991

evolve

Neural networks

Neural Network preliminaries

http://wiki.apache.org/hama/MultiLayerPerceptron

Neural Network preliminaries

Lecun et al., “Deep Learning” in Nature (2015)

Arbitrary functions

https://upload.wikimedia.org/wikipedia/commons/7/7b/XOR_perceptron_net.png

The original tenets of parallel distributed processing (roughly)

1. Cognitive processes arise from the real-time propagation of activation via weighted connections

2. Active representations are patterns of activation distributed over ensembles of units

3. Processing is interactive (bidirectional)

4. Knowledge is encoded in the connection weights (not in a separate store)

5. Learning and long-term memory depend on changes to these weights

6. Processing, learning, and representation are graded and continuous

7. Processing, learning, and representation depend on the environment

T.T. Rogers, J.L. McClelland / Cognitive Science 28 (2014)

Brain-Inspired machine learning

Structure-Inspired Learning

Neurons (e.g., spiking models)

Networks (e.g., deep belief networks)

Architectures (e.g., Human Brain Project)

Cognitive-Inspired Learning

Reinforcement Learning

Context-based Memory

Noisy Decision Making

"Gray754" by Henry Vandyke Carter - Henry Gray(1918) Anatomy of the Human Body

Deep learning takes advantage of parallel distributed processing

http://www.amax.com/blog/wp-content/uploads/2015/12/blog_deeplearning3.jpg

Winning top spots in visual recognition challenges, etc.

(1) Lin et al., 2015, (2) https://www.cityscapes-dataset.com/dataset-overview/ (3) Deng et al., 2009 (4) http://lsun.cs.princeton.edu/2015.html

MS COCO (Common Objects in Context) CityScapes Datasets (Semantic Understanding)

ImageNet (Object Localization) LSUN (Saliency Prediction)

Yang et al. (2015)

What are sitting in the basket on a bicycle?

Yang et al. (2015)

Stacked Attention Networks for Image Question Answering

The Glory and the remaining mysteryWe have achieved…

Exceeding human-level performance on visual recognition tasks

Mastering more and more complex games (Go)

Demonstrating human-level control in reinforcement learning (Atari)

Question-answering and other AI services are upon us

but we still don’t know…

How learnt (feature) representations are encoded (or if they converge for the same networks trained on the same data)

The capacity for learning representations

The trade-off between efficiency of representation and flexibility of processing

How things learnt interfere with each other

Representations and Learning Capacity

22Li et al. (ICLR 2016)

Representation encoding: meaningful and consistent?

Can we reliably map feature representations between these networks?

23Li et al. (ICLR 2016)

Convergent Learning?

Conclusions:

1. Some features are learned reliably in multiple nets (some are not)

2. Units learn to span low-D subspaces, which are common (but specific basis vectors are not)

3. Representations are encoded as a mix of single unit and slightly distributed codes

4. Mean activation values across different networks converge to a nearly identical distribution

Can cognitive neuroscience provide any insight into the nature of learning

and task capacity?

The appeal of highly-parallel neural networks

Both cognitive neuroscience and machine learning applications exploit the following two features of neural networks to great benefit:

a) The ability to learn and process complex representations, taking into account a large number of interrelated and interacting constraints

b) The ability for the same network to process a wide range of potentially disparate representations (or tasks), sometimes called “multitask learning.”

But what are their limits??

The brain: The black box at the end of our necks• Facts:

Only 2% of body weight but uses up to 20% of energy

~200B neurons

Neurons fire up to ~10 kHz

1K to 10K connections per neuron

• Cerebral neocortex:

~20B neurons

~125 trillion synapses

There are more ways to organize the neocortex’s ~125 trillion synapses than stars in the known universe

The paradox – one task at a time

A fundamental puzzle concerning human processing

Why, in some circumstances is the brain capable of a remarkable degree of parallelism (e.g., locomotion, navigation, speech, and bimanual gesticulation), while in others it’s capacity for parallelism is radically limited (e.g., the inability to conduct mental arithmetic while constructing a grocery list at the same time)??!!

A theory

The difference in multitasking ability may reflect the degree to which different tasks rely on shared representations

The more that different processes interact, the stronger the imposition of seriality

May reflect a fundamental trade-off in neural network architectures between the efficiency of shared representations (and the capacity for generalization that they afford) and the effectiveness of multitasking.

Multi-tasking and cross-talk

Feng et al. (CABN 2014)

You will see a sequence of words. Quickly say the color of the letters.

Ready!

Now with the words upside down.

Were you faster to answer?

A Demonstration of interference

Stroop (1935)

multi-tasking interference (In the stroop test)

Cohen et al. (1990)

Color Word

Verbalize Task

Control-Demanding Behavior (Feng et al. 2014)

First to describe the trade-off between the efficiency of representation (“multiplexing”) and the simultaneous engagement of different processing pathways (“multitasking”)

Showed that even a modest amount of multiplexing rapidly introduces cross-talk among processing pathways

Proposed that the large advantage of efficient encoding have driven the human brain to favour this over the capacity for control-demanding processes.

Types of interference

Maximum independent set (MIS)

The MIS is the largest set of processes in the network that can be simultaneously executed without interference.

network structure (distribution complexity)

The network capacity for multitasking depends on the distribution of in-degrees and out-degrees of the network (we only play with in-degree of output components though)

We represent this with a “distribution complexity” symmetry measure (maximized for uniform distribution)

We study the characteristics of the network with DC fixed

Takeaway: Even modest amounts of process overlap impose dramatic constraints on parallel processing capability

Trade-off between generalization and parallelism: Feed-Forward simulation

Training/Test details

Training

20 network groups, 20 random initializations per group

All networks trained on same stimuli, 16 tasks

Trained to generate 1-hot task outputs (MSE < 0.0001)

70/30 split

Generalization is MSE(ave) for ALL stimuli in test set

Parallel processing is measured response to (2,3,4) tasks simultaneously activated, measuring MSE for target pattern

Shared Representations

Smaller weights (a) Larger weights (b)

Generalization vs parallel processing capability

Parallel processing capability vs max initial weights

Future work

Extend analysis to weighted graphs

Study more complex networks (i.e., deeper structures, recurrent connections)

Study human performance (via neuroimaging data)!

C. elegans

The OpenWorm Project(image generated by neuroConstruct)

SINCE 1986

Thank you!

Ted Willke, Sr Principal Engineer, Intel at MLconf SEA - 5/20/16

Technology

Transcript of Ted Willke, Sr Principal Engineer, Intel at MLconf SEA - 5/20/16

Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Josh Patterson MLconf slides

Jeff Johnson, Research Engineer, Facebook at MLconf NYC

B. Willke, Mar 01 LIGO-G010114-00-Z Laser Developement for Advanced LIGO Benno Willke LSC meeting LIGO-Livingston Site, Mar 2001.

Josh Wills, MLconf 2013

Music recommendations @ MLConf 2014

MLconf NYC Chang Wang

MLconf NYC Edo Liberty

Jake Mannix, MLconf 2013

MLconf NYC Claudia Perlich

MLconf NYC Animashree Anandkumar

Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC

Talwalkar mlconf (1)

Michal Malohlava, Software Engineer, H2O.ai at MLconf NYC

Mahoney mlconf-nov13

Willke, Global Environmental Change and the Nation State

Andy Feng, Distinguished Architect, Yahoo at MLconf SF

American Express Slides, MLconf 2013

MLconf Yael Elmatad