Machine Learning for NLP - Aurelie...
Transcript of Machine Learning for NLP - Aurelie...
Machine Learning for NLP
Neural networks and neuroscience
Aurélie Herbelot
2018
Centre for Mind/Brain SciencesUniversity of Trento
1
Introduction
2
‘Towards an integration of deep learning and neuroscience’
• Today: reading Marblestone et al (2016).
• Artificial neural networks (ANNs) are very different from thebrain.
• Is there anything that computer science can learn from theactual brain architecture?
• Are there hypotheses that can be implemented / tested inANNs and verified in experimental neuroscience?
3
Preliminaries: processing power
• There are approximately 10 billion neurons in the humancortex, many more than in the average ANN.
• The lack of units in ANNs is compensated by processingspeed. Computers are faster than the brain...
• The brain is much more energy efficient than computers.
• Brains have evolved for tens of millions of years. ANNs aretypically trained from scratch.
4
The (artificial) neuron
Dendritic computation: dendrites of a single neuron implementsomething similar to a perceptron.
By Glosser.ca - Own work, Derivative of File:Artificialneural network.svg, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=24913461
5
Successes in ANNs
• Most insights in neural networks have been driven bymathematics and optimisation techniques:
• backpropagation algorithms;• better weight initialisation;• batch training;• ....
• These advances don’t have much to do with neuroscience.
6
Preliminaries: deep learning
• Deep learning: a family ofML techniques using NNs.
• Term often misused, forarchitectures that are notthat deep...
• Deep learning requiresmany layers of non-linearoperations.
Bojarski et al (2016)
7
Neuroscience and machine learning today
• The authors argue for combining neuroscience and NNsagain, via three hypotheses:
1. the brain, like NNs, focuses on optimising a cost function;2. cost functions are diverse across brain areas and change
over time;3. specialised systems allow efficient solving of key problems.
8
H1: Humans optimise cost functions
• Biological systems are able to optimise cost functions.
• Neurons in a brain area can change the properties of theirsynapses to be better at whatever job they should perform.
• Some human behaviours tend towards optimality, e.g.through:
• optimisation of trajectories for motoric behaviour;• minimisation of energy consumption.
9
H2: Cost functions are diverse
• Neurons in different brain areas may optimise differentthings, e.g. error of movement, surprise in a visualstimulus, etc.
• This means that neurons could locally evaluate the qualityof their statistical model.
• Cost functions can change over time: an infant needs tounderstand simple visual contrasts, and later on develop torecognise faces.
• Simple statistical modules should enable a human tobootstrap over them and learn more complex behaviour.
10
Cost functions: NNs and the brain
11
H3: Structure matters
• Information flow is different across different brain areas:• some areas are highly recurrent (for short-term memory?)• some areas can switch between different activation modes;• some areas do information routing;• some areas do reinforcement learning and gating
12
Some new ML concepts
• Recurrence: a unit shares its internal state with itself overseveral timesteps.
• Gating: all or part of the input to a unit is inhibited.
• Reinforcement learning: no direct supervision, butplanning in order to get a potential future reward.
13
H3: Structure matters
• The brain is different from machine learning.
• It learns from limited amounts of information (not enoughfor supervised learning).
• Unsupervised learning is only viable if the brain finds the‘right’ sequence of cost functions that will build complexbehaviour.
“biological development and reinforcement learning can, in effect, programthe emergence of a sequence of cost functions that precisely anticipatesthe future needs faced by the brain’s internal subsystems, as well as bythe organism as a whole”
14
Modular learning
15
H1: The brain can optimise cost functions
16
What does brain optimisation mean?
• Does the brain have mechanisms that mirror various typesof machine learning algorithms?
• Two claims are made in the paper:• The brain has mechanisms for credit assignment during
learning: it can optimise local functions in multi-layernetworks by adjusting the properties of each neuron tocontribute to the global outcome.
• The brain has mechanisms to specify exactly which costfunctions it subjects its networks to.
• Potentially, the brain can do both supervised andunsupervised learning in ways similar to ANNs.
17
The cortex
• The cortex has anarchitecture comprising 6layers, made ofcombinations of differenttypes of neurons.
• The cortex has a key rolein memory, attention,perception, awareness,thought, language, andconsciousness.
• A primary function of thecortex is some form ofunsupervised learning.
18
Unsupervised learning: local self-organisation
• Many theories of the cortex emphasise potentialself-organisation: no need for multi-layer backpropagation.
• ‘Hebbian plasticity’ can give rise to various sorts ofcorrelation or competition between neurons, leading toself-organised formations.
• Those formations can be seen as optimising a costfunction like PCA.
19
Self-organising maps
• SOMs are ANNs for unsupervised learning, doingdimensionality reduction to (typically) 2 dimensions.
• Neurons are organised in a 2D lattice, fully connected tothe input layer.
• Each unit in the lattice corresponds to one input. For eachtraining example, the unit in the lattice that is most similarto it ‘wins’ and gets its weights updated. Its neighboursreceive some weight update too.
20
Self-organising maps
Wikipedia featured article data - By Denoir - CC BY-SA 3.0, https://en.wikipedia.org/w/index.php?curid=40452073
21
Unsupervised learning: inhibition and recurrence
• Beyond self-organisation, other processes seem to mirrormechanisms found in ANNs.
• Inhibitory processes in the brain may allow local controlover when and how feedback is applied, giving rise tocompetition (SOMs) and complex gating systems (e.g.LSTMs, GRUs).
• Recurrent connectivity in the thalamus may control thestorage of information over time, to make temporalpredictions (like sequential models).
22
Supervised learning: gradient descent
• How to train when you don’t have backpropagation?
• Serial perturbation (the ‘twiddle’ algorithm): train a NN bychanging one weight and seeing what happens in the costfunction. This is slow.
• Parallel perturbation: perturb all the weights of the networkat once. This can train small networks, but is highlyinefficient for large ones.
23
Mechanisms for perturbation in the brain
• Real neural circuits have mechanisms (e.g.,neuro-modulators) that appear to code the signals relevantfor implementing perturbation algorithms.
• A neuro-modulator will modulate the activity of clusters ofneurons in the brain, producing a kind of perturbation overpotentially whole areas.
• But backpropagation in ANNs remains so much better...
24
Biological approximations of gradient descent
• E.g. XCAL (O’Reilly et al, 2012).
• Backpropagation can be simulatedthrough a bidirectional network withsymmetric connections.
• Contrastive method: at eachsynapse, compare state of networkat different timesteps, before a stablestate has been reached. Modifyweights accordingly.
25
Beyond gradient descent
• Neuron physiology may provide mechanisms that gobeyond gradient descent and help ANNs.
• Retrograde signals: direct error signal from outgoing cellsynapses carry information to downstream neurons (localfeedback loop, helping self-organisation).
• Neuromodulation (again!): modulation gates synapticplasticity to turn on and off various brain areas.
26
One-shot learning
• Learning from a single exposure to a stimulus. No gradientdescent! Humans are good at this, machines very bad!
• I-theory: categories are stored as unique samples. Thehypothesis is that this sample is enough to discriminatebetween categories.
• ‘Replaying of reality’: the same sample is replayed overand over again, until it enters long-term memory.
27
Active learning
• Learning should be based on maximally informativeexamples: ideally, a system would look for information thatwill reduce its uncertainty most quickly.
• Stochastic gradient descent can be used to generate asystem that samples the most useful training instances.
• Reinforcement learning can learn a policy to select themost interesting inputs.
• Unclear how this might be implemented in the brain, butthere is such thing as curiosity!
28
Costs functions across brain areas
29
Representation of cost functions
• Evolutionary, it may be cheaper to define a cost functionthat allows to learn a problem, rather than store thesolution itself.
• We will need different functions for different types oflearning.
30
Generative models for statistics
• One common form ofunsupervised learning inthe brain is the attempt toreproduce a sample.
• Higher brain areas attemptto reproduce the statisticsof lower layers.
• The autoencoder is such amechanism.
31
Cost functions that approximate properties of the world
• Perception of objects is sparse: we only experience a verysmall subset of what is in the world. So sparseness shouldbe integrated in perceptual networks (like visualautoencoders).
• Objects have regularities: e.g. they persist over time. Wecan learn to penalise object representations that are nottemporally continuous.
• Objects tend to undergo predictable sequences oftransformations (e.g. spatial transformations like rotation),which can be learnt via gradient descent.
• Maximising mutual information between sensorymodalities is a natural way to improve learning (see workon language and vision).
32
Cost functions for supervised learning
• Issue with supervised learning: the brain must know whatit is training.
• Difference in an event outcome (e.g. reaching out forsomething) may give the brain some indication of whicherror to minimise.
• Or supervised learning is some form of consolidation ofwhat the brain already knows. I.e. it tries to learn things ina more efficient, compressed way.
33
Cost functions for bootstrapping learning
• Signals are needed to learn problems where unsupervisedmethods (matching some statistics) are not enough.
• Availability of ’proto-concepts’ to start off learning? Thebrain has evolved over thousands of years...
• E.g. a hand detector can be operated through an opticalflow calculation to detect particular moving events. (Seethe frog’s ‘bug detector’ for a similar phenomenon.)
• Evidence that (some) emotion recognisers are encoded inthe human brain.
34
Cost functions for stories
• Understanding stories is key to human cognition:sequence of episodes, where one episode refers to theothers through complex causes and effects, as well asimplicit character goals.
• We don’t know how the cost functions of stories arise:• story-telling as imitation process?• primitives may emerge from mechanisms to learn states
and actions (e.g. in reinforcement learning);• learned patterns of saliency-directed memory storage.
35
Optimisation in specialised structures
36
The need for specialised structures
• As in programming, we may assume that the brain alreadyhas a store of algorithms and data structures to learnfaster.
• Training optimisation modules may involve specialisedsystems to route different types of signals (linked todifferent types of learning).
• While many different things are happening in the cortex,the brain also holds specialised structures which predate it– implying the cortex may have developed to use thosemore basic structures to train itself.
37
The need for specialised structures
• The point about structure is an important one.
• One strand of deep learning assumes that complexbehaviours will emerge ‘on their own’ given enoughtraining data.
• Other research argues that ANNs must have adequatestructure to learn particular tasks.
38
Learning quantification (Sorodoc et al, 2018)
A network that reproduces the generalised quantifier structure,with scope and restrictor, outperforms all baselines, at 45%
accuracy.
39
What do we need?
• Good learning seems to involve at least the following:• Models of memory.• Models of attention / saliency.• Buffers to store various variables and the structures that
contains them.• Ability to deal with time.• Some higher-level structures putting everything together.• Some imagination...
40
Content-addressable memory
• Content-addressable memories allow us to recognise apattern / situation we have encountered before.
• This is the structure of e.g. memory networks.
• The hippocampal area in the brain seems to act in a similarmanner, offering pattern separation in the dendate gyrus.
• Such structures should allow the retrieval of completememories from partial clues.
41
Working memory
• 7 ± 2 elements are storable!
• The brain may be using buffers to store distinct variables,such as subject/object in a sentence.
• Persistent, self-reinforcing patterns of neural activation viarecurrent networks (e.g. LSTMs ‘remember’ somevariables for some time);
42
Gating between buffers
• Need for a switchboard: controlling information flowbetween buffers.
• The basal ganglia seems to play a role in action selectioncircuitry, while interacting with working memory.
• It acts as a gating system, inhibiting or disinhibiting an areaof the cortex.
43
Buffers
• How do buffers interact? They need to ‘copy/paste’information from one to the other.
• Problem to model this in the brain: the activation for chairin one group of neurons has nothing to do with theactivation for chair in another group.
• Same problem in ML: interoperability of vectors.
44
Variable binding
• The issue of buffer interaction is related to the issue ofbinding in language (anaphora resolution):
• If such binding mechanism are hard to model in abiological system, then we have a fundamental problemwhen trying to model language.
• PS: ANNs are also terrible at binding...
45
Attention
• Focus allows us to give more computational resources to aprocess: focusing on one object, we can learn about itmore easily, with fewer data.
• Some indication that higher-level cortical areas may bespecialised for attention.
• Pinpointing attention is a complex issue, as there aredifferent types of attention: e.g. in vision, object-based,feature-based, location-based.
46
Attention in ANNs
• Both in vision and language, such mechanisms allow us to‘focus’ on particular aspects of the input (for instance inVQA).
“What is in the basket?” Yang et al (2016)
47
Dealing with time
• We often have to plan and execute complicated sequencesof actions on the fly, in response to a new situation.
• E.g. we need a model of our body and environment toreact to the ever-changing nature of our surroundings.
• The cerebrellum seems to perform such a function. It is ahuge feedforward architecture, with more connections thanin the rest of the brain.
• The cerebrellum may be involved in cognitive problemsrelated to movement, such as the estimation of timeintervals.
48
Hierarchical syntax
• The syntax of human language has a hierarchicalstructure.
• There is some fMRI evidence for anatomically separateregisters representing the content of different grammarrules and semantic roles.
• There are some efforts to try and implement an equivalentof a push-down stack in NNs, as in syntactic / semanticparsing.
• See work by Friedemann Pulvermüller on syntax in thebrain.
49
Putting it all together: hierarchical control
• There seems to be a hierarchy in the processing ofdifferent types of signals.
• The motor system involves a hierarchy involving variouselements, from the spinal cord to different cortical areas.
• The hypothesis is that those different areas respond todifferent cost functions.
50
Mental programs and imagination
• Humans need an ability to stitch together sub-actions torepresent larger actions, in particular in planning.
• The hippocampus supports the generation and learning ofsequential programs. It appears to explore possibledifferent trajectories towards a goal.
• The hippocampus’s simulation capabilities seem to supportthe idea of a generative process for imagination, conceptgeneration, scene construction and mental exploration.
• See generative models of Goodman & Lassiter (not NNs,but able to generate worlds using the church programminglanguage).
51
Can neuroscience and ML benefit each other?
• It is important to have an overview of what the brain has tosolve to do good AI.
• Many functions are not understood in the brain. ML givestestable hypotheses to neuroscience.
• The brain has much more complex structures than currentANNs. Can we try to reproduce them in ML?
52