Unconstrained On-line Handwriting Recognition with ...graves/nips_2007_poster.pdf · Cog Bot Lab...

1
CogBotLab Machine Learning & Cognitive Robotics Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks Alex Graves 1 , Santiago Fern ´ andez 2 , Marcus Liwicki 3 Horst Bunke 3 ,J¨ urgen Schmidhuber 1,2 1 TU Munich, Boltzmannstr. 3, 85748 Garching, Munich, Germany 2 IDSIA, Galleria 2, 6928 Manno-Lugano, Switzerland 3 IAM, University of Bern, Neubrckstrasse 10, CH-3012 Bern, Switzerland {alex,santiago,juergen}@idsia.ch {liwicki,bunke}@iam.unibe.ch On-line Handwriting Recognition On-line handwriting recognition extracts sequences of words from sequences of pen positions It is generally easier than off-line recognition, where the text image is used as input However, each letter is spread over many pen positions and extensive use of context is required—a problem for most sequence learning algorithms Especially difficult for unconstrained handwriting, where any writing style can be used Usual solution is to pre-process the data into a set of lo- calised features (containing information about shape, tra- jectory etc.) Special features are required for delayed strokes, e.g. the crossing of a ‘t’ or the dot of an ‘i’ Would prefer to map directly from pen trajectories to tran- scriptions, since this (1) involves less human effort (2) is more general (e.g. could be used for languages with dif- ferent alphabets) and (3) builds the feature extraction into the classifier, allowing the whole system to be trained to- gether Recurrent Neural Networks (RNNs) RNNs provide a differentiable map from input to output sequences The recurrent connections give access to previous con- text and build in robustness to local distortions and trans- lations of the input Bidirectional RNNs feed the input sequence forwards and backwards to two separate hidden layers, giving access to both past and future context (e.g. the letters before and after the current one) One problem with traditional RNNs is the limited range of context they can access, due to the so-called vanishing gradient problem The Long Short-Term Memory (LSTM) architecture uses a linear ‘state’ unit surrounded by multiplicative ‘gates’ to bridge long time delays, thereby giving access to long range context NET OUTPUT FORGET GATE NET INPUT INPUT GATE OUTPUT GATE CEC 1.0 g h Combining the above gives bidirectional LSTM (BLSTM) In principle BLSTM has access to context from the entire input sequence at every point in the output sequence Connectionist Temporal Classification (CTC) The standard objective functions for RNNs require a sep- arate training signal for each point in the input sequence Targets sil ow d dcl n ix w ah dx ae q Network Outputs window a at This is a problem for tasks like handwriting recognition, where the alignment between the inputs and the labels is hard to determine automatically, and (at least for cursive text) the boundaries between letters are ambiguous CTC is a recently proposed output layer that trains RNNs to map directly from input sequences to label sequences, without requiring pre-alignment The labelling can be read directly from the network out- puts (follow the spikes) The network can be trained with a log-likelihood objec- tive function to maximise the probabilities of the correct labellings in the training set Integration with an External Grammar CTC outputs the label sequence l * with highest probabil- ity given the input sequence x l * = arg max l p(l|x) But for certain tasks, we want to further weight the la- bellings according to some probabilistic grammar G (e.g. using bigram or trigram word probabilities) l * = arg max l p(l|x,G). If we assume that x is independent of G, the above equa- tion reduces to l * = arg max l p(l|x)p(l|G). This can be efficiently calculated with a variant of the to- ken passing algorithm for HMMs Experimental Data The IAM on-line database contains lines of unconstrained handwriting, with the pen positions captured from a “smart whiteboard” using an infrared receiver The input data consists of sequence of pen coordinates, along with the time the coordinate was recorded and an extra input to indicate when the pen was lifted off the board We used the inputs both raw and pre-processed with state-of-the-art techniques, such as skew and slant cor- rection, size normalisation, feature extraction etc. A 20,000 word dictionary was used to constrain the out- put words, giving 94% coverage of the test set Use of Context The main difference between the raw and pre-processed data was the amount of context required to identify let- ters (the pre-processing provides localised features that are easier to classify) The sensitivity of a CTC network to context can be anal- ysed by plotting the sequential Jacobian—the derivative of the output at a particular point with respect to the in- puts at all points in the sequence In this example, with raw inputs, the sequential Jacobian is plotted for the output corresponding to letter ‘i’ at the point when ‘i’ is emitted (shown by the dotted line) The range of sensitivity extends through most of the word ‘having’ Note the spike in sensitivity at the very end: this corre- sponds to the delayed dot of the ‘i’ Results Identical BLSTM CTC networks were applied to the raw and pre-processed data The word error rate (WER) on the test set was calculated both with and without a bigram language model (LM) An HMM system was used for comparison System Input LM WER HMM pre-processed 35.5% CTC raw 30.1 ± 0.5% CTC pre-processed 26.0 ± 0.3% CTC raw 22.8 ± 0.2% CTC pre-processed 20.4 ± 0.3% The CTC network substantially outperformed the HMM, with or without a language model, for both input types CTC performance was almost as good with raw inputs as with pre-processed It would be impossible to train most sequence learning algorithms (e.g. HMMs or traditional RNNs) with the raw inputs only

Transcript of Unconstrained On-line Handwriting Recognition with ...graves/nips_2007_poster.pdf · Cog Bot Lab...

Page 1: Unconstrained On-line Handwriting Recognition with ...graves/nips_2007_poster.pdf · Cog Bot Lab Machine Learning & Cognitive Robotics Unconstrained On-line Handwriting Recognition

CogBotLabMachine Learning & Cognitive Robotics

Unconstrained On-line HandwritingRecognition with Recurrent

Neural NetworksAlex Graves1, Santiago Fernandez2, Marcus Liwicki3

Horst Bunke3, Jurgen Schmidhuber1,2

1 TU Munich, Boltzmannstr. 3, 85748 Garching, Munich, Germany2 IDSIA, Galleria 2, 6928 Manno-Lugano, Switzerland

3 IAM, University of Bern, Neubrckstrasse 10, CH-3012 Bern, Switzerland{alex,santiago,juergen}@idsia.ch {liwicki,bunke}@iam.unibe.ch

On-line Handwriting Recognition

• On-line handwriting recognition extracts sequences ofwords from sequences of pen positions

• It is generally easier than off-line recognition, where thetext image is used as input

• However, each letter is spread over many pen positionsand extensive use of context is required—a problem formost sequence learning algorithms

• Especially difficult for unconstrained handwriting, whereany writing style can be used

• Usual solution is to pre-process the data into a set of lo-calised features (containing information about shape, tra-jectory etc.)

• Special features are required for delayed strokes, e.g. thecrossing of a ‘t’ or the dot of an ‘i’

• Would prefer to map directly from pen trajectories to tran-scriptions, since this (1) involves less human effort (2) ismore general (e.g. could be used for languages with dif-ferent alphabets) and (3) builds the feature extraction intothe classifier, allowing the whole system to be trained to-gether

Recurrent Neural Networks (RNNs)

• RNNs provide a differentiable map from input to outputsequences

• The recurrent connections give access to previous con-text and build in robustness to local distortions and trans-lations of the input

• Bidirectional RNNs feed the input sequence forwards andbackwards to two separate hidden layers, giving accessto both past and future context (e.g. the letters before andafter the current one)

• One problem with traditional RNNs is the limited range ofcontext they can access, due to the so-called vanishinggradient problem

• The Long Short-Term Memory (LSTM) architecture usesa linear ‘state’ unit surrounded by multiplicative ‘gates’ tobridge long time delays, thereby giving access to longrange context

NET OUTPUT

FORGET GATE

NET INPUT

INPUT GATE

OUTPUT GATE

CEC

1.0

g

h

• Combining the above gives bidirectional LSTM (BLSTM)• In principle BLSTM has access to context from the entire

input sequence at every point in the output sequence

Connectionist Temporal Classification (CTC)

• The standard objective functions for RNNs require a sep-arate training signal for each point in the input sequence

Targets

silowddclnixwahdxaeq

Network Outputs

windowaat

• This is a problem for tasks like handwriting recognition,where the alignment between the inputs and the labels ishard to determine automatically, and (at least for cursivetext) the boundaries between letters are ambiguous

• CTC is a recently proposed output layer that trains RNNsto map directly from input sequences to label sequences,without requiring pre-alignment

• The labelling can be read directly from the network out-puts (follow the spikes)

• The network can be trained with a log-likelihood objec-tive function to maximise the probabilities of the correctlabellings in the training set

Integration with an External Grammar

• CTC outputs the label sequence l∗ with highest probabil-ity given the input sequence x

l∗ = arg maxl

p(l|x)

• But for certain tasks, we want to further weight the la-bellings according to some probabilistic grammar G (e.g.using bigram or trigram word probabilities)

l∗ = arg maxl

p(l|x, G).

• If we assume that x is independent of G, the above equa-tion reduces to

l∗ = arg maxl

p(l|x)p(l|G).

• This can be efficiently calculated with a variant of the to-ken passing algorithm for HMMs

Experimental Data

• The IAM on-line database contains lines of unconstrainedhandwriting, with the pen positions captured from a“smart whiteboard” using an infrared receiver

• The input data consists of sequence of pen coordinates,along with the time the coordinate was recorded and anextra input to indicate when the pen was lifted off theboard

• We used the inputs both raw and pre-processed withstate-of-the-art techniques, such as skew and slant cor-rection, size normalisation, feature extraction etc.

• A 20,000 word dictionary was used to constrain the out-put words, giving 94% coverage of the test set

Use of Context

• The main difference between the raw and pre-processeddata was the amount of context required to identify let-ters (the pre-processing provides localised features thatare easier to classify)

• The sensitivity of a CTC network to context can be anal-ysed by plotting the sequential Jacobian—the derivativeof the output at a particular point with respect to the in-puts at all points in the sequence

• In this example, with raw inputs, the sequential Jacobianis plotted for the output corresponding to letter ‘i’ at thepoint when ‘i’ is emitted (shown by the dotted line)

• The range of sensitivity extends through most of the word‘having’

• Note the spike in sensitivity at the very end: this corre-sponds to the delayed dot of the ‘i’

Results

• Identical BLSTM CTC networks were applied to the rawand pre-processed data

• The word error rate (WER) on the test set was calculatedboth with and without a bigram language model (LM)

• An HMM system was used for comparison

System Input LM WERHMM pre-processed X 35.5%CTC raw 7 30.1 ± 0.5%CTC pre-processed 7 26.0 ± 0.3%CTC raw X 22.8 ± 0.2%CTC pre-processed X 20.4 ± 0.3%

• The CTC network substantially outperformed the HMM,with or without a language model, for both input types

• CTC performance was almost as good with raw inputs aswith pre-processed

• It would be impossible to train most sequence learningalgorithms (e.g. HMMs or traditional RNNs) with the rawinputs only