Intelligent Handwriting Recognition_MIL_presentation_v3_final

29
Intelligent Character Recognition By Suhas Pillai Advisor Ray Ptucha

Transcript of Intelligent Handwriting Recognition_MIL_presentation_v3_final

Page 1: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Intelligent Character Recognition

BySuhas Pillai

Advisor Ray Ptucha

Page 2: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Motivation• In this digital age, still there are many documents that are handwritten and are

required to be scanned eg Legal documents, Forms, Receipts , Bond papers etc• Difficult than OCR (Optical Character Recognition) because typed letters can be

easily recognized with fixed set of rules.• A single word can be written in ‘N’ number of ways by the same person, so

basically there are many variations, so its difficult problem to crack• High value for companies that design and develop scanners and printers eg

Kodak, Hewlett Packard, Canon etc.• Different ways of writing ‘of’ by the same person

Page 3: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Why not use OCR tool?● Tesseract is a famous Open Source OCR engine.● Widely used for OCR and is being maintained by Google.● Works for around 30-40 different languages.● Good for OCR but not good for Intelligent Character recognition● For Example Original Image OCR output

Page 4: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Previous approaches• Sophisticated preprocessing techniques

• Extracting Handcrafted features

• Combination of classifier and a sequential model i.e Hybrid ANN/DNN Hidden Markov model

• Sequential models like HMM were good at providing transcription.

Page 5: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Deep Learning

Page 6: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Recurrent Neural Networks• RNNs helps to model local as a well as global context .• Does not require alphabet specific pre processing and no need of handcrafted features• Can be used for any other language and has shown promising results (like machine to machine translation , NLP, Speech & Handwriting Recognition etc.)• Works on raw inputs (pixels)• Globally trainable model.• Good in handling long term dependencies.

Page 7: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Why LSTM cell?General idea about LSTMs• Solves vanishing gradient problem and thus works better for long term dependencies.• Activation is controlled by 3 multiplicative gates

o Input gateo Forget gateo Output gate

• Gates allow cell to store or retrieve information over time.• Showed state of the art results for speech recognition

Page 8: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Bidirectional and Multidimensional RNN(LSTM)• Normal Recurrent neural networks can look back at previous time steps (left side)

and get contextual information• This works good but its been seen that context from right also helps, so we have

BLSTM (Bidirectional LSTMs)• RNNs are usually structured for 1D sequences, so the input is always converted to

a 1D vector and fed to RNN.• So, any ‘d’ dimensional data needs to be brought down to 1D before it can be

processed by RNNs• To overcome this shortcoming, [1] suggested Multidimensional RNNs [1] Graves, Alex, and Jürgen Schmidhuber. "Offline handwriting recognition with multidimensional recurrent neural networks."

Page 9: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Multidimensional RNNs• The standard LSTMs are explicitly one dimensional,with one recurrent

connection, and whether to use the information from that recurrent connection is controlled by just one forget gate.

• With multi dimension, we extend this idea to ‘n’ dimensions, with ‘n’ recurrent connections, and the information controlled by ‘n’ forget gates.

• The network starts scanning from top left. 1. The thick lines show connection to current point (i,j). 2. The connections within the hidden plane are recurrent. 3. The dashed lines are previous points scanned by the network.

Page 10: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Multidimensional Recurrent Neural Networks• Mathematically the network is modeled using the following equations.

Page 11: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Calculation for input gate

input gate at current time

sigmoid activation

weights from i/p to hidden layer

input to LSTM block

weights from prev hidden layer to current

hidden layer input gate

previous hidden layer output across ‘d’ dimensions peep hole weights

previous cell state output across ‘d’ dimensions

input gate bias

Page 12: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Calculation for forget gate

forget gate value is calculated for every dimension separately, because it helps tostore or forget previous information based on which dimension is useful

sigmoid activation

weights from input to forget gate

weights from prev hidden layer across ‘d’ dimensions to current hidden layer

hidden layer output from previous time step

peep hole weights

cell state output of prev time step

forget gate bias, each across ‘d’ dimension input to forget gate from i/p layer

Page 13: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Calculation for input output after tanh activation. This is not an output of any of the 3 gates, this is same as o/p from fully connected network with tanh activation

tanh activation

weights from input layer to hidden layer

input from i/p layer to hidden layer weights from prev hidden layer across ‘d’ dimensions to current hidden layer

hidden layer o/p from prev time steps across ‘d’ dimensions bias for input

Page 14: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Calculation for cell state cell state of that particular LSTM block, a single lstm block can have multiple cell states, usually one cell state works well in practice.

this expression calculates whether the input from this particular time step is useful, if not then input gate value will be close to zero, else close to 1

this expressions calculates, which dimensions are useful, so suppose if information from ‘X’ dimension is not useful then forget gate value calculated for that dimension will be 0 and it is multiplied with the previous time step cell state of X dimension, so that no information is carried forward from X dimension.

Page 15: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Calculation for output gate output gate value at current time step

sigmoid activation

weights from input layer to output gate

peep hole weights

input from i/p layer to the hidden layer

weights from prev hidden layer to current hidden layer across ‘d’ dimensions

hidden layer o/p of previous time steps across ‘d’ dimensions

remember this is the cell state output gate bias at current time step

Page 16: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Calculation for output of hidden neuron output of LSTM block (i.e hidden neuron) at current time step.

output gate value, this decides whether this neuron’s o/p should be given as an input to the hidden layer of future time steps, if not then the value would be close to 0 else close to 1

passing cell state of neuron through tanh activation

Page 17: Intelligent Handwriting Recognition_MIL_presentation_v3_final

CTC (Connectionist Temporal Classification)• Previous approaches to train end to end systems for handwriting recognition

involved segmenting input with ground truths.• As a result, we had to do force alignment, which are prone to errors , thereby

that error getting propagated to the training system.• In order to overcome this issue, we use Connectionist Temporal

Classification(CTC)• It provides two advantages

1. We can have variable input, no need for force alignment.2. The CTC loss function is differentiable, hence end to end trainable.

Page 18: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Difference between Forced Alignment & CTC

Forced Alignment CTC Alignment

Page 19: Intelligent Handwriting Recognition_MIL_presentation_v3_final

CTC Cost Function and Intuition• The objective function is negative log likelihood of correctly labelling the entire training data.

• x - Training sampleS - Training dataz - Generated sequence

Page 20: Intelligent Handwriting Recognition_MIL_presentation_v3_final

CTC Cost Function and Intuition derivative of objective function wrt o/p

o/p at time step ‘t’ for‘k’ th label.

pr

probability of all the paths that can be formed for a particular input. I like speech recognition eg, suppose you are saying a word ‘Robocop’, now there are many ways of saying ‘Robocop’, like Roooooooooobooocop or Robocoooooop or Robo <pause> cop. P(z|x) is the total probability of all the sequences that can be formed for a given word. Now, number of possible paths (words/sequences) can be exponential, we use forward backward algorithm to find total probability of paths. More on this in next slides

Page 21: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Forward Backward Algorithm blank labels

t = 0 t = 1 t = 2 t = 3 t = 4 t = 0 t = 1 t =2 t = 3 t = 4 alphas betas

0

0

0

0

0

0

0 1

1

0

0

2

1

0

0

3

3

1

0

4

1

0

0

10

6

0

5

1

0

0

0

21

0

0

7

7

21

0

0

0

0

0

1

5

0

0

6

10

0

1

1

3

0

4

3

0

0

0

1

1

2

1

0

0

0

0

0

0

0

Page 22: Intelligent Handwriting Recognition_MIL_presentation_v3_final

CTC Cost Function and Intuition• Figure on the right represents total number of paths that pass through

every node at time step t=2. For eg alpha for C = 3 and beta C = 1 , total number of paths going through Cat time step t = 2 is 3. This is how we calculate all the paths going through everynode at each time step.

• O/P at time step ‘t’ for label ‘k’

• Sum across same labels, like ifyour ground truth word is KITKAT,then K appears twice, so you sum across label ‘K’.

• alphas(s) * betas(s) , it is the probability of total number of pathsthat contain label ‘s’. So, if you have 10 labels, and your soft max outputs equal probabilities i.e 0.1 at eachtime step, then the probability of total paths, through A = 16 * (0.1^5), for 5 time steps.

Page 23: Intelligent Handwriting Recognition_MIL_presentation_v3_final

CTC Cost Function and Intuition• Now to backpropagate the gradients, we need to find gradients with respect to output,

i.e before activation is applied

• Here k’ refers all the labels and k is the kth label

• Finally, we arrive to the following equation for gradients with respect to output beforeactivation is applied.

=

• For eg, if we take gradient with respect to activation at ‘A’ , considering that there are 10 labels, and all output 0.1 probability initially. Then gradient propagation for label ‘A’

• P (z|x) = 0 + (3* (0.1^5)) + (3* (0.1^5)) + (16 * (0.1^5)) + (3* (0.1^5)) + (3* (0.1^5)) + 0

alphas* betas = (16 * (0.1^5)) and output activation of label ‘A’ at time t is 0.1. Gradient value

is -0.4714

Page 24: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Stacking MDLSTMs

Page 25: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Architecture

Louradour, Jérôme, and Christopher Kermorvant. "Curriculum learning for handwritten text line recognition." Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on. IEEE, 2014.

Page 26: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Results• Trained and Tested on IAM Handwriting database using

a)python code written from scratchb) RNNlib library

1. Training data 80K 2. Validation data 20K 3. Testing data 15k• NCER % (Normalized Character Error Recognition) 1. Training NCER 15.5 % 2. Testing NCER 15 % 3. Testing NCER with Lexicon 12.60 %• Some examples from the database

Page 27: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Errors made by the network

Page 28: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Thank You

Page 29: Intelligent Handwriting Recognition_MIL_presentation_v3_final

Questions?