UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences...

39
© 2015, iText Group NV, iText Software Corp., iText Software BVBA Joris Schellekens Natural Language Processing

Transcript of UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences...

Page 1: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Joris Schellekens

Natural Language Processing

Page 2: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA © 2015, iText Group NV, iText Software Corp., iText Software BVBA

Natural Language Processing

Page 3: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Natural Language Processing 3

Why?

Pdf is where data goes to die

Dark data

Dirty data

Page 4: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA © 2015, iText Group NV, iText Software Corp., iText Software BVBA

Agenda

• Language Recognition

• Tokenization

• Part of Speech (POS) Tagging

• Keyword Extraction

Page 5: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Natural Language Processing 5

Why do we want to discern languages?

Tokenization

Metadata

Stepping stone

Page 6: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Natural Language Processing 6

Basic concept (N-Grams)

Intuition

Algorithm

Results

Page 7: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Intuition

Some letters are more common in a language than others: - Dutch text typically does not use ‘Q’ often, whereas French text does

The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a ‘fingerprint’ of the language

Natural Language Processing 7

Page 8: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

This is an example sentence.

Digital signatures: how it's done in PDF 8

bigram % count

IS 18.18 2

EN 9.09 1

TE 9.09 1

NC 9.09 1

ES 9.09 1

AM 9.09 1

AN 9.09 1

EX 9.09 1

TH 9.09 1

PL 9.09 1

Page 9: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Algorithm (building model)

1. collect a large volume of text in a specific language

2. retrieve all unigrams, bigrams, trigrams from the text

3. count all n-grams, possibly applying some normalization

uppercase vs. lowercase

accents

numbers

4. store frequency-map for later use

Natural Language Processing 9

Page 10: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Algorithm (comparison)

1. extract all n-grams from a given piece of text (applying same normalization as earlier)

2. normalize the distribution

3. apply some filtering and smoothing

Filtering : compress the data, throwing away noise

Smoothing: how do we handle unseen events?

4. compare the distribution of the text with that of all known models (cosine similarity)

Natural Language Processing 10

Page 11: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Results

Natural Language Processing 11

0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 128 256

Recognition of languages in function of language, and number of characters

DE EN FR NL

Page 12: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Extensions Handle “unknown” language

Combine all known languages in some random way, making a “random” or “noisy” pseudo-language. If this model is most similar to the text, classify as unknown.

Train the model on several unknown languages. Similar approach.

Can be used to detect variants within a language

Has even been used to detect style specific to a writer within a language, within the same historical period

Natural Language Processing 12

Page 13: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Natural Language Processing 13

Why do we want to discern tokens?

Tagging a document Part of Speech tagging Lemmatization Keyword generation Paragraph splitting

Page 14: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Natural Language Processing 14

Basic concept (Levenshtein)

Intuition

Algorithm

Page 15: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Intuition

Assume that difference between two pieces of text is measured as number of edits

Edits are “delete a character”, “insert a character” and “replace a character by another”

We can use recursion (or, even better, dynamic programming). The number of edits between A[0:i] and B[0:j] can be derived from the number of edits between A[0:i-d] and B[0:j-d] where d is {0, 1}.

Natural Language Processing 15

Page 16: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Algorithm

Natural Language Processing 16

Page 17: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Natural Language Processing 17

Basic concept (BKTree)

Intuition

Algorithm

Page 18: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Intuition

Assume we wish to correct a given piece of text W.

The brute-force way would be to calculate the edit-distance between W and all possible words in the dictionary, retaining only those that satisfy some criterion

Like most search-algorithms, using a tree-like structure can improve the complexity enormously.

Natural Language Processing 18

Page 19: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Algorithm

Natural Language Processing 19

Page 20: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Putting it all together

Starting with token = “”

While there is a prefix in the dictionary that matches the token => Process more tokens from the stream

If there is no such prefix, but the token is a valid word => Add the token to the output If the token is not a valid prefix, and not a valid word

=> Split at the last boundary that was still a valid prefix, continue from there

Natural Language Processing 20

Page 21: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Results (intermediate)

Natural Language Processing 21

Token

THIS valid word

_space_ unknown token, separated

IS valid word

_space_ unknown token, separated

AN valid word

_space_ unknown token, separated

EX valid word

BM valid prefix

PLE valid prefix

_space_ unknown token, separated

… …

Page 22: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Advanced Splits too much (this is great)

Sometimes we need to merge tokens again

Define an objective function Minimize nof. tokens

Maximize nof. tokens known in dictionary

Minimize |token length – known average token length|

Take into account misspelled words (BKTree)

Take into account unknown words

=> Use a meta-heuristic to find (local) optimum

Natural Language Processing 22

Page 23: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Results (advanced)

Natural Language Processing 23

Token

THIS valid word

_space_ later validated token

IS valid word

_space_ later validated token

AN valid word

_space_ later validated token

EXBMPLE valid (misspelled) word (half score of known word)

_space_ later validated token

SENTEENCE valid (misspelled) word (half score of known word)

_space_ later validated token

… …

Page 24: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Natural Language Processing 24

Part of Speech (POS) Tagging

Keyword Extraction

Spellcheck

Named Entity Recognition

Page 25: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Algorithm (high level)

Natural Language Processing 25

Assign POS-tag to every token

Categories such as “noun”, “adjective”, etc

Algorithms have not changed much since boom of NLP

Use (large) corpus

Keep track of assignments (eg: “EAT” => “VB”)

Keep track of chains (eg: “VB TO ..” => VB)

Brute force approach vs. Markov chains (Viterbi)

Page 26: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Natural Language Processing 26

File root = new File("C:\\Users\\Joris Schellekens\\Documents\\NetBeansProjects\\NLP\\src\\main\\java\\nlp\\models");

// 1. create tokenizer

ITokenizer t0 = new DictionaryBasedTokenizer();

t0.setFlag(IModel.IModelFlag.IGNORE_CASE);

t0.load(new FileInputStream(new File(root, "dictionary_en.xml")));

ITokenizer tokenizer = new HillClimbingTokenizer(t0);

tokenizer.load(new FileInputStream(new File(root, "dictionary_en.xml")));

// 2. create tagger

IPOSTagger tagger = new MaxEntTagger();

tagger.load(new FileInputStream(new File(root, "postagger_en.xml")));

tagger.setFlag(IModel.IModelFlag.IGNORE_CASE);

// 3. define input

String text = "I want to eat a banana.";

// 4. split

tokenizer.setFlag(IModel.IModelFlag.IGNORE_WHITESPACE);

List<String> tokens = tokenizer.tokens(text);

// 5. POS tagging

List<String> tags = tagger.tag(tokens);

// 6. output

for(int i=0;i<tokens.size();i++)

{

System.out.println(tags.get(i) + "\t" + tokens.get(i));

}

Page 27: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Results

Token Tag Meaning

THIS DT determiner/pronoun, singular

IS BEZ verb "to be", present tense, 3rd person singular

AN AT article

EXAMPLE NN noun, singular, common

SENTENCE NN noun, singular, common

. . sentence terminator

Natural Language Processing 27

Page 28: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Natural Language Processing 28

Keyword Extraction

PDF Metadata

Document classification

Document retrieval

Page 29: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Intuition

Natural Language Processing 29

We only care about some kinds of words (nouns, adjectives, verbs)

Not: “a” or “the” or “in” or “over”

Hence the previous POS-tagging

Page 30: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

But ..

Natural Language Processing 30

From that point on, many algorithms differ (relatively new field of computer sciences)

Brute force

Word2vec

Distribution based

TF-IDF

Graph based

Eigenvalues

K-Truss

Page 31: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Can you guess the book? MOUSE 0.034373513

TIME 0.031874901

WAY 0.030837694

THINGS 0.021030136

THING 0.019074103

RABBIT 0.018132116

FEET 0.017900016

CATS 0.01612702

POOL 0.016089348

DOOR 0.016043337

COURSE 0.015221295

USE 0.014015975

TABLE 0.013329826

WORDS 0.013056314

KEY 0.012309574

QUESTION 0.011830957

TEARS 0.011804331

MOMENT 0.011682474

EYES 0.011620854

GARDEN 0.011149873

Natural Language Processing 31

Page 32: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Natural Language Processing 32

Page 33: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Natural Language Processing 33

HEAVING 0.382881279

SEA 0.024455707

WATER 0.018881216

SUN 0.012538923

FLOWERS 0.01053239

SISTER 0.008346418

SHIPS 0.007588059

THINGS 0.007276676

OTHERS 0.007239968

TURN 0.007074388

SKY 0.006811269

FISH 0.006658898

TREES 0.006632687

RISE 0.005986538

CASTLE 0.00590245

STATUE 0.00586363

LAND 0.005778375

BIRDS 0.00575662

FISHES 0.005703104

BOTTOM 0.005627887

Page 34: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Natural Language Processing 34

Page 35: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Natural Language Processing 35

Page 36: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Extensions

Lemmatization

Remove inflectional endings to return the base (dictionary) form of a word

Stemming

Remove inflectional endings to return a base form of a word

Not identical to the morphological root

Collision

Natural Language Processing 36

Page 37: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Porter Stemming

Natural Language Processing 37

TIME 0.0338

THINGS 0.033378

QUEEN 0.031056

KINGS 0.027731

TURTLE 0.026883

WAY 0.025023

HEADS 0.024986

RABBIT 0.02355

VOICE 0.023143

CAT 0.020859

DUCHESS 0.018518

MOUSE 0.018515

TONE 0.015746

HAND 0.01526

EYES 0.01489

DOORS 0.014346

MINUTES 0.013891

WORDS 0.013477

DAY 0.013451

HARE 0.012962

MOMENT 0.01296

CATERPILLAR 0.012042

JURY 0.011549

COURSE 0.011109

GARDEN 0.010652

MOUSE 0.034373513

TIME 0.031874901

WAY 0.030837694

THINGS 0.021030136

THING 0.019074103

RABBIT 0.018132116

FEET 0.017900016

CATS 0.01612702

POOL 0.016089348

DOOR 0.016043337

COURSE 0.015221295

USE 0.014015975

TABLE 0.013329826

WORDS 0.013056314

KEY 0.012309574

QUESTION 0.011830957

TEARS 0.011804331

MOMENT 0.011682474

EYES 0.011620854

GARDEN 0.011149873

Page 38: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Conclusion

Next step? Up to you ..

Digital signatures: how it's done in PDF 38

Page 39: UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a fingerprint [

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

www.modsummit.com

www.developersummit.com