UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences...

© 2015, iText Group NV, iText Software Corp., iText Software BVBA

Joris Schellekens

Natural Language Processing

© 2015, iText Group NV, iText Software Corp., iText Software BVBA © 2015, iText Group NV, iText Software Corp., iText Software BVBA

Natural Language Processing


Natural Language Processing 3

Why?

Pdf is where data goes to die

Dark data

Dirty data

© 2015, iText Group NV, iText Software Corp., iText Software BVBA © 2015, iText Group NV, iText Software Corp., iText Software BVBA

Agenda

• Language Recognition

• Tokenization

• Part of Speech (POS) Tagging

• Keyword Extraction



Why do we want to discern languages?

Tokenization

Metadata

Stepping stone



Basic concept (N-Grams)

Intuition

Algorithm

Results


Intuition

Some letters are more common in a language than others: - Dutch text typically does not use ‘Q’ often, whereas French text does

The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a ‘fingerprint’ of the language



This is an example sentence.

Digital signatures: how it's done in PDF 8

bigram % count

IS 18.18 2

EN 9.09 1

TE 9.09 1

NC 9.09 1

ES 9.09 1

AM 9.09 1

AN 9.09 1

EX 9.09 1

TH 9.09 1

PL 9.09 1


Algorithm (building model)

1. collect a large volume of text in a specific language

2. retrieve all unigrams, bigrams, trigrams from the text

3. count all n-grams, possibly applying some normalization

uppercase vs. lowercase

accents

numbers

4. store frequency-map for later use



Algorithm (comparison)

1. extract all n-grams from a given piece of text (applying same normalization as earlier)

2. normalize the distribution

3. apply some filtering and smoothing

Filtering : compress the data, throwing away noise

Smoothing: how do we handle unseen events?

4. compare the distribution of the text with that of all known models (cosine similarity)



Results


0

0.2

0.4

0.6

0.8

1

1.2

16 32 64 128 256

Recognition of languages in function of language, and number of characters

DE EN FR NL


Extensions Handle “unknown” language

Combine all known languages in some random way, making a “random” or “noisy” pseudo-language. If this model is most similar to the text, classify as unknown.

Train the model on several unknown languages. Similar approach.

Can be used to detect variants within a language

Has even been used to detect style specific to a writer within a language, within the same historical period




Why do we want to discern tokens?

Tagging a document Part of Speech tagging Lemmatization Keyword generation Paragraph splitting



Basic concept (Levenshtein)

Intuition

Algorithm


Intuition

Assume that difference between two pieces of text is measured as number of edits

Edits are “delete a character”, “insert a character” and “replace a character by another”

We can use recursion (or, even better, dynamic programming). The number of edits between A[0:i] and B[0:j] can be derived from the number of edits between A[0:i-d] and B[0:j-d] where d is {0, 1}.



Algorithm




Basic concept (BKTree)

Intuition

Algorithm


Intuition

Assume we wish to correct a given piece of text W.

The brute-force way would be to calculate the edit-distance between W and all possible words in the dictionary, retaining only those that satisfy some criterion

Like most search-algorithms, using a tree-like structure can improve the complexity enormously.



Algorithm



Putting it all together

Starting with token = “”

While there is a prefix in the dictionary that matches the token => Process more tokens from the stream

If there is no such prefix, but the token is a valid word => Add the token to the output If the token is not a valid prefix, and not a valid word

=> Split at the last boundary that was still a valid prefix, continue from there



Results (intermediate)


Token

THIS valid word

_space_ unknown token, separated

IS valid word


AN valid word


EX valid word

BM valid prefix

PLE valid prefix


… …


Advanced Splits too much (this is great)

Sometimes we need to merge tokens again

Define an objective function Minimize nof. tokens

Maximize nof. tokens known in dictionary

Minimize |token length – known average token length|

Take into account misspelled words (BKTree)

Take into account unknown words

=> Use a meta-heuristic to find (local) optimum



Results (advanced)


Token

THIS valid word

_space_ later validated token

IS valid word


AN valid word


EXBMPLE valid (misspelled) word (half score of known word)


SENTEENCE valid (misspelled) word (half score of known word)


… …



Part of Speech (POS) Tagging

Keyword Extraction

Spellcheck

Named Entity Recognition


Algorithm (high level)


Assign POS-tag to every token

Categories such as “noun”, “adjective”, etc

Algorithms have not changed much since boom of NLP

Use (large) corpus

Keep track of assignments (eg: “EAT” => “VB”)

Keep track of chains (eg: “VB TO ..” => VB)

Brute force approach vs. Markov chains (Viterbi)



File root = new File("C:\\Users\\Joris Schellekens\\Documents\\NetBeansProjects\\NLP\\src\\main\\java\\nlp\\models");

// 1. create tokenizer

ITokenizer t0 = new DictionaryBasedTokenizer();

t0.setFlag(IModel.IModelFlag.IGNORE_CASE);

t0.load(new FileInputStream(new File(root, "dictionary_en.xml")));

ITokenizer tokenizer = new HillClimbingTokenizer(t0);

tokenizer.load(new FileInputStream(new File(root, "dictionary_en.xml")));

// 2. create tagger

IPOSTagger tagger = new MaxEntTagger();

tagger.load(new FileInputStream(new File(root, "postagger_en.xml")));

tagger.setFlag(IModel.IModelFlag.IGNORE_CASE);

// 3. define input

String text = "I want to eat a banana.";

// 4. split

tokenizer.setFlag(IModel.IModelFlag.IGNORE_WHITESPACE);

List<String> tokens = tokenizer.tokens(text);

// 5. POS tagging

List<String> tags = tagger.tag(tokens);

// 6. output

for(int i=0;i<tokens.size();i++)

{

System.out.println(tags.get(i) + "\t" + tokens.get(i));

}


Results

Token Tag Meaning

THIS DT determiner/pronoun, singular

IS BEZ verb "to be", present tense, 3rd person singular

AN AT article

EXAMPLE NN noun, singular, common

SENTENCE NN noun, singular, common

. . sentence terminator




Keyword Extraction

PDF Metadata

Document classification

Document retrieval


Intuition


We only care about some kinds of words (nouns, adjectives, verbs)

Not: “a” or “the” or “in” or “over”

Hence the previous POS-tagging


But ..


From that point on, many algorithms differ (relatively new field of computer sciences)

Brute force

Word2vec

Distribution based

TF-IDF

Graph based

Eigenvalues

K-Truss


Can you guess the book? MOUSE 0.034373513

TIME 0.031874901

WAY 0.030837694

THINGS 0.021030136

THING 0.019074103

RABBIT 0.018132116

FEET 0.017900016

CATS 0.01612702

POOL 0.016089348

DOOR 0.016043337

COURSE 0.015221295

USE 0.014015975

TABLE 0.013329826

WORDS 0.013056314

KEY 0.012309574

QUESTION 0.011830957

TEARS 0.011804331

MOMENT 0.011682474

EYES 0.011620854

GARDEN 0.011149873




HEAVING 0.382881279

SEA 0.024455707

WATER 0.018881216

SUN 0.012538923

FLOWERS 0.01053239

SISTER 0.008346418

SHIPS 0.007588059

THINGS 0.007276676

OTHERS 0.007239968

TURN 0.007074388

SKY 0.006811269

FISH 0.006658898

TREES 0.006632687

RISE 0.005986538

CASTLE 0.00590245

STATUE 0.00586363

LAND 0.005778375

BIRDS 0.00575662

FISHES 0.005703104

BOTTOM 0.005627887


Extensions

Lemmatization

Remove inflectional endings to return the base (dictionary) form of a word

Stemming

Remove inflectional endings to return a base form of a word

Not identical to the morphological root

Collision



Porter Stemming


TIME 0.0338

THINGS 0.033378

QUEEN 0.031056

KINGS 0.027731

TURTLE 0.026883

WAY 0.025023

HEADS 0.024986

RABBIT 0.02355

VOICE 0.023143

CAT 0.020859

DUCHESS 0.018518

MOUSE 0.018515

TONE 0.015746

HAND 0.01526

EYES 0.01489

DOORS 0.014346

MINUTES 0.013891

WORDS 0.013477

DAY 0.013451

HARE 0.012962

MOMENT 0.01296

CATERPILLAR 0.012042

JURY 0.011549

COURSE 0.011109

GARDEN 0.010652

MOUSE 0.034373513

TIME 0.031874901

WAY 0.030837694

THINGS 0.021030136

THING 0.019074103

RABBIT 0.018132116

FEET 0.017900016

CATS 0.01612702

POOL 0.016089348

DOOR 0.016043337

COURSE 0.015221295

USE 0.014015975

TABLE 0.013329826

WORDS 0.013056314

KEY 0.012309574

QUESTION 0.011830957

TEARS 0.011804331

MOMENT 0.011682474

EYES 0.011620854

GARDEN 0.011149873


Conclusion

Next step? Up to you ..

Digital signatures: how it's done in PDF 38


www.modsummit.com

www.developersummit.com

http://www.modsummit.com/

http://www.developersummit.com/

UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences...

Documents

Transcript of UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences...