UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences...
Transcript of UGent presentation: digital signatures · 2017. 7. 19. · The frequency of letter occurrences...
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Joris Schellekens
Natural Language Processing
© 2015, iText Group NV, iText Software Corp., iText Software BVBA © 2015, iText Group NV, iText Software Corp., iText Software BVBA
Natural Language Processing
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Natural Language Processing 3
Why?
Pdf is where data goes to die
Dark data
Dirty data
© 2015, iText Group NV, iText Software Corp., iText Software BVBA © 2015, iText Group NV, iText Software Corp., iText Software BVBA
Agenda
• Language Recognition
• Tokenization
• Part of Speech (POS) Tagging
• Keyword Extraction
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Natural Language Processing 5
Why do we want to discern languages?
Tokenization
Metadata
Stepping stone
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Natural Language Processing 6
Basic concept (N-Grams)
Intuition
Algorithm
Results
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Intuition
Some letters are more common in a language than others: - Dutch text typically does not use ‘Q’ often, whereas French text does
The frequency of letter occurrences (unigrams), double-letter occurrences (bigrams), etc. can be used as a ‘fingerprint’ of the language
Natural Language Processing 7
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
This is an example sentence.
Digital signatures: how it's done in PDF 8
bigram % count
IS 18.18 2
EN 9.09 1
TE 9.09 1
NC 9.09 1
ES 9.09 1
AM 9.09 1
AN 9.09 1
EX 9.09 1
TH 9.09 1
PL 9.09 1
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Algorithm (building model)
1. collect a large volume of text in a specific language
2. retrieve all unigrams, bigrams, trigrams from the text
3. count all n-grams, possibly applying some normalization
uppercase vs. lowercase
accents
numbers
4. store frequency-map for later use
Natural Language Processing 9
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Algorithm (comparison)
1. extract all n-grams from a given piece of text (applying same normalization as earlier)
2. normalize the distribution
3. apply some filtering and smoothing
Filtering : compress the data, throwing away noise
Smoothing: how do we handle unseen events?
4. compare the distribution of the text with that of all known models (cosine similarity)
Natural Language Processing 10
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Results
Natural Language Processing 11
0
0.2
0.4
0.6
0.8
1
1.2
16 32 64 128 256
Recognition of languages in function of language, and number of characters
DE EN FR NL
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Extensions Handle “unknown” language
Combine all known languages in some random way, making a “random” or “noisy” pseudo-language. If this model is most similar to the text, classify as unknown.
Train the model on several unknown languages. Similar approach.
Can be used to detect variants within a language
Has even been used to detect style specific to a writer within a language, within the same historical period
Natural Language Processing 12
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Natural Language Processing 13
Why do we want to discern tokens?
Tagging a document Part of Speech tagging Lemmatization Keyword generation Paragraph splitting
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Natural Language Processing 14
Basic concept (Levenshtein)
Intuition
Algorithm
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Intuition
Assume that difference between two pieces of text is measured as number of edits
Edits are “delete a character”, “insert a character” and “replace a character by another”
We can use recursion (or, even better, dynamic programming). The number of edits between A[0:i] and B[0:j] can be derived from the number of edits between A[0:i-d] and B[0:j-d] where d is {0, 1}.
Natural Language Processing 15
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Algorithm
Natural Language Processing 16
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Natural Language Processing 17
Basic concept (BKTree)
Intuition
Algorithm
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Intuition
Assume we wish to correct a given piece of text W.
The brute-force way would be to calculate the edit-distance between W and all possible words in the dictionary, retaining only those that satisfy some criterion
Like most search-algorithms, using a tree-like structure can improve the complexity enormously.
Natural Language Processing 18
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Algorithm
Natural Language Processing 19
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Putting it all together
Starting with token = “”
While there is a prefix in the dictionary that matches the token => Process more tokens from the stream
If there is no such prefix, but the token is a valid word => Add the token to the output If the token is not a valid prefix, and not a valid word
=> Split at the last boundary that was still a valid prefix, continue from there
Natural Language Processing 20
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Results (intermediate)
Natural Language Processing 21
Token
THIS valid word
_space_ unknown token, separated
IS valid word
_space_ unknown token, separated
AN valid word
_space_ unknown token, separated
EX valid word
BM valid prefix
PLE valid prefix
_space_ unknown token, separated
… …
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Advanced Splits too much (this is great)
Sometimes we need to merge tokens again
Define an objective function Minimize nof. tokens
Maximize nof. tokens known in dictionary
Minimize |token length – known average token length|
Take into account misspelled words (BKTree)
Take into account unknown words
=> Use a meta-heuristic to find (local) optimum
Natural Language Processing 22
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Results (advanced)
Natural Language Processing 23
Token
THIS valid word
_space_ later validated token
IS valid word
_space_ later validated token
AN valid word
_space_ later validated token
EXBMPLE valid (misspelled) word (half score of known word)
_space_ later validated token
SENTEENCE valid (misspelled) word (half score of known word)
_space_ later validated token
… …
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Natural Language Processing 24
Part of Speech (POS) Tagging
Keyword Extraction
Spellcheck
Named Entity Recognition
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Algorithm (high level)
Natural Language Processing 25
Assign POS-tag to every token
Categories such as “noun”, “adjective”, etc
Algorithms have not changed much since boom of NLP
Use (large) corpus
Keep track of assignments (eg: “EAT” => “VB”)
Keep track of chains (eg: “VB TO ..” => VB)
Brute force approach vs. Markov chains (Viterbi)
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Natural Language Processing 26
File root = new File("C:\\Users\\Joris Schellekens\\Documents\\NetBeansProjects\\NLP\\src\\main\\java\\nlp\\models");
// 1. create tokenizer
ITokenizer t0 = new DictionaryBasedTokenizer();
t0.setFlag(IModel.IModelFlag.IGNORE_CASE);
t0.load(new FileInputStream(new File(root, "dictionary_en.xml")));
ITokenizer tokenizer = new HillClimbingTokenizer(t0);
tokenizer.load(new FileInputStream(new File(root, "dictionary_en.xml")));
// 2. create tagger
IPOSTagger tagger = new MaxEntTagger();
tagger.load(new FileInputStream(new File(root, "postagger_en.xml")));
tagger.setFlag(IModel.IModelFlag.IGNORE_CASE);
// 3. define input
String text = "I want to eat a banana.";
// 4. split
tokenizer.setFlag(IModel.IModelFlag.IGNORE_WHITESPACE);
List<String> tokens = tokenizer.tokens(text);
// 5. POS tagging
List<String> tags = tagger.tag(tokens);
// 6. output
for(int i=0;i<tokens.size();i++)
{
System.out.println(tags.get(i) + "\t" + tokens.get(i));
}
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Results
Token Tag Meaning
THIS DT determiner/pronoun, singular
IS BEZ verb "to be", present tense, 3rd person singular
AN AT article
EXAMPLE NN noun, singular, common
SENTENCE NN noun, singular, common
. . sentence terminator
Natural Language Processing 27
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Natural Language Processing 28
Keyword Extraction
PDF Metadata
Document classification
Document retrieval
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Intuition
Natural Language Processing 29
We only care about some kinds of words (nouns, adjectives, verbs)
Not: “a” or “the” or “in” or “over”
Hence the previous POS-tagging
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
But ..
Natural Language Processing 30
From that point on, many algorithms differ (relatively new field of computer sciences)
Brute force
Word2vec
Distribution based
TF-IDF
Graph based
Eigenvalues
K-Truss
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Can you guess the book? MOUSE 0.034373513
TIME 0.031874901
WAY 0.030837694
THINGS 0.021030136
THING 0.019074103
RABBIT 0.018132116
FEET 0.017900016
CATS 0.01612702
POOL 0.016089348
DOOR 0.016043337
COURSE 0.015221295
USE 0.014015975
TABLE 0.013329826
WORDS 0.013056314
KEY 0.012309574
QUESTION 0.011830957
TEARS 0.011804331
MOMENT 0.011682474
EYES 0.011620854
GARDEN 0.011149873
Natural Language Processing 31
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Natural Language Processing 32
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Natural Language Processing 33
HEAVING 0.382881279
SEA 0.024455707
WATER 0.018881216
SUN 0.012538923
FLOWERS 0.01053239
SISTER 0.008346418
SHIPS 0.007588059
THINGS 0.007276676
OTHERS 0.007239968
TURN 0.007074388
SKY 0.006811269
FISH 0.006658898
TREES 0.006632687
RISE 0.005986538
CASTLE 0.00590245
STATUE 0.00586363
LAND 0.005778375
BIRDS 0.00575662
FISHES 0.005703104
BOTTOM 0.005627887
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Natural Language Processing 34
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Natural Language Processing 35
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Extensions
Lemmatization
Remove inflectional endings to return the base (dictionary) form of a word
Stemming
Remove inflectional endings to return a base form of a word
Not identical to the morphological root
Collision
Natural Language Processing 36
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Porter Stemming
Natural Language Processing 37
TIME 0.0338
THINGS 0.033378
QUEEN 0.031056
KINGS 0.027731
TURTLE 0.026883
WAY 0.025023
HEADS 0.024986
RABBIT 0.02355
VOICE 0.023143
CAT 0.020859
DUCHESS 0.018518
MOUSE 0.018515
TONE 0.015746
HAND 0.01526
EYES 0.01489
DOORS 0.014346
MINUTES 0.013891
WORDS 0.013477
DAY 0.013451
HARE 0.012962
MOMENT 0.01296
CATERPILLAR 0.012042
JURY 0.011549
COURSE 0.011109
GARDEN 0.010652
MOUSE 0.034373513
TIME 0.031874901
WAY 0.030837694
THINGS 0.021030136
THING 0.019074103
RABBIT 0.018132116
FEET 0.017900016
CATS 0.01612702
POOL 0.016089348
DOOR 0.016043337
COURSE 0.015221295
USE 0.014015975
TABLE 0.013329826
WORDS 0.013056314
KEY 0.012309574
QUESTION 0.011830957
TEARS 0.011804331
MOMENT 0.011682474
EYES 0.011620854
GARDEN 0.011149873
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
Conclusion
Next step? Up to you ..
Digital signatures: how it's done in PDF 38
© 2015, iText Group NV, iText Software Corp., iText Software BVBA
www.modsummit.com
www.developersummit.com