Security's in your DNA: Genomics for InfoSec
-
Upload
rob-bird -
Category
Data & Analytics
-
view
930 -
download
0
description
Transcript of Security's in your DNA: Genomics for InfoSec
![Page 1: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/1.jpg)
Security’s in your DNA:Genomics for InfoSec
Rob Bird@conduit242
![Page 2: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/2.jpg)
What is the most efficient way to analyze a sequence of events?
![Page 3: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/3.jpg)
What’s a genome?
• The genetic material of an organism• A redundant encoding of instructions• A big sequence of letters
![Page 4: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/4.jpg)
HIVtggatgggttaatttactccaagcaaagacaagatatccttgatctgtgggtctaccacacacaaggctacttccctgattggcagaattacacaccagggccaggagtcagatacccactaacatttggatggtgcttcaagctagtaccagttgatccagatgaagtagagaaggatactgagggagagaacaacagcctattacaccctatatgccaacatggaatggatgatgaggagaaagaagtattaaggtggaaatttgacagccgcctggcactaaaacacagagcccaagagatgcatccggagttctacaaagactgctgacacagaagttgctgacagggactttccgctgggactttccaggggaggtgtggtttgggcggagttggggagtggccaaccctcagatgctgcatataagcagctgcttttcgcttgtactgggtctctctaggtagaccagatccgagcctgggagctctctggctatctggggaacccactgcttaagcctcaataaagcttgccttgagtgctctaagtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctcagaccactctagactgagtaaaaatctctagcagtggcgcccgaacagggactcgaaagcgaaagtaagaccagagaagttctctcgacgcaggactcggcttgctgaggtgcacacagcaagaggcgagagcggcgactggtgagtacgccaatttttgactagcggaggctagaaggagagagatgggtgcgagagcgtcagtattaagcgggggaaaattagatgcatgggagagaattcggttaaggccagggggaaagaaaaaatatagaatgaaacatctagtatgggcaagcagggagctggaaagatttgcacttaaccctggcctgttagaaacaacagaaggatgtcaacaaataatagaacagttacaaccagctctcaagacaggaacagaagaacttagatcattatttaatacagtagtaaccctctattgtgtacatcaacggatagaggtaaaagacaccaaggaagctctagataaaatagaggaaatacaaaataagagcaagcaaaagacacaacaggcagcagctgccacaggaaacagcagcaatgtcagccaaaattaccctatagtgcaaaatgcacaagggcaaatggtacaccaggctgtatcacctaggacattgaatgcatgggtgaaggtaatagaagaaaaggctttcagcccagaagtaatacccatgttctcagcattgtcagaaggagccaccccacaagatttaaatatgatgctaaacatagtggggggacaccaggcagctatgcagatgttgaaagataccatcaatgaggaagctgcagaatgggacaggttacatccagtacaggcagggcctattccaccaggccaattgagagaaccaaggggaagtgacatagcaggaactactagtacccctcaagaacaaataggatggatgacaggcaacccacctattccagtgggagacatctataaaagatggataatcctgggattaaataaaatagtaagaatgtatagccctgttagcattttggacgtaaaacaagggccaaaagaacccttcagagactatgtagataggttctttaaaattctcagagctgagcaagctacacaggaggtaaaaggttggatgacagaaaccttgctggtccaaaatgcaaatccagattgtaagtccattttaagagcactaggaacaggagctacattagaagaaatgatgacagcatgccagggagtgggaggacccggccataaagcaagggttttggctgaggcaatgagtcaagtacaacatacaaacataatgatgcagagaggcaattttaggggtcagagaaggatgattaaatgtttcaattgtggcaaagaaggacacctagccagaaattgcagagcccctaggaaaaagggctgttggaaatgtgggaaagagggacaccaaatgaaggactgcactgaaagacaggctaattttttagggaaaatttggccttccagcaaggggaggccaggaaactttccccagagcaggccagagccaacagccccaccagcagagctctttgggatggaggaagaaaaaacctccgctctgaagcaggagcagaaggacaggaaacaggacccacctttagtttccctcaaatcactctttggcaacgaccccttgtcacagtaaaagtagggggacagctaaaagaagctctattagatacaggagcagatgacacagtattagaagatataaatttgccaggaaaatggaaaccaagaatgatagggggaattggaggttttatcaaagtaaaacagtatgatcagatacttatagaaatttgtggaaaaaaggctataggtacagtattagtaggacccacacctgtcaacataattggaaggaatatgttgacccagattggatgtactttaaatttcccaattagtcctattgagactgtgccagtaaaattaaagccaggaatggatggcccaaaggttaaacaatggccattgacagaagaaaaaataaaagcattaacagaaatttgtacagatatggaaaaggaaggaaaaatttcaagaattgggcctgaaaatccatacaatactccaatatttgctataaagaaaaaagacagcactaaatggaggaaactagtagatttcagagagctcaataaaagaacacaagacttttgggaagttcaattgggaataccgcatccagcgggcctaaaaaagaaaaaatcagtaacagtactagatgtgggggacgcatatttttcagttcctttagatgaaagctttagaaagtatactgcgttcaccatacctagtacaaataatgagacaccaggaatcaggtatcaatacaatgtgctgccacagggatggaaaggatcaccggcaatattccagagtagcatgacaaaaatcttagagccctatagatcaaagaatccagaaataattatctatcaatacatggatgacttgtatgtaggatctgatttagaaatagggcagcatagaacaaaaatagaggagttgagagctcatctattgagctggggatttactacaccagacaaaaagcatcaaaaagaacctccatttctttggatggggtatgaactccatcctgacaaatggacagtacagcctatacaactgccagaaaaggatagctggactgtcaatgatatacagaagttggtggggaaactgaattgggcaagtcaaatttatgcagggattaaagtaaagcaactgtgcaaactcctcaggggagccaaagcactaacagaggtagtaactctgactgaggaagcagaattagaattggcagagaacagggaaattctaaaagaccctgtgcatggagtatattatgacccatcaaaagaattaatagcagaaatacagaaacaagggcaagaccaatggacatatcaaatttatcaagagccatttaaaaatctaaaaacaggaaaatatgcaagaaaaaggtctgctcacactaatgatgtaaagcaattagcagaagtggtgcaaaaggtggtcatggagagcatagtaatatggggaaagactcctaaatttaaactacccatacaaaaagagacatgggaaacatggtggatggactattggcaggctacctggattcctgaatgggagtttgtcaatacccctcccctagtaaaattgtggtaccagttagagaaagaccctatagcaggagcagaaactttctatgtagatggggcagccaatagggagactaagctaggaaaagcagggtatgtaactgacagaggaagacaaaaggttgtttccctaactgagacaacaaatcaaaagactgaactacatgcaatccatctagccttacaggattcaggatcagaagtaaacatagtaacggactcacagtatgcattaggaatcattcaggcacaaccagacaggagtgaatcagaattagtcaatctaataatagaggagctaatagaaaaggacaaggtctacctgtcatgggtaccagcacacaaaggaattggaggaaatgaacaagtagataaattagtcagttccggaattaggaaggtgctgtttttagatgggatagataaagctcaagaagaacatgaaagatatcacagcaattggaaagcaatggctagtgattttaatctgccacctatagtagcaaaggaaatagtagccagctgtgataaatgccaactaaaaggagaagccatgcatggacaggtagactgtagtccaggaatatggcaattagattgcacacatctagaaggaaaagtaatcctggtagcagtccatgtagccagtggttatatagaagcagaagttatcccagcagaaacaggacaagagacagcatactttctactaaaattagcaggaagatggccagtaaaagtagtacacacagacaatggaggcaatttcaccagtgctgcagttaaagcagcctgttggtgggcaaatatccaacaggaatttgggattccctacaatccccaaagtcaaggagtagtggaatctatgaataaagaattaaagaaaatcatagggcaggtaagagatcaagctgaacatcttaagacagcagtacaaatggcagtattcattcacaattttaaaagaaaaggggggattggggggtacagtgcaggggaaaggataatagacataatagcaacagacatgcaaactaaagaattacaaaaacaaattacaaaaattcaaaattttcgggtttattacagggacagcagagatccaatttggaaaggaccagcaaaactactctggaaaggtgaaggggcagtagtaatacaggacaatagtgatatcaaggtagtaccaagaagaaaagcaaagatcattagggattatggaaaacagatggcaggtgatgattgtgtggcaggtagacaggatgaggattagaacatggaacagtttagtaaaatatcatatgtatgtctcaaagaaagctcgaaagtggctctatagacatcactatgatagcaggcatccaaaagtaagttcagaagtacacatcccactaggggatgctagattagtagtaagaacatattggggtctgcatacaggagaaaaagactggcaattgggtcacggggtctccatagaatggaggctaagaagatatagcacacaaatagatcctgacctagcagaccaactaattcatctgcattattttgactgtttttcagaatctgccataaggagagccatattaggacaagtagttagccctaggtgtgtatatccaacaggacataaccaggtaggatccctacaatatctagcactgaaggcattagtaacaccaataaagacaagaccacctttgcctagtgttaagatattaacagaggatagatggaacaagccccagaagaccaggggccacagagggaaccatacaatgaatggatgttagaactgttagaagatcttaaacatgaagcagttagacactttcctagaccatgggctaggacaacatatatataacacctatggggatacttgggaaggagtcgaagctatagtaagaattttgcaacaactactgtttgttcatttcagaattgggtgccaacatagcagaataggcattattcaagggagaagagtcagaaatggagccggtagatcctaacttagagccctggaaccatccgggaagtcagcctacaactgcttgtaccaagtgttactgtaaaaagtgttgctatcattgcctagtttgctttctgaacaaaggcttaggcatctcctatggcaggaagaagcggagcaagcgacgacgaactcctcacagcagtaaggatcatcaaaatcctataccaaagcagtaagtatcagtaattagtatatgtaatgagtcctttagaaatctgtgcaatagtaggattgatagtagcgctaatcatagcaatagttgtgtggactatagtaggtatagaatataagagattgttaaagcaaaggaaaatagacaggttaattaagaaaatacgagaaagagcagaagacagtggcaatgagagtgatggggacatggatgaattggcaaaacttgtggagagggggaactatgatcttggggatgttaatgatctgtagtactgcagaaaacttgtgggttactgtctactatggggtacctgtgtggaaagatgcagaaaccaccttattttgtgcatcagatgctaaagcatacgacacagaggcgcataatgtctgggctacacatgcctgtgtacccacagaccccaacccacaagaaatatatttggaaaatgtgacagaagagtttaacatgtggaaaaataacatggtagagcagatgcatacagatataatcagtctatgggatcaaagcctaaagccatgtgtacagttaacccctctctgcgttactttaaattgtaataacatcaccatcaataacatcaccaccaacatcactgaggacatgagaggagaaataaaaaactgctcgtacaatatgaccacagtattaagggataagagaaggaaagtgtattcacttttttatagacttgatatagtaccacttgatgaggggaataataactctgctgggagtagtgactatagattaataaattgtaatacctcaaccataacacaagcctgtccaaaggtctcttttgacccaattcctatacattattgtgctccagctggttttgcgattctaaaatgtaaggatccagatttcaatggaacagggccatgcaagaatgtcagcacagtacaatgcacacatggaatcaagccagtagtatcaactcaactgctgttaaatggcagtctagcagaaggaaaggtaagaattagatctgaaaatattacaaacaatgccaaaaacataatagtacaacttgtcaagcctgtaaaaattaattgtgtcagacctaacaacaatacaagaacaagtgtacgtataggaccaggacaaacattctatgcaacaggtgaaataataggggatataagacaagcattttgtactgtcaatgaatcagaatggaatgaaactttacaacaggtagctacgcaattaagagaacactttgagaacaaaacaataaaatttactaactcctcaggaggggatttagaaattacaacacatagctttaattgtggaggagaatttttctattgtaatacatcaggcctgtttaatagcacctggaataataataataccagggagaagataaatggtacagagtcaaatagcactataactctccattgcagaataaagcaaattataaataggtggcaggaagtaggacaagcaatgtatgcccctcccatcccaggagtaataaattgtagatcaaacattacaggactaatattaacaagagatggtggggatggggataacaatacggaaatcttcagacctggaggaggaaatatgaaggacaattggagaagtgaattatataagtataaagtagtaaaaattgaaccactgggagtagcacccaccagggctaagagaagagtggtggagagagcaaaaagagcagttggaataggagctgttttccttgggttcttaggagcagcaggaagcactatgggcgcggcgtcaataacgctgacggtacaggccagacaattattgtctggcatagtgcaacagcaaagcaatttgctgagggctatagaggctcaacaacatctgttgaaactcacggtctggggcattaaacagctccaggcaagagtccttgctgtggaaagatacctgcaggatcaacagctcctaggaatttggggctgctctggaaaactcatctgcaccactaatgtgccctggaactctagttggagtaataaatctcagagtgagatatgggagaacatgacctggctgcaatgggataaagaaattagcagttacacaggcataatatataaactaattgaagaatcgcagaaccagcaggaaaagaatgaacaagacttattggcattggacaagtgggcaagtctatggaattggtttgaaatatcaaagtggctgtggtatataaaaatatttataatgatagtaggaggattaataggattaagaatagtttttgctgtgctttctataatcaatagagttaggcagggatactcacctttgtcatttcagacccacaccccaaacccaagggaacccgacaggcccgaaagaatcgaagaagaaggtggagagcaaggcagagacagatcgatacgcttagtgagcggattcttagcacttgcctgggacgacctacggagcctgtgccttttcagctaccaccgcttgagagacttcatcttgattgcagcgaggactgtggaacttctgggacacagcagtctcaaggggttgagactggggtgggaaagcctcaagtatctggggaatcttctgctatattggagtcaggaactaaaaattagtgctgttaatttagttgataccatagcaatagcagtagctggctggacagataggattatagaaacaggacaaagattttgtagagctcttctcaacgtacctagaagaatcagacaaggatttgaaagggctctgctataacatgggtggcaagtggtcaaaaagtagcatagtgggatggcctgagattagggaaagaatgaggcgtgctcctccagcagcaaaaggagtaggagcagtatctcaagatttagataaatttggagcagttacaagcagtaatatgaatcaccctagttgcgtctggctggaagcacaagaggaaacggaggtaggctttccagtcaggccacaagtacctctaaggccaatgacttacaagggagcagtggatctcagccattttttaaaagaaaaggggggactggaagggttaatttactccaagcaaagacaagatatccttgatctgtgggtctaccacacacaaggctacttccctgattggcagaattacacaccagggccaggagtcagatacccactaacatttggatggtgcttcaagctagtaccagttgatccagatgaagtagagaaggatactgagggagagaacaacagcctattacaccctatatgccaacatggaatggatgatgaggagaaagaagtattaaggtggaaatttgacagccgcctggcactaaaacacagagcccaagagatgcatccggagttctacaaagactgctgacacagaagttgctgacagggactttccgctgggactttccaggggaggtgtggtttgggcggagttggggagtggccaaccctcagatgctgcatataagcagctgcttttcgcttgtactgggtctctctaggtagaccagatccgagcctgggagctctctggctatctggggaacccactgcttaagcctcaataaagcttgccttgagtgcb
![Page 5: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/5.jpg)
Basics
• Letters (nucleotides)– 4 in DNA, A,G,C,T
• Codons– Triplets of nucleotides e.g. GAA
• Genomes have coding regions (proteins) & non-coding regions (other)
• One strand can be read forward, the other in reverse
![Page 6: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/6.jpg)
It’s all about the Codons
• The Genetic Code is a dictionary of Codons
• 64 entries (4^3)
![Page 7: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/7.jpg)
![Page 8: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/8.jpg)
Analyzing Genomes
• Compare them to each other– Alignments (e.g. Smith-Waterman, etc.)– Distances• Levenshtein (edit) distance (metric)• Longest Common Subsequence distance
(metric)• Normalized Compression Distance (metric)
– Optimal Grammars• Pisa.c: Optimal sequence grammar search
using hyperstring encodings
![Page 9: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/9.jpg)
Analyzing Genomes
• Look for interesting regions– Information gain (Kullback-Leibler Div)– Coding Costs (Kolmogorov Complexity)– Decaying Coding Costs (Lossy
Kolmogorov Complexity)
![Page 10: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/10.jpg)
Rule 1:Size doesn’t matter
![Page 11: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/11.jpg)
Smallest(almost)
• Mycoplasma Genitalium• 580,000 bp
![Page 12: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/12.jpg)
Largest
• Polychaos Dubium• 670 billion bp
![Page 13: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/13.jpg)
Rule 2:Repetition matters
![Page 14: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/14.jpg)
Don’t say that again
• Sections of DNA that do not repeat are the most important
• Protein coding genes and RNA coding genes are non-repetitive
• Higher-order creatures are largely repetitive
![Page 15: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/15.jpg)
Rule 3:Compression is hard
![Page 16: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/16.jpg)
Putting the squeeze on
• Normal compressors ~ 2bit codes• Special genetic compressors exist• Compressibility equates to sequence
predictability for the model in use
![Page 17: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/17.jpg)
So what does this have to do with security???
![Page 18: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/18.jpg)
A Question
If we could convert sequences of logs, packets, etc. to a genomic encoding,
could we use genomic analysis to dramatically speed up & improve forensics, incident response and
anomaly detection?
![Page 19: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/19.jpg)
YES
![Page 20: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/20.jpg)
How?
• Step 1: Convert events into alphabet• Step 2: Convert stream into string of
letters• Step 3: Money bath
![Page 21: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/21.jpg)
A Naïve Solution
• Step 1: Hash each input, use hash value as a letter
• Step 2: Create stream of hash values• Step 3: #fail
Why?
![Page 22: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/22.jpg)
Answer
• The alphabet is too big • The stream will need at least
2^(2^<hash_key_size) examples• Stream is virtually unpredictable
![Page 23: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/23.jpg)
Enter blar.p
y
![Page 24: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/24.jpg)
WTF is a ‘blarp’?
• Let’s ask Google• The sound a fat person makes being
fat• The sound of taking big fat data and
making it useful & efficient small data
• A cool little python tool for creating and analyzing genomic encodings
• The last two will not be found on Google…yet
![Page 25: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/25.jpg)
Idea
• We want similar events to be represented by a single letter
• Hashes are random projections• Let’s use geometry instead
![Page 26: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/26.jpg)
Position in space
• To precisely locate something in space D, you need dist. to n=D+1 reference points
• Key notion: To get something’s general area you can use n<<D+1 reference points
![Page 27: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/27.jpg)
Locality-Sensitive Hashing
• Created by Yahoo in late 90’s• Used within indexing for text lookups
on massive data sets• Many hashes; data-type dependent• Question: What if you thought about
it as a ‘general area’ hash instead?
![Page 28: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/28.jpg)
How it works
• Basic type: Random Projection• Given a numeric vector (e.g. 1, 15, 3,
14.8) calculate its dot product vs. a random vector
• If result is positive, call it a ‘1’• If negative, call it a ‘0’• Repeat• Concatenate binary together, result
is LSH
![Page 29: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/29.jpg)
Blar.py Pipeline
Vectorize Input
Find Locality
Sensitive Hash
Convert to UTF-16 char
Output stream of UTF-
16
Analyze sliding
window over
genome stream
Score Chart stuff
![Page 30: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/30.jpg)
Vectorizing
• Idea: Count things that matter, take measurements, etc. and create an array to hold that information
• Where the rubber meets the road• Lots of chances for domain expertise
![Page 31: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/31.jpg)
Basic Vectorizing in Blar.py
• Basic model: character n-grams• Also known as Markov chains or Bag of
Letters• Counts up sliding windows of text• E.G. 2-grams for ‘sassyfrassy’sa: 1 as: 2 ss: 2 sy: 2 yf: 1 fr: 1 ra: 1For 256^2 length array (1,0…0,2,0…0,2,0…
![Page 32: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/32.jpg)
Let’s Vectorize Better
• Use Feature Hashing otherwise known as the hashing trick
• Find hash mod length and increment counter for each model pattern
• Permits lossy counting with graceful random collisions
• Blar.py uses length 64 by default and xxHash
![Page 33: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/33.jpg)
Blar.py code
1. def feature_hash_string(s, window, dim):2. # Generate window-char Markov chains & create feature
hashes3. chains = [(xxhash.xxh32(s[i:i+window]) % dim) for i in
xrange(len(s)-(window-1))]
4. # Initialize counter array5. counters = numpy.zeros(dim)
6. # Count instances of feature hashes7. for i in range(len(chains)):8. counters[chains[i]] += 19. # Return feature hash count vector10. return counters
![Page 34: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/34.jpg)
Now let’s find the LSH
1. # Use random projection for LSH and output a UTF char for the locality-sensitive hash
2. def locality_hash_vector(v, width):3. hash = numpy.zeros(width, dtype=int)4. for x in range(0, width - 1):5. projection = numpy.dot(COMP_VECTORS[x], v)6. if projection < 0:7. hash[x] = 08. else:9. hash[x] = 1
10. # Return unicode char equal to the LSH11. return unichr(int(''.join(map(str, hash)),2))
![Page 35: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/35.jpg)
Blar.py analysis
• Analyzes 4 character sequences and assigns a decaying version of the optimal coding cost to each line
• Tells you how interesting a certain event is relative to everything else in the genome, accounting for ordering
• Blar.py Genomes are extremely compressible using bzip especially
![Page 36: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/36.jpg)
Blar.py defaults (ATM)
• 4 character sliding windows• 4 bit hashes• 64d feature hashes• Outputs a list of the most interesting
scores• Outputs a few bad charts
![Page 37: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/37.jpg)
Blar.py vs. Toy File1. Mary had a little lamb whose fleece was white as snow.2. Mary had a little lamb whose fleece was white as snow.3. Mary had a little lamb whose fleece was white as snow.4. Mary had a little lamb whose fleece was white as snow.5. Mary had a little lamb whose fleece was white as snow.6. Gary had a little hand whose hair was as white as blow.7. some more strings8. some more strings9. some more strings10.some more strings11.some more strings12.John McAfee was the keynote for Skytalks.13.John McAfee was the keynote for Skytalks.14.John McAfee was the keynote for Skytalks.15.some more strings16.some more strings17.some more strings18.John McAfee was the keynote for Skytalks.19.John McAfee was the keynote for Skytalks.20.FOO BAR BAS
![Page 38: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/38.jpg)
Blar.py vs. Toy File
![Page 39: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/39.jpg)
Blar.py vs. Toy File(Look Raffy, I’m using the completely inappropriate chart type)
![Page 40: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/40.jpg)
Blar.py vs. BlueGene/L
• From the Usenix Computer Failure Data Repository
• 1.2GB combined log file from 131,072 processors for six months
• 119MB compressed with gzip• 9.4MB blar.py genome• Blar.py ~1000 lines/sec
![Page 41: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/41.jpg)
Blar.py vs. BlueGene/L
![Page 42: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/42.jpg)
Blar.py vs. BlueGene/L
![Page 43: Security's in your DNA: Genomics for InfoSec](https://reader033.fdocuments.in/reader033/viewer/2022061208/548773aab4af9f820d8b5428/html5/thumbnails/43.jpg)
TL;DR
• Fast, accurate, free: Blar.py genomic encoding tool provides very fast, low noise anomaly detection
• Stop searching in a crisis: Great way to quickly explore data for IR, forensics, etc., especially from unknown sources
• Want it? Follow me @conduit242 for the GitHub posting announcement