File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf ·...
Transcript of File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf ·...
![Page 1: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/1.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
1
DFRWS20003
File Classification Using Sub-Sequence Kernels
Olivier de VelComputer Forensics Group
DSTO, Australia
![Page 2: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/2.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
2
DFRWS20003
Presentation:
Document mining and computer forensics
Semi-structured documents
Broadband k-spectrum kernel
Coloured generalized suffix tree representation
Experimental methodology
Results and Conclusion
![Page 3: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/3.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
3
DFRWS20003
Computer Forensics: Focus on Document Mining
Eg, disk surface contents
structured data
unstructured datasemi-structured
data
![Page 4: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/4.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
4
DFRWS20003
Presentation:
Document mining and computer forensics
Semi-structured documents
Broadband k-spectrum kernel
Coloured generalized suffix tree representation
Experimental methodology
Results and Conclusion
![Page 5: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/5.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
5
DFRWS20003
Semi-Structured Documents
May or may not include natural language text
May be compound documents
Include low-level, short-range structures. For example:
byte block identifiers
mark-up tokens
![Page 6: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/6.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
6
DFRWS20003
Semi-Structured Documents
⇒ Traditional text mining techniques generally not appropriate.
![Page 7: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/7.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
7
DFRWS20003
Presentation:
Document mining and computer forensics
Semi-structured documents
Broadband k-spectrum kernel
Coloured generalized suffix tree representation
Experimental methodology
Results and Conclusion
![Page 8: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/8.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
8
DFRWS20003
Broadband k-Spectrum Kernel:
Use an extension to the string kernel.
String (k-Spectrum) Kernel:Is the set of all k-length contiguous subsequences that a sequence xi (of alphabet T, size |T|=l) containsFeature space dimension is up to |T|k (eg, if k=4 and |T| = 256 ⇒number of dimensions ≈ 109)
BOOKOOPERPPO…
BOOKOOKO
OKOOKOOP
(k=4)
Xi =
OOPEetc….
Vector representation: (φκ1(xi), φκ1(xi), …, φκ1(xi)) where κ1, κ2,…, κl ∈ Tk
![Page 9: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/9.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
9
DFRWS20003
Broadband k-Spectrum Kernel:
Broadband k-Spectrum Kernel:
Is the set of all (0, 1, ..k)-length contiguous subsequences that a sequence (of alphabet T) containsFeature space dimension is even larger!
BOOKOOPERPPO…
B, BO, BOO, BOOKO, OO, OOK, OOKO
O, OK, OKO, OKOOK, KO, KOO, KOOP
(k=4)
O, OO, OOP, OOPEetc….
![Page 10: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/10.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
10
DFRWS20003
Broadband k-Spectrum Kernel:
Matrix representation, φ(BB)(xi) :
(φκ1(1)(xi) φκ2(1)(xi) … φκl(1)(xi), φκ1(2)(xi) φκ2(2)(xi) … φκl(2)(xi),
:::
φκ1(2)(xi) φκ2(2)(xi) … φκl(2)(xi))
where κi(j) ∈ Tj for i = 1, 2,..l
![Page 11: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/11.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
11
DFRWS20003
Broadband k-Spectrum Kernel: What’s the Big Deal?
This allows us to define the inner product between two input sequences, xr and xs as:
GramMatrix(xr, xs) ≡ φ(BB)(xr) • φ(BB)(xs)
and use the “kernel trick” when maximising the margin of separation between two classes. The decision function for the SVM learning algorithm is:
f(x) = sign(Σ αI yi[φ(BB)(xr) • φ(BB)(xs)] + b)
in the transformed feature space rather than the input feature space.
![Page 12: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/12.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
12
DFRWS20003
Presentation:
Document mining and computer forensics
Semi-structured documents
Broadband k-spectrum kernel
Coloured generalized suffix tree representation
Experimental methodology
Results and Conclusion
![Page 13: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/13.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
13
DFRWS20003
Coloured Generalized Suffix Tree (CGST):
BOOKOOPERPPO PPEKOOBOKKOO(k=4)
The CGST is a labelling of the GST representation of a sequence of symbols:
It stores a count vector to be stored at each node in the GST – each vector element value equals the frequency of the subsequence in the sequence.
B O K K(1,1) (1,1) (0,1)
(1,0)
(0,1)
(1,0)
(0,1)
O K
(1,2)
(0,1)
(1,0)
OE K O(1,1) (0,1)
R P(0,1)
P
(1,3)
(1,0) (1,0)
OK K O(0,1)
O O B(0,1)
(0,1)
P(1,2)
(1,0)
![Page 14: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/14.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
14
DFRWS20003
CGST: Computing the Kernel (Gram) Matrix
For N input sequences, the NxN Gram matrix G is computed as follows:
initialize G r,s = 0 ∀ xr and xs
traverse the CGST
at each node, compute the product of each vector element pair and sum to G r,s
![Page 15: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/15.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
15
DFRWS20003
Presentation:
Document mining and computer forensics
Semi-structured documents
Broadband k-spectrum kernel
Coloured generalized suffix tree representation
Experimental methodology
Results and Conclusion
![Page 16: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/16.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
16
DFRWS20003
Experimental Methodology: Corpus
Document corpus for a controlled experiment :
5 document classes (MSWord, JPEG, MSExcel,
Java source, Adobe PDF)
465 documents (0.11KB min. to 523.3KB max.
size)
data are scaled to unit standard deviation and
zero mean
![Page 17: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/17.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
17
DFRWS20003
Experimental Methodology: Classifier
SVMlight as the (two-way) classifier,
Obtain two-way categorisation matrix for each
document type category, using 10-fold cross-validation
sampling,
Calculate per-document-type category classification
performance statistics – precision, recall and F1
statistics.
![Page 18: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/18.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
18
DFRWS20003
Presentation:
Document mining and computer forensics
Semi-structured documents
Broadband k-spectrum kernel
Coloured generalized suffix tree representation
Experimental methodology
Results and Conclusion
![Page 19: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/19.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
19
DFRWS20003
Experiments: Performance Results
Classification performance was evaluated using the
following parameters:
Minimum and maximum sub-sequence window
offset lengths,
![Page 20: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/20.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
20
DFRWS20003
Experiments: Performance Results
Minimum and Maximum Sub-sequence Window Offset Length:
Definition: For example;
sub-sequence window(length = 5)
MinSeqLength(=2)
MaxSeqLength(=4)
![Page 21: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/21.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
21
DFRWS20003
Experiments: Performance Results
Minimum Sub-sequence Offset Length, MinSeqLength:
12
34
5
MSWord
JPEG
MSExcel
Java
0
10
20
30
40
50
60
70
80
90
100
F1 Performance (%)
Minimum Sub-sequence Offset Length
Document Type
Optimal MinSeqLength=1
Use broadband spectrum
Faster drop-off for non-text
documents (except JPEG)
[Parameters: MaxStringLength=256 MaxSeqLength=5 MinSeqCount=10]
![Page 22: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/22.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
22
DFRWS20003
Experiments: Performance Results
Maximum Sub-sequence Offset Length, MaxSeqLength:
12
34
58
1015
MSWord
JPEG
MSExcel
Java
0
10
20
30
40
50
60
70
80
90
100
F1 Performance (%)
Maximum Sub-sequence Offset Length
Document Type
Optimal value is
4<MaxSeqLength<7
No significant improvements
for large sub-sequence offset
values
![Page 23: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/23.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
23
DFRWS20003
Conclusions:
Promising semi-structured document categorization results using the broadband k-spectrum kernel.Experiments suggest:
use broadband kernels rather than narrow band kernels (iefixed k-length sequence), document length can be small,use a minimum suffix count to achieve improved classification
performance and tree compression.Extensions:
Increase the number of document typesInvestigate techniques for capturing longer-range data structures in documents
![Page 24: File Classification Using Sub- Sequence Kernelsold.dfrws.org/2003/presentations/Brief-deVel.pdf · Semi-structured documents Broadband k-spectrum kernel Coloured generalized suffix](https://reader033.fdocuments.in/reader033/viewer/2022060719/607fd5f92593a163b569df86/html5/thumbnails/24.jpg)
INFORMATION NETWORKS DIVISIONCOMPUTER FORENSICS UNCLASSIFIED www.dsto.defence.gov.au
24
DFRWS20003
Questions ?Advertisement:
“Computer and Intrusion Forensics” by G. Mohay, A. Anderson, B.Collie, O. de Vel and R. McKemmishArtech House ISBN 1-58053-369-8 (2003)