High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. ·...
Transcript of High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. ·...
High Performance Indexing of Large Heterogeneous Data Sets using GPU
Massimo Bernaschi IAC – National Research Council of Italy
funded by the ISEC programme under GA n° 4000003856
Why a new indexer?
• Law Enforcement Agencies need an easy and fast tool to index and search seized disk images
GTC 2015 2
How it works • Extract raw files and metadata from (seized) disk images
• Distribute them over multiple systems
• Extract plain text and metadata from every file – including deleted files
• Create distributed indexes
• Provide a friendly user interface to query results
• Organize query results in an intuitive visual representation
GTC 2015 3
Architecture Overview
GTC 2015 4
HPC Cluster
Web GUI DATABASE
CONNECTIONS’ LEGEND
DB input/ouput
HPC cluster
INDEX
REPO
SEARCHER
Search Admin
MEDIATOR
Worker Nodes
DBMS
COORDINATOR Status
Manager
Job
Scheduler WORKER
AGENT
Architecture Overview (cont.) • Coordinator
– Manage, coordinate and monitor the whole system
• DBMS – Provides the interface to the Database
• Mediator – Mediates among all components to ease message communication
• Admin – Web UI – Used to manage the infrastruture, create investigation cases and add disk images for indexing
• Worker Agent – Runs all worker nodes and provides services for monitoring, starting, stopping, configuring
local components
• Index Repository – Repository used to store results of all indexing jobs
GTC 2015 5
Architecture Overview (cont.) • Each worker node can run one or more
– Image-Extractor • to extract files from seized disk images
– Docu-Parser • to trasform extracted documents into plain text and metadata
– Docu-Indexer • to create searchable indexes from transformed text and metadata
• Managed by worker agents • They are connected to form an
Extraction –> Parse –> Indexing Pipeline
GTC 2015 6
Extract – Parse – Indexing Pipeline
GTC 2015 7
Docu - Parser
Image - Extractor
Docu - Parser
Docu - Parser
Docu - Parser
Docu - Indexer
Docu - Indexer
Docu - Indexer
Docu - Indexer
1: EXTRACT 2: PARSE 3: INDEXING
Extract – Parse – Indexing Pipeline
GTC 2015 8
Docu - Parser
Image - Extractor
Docu - Parser
Docu - Parser
Docu - Parser
Docu - Indexer
Docu - Indexer
Docu - Indexer
Docu - Indexer
1: EXTRACT 2: PARSE 3: INDEXING
Disk Image Extraction • Performed by the Image Extractor component • Based on The Sleuth Kit Library® • Supports Unix, Linux, OSx and Windows volumes and
file systems • Extracts raw files and file system metadata
GTC 2015 9
The Sleuth Kit Library http://www.sleuthkit.org/
CREATION_DATE
FILENAME
SIZE
PATH
LAST_MODIFICATION_DATE
SYSTEM METADATA
Document Parsing • Performed by Docu-Parser component • Based on Apache Tika™ Library • Detects and extracts document metadata and structured text • Supports about 1400 file types
GTC 2015 10
Tika Library http://tika.apache.org/
AUTHOR
TITLE
KEYWORDS
SUMMARY
LANGUAGE
TOOL
RIGHTS
FORMAT
DOCUMENT METADATA
Document Indexing • Perfomed by Docu-Indexer component • Based on Apache Lucene™ Libraries • Provides indexing and search capabilities • Index size roughly 20-30% the size of text indexed • Indexes are collected into Index Repository
GTC 2015 11
Apache Lucene™ Libraries http://lucene.apache.org/
Document Searching • Based on Apache Lucene™ Libraries • Provides searching capabilities:
– ranked searching – multiple-index searching with merged results – many powerful query types – fielded searching (e.g. title, author, contents)
• Working on presenting results through an efficient and interactive interface
GTC 2015 12
HPC Document Indexing • Text analysis requires tokenization, filtering and stop
words removal • GPU cards offer huge computing power • Combine CLucene indexing with GPU power to
accelerate these steps
GTC 2015 13
Clucene Libraries http://clucene.sourceforge.net/
GPU CUDA Text Analysis
GTC 2015 15
One CUDA Thread per character. Each thread applies LowerCase Filter
-1 -1 2 -1 -1 -1 -1 7 -1 -1 10 -1 -1 -1 14 15
0 3 8 11 2 7 10 14
my
M n a m e i s B
Each CUDA Thread performs Tokenization by locating delimiter positions
Vector processing in order to create two vectors representing start and end token indexes respectively.
y o b . \0
m y n a m e i s b o b . \0
Start Indexes (related to input text)
name
is
bob
End Indexes (related to input text)
my
name
bob
One CUDA Thread per token.
Each thread applies StopWords Filter.
(2070 Fermi) GPU CUDA Results
GTC 2015 16
2x
7x
9x
0
10
20
30
40
50
60
70
4MB 32MB 128MB
Tim
e (
Seco
nd
s)
Plain-Text Size
CLucene
GPU+CLucene
Speed-Up
CUDA and (Java)Lucene 1/2
● How do they cooperate?
CUDA and (Java)Lucene 2/2
● How do they cooperate efficiently? o smart and efficient memory transfer using Java
Unsafe API
Test Environment • 4 Worker Nodes
– 4 CPUs / 24 Cores 2.67GHz 48 GB RAM – 2 2070 GPU per node – Running Worker Agents and Extract – Parse – Indexing Pipelines
• 1 Management Node – Running all other components
GTC 2015 19
1G Ethernet
Disk Images for Test • Disk images built using the Govdocs1 document set
• Govdocs1 digital corpora includes nearly 1 milion freely-redistributable files
GTC 2015 20 Govdocs1 available @ http://digitalcorpora.org/corpora/files
0% 5% 10% 15% 20% 25%
image
doc
ppt
ps
gz
Govdocs1 File Types
Results
Disk Image Size (GB) Extract-Parse-Indexing Time DD Time # Files Index Size (GB)
32 00:09:15 00:07:36 58225 4.8
80 00:17:43 00:14:50 117282 8.5
100 00:37:16 00:20:15 186305 12
210 01:02:22 00:33:21 368856 19
0:00:00
0:07:12
0:14:24
0:21:36
0:28:48
0:36:00
0:43:12
0:50:24
0:57:36
1:04:48
1:12:00
32 80 100 210
Time
Seized disk image size (in GB)
"Extract-Parse-Index Time"
29/09/14 21
64 GB Disk Image Indexing
GTC 2015 22
0 10 20 30 40 50 60 70
Disk Image
Extracted Text
ISODAC Index
SIZE (GB)
text pdf xls others html doc csv xml ps ppt gz image
Highlights • Streamed In-Memory Extraction+Parse+Indexing
– Only indexes written on disks – Much faster than a Map-Reduce based solution
• File indexing failure recovery – Files are processed again in case of failure – Selectable files extraction and indexing
• Exportable indexes – Generated indexes can be exported and handled to back to
investigators
GTC 2015 23
Future Works
• Distribute workload based on file type
• Enhance scheduling algorithm
• Support file extraction filtering
• Alternative ad-hoc parser based on file type
• CUDA version of Tesseract (for fast OCR)
• Enhanced and interactive results visualization
GTC 2015 24
Tesseract OCR
Profiling with valgrind’s tool callgrind reveals how 3 functions collect approximately 50% self time execution
try to parallelize these in CUDA
In multi-paged documents, ProcessPages function takes near 98% of total execution time
openmp: 1 page per thread
or
get total number of pages and launch a process per page
go parallel
go parallel
GTC 2015 26
Please complete the Presenter Evaluation sent to you by email or through the GTC Mobile App. Your feedback is important!
Why Not ? • Hadoop performance
• MapReduce performance [Jiang et al. (2010)] [Lin et al. (2012)] • HDFS performance [Dong et al. (2014)]
• Seized disk images are neither stored on cluster nor available on a distributed infrastructure
• As fast as possible – In-Memory Streaming Pipeline
• Only indexes are written to disk • Ad-hoc Recovery process
GTC 2015 27