High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. ·...

High Performance Indexing of Large Heterogeneous Data Sets using GPU

Massimo Bernaschi IAC – National Research Council of Italy

funded by the ISEC programme under GA n° 4000003856

Why a new indexer?

• Law Enforcement Agencies need an easy and fast tool to index and search seized disk images

GTC 2015 2

How it works • Extract raw files and metadata from (seized) disk images

• Distribute them over multiple systems

• Extract plain text and metadata from every file – including deleted files

• Create distributed indexes

• Provide a friendly user interface to query results

• Organize query results in an intuitive visual representation

GTC 2015 3

Architecture Overview

GTC 2015 4

HPC Cluster

Web GUI DATABASE

CONNECTIONS’ LEGEND

DB input/ouput

HPC cluster

INDEX

REPO

SEARCHER

Search Admin

MEDIATOR

Worker Nodes

DBMS

COORDINATOR Status

Manager

Job

Scheduler WORKER

AGENT

Architecture Overview (cont.) • Coordinator

– Manage, coordinate and monitor the whole system

• DBMS – Provides the interface to the Database

• Mediator – Mediates among all components to ease message communication

• Admin – Web UI – Used to manage the infrastruture, create investigation cases and add disk images for indexing

• Worker Agent – Runs all worker nodes and provides services for monitoring, starting, stopping, configuring

local components

• Index Repository – Repository used to store results of all indexing jobs

GTC 2015 5

Architecture Overview (cont.) • Each worker node can run one or more

– Image-Extractor • to extract files from seized disk images

– Docu-Parser • to trasform extracted documents into plain text and metadata

– Docu-Indexer • to create searchable indexes from transformed text and metadata

• Managed by worker agents • They are connected to form an

Extraction –> Parse –> Indexing Pipeline

GTC 2015 6

Extract – Parse – Indexing Pipeline

GTC 2015 7

Docu - Parser

Image - Extractor

Docu - Parser

Docu - Parser

Docu - Parser

Docu - Indexer

Docu - Indexer

Docu - Indexer

Docu - Indexer

1: EXTRACT 2: PARSE 3: INDEXING

Extract – Parse – Indexing Pipeline

GTC 2015 8

Docu - Parser

Image - Extractor

Docu - Parser

Docu - Parser

Docu - Parser

Docu - Indexer

Docu - Indexer

Docu - Indexer

Docu - Indexer

1: EXTRACT 2: PARSE 3: INDEXING

Disk Image Extraction • Performed by the Image Extractor component • Based on The Sleuth Kit Library® • Supports Unix, Linux, OSx and Windows volumes and

file systems • Extracts raw files and file system metadata

GTC 2015 9

The Sleuth Kit Library http://www.sleuthkit.org/

CREATION_DATE

FILENAME

SIZE

PATH

LAST_MODIFICATION_DATE

SYSTEM METADATA

Document Parsing • Performed by Docu-Parser component • Based on Apache Tika™ Library • Detects and extracts document metadata and structured text • Supports about 1400 file types

GTC 2015 10

Tika Library http://tika.apache.org/

AUTHOR

TITLE

KEYWORDS

SUMMARY

LANGUAGE

TOOL

RIGHTS

FORMAT

DOCUMENT METADATA

Document Indexing • Perfomed by Docu-Indexer component • Based on Apache Lucene™ Libraries • Provides indexing and search capabilities • Index size roughly 20-30% the size of text indexed • Indexes are collected into Index Repository

GTC 2015 11

Apache Lucene™ Libraries http://lucene.apache.org/

Document Searching • Based on Apache Lucene™ Libraries • Provides searching capabilities:

– ranked searching – multiple-index searching with merged results – many powerful query types – fielded searching (e.g. title, author, contents)

• Working on presenting results through an efficient and interactive interface

GTC 2015 12

HPC Document Indexing • Text analysis requires tokenization, filtering and stop

words removal • GPU cards offer huge computing power • Combine CLucene indexing with GPU power to

accelerate these steps

GTC 2015 13

Clucene Libraries http://clucene.sourceforge.net/

GPU CUDA Text Analysis

GTC 2015 15

One CUDA Thread per character. Each thread applies LowerCase Filter

-1 -1 2 -1 -1 -1 -1 7 -1 -1 10 -1 -1 -1 14 15

0 3 8 11 2 7 10 14

my

M n a m e i s B

Each CUDA Thread performs Tokenization by locating delimiter positions

Vector processing in order to create two vectors representing start and end token indexes respectively.

y o b . \0

m y n a m e i s b o b . \0

Start Indexes (related to input text)

name

is

bob

End Indexes (related to input text)

my

name

bob

One CUDA Thread per token.

Each thread applies StopWords Filter.

(2070 Fermi) GPU CUDA Results

GTC 2015 16

2x

7x

9x

0

10

20

30

40

50

60

70

4MB 32MB 128MB

Tim

e (

Seco

nd

s)

Plain-Text Size

CLucene

GPU+CLucene

Speed-Up

CUDA and (Java)Lucene 1/2

● How do they cooperate?

CUDA and (Java)Lucene 2/2

● How do they cooperate efficiently? o smart and efficient memory transfer using Java

Unsafe API

Test Environment • 4 Worker Nodes

– 4 CPUs / 24 Cores 2.67GHz 48 GB RAM – 2 2070 GPU per node – Running Worker Agents and Extract – Parse – Indexing Pipelines

• 1 Management Node – Running all other components

GTC 2015 19

1G Ethernet

Disk Images for Test • Disk images built using the Govdocs1 document set

• Govdocs1 digital corpora includes nearly 1 milion freely-redistributable files

GTC 2015 20 Govdocs1 available @ http://digitalcorpora.org/corpora/files

0% 5% 10% 15% 20% 25%

pdf

image

doc

ppt

ps

gz

Govdocs1 File Types

Results

Disk Image Size (GB) Extract-Parse-Indexing Time DD Time # Files Index Size (GB)

32 00:09:15 00:07:36 58225 4.8

80 00:17:43 00:14:50 117282 8.5

100 00:37:16 00:20:15 186305 12

210 01:02:22 00:33:21 368856 19

0:00:00

0:07:12

0:14:24

0:21:36

0:28:48

0:36:00

0:43:12

0:50:24

0:57:36

1:04:48

1:12:00

32 80 100 210

Time

Seized disk image size (in GB)

"Extract-Parse-Index Time"

29/09/14 21

64 GB Disk Image Indexing

GTC 2015 22

0 10 20 30 40 50 60 70

Disk Image

Extracted Text

ISODAC Index

SIZE (GB)

text pdf xls others html doc csv xml ps ppt gz image

Highlights • Streamed In-Memory Extraction+Parse+Indexing

– Only indexes written on disks – Much faster than a Map-Reduce based solution

• File indexing failure recovery – Files are processed again in case of failure – Selectable files extraction and indexing

• Exportable indexes – Generated indexes can be exported and handled to back to

investigators

GTC 2015 23

Future Works

• Distribute workload based on file type

• Enhance scheduling algorithm

• Support file extraction filtering

• Alternative ad-hoc parser based on file type

• CUDA version of Tesseract (for fast OCR)

• Enhanced and interactive results visualization

GTC 2015 24

Tesseract OCR

Profiling with valgrind’s tool callgrind reveals how 3 functions collect approximately 50% self time execution

try to parallelize these in CUDA

In multi-paged documents, ProcessPages function takes near 98% of total execution time

openmp: 1 page per thread

or

get total number of pages and launch a process per page

go parallel

go parallel

GTC 2015 26

Please complete the Presenter Evaluation sent to you by email or through the GTC Mobile App. Your feedback is important!

[email protected]

Why Not ? • Hadoop performance

• MapReduce performance [Jiang et al. (2010)] [Lin et al. (2012)] • HDFS performance [Dong et al. (2014)]

• Seized disk images are neither stored on cluster nor available on a distributed infrastructure

• As fast as possible – In-Memory Streaming Pipeline

• Only indexes are written to disk • Ad-hoc Recovery process

GTC 2015 27

High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. ·...

Documents

Transcript of High Performance Indexing of Large Heterogeneous Data Sets using GPU - GTC … · 2015. 3. 18. ·...