Sqrrl October Webinar: Data Modeling and Indexing

21
Securely explore your data DATA MODELING AND INDEXING FOR APACHE ACCUMULO Sqrrl Webinar Series October, 2013 Adam Fuchs, CTO Sqrrl Data, Inc.

description

This webinar provides a technical deep dive into the NoSQL database Apache Accumulo. Sqrrl extends Accumulo with additional security, analytical, and data modeling tools. Topics include data modeling techniques, secondary indices, JSON and Graph capabilities for Accumulo.

Transcript of Sqrrl October Webinar: Data Modeling and Indexing

Page 1: Sqrrl October Webinar: Data Modeling and Indexing

Securely explore your data

DATA MODELING AND INDEXING FOR APACHE ACCUMULO

Sqrrl Webinar Series October, 2013 Adam Fuchs, CTO Sqrrl Data, Inc.

Page 2: Sqrrl October Webinar: Data Modeling and Indexing

RECAP

1.  Introduction to Sqrrl and Accumulo 2.  Security In The Wild 3.  Sqrrl and Accumulo Technology 4.  The Data-Centric Security Ecosystem

In our September Webinar: Sqrrl, Apache Accumulo, and Cell-Level Security

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 2%

Page 3: Sqrrl October Webinar: Data Modeling and Indexing

TODAY’S DISCUSSION

1.  Sqrrl and Accumulo Technology Review 2.  Table Designs

1.  Dynamic Documents 2.  Graphs 3.  Inverted Indexes

3.  Putting It All Together with Sqrrl

Data Modeling and Indexing for Apache Accumulo

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 3%

Page 4: Sqrrl October Webinar: Data Modeling and Indexing

LAYERED ARCHITECTURE Turtles all the way down...

Accumulo'RPC'(Sorted(Key/Value(I/O)(

Hadoop'RPC'(File(I/O)(

Application

Sqrrl Enterprise

Sqrrl'API'over'Apache'Thri8'RPC'(JSON,(Graph,(Aggrega=on,(

Search,(etc.)(

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 4%

Page 5: Sqrrl October Webinar: Data Modeling and Indexing

An Accumulo key is a 5-tuple, consisting of:

"   Row: Controls Atomicity "   Column Family: Controls Locality "   Column Qualifier: Controls Uniqueness "   Visibility Label: Controls Access "   Timestamp: Controls Versioning

Row Col. Fam. Col. Qual. Visibility Timestamp Value

John Doe Notes PCP PCP_JD 20120912 Patient suffers from an acute …

John Doe Test Results Cholesterol JD|PCP_JD 20120912 183

John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass

John Doe Test Results X-Ray JD|PHYS_JD 20120513 1010110110100…

Accumulo(Key/Value(Example(

ACCUMULO DATA FORMAT

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 5%

Page 6: Sqrrl October Webinar: Data Modeling and Indexing

Instance new%ZooKeeperInstance(...)%

new%MockInstance()%

Connector

getConnector(...)%

TableOperations

InstanceOperations

SecurityOperations Scanner BatchScanner

createScanner(...)% createBatchScanner(...)%

Range

IteratorOption

Map.Entry

Key Value

iterator()%

BatchWriter

createBatchWriter(...)%

Mutation

addMuta3on(...)%

THE ACCUMULO CLIENT API

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 6%

Page 7: Sqrrl October Webinar: Data Modeling and Indexing

InJMemory%Map%

Write%Ahead%Log%

(For%Recovery)%

Sorted,%Indexed%File%

Sorted,%Indexed%File%

Sorted,%Indexed%File%

Tablet(Data(Flow(

Reads&Iterator%Tree%

Minor&Compac0

on&

Merging&/&Major&Compac0on&

Iterator%Tree%

Writes& Iterator%Tree%

Scan&

Tablet%Server%

Tablet%

Tablet%Server%

Tablet%

Tablet%Server%

Tablet%

Applica3on%

Zookeeper%

Zookeeper%

Zookeeper%

Master%

HDFS%

Read/Write&

Store/Replicate&

Assign/Balance&

Delegate&Authority&

Delegate&Authority&

Applica3on%

Applica3on%

ACCUMULO TECHNOLOGY Strengths •  Shared-Nothing => Scalability •  Micro-Batching for Efficient

Random I/O •  High Concurrency, Low Latency

for Denormalized Data •  Sparse, Flexible Schema supports

dynamic and diverse data models •  Cell-level Security promotes

sharing Weaknesses •  Sorting induces write multiplication

factor •  Sparse schema support induces

additional storage overhead

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 7%

Page 8: Sqrrl October Webinar: Data Modeling and Indexing

TODAY’S DISCUSSION

1.  Sqrrl and Accumulo Technology Review 2.  Table Designs

1.  Dynamic Documents 2.  Graphs 3.  Inverted Indexes

3.  Putting It All Together with Sqrrl

Data Modeling and Indexing for Apache Accumulo

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 8%

Page 9: Sqrrl October Webinar: Data Modeling and Indexing

PROXY/NETFLOW EXAMPLE

Source Destination Port Bytes In Bytes Out Protocol 10.1.2.3 google.com 80 73,824 15,632 http 10.1.2.4 facebook.com 443 10,328 13,284,129 https 10.1.2.4 google.com 80 623,249 93,125 http 10.1.2.3 abcd1234.ru 3133

7 158 523,698,104 unknown

10.1.2.3 netflix.com 443 434,855,357 1,392,994 https 10.1.2.4 google.com 443 23,084 583,331 https 10.1.2.3 10.1.2.5 22 204 158 ssh

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 9%

Page 10: Sqrrl October Webinar: Data Modeling and Indexing

INDEXES AND QFDS

Logs/Observations Input

Indexes

Question-Focused Datasets Transform

ation

•  Immutable(

•  AppendHOnly(

•  RealHTime(

•  Online(•  Sorted(•  Grouped(•  Aggregated(

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 10%

Page 11: Sqrrl October Webinar: Data Modeling and Indexing

QFD KEY GENERATION

Source Destination Port Bytes In Bytes Out Protocol 10.1.2.3 google.com 80 73,824 15,632 http

Key% % % % % % %J>%%Value%10.1.2.3,%Bytes%In%% % %J>%+73,824%10.1.2.3,%Bytes%Out% % %J>%+15,632%10.1.2.3,%Ports%Used% % %J>%+{80}%10.1.2.3,%Protocols%Used% %J>%+{hap}%

Hosts QFD

0x00

.

.

.

0xFF

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 11%

Page 12: Sqrrl October Webinar: Data Modeling and Indexing

HOSTS QFD WITH AGGREGATION IP Ports

Used Protos Used

Total Bytes In

Total Bytes Out

Ports Hosted

Protos Hosted

10.1.2.3 {22, 80, 443, 31337}

{http, https, ssh, unknown}

434,931,543 525,106,888 - -

10.1.2.4 {80, 443}

{http, https}

656,661 13,960,585 - -

10.1.2.5 - - 158 204 {22} {ssh}

New%Contribu3on:%(10.1.2.5,%Total%Bytes%In%J>%+3,215)%

158%+3,215%3,373%

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 12%

Page 13: Sqrrl October Webinar: Data Modeling and Indexing

facebook.com

google.com

abcd1234.ru

netflix.com

10.1.2.3

10.1.2.4

10.1.2.5

CONNECTIVITY GRAPH

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 13%

Row Col. Fam. Col. Qual. Val. 10.1.2.3 Contacts 10.1.2.5 -

10.1.2.3 Contacts abcd1234.ru -

10.1.2.3 Contacts google.com -

10.1.2.3 Contacts netflix.com -

10.1.2.4 Contacts facebook.com -

10.1.2.4 Contacts google.com -

Row Col. Fam. Col. Qual. Val 10.1.2.5 Serves 10.1.2.3 -

abcd1234.ru Serves 10.1.2.3 -

facebook.com Serves 10.1.2.4 -

google.com Serves 10.1.2.3 -

google.com Serves 10.1.2.4 -

netflix.com Serves 10.1.2.3 -

Page 14: Sqrrl October Webinar: Data Modeling and Indexing

INVERTED INDEXING

Table:(

Row:(

Column(Family:(

Column(Qualifier:(

Value:(

Forward(Index(

<UUID>(

<Type>(

<Field>(

<Term>(

Inverted(Index(

<Field>(

<Term>(

<UUID>(

<Digest(of(Event>(

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 14%

Page 15: Sqrrl October Webinar: Data Modeling and Indexing

INVERTED INDEXING

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 15%

Page 16: Sqrrl October Webinar: Data Modeling and Indexing

ADVANCED INDEXING

Table:(

Row:(

Column(Family:(

Column(Qualifier(

(Tuples):(

Value:(

Shard(Table(

<Par==on(ID>(

“Docs”( “Inv.(Index”( “Field(Index”(

<UUID>(

<Value>(

<Term>(

<UUID>(

<Field:Term>(

<UUID>(<Field>(

“Geo”(

<Hash>(

<UUID>(

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 16%

Page 17: Sqrrl October Webinar: Data Modeling and Indexing

TODAY’S DISCUSSION

1.  Sqrrl and Accumulo Technology Review 2.  Table Designs

1.  Dynamic Documents 2.  Graphs 3.  Inverted Indexes

3.  Putting It All Together with Sqrrl

Data Modeling and Indexing for Apache Accumulo

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 17%

Page 18: Sqrrl October Webinar: Data Modeling and Indexing

SQRRL ENTERPRISE

•  Dynamic Documents •  JSON I/O support •  Cell-level Security and Efficient Aggregation Extensions

•  Dynamic Graphs •  Co-partitioned with Documents for Integrated Search and

Discovery

•  Search •  Lucene Query Syntax •  Accumulo Indexes Preserve Security Model

•  Processing •  SQL-Like Language for Transforming and Aggregating Results •  Parallel Slicing and Extraction

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 18%

Simple API for Advanced Accumulo Usage

Page 19: Sqrrl October Webinar: Data Modeling and Indexing

REAL-TIME OPERATIONAL APPS

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary%

Contact us for a demo

19%

Page 20: Sqrrl October Webinar: Data Modeling and Indexing

HOW TO LEARN MORE

Download our White Paper "   www.sqrrl.com/whitepaper

Watch a video "   www.sqrrl.com/downloads#videos

Request a demo or one-on-one workshop "   www.sqrrl.com/contact

Come meet us "   Accumulo Meetup (October 28, New York) "   Strata + Hadoop World (October 28-30, New York) "   IBM IOD (November 4-7, Las Vegas) "   SC13 (November 18-21, Denver)

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 20%

Page 21: Sqrrl October Webinar: Data Modeling and Indexing

THANK YOU

Thanks for attending!

To keep up to date with Sqrrl, check out or social media sites: www.twitter.com/sqrrl_inc www.linkedin.com/company/sqrrl

Sqrrl%Data,%Inc.%Confiden3al%and%Proprietary% 21%