An Introduction to Accumulo

75
AN INTRODUCTION TO APACHE ACCUMULO HOW IT WORKS, WHY IT EXISTS, AND HOW IT IS USED Donald Miner CTO, ClearEdge IT Solutions @donaldpminer August 5 th , 2014

description

This was presented for an O'Reilly Media webcast. http://www.oreilly.com/pub/e/3152?cmp=tw-na-webcast-product-webcast_an_introduction_to_apache_accumulo This webcast will cover the basics of Apache Accumulo architecture and how it works, along with examples of how it is used. We'll also talk about some interesting use cases, such as text indexing, fine-grained multi-level access controls, and storing large-scale graphs. We'll also briefly touch on what sets Accumulo apart from other similar and not-so similar systems and where we think the Accumulo project is headed in a technical direction. A description of Accumulo from the Apache Accumulo website: 
 The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here. Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

Transcript of An Introduction to Accumulo

Page 1: An Introduction to Accumulo

AN INTRODUCTION TO

APACHE ACCUMULOHOW IT WORKS, WHY IT EXISTS, AND HOW IT IS USED

Donald Miner

CTO, ClearEdge IT Solutions

@donaldpminer

August 5th, 2014

Page 2: An Introduction to Accumulo

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

COPY AND PASTED FROM

ACCUMULO.APACHE.ORG

Page 3: An Introduction to Accumulo

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

Adelaide BartkowskiAlyssa Files

Beatriz PalmoreCecilia OursCraig Avalos

Dianna LapointeErma Davis

Fermina SmeadGarrett Harsh

Gaylene SherryGilberto Pardue

Hui NodalJanell Tomita

Jannette BettersJeana Delk

Madlyn RadkePeggie Allis

Rhona ZygmontTran Degarmo

Wilhelmina Papp

Page 4: An Introduction to Accumulo

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

Janell TomitaJannette Betters

Jeana DelkMadlyn Radke

Peggie AllisRhona ZygmontTran Degarmo

Wilhelmina Papp

Adelaide BartkowskiAlyssa Files

Beatriz PalmoreCecilia OursCraig Avalos

Dianna Lapointe

Erma DavisFermina SmeadGarrett Harsh

Gaylene SherryGilberto Pardue

Hui Nodal

-inf to D E to H J to +inf

Page 5: An Introduction to Accumulo

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

Accumulo Master

TabletServer TabletServer TabletServer

ZooKeeper

Page 6: An Introduction to Accumulo

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

KEYVALUEAdelaide Bartkowski 91294124Alyssa Files 491294Beatriz Palmore 4124124124Cecilia Ours 419120Craig Avalos 940124Dianna Lapointe 4921Erma Davis 050194Fermina Smead 10024599949Garrett Harsh 140095931Gaylene Sherry 914815Gilberto Pardue 412414124124Hui Nodal 962195192Janell Tomita 12121Jannette Betters 9192012Jeana Delk 9120150Madlyn Radke 4921Peggie Allis 944944Rhona Zygmont 123103Tran Degarmo 9499494Wilhelmina Papp 11221

Lookup “Garret Harsh”

FAST

Lookup “4921”

SLOW

Page 7: An Introduction to Accumulo

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

Page 8: An Introduction to Accumulo

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

Page 9: An Introduction to Accumulo

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

Page 10: An Introduction to Accumulo

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

Page 11: An Introduction to Accumulo

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

Page 12: An Introduction to Accumulo

The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.

MIT Lincoln Lab study:100 Million inserts per second using Accumulo

http://arxiv.org/ftp/arxiv/papers/1406/1406.4923.pdfhttp://sqrrl.com/media/Accumulo-Benchmark-10312013-1.pdf

Booz Allen Hamilton study:942 tablet servers, 7.56 trillion entries, 408TB, 26 hours94MB/Sec, 15TB/hr, 80million inserts per second11 tablet servers went down with no interruptionShowed linear scalability for write throughput22,000 queries per second

Page 13: An Introduction to Accumulo

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

COPY AND PASTED FROM

ACCUMULO.APACHE.ORG

Page 14: An Introduction to Accumulo

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

Page 15: An Introduction to Accumulo

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

Page 16: An Introduction to Accumulo

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

Page 17: An Introduction to Accumulo

HBase vs. Accumulo• Slight differences in visibility labels• Coprocessors vs. Iterators• Accumulo has faster write throughput*• HBase’s reads are faster*• HBase has more ecosystem integration• BatchScanner• Accumulo can shift around locality groups after the fact• Accumulo has shown to work with no problems at 1,000

nodes (BAH paper). Facebook and others run a “cell” design for HBase. Largest clusters in the hundreds*.

* We believeDisclaimer: I am biased

Page 18: An Introduction to Accumulo

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

Page 19: An Introduction to Accumulo

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

VISIBILITY LABELS!

(admin & developer) | analyst

Page 20: An Introduction to Accumulo

Column Visibility SyntaxLabel DescriptionA & B Both ‘A’ and ‘B’ are required

A | B Either ‘A’ or ‘B’ is required

A & (C | B) ‘A’ and ‘C’ or ‘A’ and ‘B’ is required

A | (B & C) ‘A’ or ‘B’ and ‘C’ is required

(A | B) & (C & D) ?

A & (B & (C | D)) ?

Patient has schizophrenia: insurer | MD & psychPatient has stomach ulcers: insurer | doctorPatient has cavity: insurer | dentistPatient has consent for general anesthesia: surgeon

Page 21: An Introduction to Accumulo

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

ITERATORS!

Page 22: An Introduction to Accumulo

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

Page 23: An Introduction to Accumulo

More cool features• Constraints: user-defined Java functions that allow or

prevent new writes based on a condition• Large rows: no limit on data stored in a row• Multiple masters & FATE: able to execute table operations

in a fault-tolerant manner• MapReduce InputFormats• Bulk import utilities: write directly to Accumulo file formats• Batch scanner: client scans multiple ranges at once• Batch writer: client buffers and organized data before

writing in parallel

Page 24: An Introduction to Accumulo

More cool features• Constraints: user-defined Java functions that allow or

prevent new writes based on a condition• Large rows: no limit on data stored in a row• Multiple masters & FATE: able to execute table operations

in a fault-tolerant manner• MapReduce InputFormats• Bulk import utilities: write directly to Accumulo file formats• Batch scanner: client scans multiple ranges at once• Batch writer: client buffers and organized data before

writing in parallel

Page 25: An Introduction to Accumulo

More cool features• Thrift proxy: access Accumulo through Ruby, Python, …• Monitor page: shows performance, status, errors, more• Locality groups: group column families together on disk

for performance tuning (changeable later)• On-HDFS at rest encryption (work in progress)• Table import and export

Page 26: An Introduction to Accumulo

More cool features• Thrift proxy: access Accumulo through Ruby, Python, …• Monitor page: shows performance, status, errors, more• Locality groups: group column families together on disk

for performance tuning (changeable later)• On-HDFS at rest encryption (work in progress)• Table import and export

Page 27: An Introduction to Accumulo

Scalability & Performance• Multiple HDFS volumes: Accumulo can use multiple

NameNodes to store its data• Master stores metadata in an Accumulo table

• Native in-memory map: data is first written into a buffer written in C++, outside of Java

• Relative encoding: consecutive keys with the same values are flagged instead of rewritten

• Scan pipelines: stages of the read path are parallelized into separate threads

• Caching: data recently scanned is cached

Page 28: An Introduction to Accumulo

HOW IT WORKS

Page 29: An Introduction to Accumulo

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

Page 30: An Introduction to Accumulo

Data ModelRow ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public | private 12423523 @donaldpminer

don info height public | private 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

Name email twitter picture height SSN

derek de…@ad….com 9efe23aa… 6’2”

don dm…@cl….com @donaldpminer 5’ 9”

erica @erica aef319eaf…

Page 31: An Introduction to Accumulo

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

Lookup key

Page 32: An Introduction to Accumulo

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

Collection of data that is kept together

Page 33: An Introduction to Accumulo

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

What the data is

Page 34: An Introduction to Accumulo

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

Who can see the data

Page 35: An Introduction to Accumulo

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

When the data was created

Page 36: An Introduction to Accumulo

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

UNIQUENESS

Page 37: An Introduction to Accumulo

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

SORTED

Page 38: An Introduction to Accumulo

Data ModelKEY

ROW IDCOLUMN

FAMILY QUALIFIER VISIBILITY

VALUE

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

TIMESTAMP

Some piece of information

Page 39: An Introduction to Accumulo

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info SSN private 12314514 123-45-6789

erica … … … … …

Row ID Family Qualifier Visibility Timestamp Value

don info picture public 13119103 dd3ae1d3b951a33f…

Writing data into Accumulo

Page 40: An Introduction to Accumulo

Row ID Family Qualifier Visibility Timestamp Value

don info picture public 13119103 dd3ae1d3b951a33f…

Writing data into Accumulo

Text rowID = new Text(”don");Text colFam = new Text(”info");Text colQual = new Text(”picture");ColumnVisibility colVis = new ColumnVisibility("public");long timestamp = System.currentTimeMillis();Value value = new Value(MyPictureObj.getBytes());

Mutation mutation = new Mutation(rowID);mutation.put(colFam, colQual, colVis, timestamp, value);

BatchWriterConfig config = new BatchWriterConfig();BatchWriter writer = conn.createBatchWriter(”usertable", config)

writer.add(mutation);writer.close();

Page 41: An Introduction to Accumulo

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info picture public 13119103 dd3ae1d3b951a33f…

don info SSN private 12314514 123-45-6789

erica … … … … …

Row ID Family Qualifier Visibility Timestamp Value

don info picture public 13119103 dd3ae1d3b951a33f…

Writing data into Accumulo

Page 42: An Introduction to Accumulo

Writing data into Accumulo

New Record

Page 43: An Introduction to Accumulo

Writing data into Accumulo

Write Ahead

Log (WAL)

New Record

MemTable

sorted

append

Page 44: An Introduction to Accumulo

Writing data into Accumulo

New Record

Page 45: An Introduction to Accumulo

Writing data into Accumulo

Write Ahead

Log (WAL)

New Record

MemTable

Page 46: An Introduction to Accumulo

Writing data into Accumulo

Write Ahead

Log (WAL)

New Record

MemTable

RFile(minc)

sorted

Minor Compaction

Page 47: An Introduction to Accumulo

Writing data into Accumulo

Write Ahead

Log (WAL)

New Record

MemTable

RFile(minc)

RFile(minc)

Minor Compaction

Page 48: An Introduction to Accumulo

Writing data into Accumulo

Write Ahead

Log (WAL)

New Record

MemTable

RFile(minc)

RFile(minc)

RFile(minc)

Minor Compaction

Page 49: An Introduction to Accumulo

Writing data into Accumulo

RFile(majc)

RFile(minc)

RFile(minc)

RFile(minc)

sorted

Major Compaction

Page 50: An Introduction to Accumulo

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info picture public 13119103 dd3ae1d3b951a33f…

don info SSN private 12314514 123-45-6789

erica … … … … …

Range Family Visibilities

don-don info public

Reading data

Page 51: An Introduction to Accumulo

Range Family Visibilities

don-don info public

Reading data

Authorizations auths = new Authorizations("public”);

Scanner scan = conn.createScanner(”usertable", auths);

scan.setRange(new Range(”don",”don"));scan.fetchFamily(”info");

for(Entry<Key,Value> entry : scan) { String row = entry.getKey().getRow(); Value value = entry.getValue();}

Page 52: An Introduction to Accumulo

Reading data

MemTable RFile(minc)

RFile(minc)

RFile(minc)

RFile(majc)

Range Family Visibilities

don-don info public

Tablet: c - f

Page 53: An Introduction to Accumulo

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info picture public 13119103 dd3ae1d3b951a33f…

don info SSN private 12314514 123-45-6789

erica … … … … …

Range Family Visibilities

don-don info public, user, tech

Reading data

Page 54: An Introduction to Accumulo

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info picture public 13119103 dd3ae1d3b951a33f…

don info SSN private 12314514 123-45-6789

erica … … … … …

Range Visibilities

don-don public, user, tech

Reading data Scan

Page 55: An Introduction to Accumulo

Row ID Family Qualifier Visibility Timestamp Value

derek … … … … …

don contact email admin | private 11905014 [email protected]

don contact email admin | private 12412412 [email protected]

don contact email public 12412412 dm…@cl....com

don contact twitter public 12423523 @donaldpminer

don info height public 12314514 5’ 9”

don info picture public 13119103 dd3ae1d3b951a33f…

don info SSN private 12314514 123-45-6789

erica … … … … …

Range Visibilities

d-e public, user, tech

Reading data Scan

Page 56: An Introduction to Accumulo

Iterators• Iterators run tablet server side at these times:

1. Scan Time

2. Minor Compaction

3. Major Compaction

• Multiple iterators are included with Accumulo• Custom iterators can be created using the Iterator API

Page 57: An Introduction to Accumulo

Scan Time Iterator

Page 58: An Introduction to Accumulo

Minor Compaction Iterator

Page 59: An Introduction to Accumulo

Major Compaction Iterator

Page 60: An Introduction to Accumulo

Age-Off Iterator

Row ID

Column Family

Column Qualifier

Column

Visibility

Timestamp

Value

bob attribute score public 1005 24

bob attribute score public 1004 55

bob attribute score public 1003 71

bob attribute score public 1002 66

bob attribute score public 1001 39

bob attribute score public 1000 33

Current Time: 1102

Entries < 100s old

Entries > 100s old

Scan time: server side filtering Major compaction time: age off

Page 61: An Introduction to Accumulo

Combiner Iterators

Apply a function to all available versions of a particular key

Row ID

Column Family

Column Qualifier

Column Visibility

Time Stamp

Value

bob attribute score public 1005 33

bob attribute score public 1004 65

bob attribute score public 1003 71

bob attribute score public 1002 59

bob attribute score public 1001 57

bob attribute score public 1000 51

MAX 71

Scan time: server side combining Minor & Major compaction time: consolidation

Page 62: An Introduction to Accumulo

USE CASES

Page 63: An Introduction to Accumulo

Basic Structured Data

Row IDColumn Family

Column Qualifier

Column Visibility

Timestamp

Value

bob attribute surname public Jul 2013 doe

bob attribute height public Jun 2012 5’11”

bob insurance dental private Sep 2009 MetLife

jane attribute bloodType public Jul 2011 ab-

jane attribute surname public Aug 2013 doe

jane contact cellPhone public Dec 2010 (808) 345-9876

jane insurance vision private Jan 2008 VSP

john allergy major private Feb 1988 amoxicillin

john attribute weight public Sep 2013 180

john contact homeAddr public Mar 2003 34 Baker LN

Page 64: An Introduction to Accumulo

Indexing Everything

Row ID Column Fam Column Qual Visibility Time value

index Column Fam Column Qual:Row ID Visibility Time -

to Column Fam Column Qual:Row ID Visibility Time -

values Column Fam Column Qual:Row ID Visibility Time -

Event Table

Index Table

Page 65: An Introduction to Accumulo

Index TableRow ID

Column Family

Column Qualifier

Column Visibility

Timestamp

Value

(808) 345-9876

contact cellPhone:jane public Dec 2010 -

180 attribute weight:john public Sep 2013 -

34 Baker LN contact homeAddr:john public Mar 2003 -

5’11” attribute height:bob public Jun 2012 -

MetLife insurance

dental:bob private Sep 2009 -

VSP insurance

vision:jane private Jan 2008 -

ab- attribute bloodType:jane public Jul 2011 -

amoxicillin allergy major:john private Feb 1988 -

doe attribute surname:bob public Jul 2013 -

doe attribute surname:jane public Aug 2013 -

Page 66: An Introduction to Accumulo

Data Lake

PATIENTS MEDICINES DOCTORS

INDEX

Page 67: An Introduction to Accumulo

Data Lake

PATIENTS MEDICINES DOCTORS

INDEX

Tell me everything you know

of amoxicillin

amoxicillin

Page 68: An Introduction to Accumulo

Data Lake

PATIENTS DISEASES DOCTORS

INDEX

amoxicillin

bob:allergy:amoxicillin

larry:takes:amoxicillinStomach ulcer:treatment:amoxicillin

smith:prescribed:amoxicillinInfection:

treatment:amoxicillin

Diarrhea:side effect:amoxicillin

Page 69: An Introduction to Accumulo

Graphs

a

bc

d

e

a b c d e

a - 1

b 1 -

c - 1

d 1 1 - 1

e -

Start Nodes

End

Nod

es

Row ID Column Family Column Qualifier Value

a edge b 1

a edge d 1

c edge a 1

c edge d 1

d edge c 1

e edge d 1

Page 70: An Introduction to Accumulo

Term-Partitioned Index

Tablet Server 1

Row IDColumn Family

Value

baseball document docid_3

baseball document docid_2

bat document docid_2

Tablet Server 2

Row IDColumn Family

Value

football document docid_1

football document docid_3

glove document docid_1

Tablet Server 3

Row IDColumn Family

Value

nba document docid_1

shoes document docid_1

soccer document docid_3

RESULTS: [docid_2, docid_3] RESULTS: [docid_1, docid_3] RESULTS: [docid_3]

Tablet Server knows about the terms “baseball”

Tablet Server knows about the terms “football”

Tablet Server knows about the terms “soccer”

Query: “baseball” AND “football” AND “soccer”

Client

Client-side Set Intersection

[docid_2, docid_3][docid_1, docid_3][docid_3]

Page 71: An Introduction to Accumulo

Geospacial Indexing: Z-Order Curve

33.333W, 55.555N = 3535.353535

Page 72: An Introduction to Accumulo

WHERE TO GO FROM HERE

Page 73: An Introduction to Accumulo

Resources

Apache Accumulo website

accumulo.apache.org

Accumulo Summit 2014

accumulosummit.com

slideshare.net/AccumuloSummit

Multi-day in-person training

UMBC Training Centers

ClearEdge IT Solutions

Sqrrl

Page 74: An Introduction to Accumulo

Find a job

Page 75: An Introduction to Accumulo

AN INTRODUCTION TO

APACHE ACCUMULOHOW IT WORKS, WHY IT EXISTS, AND HOW IT IS USED

Donald Miner

CTO, ClearEdge IT Solutions

@donaldpminer

August 5th, 2014