Introduction to Accumulo

Mario [email protected]

March 7, 2016

History

To accommodate their needs for analysis of largeamounts of data on commodity hardware, Googledeveloped three main distributed systems:

I GFS: distributed filesystemI MapReduce: distributed data processingI BigTable: distributed storage system for

structured data

Accumulo is an open-source implementation ofBigTable

History

I GFS: distributed filesystem

I MapReduce: distributed data processingI BigTable: distributed storage system for

structured data

History

I GFS: distributed filesystemI MapReduce: distributed data processing

I BigTable: distributed storage system forstructured data

History

structured data

History

structured data

Distributed Structured Data

I structured data should be– distributed for parallel processing

– indexed for fast retrieval (“structured” means that it hassome kind of “primary key”)

– tabular for easy processing of complex data, each row canpotentially have many columns

I databases offer indexes and tables but don’tscale without significant effort

I key-value stores can easily be distributed buthave limited index support over keys and don’thave support for tabular format out of the box

Accumulo

I Accumulo is a key-value store with support fortabular data

– keys are columns identifiers, i.e. they uniquely identify acolumn of a row

– a row is composed by multiple keys-values grouped by theprefix of the key, the row id

ExampleEMAIL NAME LASTNAME COMPANY

[email protected] Olivia Smith Winsystems

[email protected] Emily Brown Jones Inc.

⇓KEY (composed by row id and column id) VALUE

[email protected] Olivia

[email protected] Smith

[email protected] Winsystems

[email protected] Emily

[email protected] Brown

[email protected] Jones Inc.

Composite Keys

Keys in Accumulo are composite and have the following components

I row id: to which row the key belongs toI column family: to which “column group” the key belongs toI column qualifier: the column idI column visibility: who can access this columnI timestamp: the version of the key

A single key-value is stored as

KEYVALUE

row idcolumn

timestampfamily qualifier visibility

Composite Keys

Keys in Accumulo are composite and have the following components

I row id: to which row the key belongs toI column family: to which “column group” the key belongs toI column qualifier: the column idI column visibility: who can access this columnI timestamp: the version of the key

A single key-value is stored as

KEYVALUE

row idcolumn

timestampfamily qualifier visibility

Accumulo featuresI range queries: keys are stored in lexicographical order

allowing to query “semantically close” data

– e.g. temporal data can be stored such that aggregation ofclose days is local and fast

I fast: with proper key schemas a query can takemilliseconds

I scalable: designed to store huge amount of data overmultiple tables

I built-in cache for recently queried data

I many others, such as bulk imports, iterators, faulttolerance, large rows, multiple-batch queries, testingutilities (mocks, miniclusters) . . .

Example

we want to store and analyze tweets from all aroundthe world.

Example: Tweets analysis

I A tweet has the following (simplified) fields– coordinate: geospatial information composed by longitude

and latitude

– created at: UTC time of the tweet

– id: tweet unique identifier

– user informations, such as

I user.id: unique identifier of the userI user.screen name: user nameI . . .

– entities such as hashtags, urls. . .

– text: tweet content

– . . .

I how do we store this data in Accumulo?

I there is no single way to do it, it depends onthe query

I two good practices– work with denormalized data

– specialize tables for each kind of query

I there is no single way to do it, it depends onthe query

I two good practices– work with denormalized data

– specialize tables for each kind of query

Example: Twitter User Timeline

I schemaKEY

VALUErow id

columntimestamp

family qualifier visibility

user.id + created at + id

”coordinate” lon/lat

”entities””hashtags” hashtags

”urls” urls”text” text

I Easy to process the entire timeline or a timeinterval for the same user

I Not good for other kind of analysis– find all the tweets with a given hashtag

– find all the tweets in New York

– . . .

Example: Twitter User Timeline

I schemaKEY

VALUErow id

columntimestamp

family qualifier visibility

user.id + created at + id

”coordinate” lon/lat

”entities””hashtags” hashtags

”urls” urls”text” text

I Easy to process the entire timeline or a timeinterval for the same user

I Not good for other kind of analysis– find all the tweets with a given hashtag

– find all the tweets in New York

– . . .

Summary

I Accumulo is great for storing large amount ofstructured data

I Accumulo is good for interactive queries as wellas more batch queries

I Accumulo is a low-level system– NoSQL (that’s not good!), which means no high-level

language to query the data

– a lot of flexibility which can easily backfire

Thank you

Questions?

Introduction to Accumulo

Data & Analytics

Transcript of Introduction to Accumulo