Accumulo Summit 2014: Accumulo backed Tinkerpop Implementation
Introduction to Accumulo
-
Upload
mario-pastorelli -
Category
Data & Analytics
-
view
425 -
download
2
Transcript of Introduction to Accumulo
![Page 2: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/2.jpg)
History
To accommodate their needs for analysis of largeamounts of data on commodity hardware, Googledeveloped three main distributed systems:
I GFS: distributed filesystemI MapReduce: distributed data processingI BigTable: distributed storage system for
structured data
Accumulo is an open-source implementation ofBigTable
2
![Page 3: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/3.jpg)
History
To accommodate their needs for analysis of largeamounts of data on commodity hardware, Googledeveloped three main distributed systems:
I GFS: distributed filesystem
I MapReduce: distributed data processingI BigTable: distributed storage system for
structured data
Accumulo is an open-source implementation ofBigTable
2
![Page 4: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/4.jpg)
History
To accommodate their needs for analysis of largeamounts of data on commodity hardware, Googledeveloped three main distributed systems:
I GFS: distributed filesystemI MapReduce: distributed data processing
I BigTable: distributed storage system forstructured data
Accumulo is an open-source implementation ofBigTable
2
![Page 5: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/5.jpg)
History
To accommodate their needs for analysis of largeamounts of data on commodity hardware, Googledeveloped three main distributed systems:
I GFS: distributed filesystemI MapReduce: distributed data processingI BigTable: distributed storage system for
structured data
Accumulo is an open-source implementation ofBigTable
2
![Page 6: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/6.jpg)
History
To accommodate their needs for analysis of largeamounts of data on commodity hardware, Googledeveloped three main distributed systems:
I GFS: distributed filesystemI MapReduce: distributed data processingI BigTable: distributed storage system for
structured data
Accumulo is an open-source implementation ofBigTable
2
![Page 7: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/7.jpg)
Distributed Structured Data
I structured data should be– distributed for parallel processing
– indexed for fast retrieval (“structured” means that it hassome kind of “primary key”)
– tabular for easy processing of complex data, each row canpotentially have many columns
I databases offer indexes and tables but don’tscale without significant effort
I key-value stores can easily be distributed buthave limited index support over keys and don’thave support for tabular format out of the box
3
![Page 8: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/8.jpg)
Distributed Structured Data
I structured data should be– distributed for parallel processing
– indexed for fast retrieval (“structured” means that it hassome kind of “primary key”)
– tabular for easy processing of complex data, each row canpotentially have many columns
I databases offer indexes and tables but don’tscale without significant effort
I key-value stores can easily be distributed buthave limited index support over keys and don’thave support for tabular format out of the box
3
![Page 9: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/9.jpg)
Distributed Structured Data
I structured data should be– distributed for parallel processing
– indexed for fast retrieval (“structured” means that it hassome kind of “primary key”)
– tabular for easy processing of complex data, each row canpotentially have many columns
I databases offer indexes and tables but don’tscale without significant effort
I key-value stores can easily be distributed buthave limited index support over keys and don’thave support for tabular format out of the box
3
![Page 10: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/10.jpg)
Accumulo
I Accumulo is a key-value store with support fortabular data
– keys are columns identifiers, i.e. they uniquely identify acolumn of a row
– a row is composed by multiple keys-values grouped by theprefix of the key, the row id
4
![Page 11: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/11.jpg)
ExampleEMAIL NAME LASTNAME COMPANY
[email protected] Olivia Smith Winsystems
[email protected] Emily Brown Jones Inc.
⇓KEY (composed by row id and column id) VALUE
[email protected] Olivia
[email protected] Smith
[email protected] Winsystems
[email protected] Emily
[email protected] Brown
[email protected] Jones Inc.
5
![Page 12: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/12.jpg)
Composite Keys
Keys in Accumulo are composite and have the following components
I row id: to which row the key belongs toI column family: to which “column group” the key belongs toI column qualifier: the column idI column visibility: who can access this columnI timestamp: the version of the key
A single key-value is stored as
KEYVALUE
row idcolumn
timestampfamily qualifier visibility
6
![Page 13: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/13.jpg)
Composite Keys
Keys in Accumulo are composite and have the following components
I row id: to which row the key belongs toI column family: to which “column group” the key belongs toI column qualifier: the column idI column visibility: who can access this columnI timestamp: the version of the key
A single key-value is stored as
KEYVALUE
row idcolumn
timestampfamily qualifier visibility
6
![Page 14: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/14.jpg)
Accumulo featuresI range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation ofclose days is local and fast
I fast: with proper key schemas a query can takemilliseconds
I scalable: designed to store huge amount of data overmultiple tables
I built-in cache for recently queried data
I many others, such as bulk imports, iterators, faulttolerance, large rows, multiple-batch queries, testingutilities (mocks, miniclusters) . . .
7
![Page 15: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/15.jpg)
Accumulo featuresI range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation ofclose days is local and fast
I fast: with proper key schemas a query can takemilliseconds
I scalable: designed to store huge amount of data overmultiple tables
I built-in cache for recently queried data
I many others, such as bulk imports, iterators, faulttolerance, large rows, multiple-batch queries, testingutilities (mocks, miniclusters) . . .
7
![Page 16: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/16.jpg)
Accumulo featuresI range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation ofclose days is local and fast
I fast: with proper key schemas a query can takemilliseconds
I scalable: designed to store huge amount of data overmultiple tables
I built-in cache for recently queried data
I many others, such as bulk imports, iterators, faulttolerance, large rows, multiple-batch queries, testingutilities (mocks, miniclusters) . . .
7
![Page 17: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/17.jpg)
Accumulo featuresI range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation ofclose days is local and fast
I fast: with proper key schemas a query can takemilliseconds
I scalable: designed to store huge amount of data overmultiple tables
I built-in cache for recently queried data
I many others, such as bulk imports, iterators, faulttolerance, large rows, multiple-batch queries, testingutilities (mocks, miniclusters) . . .
7
![Page 18: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/18.jpg)
Accumulo featuresI range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation ofclose days is local and fast
I fast: with proper key schemas a query can takemilliseconds
I scalable: designed to store huge amount of data overmultiple tables
I built-in cache for recently queried data
I many others, such as bulk imports, iterators, faulttolerance, large rows, multiple-batch queries, testingutilities (mocks, miniclusters) . . .
7
![Page 19: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/19.jpg)
Example
we want to store and analyze tweets from all aroundthe world.
8
![Page 20: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/20.jpg)
Example: Tweets analysis
I A tweet has the following (simplified) fields– coordinate: geospatial information composed by longitude
and latitude
– created at: UTC time of the tweet
– id: tweet unique identifier
– user informations, such as
I user.id: unique identifier of the userI user.screen name: user nameI . . .
– entities such as hashtags, urls. . .
– text: tweet content
– . . .
I how do we store this data in Accumulo?
9
![Page 21: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/21.jpg)
Example: Tweets analysis
I there is no single way to do it, it depends onthe query
I two good practices– work with denormalized data
– specialize tables for each kind of query
10
![Page 22: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/22.jpg)
Example: Tweets analysis
I there is no single way to do it, it depends onthe query
I two good practices– work with denormalized data
– specialize tables for each kind of query
10
![Page 23: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/23.jpg)
Example: Twitter User Timeline
I schemaKEY
VALUErow id
columntimestamp
family qualifier visibility
user.id + created at + id
”coordinate” lon/lat
”entities””hashtags” hashtags
”urls” urls”text” text
I Easy to process the entire timeline or a timeinterval for the same user
I Not good for other kind of analysis– find all the tweets with a given hashtag
– find all the tweets in New York
– . . .
11
![Page 24: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/24.jpg)
Example: Twitter User Timeline
I schemaKEY
VALUErow id
columntimestamp
family qualifier visibility
user.id + created at + id
”coordinate” lon/lat
”entities””hashtags” hashtags
”urls” urls”text” text
I Easy to process the entire timeline or a timeinterval for the same user
I Not good for other kind of analysis– find all the tweets with a given hashtag
– find all the tweets in New York
– . . .
11
![Page 25: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/25.jpg)
Summary
I Accumulo is great for storing large amount ofstructured data
I Accumulo is good for interactive queries as wellas more batch queries
I Accumulo is a low-level system– NoSQL (that’s not good!), which means no high-level
language to query the data
– a lot of flexibility which can easily backfire
12
![Page 26: Introduction to Accumulo](https://reader033.fdocuments.in/reader033/viewer/2022050900/58f2d50c1a28abea698b456d/html5/thumbnails/26.jpg)
Thank you
Questions?
13