Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?
Introduction to Apache Accumulo
-
Upload
jared-winick -
Category
Technology
-
view
138 -
download
6
description
Transcript of Introduction to Apache Accumulo
![Page 1: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/1.jpg)
Introduction to Apache Accumulo
Boulder/Denver BigData Meetup - March 21,2012Jared Winick@jaredwinick
![Page 2: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/2.jpg)
Accumulo /əˈkjuːmjʊˌloʊ/
1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing
![Page 3: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/3.jpg)
http://yourmotivational.com/uploads/8604.jpg
![Page 4: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/4.jpg)
Annotation AddedJeff Dean: Designs, Lessons and Advice from Building Large Distributed Systemshttp://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf
![Page 5: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/5.jpg)
Enables interactive access to…
Trillions of recordspetabytes of indexed data
across 100s-1000s of servers
![Page 6: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/6.jpg)
Short Accumulo History Lesson
http://www.flickr.com/photos/mr_t_in_dc/4249886990/sizes/l/in/photostream/
![Page 7: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/7.jpg)
2006
![Page 8: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/8.jpg)
2008
http://upload.wikimedia.org/wikipedia/commons/8/84/National_Security_Agency_headquarters%2C_Fort_Meade%2C_Maryland.jpg
![Page 9: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/9.jpg)
2011
![Page 10: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/10.jpg)
2012
![Page 11: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/11.jpg)
Uses of BigTable and Kin
(BigTable)
•Google Analytics1
•Crawl1
•AppEngine Datastore2
•Many more1
(HBase)
•Messages3,4,6
•Insights5,6
(Accumulo)
•???
(Cassandra)
•Rainbird (realtime analytics)7
1.) http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf2.) http://code.google.com/appengine/articles/storage_breakdown.html3.) http://www.facebook.com/note.php?note_id=4549916089194.) http://mvdirona.com/jrh/TalksAndPapers/KannanMuthukkaruppan_StorageInfraBehindMessages.pdf5.) http://www.facebook.com/note.php?note_id=101501039002589206.) http://borthakur.com/ftp/SIGMODRealtimeHadoopPresentation.pdf7.) http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
![Page 12: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/12.jpg)
Accumulo /əˈkjuːmjʊˌloʊ/
1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing
![Page 13: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/13.jpg)
Multi-dimension Key
Key
Row ID
Column
Timestamp
Family Qualifier Visibility
Value
http://incubator.apache.org/accumulo/user_manual_1.4-incubating/Accumulo_Design.html
![Page 14: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/14.jpg)
Keys Sorted Lexicographically
Row ID, Column Family, Column Qualifier, Column Visibility, Timestamp
Everything is a byte[] except the Timestamp which is a long
![Page 15: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/15.jpg)
Physical Layout
Row ID Col Fam Col Qual Col Vis Time Value
Alice properties age public March 2011 31
Alice properties phone private Feb 2011 555-1234
Alice purchases Xbox public Feb 2011 $299
Bob properties phone private March 2011 555-4321
Bob purchases iPhone Public Feb 2011 $399
Key Value
![Page 16: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/16.jpg)
Queries
•By exact Key or range of Keys•Data is always returned in sorted order
Query Requirements Drive Data Model Design
![Page 17: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/17.jpg)
http://incubator.apache.org/accumulo/user_manual_1.4-incubating/Accumulo_Design.html
![Page 18: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/18.jpg)
Accumulo
Hadoop HDFS Zookeeper
Storage Configuration/State
Hadoop MapReduce
Analytics
Clients
Read/Write
![Page 19: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/19.jpg)
Accumulo
Hadoop HDFS
Tablet Server
Tablet Server
Tablet Server
Master
Data Node
Data Node
Data Node
Name Node
. . .
. . .
TableTablets
… … … … … …
![Page 20: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/20.jpg)
Accumulo
Hadoop HDFS
Tablet Server
Tablet Server
Tablet Server
Master
Data Node
Data Node
Data Node
Name Node
. . .
. . .
TableTablets
Tablet Server Failure
1.) Detect Failure
2.) Reassign
![Page 21: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/21.jpg)
Accumulo
Hadoop HDFS
Tablet Server
Writes
Client
Write-Ahead
Log (WAL)
1MemTable2
Data Node
Data Node
Data Node. . .
Tablet
![Page 22: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/22.jpg)
Accumulo
Hadoop HDFS
Tablet Server
Writes
Client
Write-Ahead
Log (WAL)
1MemTable2
Data Node
Data Node
Data Node. . .
File 1
3
Tablet
![Page 23: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/23.jpg)
Compactions
Minor Major
The process of flushing a MemTable of a Tablet to a single file in HDFS
The process of combining multiple files into a single file
![Page 24: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/24.jpg)
• Tablets are split when they reach a max size• Always split on row boundary• Master assigns a split Tablet to another Tablet
server (no data is moved!)
Tablet Splits
![Page 25: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/25.jpg)
Reads
AccumuloTablet Server
Client MemTable
File 1
Tablet
File 1
Merge
![Page 26: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/26.jpg)
Accumulo /əˈkjuːmjʊˌloʊ/
1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing
![Page 27: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/27.jpg)
http://wiki.eeng.dcu.ie/ee557/287-EE/version/default/part/ImageData/data/server-side_intro.gif
Iterators: Server-side programming
![Page 28: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/28.jpg)
Iterators
Can be run at:•Scan Time•Minor Compaction•Major Compaction
Can do things like:•Aggregation (Combiners)•Age-Off•Filtering (access control)•Transformation
Push Processing to the Data
![Page 29: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/29.jpg)
Accumulo /əˈkjuːmjʊˌloʊ/
1. Sorted, distributed key/value store with cell-based access control and customizable server-side processing
![Page 30: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/30.jpg)
• Every key-value has a visibility label• Label is defined with boolean operators• Label is arbitrary and ad-hoc
• Authorizations presented at scan time• Data is filtered out automatically by system-
level Iterator
Access Control
Public Private | Admin Finance | (HR & Manager)
![Page 31: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/31.jpg)
Access Control – Typical Architecture
Web Server
Enterprise Identity
Management
Accumulo1.) Pass Credentials
2.) Lookup User
3.) Return Authorizations
4.) Proxy Authorization
Trusted Zone
5.) Return Visible Data6.) Return Data
![Page 32: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/32.jpg)
Access Control – Typical Architecture
3.) Auths:[SECRET, UNCLASSIFIED,PROJECT X, PROJECT Y]
Web Server
Enterprise Identity
Management
Accumulo
1.) PKI Cert
2.) Lookup Bob
4.) Proxy Bob’s Auths
Trusted Zone
5.) Return [6,8]6.) Return [6,8]
Bob
SECRET&PROJECT X, 6SECRET&PROJECT Y, 8SECRET&PROJECT Z, 3
![Page 33: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/33.jpg)
Demo
![Page 34: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/34.jpg)
Application Requirements
Build an application to analyze trends in Twitter messages.
•Query for word/phrase and view real-time activity in a time series graph•View at different time ranges (1 day, 7 days, 30 days, etc)•Allow multiple query terms to compare activity (ex. Breakfast,Lunch)•Automatically extract daily trends for the user
![Page 35: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/35.jpg)
Demo Setup/Data
• Twitter Streaming API• US country codes only messages• 1,2,3-grams built• Data since Dec 24 – Live• Running on average workstation, 1 SATA disk,
6 GB memory.• 72GB, 2.6 billion entries and counting
![Page 36: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/36.jpg)
![Page 37: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/37.jpg)
Data Model• Tweets table– Row ID: n-gram– Column Family: Date Granularity (DAY, HOUR)– Column Qual: Date Value– Value: Count– SummingCombiner (Iterator) used to update Count
Row ID Col Fam Col Qual Value
breakfast DAY 20120318 31
breakfast DAY 20120319 56
… … … …
lunch HOUR 2012031801 3
lunch HOUR 2012031802 4
![Page 38: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/38.jpg)
Data Model• Trends table– Row ID: (Date Granularity + Date Value)– Column Family: (Integer.MAX_VALUE – trendScore)– Column Qual: n-gram– Value: []
Row ID Col Fam Col Qual Value
DAY:20120318 2147483145 church
DAY:20120318 2147483316 hangover
… … … …
DAY:20120319 2147476521 the broncos
DAY:20120319 2147477704 tim tebow
![Page 39: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/39.jpg)
• Utilize MapReduce for building trends• AccumuloInputFormat reads from tweets
table• AccumuloOutputFormat writes to trends
table• AccumuloStorage LoadFunc for Pig
available on github
MapReduce Analytics
![Page 40: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/40.jpg)
![Page 41: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/41.jpg)
Summary
•Accumulo exploits locality to enable interactive access to huge data sets while adding cell-level access control and server-side programming
•Nothing in life is free. Accumulo comes with the complexity and responsibility of managing a distributed system and designing indexes on your data
![Page 42: Introduction to Apache Accumulo](https://reader030.fdocuments.in/reader030/viewer/2022012400/54c6ad934a79593f6c8b4573/html5/thumbnails/42.jpg)
References
http://incubator.apache.org/accumulo/
http://www.slideshare.net/cloudera/h-base-and-accumulo-todd-lipcom-jan-25-2012
• Documentation, Mailing Lists, Links
• HBase Shootout
• Trendulohttps://github.com/jaredwinick/trendulo