Accumulo Summit 2015: Verifiable Responses to Accumulo Queries [Security]
Introduction to Apache Accumulo
-
Upload
aaron-cordova -
Category
Technology
-
view
4.927 -
download
4
description
Transcript of Introduction to Apache Accumulo
![Page 1: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/1.jpg)
Apache Accumulo
Introduction
![Page 2: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/2.jpg)
Introduction
• Aaron Cordova
• Founded Accumulo project with several others
• Led development through release 1.0
![Page 3: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/3.jpg)
Agenda
• Introduction
• Data Model
• API
• Architecture - scaling, recovery
• Security
• Data-lifecycle
• Applications
![Page 4: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/4.jpg)
Introduction
![Page 5: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/5.jpg)
History
• Began writing in summer of 2008, after comparing design goals with BigTable paper and existing implementations Hbase, Hypertable
• Released internal version 1.0 summer of 2009.
• September 2011 accepted as an Apache Incubator project. Doug Cutting, founder of Hadoop, was the Champion Sponsor
• Feb 2012 1.4 Released
• March 2012 graduates to a top level Apache project
• V 1.5 due out soon
![Page 6: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/6.jpg)
Introduction
• Accumulo is a sparse, distributed, sorted, multi-dimensional map
• Modeled after Google’s BigTable design
• Scales to trillions of records and 100s of Terabytes
• Features automatic load balancing, high-availability, dynamic control over data layout
![Page 7: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/7.jpg)
Data Model
![Page 8: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/8.jpg)
Data Model
KeyKeyKeyKeyKeyValue
row IDColumnColumnColumn
TimestampValue
row IDFamily Qualifier Visibility
TimestampValue
![Page 9: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/9.jpg)
Data Model (Logical 2D table structure)
attribute:age
attribute:phone
purchases:sneakers returns:hat
bill 49 555-1212 $100 -
george 38 - $80 $30
![Page 10: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/10.jpg)
Physical layout (sorted keys)
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill attribute phone private Jun 2010 555-1212
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
![Page 11: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/11.jpg)
High-level API
![Page 12: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/12.jpg)
Accumulo API
• To use Accumulo, must write a an application using the Accumulo Java client library. There is no SQL (hence NoSQL)
• Data is packaged into Mutation objects which are added to a BatchWriter which sends them to TabletServers
• Clients can scan a set of key value pairs by specifying optional start and end keys (Range) and obtaining a Scanner. Iterating over the scanner returns sorted key value pairs for that range. Each scan takes milliseconds to start.
• Can scan over a subset of the columns
• Can send a set of Ranges to a BatchScanner, get matching key value pairs, unsorted
![Page 13: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/13.jpg)
Insert
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
![Page 14: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/14.jpg)
Insert
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
bill attribute phone private Jun 2010 555-1212
![Page 15: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/15.jpg)
Insert
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill attribute phone private Jun 2010 555-1212
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
![Page 16: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/16.jpg)
Scan - Full key lookup
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill attribute phone private Jun 2010 555-1212
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
bill attribute phone private Jun 2010
![Page 17: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/17.jpg)
Scan - Single row
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill attribute phone private Jun 2010 555-1212
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
bill
![Page 18: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/18.jpg)
Scan - Multiple Rows
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill attribute phone private Jun 2010 555-1212
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
bill - will
![Page 19: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/19.jpg)
Scan - Multiple Rows, Selected Columns
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill attribute phone private Jun 2010 555-1212
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
bill - will, fetch purchases
![Page 20: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/20.jpg)
Architecture - Scaling and Recovery
![Page 21: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/21.jpg)
Performance
• Accumulo ‘scales’ because aggregate read and write performance increase as more machines are added, and because individual reads/write performance remains very good even with trillions of key-value pairs already in the system
• Sources: http://www.slideshare.net/acordova00/accumulo-on-ec2
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf
10
100
1000
10000
1 16 64 256 1024
Thou
sand
s of
writ
es p
er s
econ
d
Number of machines
AccumuloBigTable circa 2006Cassandra
![Page 22: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/22.jpg)
Accumulo Prerequisites
• One to hundreds of computers with local hard drives, connected via ethernet
• Password-less SSH access
• Local directory for write-ahead logs
• Hadoop and ZooKeeper installed, configured, and running
![Page 23: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/23.jpg)
Architecture
HDFS MapReduce
Accumulo
ZooKeeper
![Page 24: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/24.jpg)
Architecture: HDFS
HDFSNameNode
DataNodes
File
![Page 25: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/25.jpg)
Architecture: HDFS
HDFSNameNode
DataNodes
Block 2Block 1
![Page 26: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/26.jpg)
Architecture: HDFS
HDFSNameNode
DataNodes
![Page 27: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/27.jpg)
Architecture: Tables
Accumulo
Tablet Servers
Master
Table
![Page 28: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/28.jpg)
Architecture: Tables
Accumulo
Tablet Servers
Master
P2P1 P3
![Page 29: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/29.jpg)
Architecture: Tables
Accumulo
Tablet Servers
Master
![Page 30: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/30.jpg)
Architecture: Writes
HDFS
P1
File1
MemTable
![Page 31: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/31.jpg)
Architecture: Writes
HDFS
P1
File1
MemTable
Client
Write-ahead Log
![Page 32: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/32.jpg)
Architecture: Writes
HDFS
File1 File 2
P1 MemTable
Write-ahead Log
![Page 33: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/33.jpg)
Architecture: Writes
HDFS
File1 File 2
P1 MemTable
Write-ahead LogX
![Page 34: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/34.jpg)
Architecture: Splits
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill attribute phone private Jun 2010 555-1212
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
![Page 35: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/35.jpg)
Architecture: Splits
Accumulo
Tablet Servers
Master
![Page 36: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/36.jpg)
Architecture: Splits
Accumulo
Tablet Servers
Master
![Page 37: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/37.jpg)
Architecture: Splits
Accumulo
Tablet Servers
Master
![Page 38: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/38.jpg)
• Because keys are sorted, tables can be partitioned based on the data
• partitions (tablets) are uniform in size, regardless of data distribution,(as long as single rows are smaller than the partition size)
• not based on the number of servers
• Can add /remove / fail servers at any time, the system is always automatically balanced
Sorted keys - dynamic partitioning
![Page 39: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/39.jpg)
Partitioning Contrast
• Some relational databases allow partitioning. May require users to choose a field or two on which to partition. Hopefully that field is uniformly distributed
• Hash-based systems (default Cassandra, CouchDB, Riak, Voldemort) avoid this problem, but at the cost of range scans. Some support range scans via other means.
• Many systems couple partition storage with partition service, requiring data movement to rebalance partition service (MongoDB, Cassandra, etc)
![Page 40: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/40.jpg)
Architecture: Reads
File1 File 2
P1 MemTable
Client
Merge
![Page 41: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/41.jpg)
Architecture: Recovery
Accumulo
Tablet Servers
Master
DataNodesNameNode
![Page 42: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/42.jpg)
Architecture: Recovery
Accumulo
Tablet Servers
Master
DataNodesNameNode
![Page 43: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/43.jpg)
Architecture: Recovery
Accumulo
Tablet Servers
Master
DataNodesNameNode
![Page 44: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/44.jpg)
Architecture: Recovery
Accumulo
Tablet Servers
Master
DataNodesNameNode
![Page 45: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/45.jpg)
Architecture: Recovery
Accumulo
Tablet Servers
Master
DataNodesMaster reassigns
NameNode
![Page 46: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/46.jpg)
Architecture: Recovery
Accumulo
Tablet Servers
Master
DataNodesReplay Write-ahead Log
NameNode
![Page 47: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/47.jpg)
Architecture: Recovery
Accumulo
Tablet Servers
Master
DataNodesNameNode
![Page 48: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/48.jpg)
Architecture: Recovery
Accumulo
Tablet Servers
Master
DataNodesNameNode
![Page 49: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/49.jpg)
user tables
metadata table
Metadata Hierarchy
root
md1 md2 md3
user1 user2 index1 index2
![Page 50: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/50.jpg)
Architecture: Lookup
Accumulo
Tablet Servers
Master
Client
ZooKeeper
![Page 51: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/51.jpg)
Architecture: Lookup
Accumulo
Tablet Servers
Master
Client
ZooKeeperClient knows zookeeper,finds root tablet
![Page 52: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/52.jpg)
Architecture: Lookup
Accumulo
Tablet Servers
Master
Client
ZooKeeperScan root tabletfind metadata tabletthat describes theuser table we want
![Page 53: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/53.jpg)
Architecture: Lookup
Accumulo
Tablet Servers
Master
Client
ZooKeeperRead location info
of tablets of user tableand cache it
![Page 54: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/54.jpg)
Architecture: Lookup
Accumulo
Tablet Servers
Master
Client
ZooKeeperRead directly from server
holding the tablets we want
![Page 55: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/55.jpg)
Architecture: Lookup
Accumulo
Tablet Servers
Master
Client
ZooKeeperFind other tabletsvia cache lookups
![Page 56: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/56.jpg)
Security
![Page 57: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/57.jpg)
Security
• Design and Guarantees
• Data Labeling
• Authentication
• User Configuration
![Page 58: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/58.jpg)
Data Security
• Accumulo will only return cells whose visibility labels are satisfied by user credentials presented at Scan time
• Two necessary conditions
• Correctly labeling data on ingest
• Presenting right user credentials
![Page 59: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/59.jpg)
Security Labels
row IDcolumncolumncolumn
timestamp valuerow IDfamily qualifier visibility
timestamp value
Extension of BigTable data model
![Page 60: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/60.jpg)
Column Visibility
row col fam col qual col vis time value
bill attribute age public Jun 2010 49
bill attribute phone private Jun 2010 555-1212
bill purchases sneakers public Apr 2010 $100
george attribute age private Oct 2009 38
george purchases sneakers public Nov 2009 $80
george returns hat public Dec 2009 $30
![Page 61: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/61.jpg)
Security Label Syntax
• A & B - both A and B required
• A | B - must have either A or B
• (A | B) & C - must have C and A or B
• A | (B & C) - must have A or both B and C
• A & (B | (C & D))
![Page 62: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/62.jpg)
Security Label Example
• Drive needs:
• license&over15
• Join military:
• (over17|(over16&parentConsent)) & (greencard|USCitizen)
• Access to Classified data
• TS&SI&(USA|GBR|NZL|CAN|AUS)
![Page 63: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/63.jpg)
Security Perimeter
Security Model
Accumulo
Trusted Client Auth Service
User
ID, password, cert
auths
verify
auths data
data
![Page 64: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/64.jpg)
Trusted Client Responsibility
• Ensure that credentials belong to the user
• Ensure that the user is authenticated
![Page 65: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/65.jpg)
Application Authorization
• Trusted Client applications must have max authorizations set before they can be passed
• The Trusted Client limits the set of authorizations by application
![Page 66: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/66.jpg)
Application Authorization Example
• Data may be labeled with any combination of the following:
{ personal, research, finance, diet, cancer }
• We wish to limit certain applications to a subset
![Page 67: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/67.jpg)
Example Table
row colF ColQ col vis value
row0 name - personal|finance Johnrow0 age - personal|research 49row0 phone - personal|finance 555-1212row0 owed - personal|finance $5440
row0 diagnosis - personal|(research & cancer)
melanoma
row0 diagnosis - personal|(research & diet) diabetes
![Page 68: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/68.jpg)
Application Authorizations
Cancer Research: cancer diagnoses, age
Diabetes Research: diet info, age
Accounting System: balance, name, phone
Personal Records Management: all
![Page 69: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/69.jpg)
Security Perimeter
Security Model
Accumulo
Auth Service
Researcher
ID, password, cert
Cancer Research App
![Page 70: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/70.jpg)
Security Perimeter
Security Model
Accumulo
Auth Service
ID, password, cert
verify
Researcher
Cancer Research App
![Page 71: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/71.jpg)
Security Perimeter
Security Model
Accumulo
Auth Service
ID, password, cert
research, cancer, diabetes
verify
Researcher
Cancer Research App
![Page 72: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/72.jpg)
Security Perimeter
Security Model
Accumulo
Auth Service
ID, password, cert
research,cancer
Researcher
Cancer Research App
![Page 73: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/73.jpg)
Security Perimeter
Security Model
Accumulo
Auth Service
ID, password, cert
dataresearch,cancer
Researcher
Cancer Research App
![Page 74: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/74.jpg)
Security Perimeter
Security Model
Accumulo
Auth Service
ID, password, cert
data
data
research,cancer
Researcher
Cancer Research App
![Page 75: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/75.jpg)
Data life-cycle
![Page 76: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/76.jpg)
Data Model
KeyKeyKeyKeyKeyValue
row IDColumnColumnColumn
TimestampValue
row IDFamily Qualifier Visibility
TimestampValue
![Page 77: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/77.jpg)
Versions
rowID family qualifier timestamp valuerow1 fam1 qual1 1005 2
row1 fam1 qual1 1004 5
row1 fam1 qual1 1003 3
row1 fam1 qual1 1002 2
row1 fam1 qual1 1001 7
What can we do with multiple versions of the same data?
![Page 78: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/78.jpg)
Iterators
• Mechanism for adding online functionality to tables
• Aggregation (called Combiners)
• Age-Off
• Filtering (including by security label)
![Page 79: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/79.jpg)
Versioning Iterators
rowID family qualifier timestamp valuerow1 fam1 qual1 1005 2row1 fam1 qual1 1004 5row1 fam1 qual1 1003 3row1 fam1 qual1 1002 2row1 fam1 qual1 1001 7
![Page 80: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/80.jpg)
Filtering Iterators
• Age Off
• RegEx
• Arbitrary filtering
![Page 81: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/81.jpg)
Age Off
• Can specify a particular date - e.g. delete everything older than July 1, 2007
• Can specify a time period - e.g. delete everything older than 6 months
![Page 82: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/82.jpg)
Age-Off
rowID family qualifier timestamp valuerow1 fam1 qual1 1005 2row1 fam1 qual1 1004 5row1 fam1 qual1 1003 3row1 fam1 qual1 1002 2row1 fam1 qual1 1001 7
Current Time: 1103
K/V pair ismore than
100 sec. old
![Page 83: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/83.jpg)
Age-Off
rowID family qualifier timestamp valuerow1 fam1 qual1 1005 2row1 fam1 qual1 1004 5row1 fam1 qual1 1003 3row1 fam1 qual1 1002 2row1 fam1 qual1 1001 7
Current Time: 1104
K/V pair ismore than
100 sec. old
![Page 84: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/84.jpg)
Age-Off
rowID family qualifier timestamp valuerow1 fam1 qual1 1005 2row1 fam1 qual1 1004 5row1 fam1 qual1 1003 3row1 fam1 qual1 1002 2row1 fam1 qual1 1001 7
Current Time: 1105 K/V pair ismore than
100 sec. old
![Page 85: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/85.jpg)
Manual Deletes
• Can insert ‘deletes’. They are inserted like other key-value pairs, any keys with an older timestamp is suppressed from reads
• Compactions write non-deleted data to new files
• Old files are then removed from HDFS
• To ensure data is deleted from disk,
• write deletes (they are now absent from query results)
• compact (can compact a particular range of a table if it’s large)
![Page 86: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/86.jpg)
Garbage Collection
• Garbage collector compares the files in HDFS with the set of files currently active
• When files are no longer on the active list, GC waits for a while, then deletes from HDFS
![Page 87: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/87.jpg)
Applications
• Fast lookups / scan on extremely large tables with flexible schemas, varying security
• Large index across heterogeneous data sets
• Continuous Summary Analytics via Iterators
• Secure Storage of key value pairs for MapReduce jobs
![Page 88: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/88.jpg)
Where does your data come from?
• BigTable was designed to store data for web applications serving millions of users. Web application creates all the data. Many NoSQL databases are designed solely for this purpose. Accumulo can certainly support that.
• However, many organizations have lots of data from various sources. Different schema, different security levels. Bringing them together for analysis is very valuable. Accumulo can support this too.
![Page 89: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/89.jpg)
Indexing and queries
• BigTable data model supports building a wide variety of indexes
• Simple strings, numbers, geo points, ip addresses, etc
• Each has to be coupled with query code
• New applications should examine their data access use cases, indexes and query code to accomplish those can then be written
• Best applications are constructed so each user request is a single scan, or a small number of scans
![Page 90: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/90.jpg)
Compared to MapReduce
• Hadoop’s HDFS stores simple files. Usually unsorted.
• MapReduce is designed to process all or most of the files at once.
• Accumulo maintains a set of sorted files in HDFS
• Accumulo scans are designed to access a small portion of the data quickly.
• Fairly complementary
![Page 91: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/91.jpg)
Tough use case
• Ran MapReduce on some input data set to create a large result set.
• Now have a few new records, want to update the result set
• MapReduce has to process all the data again, have to wait
• Accumulo allows users to perform a limited set of operations to update a result set incrementally, using Iterators
• Result sets are always up to date, immediately after insert
![Page 92: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/92.jpg)
Combiners
row col fam col qual col vis time value
bill perf June_calls P June 1 9
bill perf June_calls P June 4 3
bill perf July_calls P July 3 4
bill perf July_calls P July 11 7
bill perf August_calls P Aug 12 5
bill perf August_calls P Aug 29 2
![Page 93: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/93.jpg)
Combiners
row col fam col qual col vis time value
bill perf June_calls P - 12
bill perf July_calls P - 11
bill perf August_calls P - 7
![Page 94: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/94.jpg)
Combiners
• Almost equivalent to Reduce of MapReduce except:
• Cannot assume we have seen all the values for a particular key
• Exactly equivalent to a Combiner function
![Page 95: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/95.jpg)
Combiners
• Useful Combiners:
• Event count (StringSummation or LongSummation aggregator)
• Event hour occurrence histogram (NumArraySummation aggregator)
• Event duration histogram (NumArraySummation aggregator)
![Page 96: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/96.jpg)
Conceptual Graph Representation
a
c
b
e
f
d
g
![Page 97: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/97.jpg)
Edge table
row col fam col qual col vis time valuea edge f 1.0c edge b 1.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0g edge e 1.0g edge f 1.0
![Page 98: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/98.jpg)
Edge Weights
• Summing Combiners are typically used to efficiently and incrementally update edge weights
• See SummingCombiner
![Page 99: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/99.jpg)
Edge table
row col fam col qual col vis time valuea edge f 1.0c edge b 1.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0
Incoming: a, edge, f, 1.0
![Page 100: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/100.jpg)
Edge table
row col fam col qual col vis time valuea edge f 2.0c edge b 1.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0
![Page 101: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/101.jpg)
Edge table
row col fam col qual col vis time valuea edge f 2.0c edge b 1.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0
Incoming: c, edge, b, 6.0
![Page 102: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/102.jpg)
Edge table
row col fam col qual col vis time valuea edge f 2.0c edge b 7.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0
![Page 103: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/103.jpg)
Edge table
row col fam col qual col vis time valuea edge f 2.0c edge b 7.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0
Incoming: a, edge, f, 2.3
![Page 104: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/104.jpg)
Edge table
row col fam col qual col vis time valuea edge f 4.3c edge b 7.0c edge d 1.0d edge b 1.0d edge e 1.0e edge d 1.0f edge g 1.0
![Page 105: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/105.jpg)
Edge Table Applications
• Graph Analytics - traversal, neighbors, connected components
• Neighborhood = feature vector. Vector-based machine learning techniques. Nearest neighbor search, clustering, classification
• Automated dossiers, fact accumulation - ‘tell me everything we know about X’ in a single scan
• Find entities based on features - ‘show me everyone who has feature value > x’ or ‘with < 5 neighbors of type k’
![Page 106: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/106.jpg)
RDF Triples
row col fam col qual col vis time value
DC is_capital_of USA 1.0
Don vacations_in Arctic 7.0
Don is_employed_by MI6 1.0
Sean has_status “007” 1.0
Sean starred_with Ursula 1.0
Sean starred_with Anya 0.7
Sean starred_with Teresa 0.3
![Page 107: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/107.jpg)
RDF Triples - RYA
• See RYA project : http://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf
![Page 108: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/108.jpg)
Additional Training
![Page 109: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/109.jpg)
Additional Training
• Talked about the basics today
• 3 days of developer training with hands on examples covering
• installation, configuration, read / write API, MapReduce, security, table configuration, indexing specific types, querying index tables, combiners, custom iterators, table constraints, storing relational data, joins, high performance considerations, document-partitioned indexing (text search), machine learning, object persistence
• 2 days of administrator training covering
• hardware selection, process assignment, troubleshooting, maintenance, replication and high availability, cluster modification, failure handling
![Page 110: Introduction to Apache Accumulo](https://reader038.fdocuments.in/reader038/viewer/2022102621/54c6ad7e4a79595e6c8b4579/html5/thumbnails/110.jpg)
Next Scheduled Training Sessions
• March 5-7 Columbia MD
• April 9-11 Columbia MD
• http://www.tetraconcepts.com/training