Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]
-
Upload
accumulo-summit -
Category
Technology
-
view
166 -
download
2
Transcript of Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]
![Page 1: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/1.jpg)
Mike Walch
Using Fluo to incrementally process data in Accumulo
![Page 2: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/2.jpg)
Problem: Maintain counts of inbound links
fluo.io
github.com
apache.org
nytimes.com
Website
fluo.iogithub.comapache.orgnytimes.com
# Inbound Links
032
0
Example DataExample Graph
![Page 3: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/3.jpg)
Solution 1 - Maintain counts using batch processing
Website
fluo.iogithub.comapache.orggithub.comnytimes.comapache.org
# Inbound
+1-1
+1-1
+1+1
Link count change log
Website
fluo.iogithub.comapache.orgnytimes.com
# Inbound
+1-23
+65 +105
Last Hour Aggregates
Website
fluo.iogithub.comapache.orgnytimes.com
# Inbound
531,385,1922,528,190
53,395,000
Website
fluo.iogithub.comapache.orgnytimes.com
# Inbound
541,385,1692,528,255
53,395,105
Historical
Latest Counts
MapReduce
MapReduce
WebCrawler
Internet
WebCache
![Page 4: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/4.jpg)
Solution 2 - Maintain counts using Fluo
Website
fluo.iogithub.comapache.orgnytimes.com
# Inbound
531,385,1922,528,190
53,395,000
Fluo Table
+1
-1WebCrawler
Internet
WebCache
![Page 5: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/5.jpg)
Solution 3 - Use both: update popular sites using batch processing & update long tail using Fluo
# InboundLinks
Update every hour using
MapReduce
Update in real-timeusing Fluo
Website Distribution
nytimes.com
github.com
fluo.io
![Page 6: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/6.jpg)
Fluo 101 - Basics
- Provides cross-row transactions and snapshot isolation which makes it safe to do concurrent updates
- Allows for incremental processing of data
- Based on Google’s Percolator paper
- Started as a side project by Keith Turner in 2013
- Originally called Accismus
- Tested using synthetic workloads
- Almost ready for production environments
![Page 7: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/7.jpg)
Fluo 101 - Accumulo vs Fluo
- Fluo is a transactional API built on top of Accumulo- Fluo stores its data in Accumulo
- Fluo uses Accumulo conditional mutations for transactions
- Fluo has a table structure (row, column, value) similar to Accumulo except Fluo has no timestamp
- Each Fluo application runs its own processes- Oracle allocates timestamps for transactions
- Workers run user code (called observers) that perform transactions
![Page 8: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/8.jpg)
Fluo 101 - Architecture
Accumulo
HDFS
Zookeeper
YARN
Client Cluster
Fluo Client for App 1
Fluo Clientfor App 1
Fluo Clientfor App 2
Fluo Application 2Fluo Application 1
Fluo Worker
Observer1 Observer2
Fluo Oracle
Fluo Worker
ObserverA
Fluo Oracle
Fluo Worker
Observer1 Observer2
Table1 Table2
![Page 9: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/9.jpg)
Fluo 101 - Client API
Used by developers to ingest data or interact with Fluo from external applications (REST services, crawlers, etc)
public void addDocument(FluoClient fluoClient, String docId, String content) {
TypeLayer typeLayer = new TypeLayer(new StringEncoder());
try (TypedTransaction tx1 = typeLayer.wrap(fluoClient.newTransaction())) {
if (tx1.get().row(docId).col(CONTENT_COL).toString() == null) { tx1.mutate().row(docId).col(CONTENT_COL).set(content); tx1.commit(); } }}
![Page 10: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/10.jpg)
Fluo 101 - Observers- Developers can write observers that are triggered when a column is
modified and run by Fluo workers.
- Best practice: Do work/transactions in observers over client code
public class DocumentObserver extends TypedObserver {
@Override public void process(TypedTransactionBase tx, Bytes row, Column column) { // do work here }
@Override public ObservedColumn getObservedColumn() { return new ObservedColumn(CONTENT_COL, NotificationType.STRONG); }}
![Page 11: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/11.jpg)
Example Fluo Application
- Problem: Maintain word & document counts as documents are added and deleted from Fluo in real time
- Fluo client performs two actions:1. Add document to table 2. Mark document for deletion
- Which triggers two observers: - Add Observer - increase word and document counts- Delete Observer - decrease counts and clean up
![Page 12: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/12.jpg)
Add first document to table
Fluo Table
Row
d : doc1
Column
doc
Value
my first hello world
Fluo Client
Client Cluster
AddObserver
DeleteObserver
![Page 13: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/13.jpg)
An observer increments word counts
Fluo Table
Row
d : doc1
w : firstw : hellow : myw : world
total : docs
Column
doc
cntcntcntcnt
cnt
Value
my first hello world
1111
1Fluo Client
Client Cluster
AddObserver
DeleteObserver
![Page 14: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/14.jpg)
A second document is added
Fluo Table
Row
d : doc1d : doc2
w : firstw : hellow : myw : secondw : world
total : doc
Column
docdoc
cntcntcntcntcnt
cnt
Value
my first hello worldsecond hello world
12112
2
Fluo Client
Client Cluster
AddObserver
DeleteObserver
![Page 15: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/15.jpg)
First document is marked for deletion
Fluo Table
Row
d : doc1d : doc1d : doc2
w : firstw : hellow : myw : secondw : world
total : doc
Column
docdeletedoc
cntcntcntcntcnt
cnt
Value
my first hello world
second hello world
12112
2
Fluo Client
Client Cluster
AddObserver
DeleteObserver
![Page 16: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/16.jpg)
Observer decrements counts and deletes document
Fluo Table
Row
d : doc1d : doc1d : doc2
w : firstw : hellow : myw : secondw : world
total : doc
Column
docdeletedoc
cntcntcntcntcnt
cnt
Value
my first hello world
second hello world
11111
1
Fluo Client
Client Cluster
AddObserver
DeleteObserver
![Page 17: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/17.jpg)
Things to watch out for...
- Collisions occur when two transactions update the same data at the same time
- Only one transaction will succeed. Others need to be retried.
- Some OK but too many can slow computation
- Avoid collisions by not updating same row/column on every transaction
- Write Skew occurs when two transactions read an overlapping data set and make disjoint updates without seeing the other update
- Result is different than if transactions were serialized
- Prevent write skew by making both transactions update same row/column. If concurrent, a collision will occur and only one transaction will succeed.
![Page 18: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/18.jpg)
How does Fluo fit in?
Higher
Large JoinThroughput
Lower
Slower Processing Latency Faster
Batch Processing
MapReduce, Spark
Incremental Processing
Fluo, Percolator
Stream Processing
Storm
![Page 19: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/19.jpg)
Don’t use Fluo if...
1. You want to do ad-hoc analysis on your data (use batch processing instead)
2. Your incoming data is being joined with a small data set(use stream processing instead)
![Page 20: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/20.jpg)
Use Fluo if...
1. If you want to maintain a large scale computation using a series of small transaction updates
2. Periodic batch processing jobs are taking too long to join new data with existing data
![Page 21: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/21.jpg)
Fluo Application Lifecycle
1. Use batch processing to seed computation with historical data
2. Use Fluo to process incoming data and maintain computation in real-time
3. While processing, Fluo can be queried and notifications can be made to user
![Page 22: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/22.jpg)
Major Progress
2010 2013 2014 2015
Google releases Percolator paper
Keith Turner starts work on Percolator implementation for Accumulo as a side project (originally called Accismus)
Fluo can process transactions
1.0.0-alpha released
Oracle and worker can be run in YARN
Changed project name to Fluo
1.0.0-beta releasing soon
Solidified Fluo Client/Observer API
Automated running Fluo cluster on Amazon EC2
Multi-application support
Improved how observer notifications are found
Created Stress Test
![Page 23: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/23.jpg)
Fluo Stress Test- Motivation: Needed test that stresses Fluo
and is easy to verify for correctness
- The stress test computes the number of unique integers by building a bitwise trie
- New integers are added at leaf nodes
- Observers watch all nodes, create parents, and percolate total up to root node
- Test runs successfully if count at root is same a number of leaf nodes
- Multiple transactions can operate on same nodes causing collisions
1110
11xx = 3
1100
10xx = 0 01xx = 1 00xx = 1
xxxx = 5
0101 00011110
![Page 24: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/24.jpg)
Easy to run Fluo
1. On machine with Maven+Git, clone the fluo-dev and fluo repos
2. Follow some basic configuration steps
3. Run the following commands
It’s just as easy to run a Fluo cluster on Amazon EC2
fluo-dev download # Downloads Accumulo, Hadoop, Zookeeper tarballsfluo-dev setup # Sets up locally Accumulo, Hadoop, etcfluo-dev deploy # Build Fluo distribution and deploy locallyfluo new myapp # Create configuration for ‘myapp’ Fluo applicationfluo init myapp # Initialize ‘myapp’ in Zookeeperfluo start myapp # Start the oracle and worker processes of ‘myapp’ in YARNfluo scan myapp # Print snapshot of data in Fluo table of ‘myapp’
![Page 25: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/25.jpg)
Fluo Ecosystem
fluoMain Project Repo
fluo-quickstart
Simple Fluo example
fluo-stressStresses Fluo on
cluster
fluo-io.github.io
Fluo project website
phrasecountIn-depth Fluo
example
fluo-deployRun Fluo on EC2
cluster
fluo-devHelps developers
run Fluo locally
![Page 26: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/26.jpg)
Future Direction- Primary focus: Release production-ready 1.0 release with stable API
- Other possible work:
- Fluo-32: Real world example application
- Possibly using CommonCrawl data
- Fluo-58: Support writing observers in Python
- Fluo-290: Support running Fluo on Mesos
- Fluo-478: Automatically scale up & down Fluo workers based on workload
![Page 27: Accumulo Summit 2015: Using Fluo to incrementally process data in Accumulo [API]](https://reader033.fdocuments.in/reader033/viewer/2022052700/55a5e4af1a28ab28368b480e/html5/thumbnails/27.jpg)
Get involved!
1. Experiment with Fluo- API has stabilized- Tools and development process make it easy- Not recommended for production yet (wait for 1.0)
2. Contribute to Fluo- ~85 open issues on GitHub- Review-then-commit process