Nov 2010 HUG: Fuzzy Table - B.A.H
-
Upload
yahoo-developer-network -
Category
Technology
-
view
6.851 -
download
1
Transcript of Nov 2010 HUG: Fuzzy Table - B.A.H
![Page 2: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/2.jpg)
Session Agenda Fuzzy Matching?
The Big Data Problem
A Scalable Solution
Performance
Questions?
2
![Page 3: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/3.jpg)
Fuzzy Matching?
3
![Page 4: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/4.jpg)
What is Fuzzy Matching?
*Euclidean Distance in this example
These images are very similar, but obliviously not the same.To find image #2 given image #1, some sort of fuzzy matching technique needs to be used
DistanceFunction*
31.46
Feature
Extraction &Normalization
Feature
Extraction &Normalization
1
2
Start with some multimediaimage/voice/audio/video/etc
Create a Vector or Matrix of doubles
4
Images from Flickr; Licensed under Creative Commons http://www.flickr.com/photos/mdpettitt/455527136/sizes/l/in/photostream/http://www.flickr.com/photos/mdpettitt/455539917/sizes/l/in/photostream/
![Page 5: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/5.jpg)
How Is Fuzzy Matching Being Used Today?
5
![Page 6: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/6.jpg)
Why Do We Care?
At the forefront of strategy and technology consulting for nearly a century
Deep functional knowledge spanning strategy and organization, technology, operations, and analytics
US government agencies in the defense, security, and civil sectors, as well as corporations, institutions, and not-for-profit organizations
6
![Page 7: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/7.jpg)
Biometrics – A Fuzzy Matching Problem
Same Person?Lifted From A Crime Scene
Law Enforcement Database
7
![Page 8: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/8.jpg)
Biometrics – Example
*Euclidean Distance in this example
DistanceFunction*
2.41
Feature
Extraction &Normalization
Feature
Extraction &Normalization
1
2
Query Biometrics Database
Create a Vector or Matrix of doubles
8
![Page 9: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/9.jpg)
The Big Data Problem
9
![Page 10: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/10.jpg)
Growth of Multimedia Databases Flickr – over 5 billion images ImageShack – over 20 billion unique images
http://techcrunch.com/2009/04/07/who-has-the-most-photos-of-them-all-hint-it-is-not-facebook/http://ksudigg.wetpaint.com/page/YouTube+Statisticshttp://techcrunch.com/2009/04/28/as-youtube-passes-a-billion-unique-us-viewers-hulu-rushes-into-third-place/
Youtube – over 6 billion videos Hulu – over 380 million videos
10
![Page 11: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/11.jpg)
Growth of Biometric Databases Combined U.S. government databases will soon
hold billions of identities DHS’s US-VISIT has the world’s largest and
fastest biometric database: over 110 million identities and 145,000 transactions daily*
The FBI’s Integrated Automated Fingerprint Identification System has 66 million identities with 8,000 added daily **
* US-VISIT: The world’s largest biometric application. William Graves.** http://www.fbi.gov/hq/cjisd/iafis/iafis_facts.htm*** http://www.business-standard.com/india/news/national-population-register-to-start-biometrics-data-collectiondec/399135/**** http://www.findbiometrics.com/articles/i/5220/***** http://www.alltrustnetworks.com/News/6Million/tabid/378/Default.aspx
US-VISIT
India working on a database of fingerprints and face images it’s population of 1.2 billion ***
European Union’s Biometric Matching System (supporting visa applications, immigration, and border control) will grow to 70 Million****
AllTrust Networks Paycheck Secure system has over 6 Million identities and has performed over 70 Million transactions*****
11
![Page 12: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/12.jpg)
Biometric Databases are a Big (Data) Problem Large scale operations
Searching and storing 100’s of millions to billions of Identities
Multiple biometric templates and raw files per identity Fingerprints, Faces, and Iris
New raw files and templates stored on each verification Computation to update models for identity
Results are expected in real time Cost efficient storage and retrieval is hard
Need innovative ways to reduce costs per match
500 M Identities x (16 KB to 300 KB) x (10 to 20)
= 1 – 2 PB
500 M Identities x (256 b to 3 KB) x (10 to 20)= 2 – 27 TB
12
![Page 13: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/13.jpg)
A Scalable Solution
13
![Page 14: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/14.jpg)
Hadoop and Multimedia Databases HDFS as file storage for petabytes worth of multimedia
(images/audio/video/etc) Redundancy Distribution
Mahout and MapReduce used for indexing and binning similar objects Improve overall search speeds
Improving feature selection by analyzing the entire database with MapReduce Select most effective features in distinguishing identities
N-to-N matching search (special type of Identification search) to cleanse database
What about low latency matching?
14
![Page 15: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/15.jpg)
Fuzzy Table: Large-scale, Low Latency, Fuzzy Matching Database Designed for biometric applications, has uses in other domains Enables fast parallel search of keys that cannot be effectively
ordered Biometrics Images Audio Video
Enabled by Mahout and MapReduce for binning, re-encoding, and other bulk data operations
Inherits nice features of Hadoop: Horizontal scalability over commodity hardware Distributed and parallel computation High reliability and redundancy
15
![Page 16: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/16.jpg)
Fuzzy Table Architecture
16
![Page 17: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/17.jpg)
Bulk Binning and Real-time Classification
17
* Efficient fingerprint search based on database clustering. Manhua Liu, Xudong Jiang, Alex Chichung Kot
![Page 18: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/18.jpg)
Fuzzy Table: Bulk Data Processing Component Canopy Clustering and K-means Clustering partitions data into
bins Reduces search space This concept is based on work done in academia*
Centroids from K-means clustering are used to create a “Bin classifier” Determines the best bins to search for a given key
{Key, Value} records are stored as Sequence Files in HDFS Spread across the cluster for optimal parallel searching
MapReduce is used for all other bulk or batch data processing Batch operations (many to many search, duplicate
detection) Encoding the raw files into feature vectors Feature evalutation
*Efficient Search and Retrieval in Biometric Databases by Amit Mhatre, Srinivas Palla, Sharat Chikkerur and Venu Govindaraju* Efficient fingerprint search based on database clustering. Manhua Liu, Xudong Jiang, Alex Chichung Kot
18
![Page 19: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/19.jpg)
19
Procedure
![Page 20: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/20.jpg)
Fuzzy Table: Data Storage and Bins Bins are represented as directories, contain chunk
files: /fuzzytable/_table_fingerprints/_bin_000001/_chunk_000001
Chunk files contain many {Key, Value} pairs Key is biometric template, Value is a reference to the
biographic record Chunks are same size as HDFS block to simplify data-local
search HDFS load balancing distributes data evenly across
cluster Enables parallel search Replication provides fault tolerance and speculative
execution of queries Data Servers only search local chunk files
Results returned in real-time as soon as a match is found Preserve principle of keeping computation next to data
20
![Page 21: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/21.jpg)
Low Latency Component
21
After data is organized, we want to retrieve it quickly
Does not use MapReduce MapReduce is high latency due to jar shipping,
other misc. tasks which support redundancy in the process
Need lightweight framework to perform realtime queries with minimum overhead
Provides real time matching and responses over Apache Avro-based protocol.
![Page 22: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/22.jpg)
Fuzzy Table Query
22
![Page 23: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/23.jpg)
Fuzzy Table Query
23
![Page 24: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/24.jpg)
Fuzzy Table Query
24
![Page 25: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/25.jpg)
Fuzzy Table Query
25
![Page 26: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/26.jpg)
Fuzzy Table Query
26
![Page 27: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/27.jpg)
Fuzzy Table Query
27
![Page 28: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/28.jpg)
Fuzzy Table Query
28
![Page 29: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/29.jpg)
Fuzzy Table Query
29
![Page 30: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/30.jpg)
Fuzzy Table: Optimizations Master Server HDFS Metadata Caching
The HDFS Namenode is a performance bottleneck for low latency searches
Master Server caches HDFS Block locations for all Fuzzytable files (Bins and Chunk Files)
Periodic refresh of the cache so its metadata is always fresh
Increased HDFS replication factor (Replication factor of N) Fuzzytable is close to a read only system Data replication enables speculative execution
Data Servers only perform searches against data that resides locally on disk
30
![Page 31: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/31.jpg)
Performance
31
![Page 32: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/32.jpg)
Performance and Scalability Testing On EC2 Employed EC2 for all testing Downloaded ~1 TB of images from Flickr (100 Nodes) Performed the Bulk Processing Components tasks across all
1 TB of images (80 nodes) Duplicate detection and removal Feature extraction and normalization Mahout’s canopy clustering Mahout’s k-means clustering Join Clusters with Features Post processing data into bins and chunk files
Run a series of test iterations against the low latency component
Querying increasing cluster sizes Queries performed using random images from the larger set
32
![Page 33: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/33.jpg)
Average Query Times
33
# Of Data Servers
Tim
e T
o R
esp
on
d (
ms)
![Page 34: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/34.jpg)
Average Query Times
34
# Of Data Servers
Tim
e T
o R
esp
on
d (
ms) Linear Scalability to ~ 7 Nodes
Lower limit due to I/O latencies
![Page 35: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/35.jpg)
Longest Query Times
35
# Of Data Servers
Tim
e T
o R
esp
ond (
ms)
Frequent Namenode access + large number of DFS clientsbegins to erode performance
![Page 36: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/36.jpg)
Shortest Query Times
36
# Of Data Servers
Tim
e T
o R
esp
ond (
ms)
~500 ms
![Page 37: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/37.jpg)
EC2 Results Discussion
37
Linear scalability – great! One data point shows 500 ms queries are possible
I/O Latency is a lower bound on average query response time Combined disk, network
Future enhancements Reduce disk penalty via hardware, cleaver data
structures, specialized data store Reliance on HDFS/Namenode for filesystem
metadata is another bottleneck Optimizations to HDFS client Distributed Namenode
![Page 38: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/38.jpg)
Performance and Scalability (Local)
38
Instrumented Master Server code Compared initial implementation that
accesses Namenode frequently with rework that caches filesystem metadata
Results matched those anticipated from EC2 testing
![Page 39: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/39.jpg)
Caching Performance
39
# Threads Polling The Master Server
Avera
ge R
esp
on
se T
ime (
ns)
Major discrepency, grows with load
![Page 40: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/40.jpg)
Conclusion & Future Work Large-scale, real-time Multimedia/Biometric Database
search is a hard problem And it’s becoming computationally more expensive as the
amount of data grows Hadoop is a potential solution to this problem MapReduce can be used for bulk processing to enable
distributed, low latency fuzzy matching over HDFS Hadoop is a great platform for solving all sorts of Big Data
and distributed computing problems, even for low latency searching
Future work Hadoop-level optimizations Currently implementing a new version based on Hbase which
supports online insertion and reorganization
40
![Page 41: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/41.jpg)
Contact Information – Cloud Computing Team
Booz Allen Hamilton Inc.134 National Business Parkway.
Annapolis Junction, Maryland 20701(301)543-4611
Michael RidleyAssociate
Booz Allen Hamilton Inc.134 National Business Parkway.
Annapolis Junction, Maryland 20701(301)543-4400
Jason TrostAssociate
Booz Allen Hamilton Inc.134 National Business Parkway.
Annapolis Junction, Maryland 20701(301)821-8000
Edmund KohlweySenior Consultant
Booz Allen Hamilton Inc.134 National Business Parkway.
Annapolis Junction, Maryland 20701(301)821-8000
Robert GordonAssociate
Booz Allen Hamilton Inc.134 National Business Parkway.
Annapolis Junction, Maryland 20701(301)617-3523
Jesse YatesConsultant
@jason_trost@ekohlwey
@jesse_yates
@mikeridley
41
![Page 42: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/42.jpg)
Thanks Lalit Kapoor (@idefine) – Former team
member Brandyn White (@brandynwhite) – Assistance
with Flickr image retrieval
42
![Page 43: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/43.jpg)
Questions
43
![Page 44: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/44.jpg)
Questions?
44
![Page 45: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/45.jpg)
Appendix
45
![Page 46: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/46.jpg)
Technologies Used Cloudera’s Distribution of Hadoop (CDH3)
MapReduce HDFS
Mahout Avro Amazon EC2 Ubuntu Linux Java Python Bash
46
![Page 47: Nov 2010 HUG: Fuzzy Table - B.A.H](https://reader035.fdocuments.in/reader035/viewer/2022070319/558390f0d8b42a8e0c8b5298/html5/thumbnails/47.jpg)
Fuzzy Table: Low Latency Fuzzy Matching Component Details The low latency component consists of three main
parts Client – submits queries for Keys and get back {Key, Value}
pairs Master Server – serve metadata about which Data Servers
host which bins Data Servers – Actually perform fuzzy matching searches
Data Servers perform fuzzy matching against Keys in order to find {Key, Value} records
double score = fuzzyMatcher.match(key, storedRec.getKey());
if(score >= threshold)
return storedRec;
Fuzzy matching searches are performed in parallel across many Data Servers
47