Rapid pruning of search space through hierarchical matching
-
Upload
lucenerevolution -
Category
Education
-
view
3.441 -
download
0
description
Transcript of Rapid pruning of search space through hierarchical matching
RAPID PRUNING OF SEARCH SPACE THROUGH HIERARCHICAL MATCHING Chandra Mouleeswaran Machine Learning Scientist, ThreatMetrix Inc.
5/2/13 1
My Background • Machine Learning Scien8st at ThreatMetrix Inc. • Co-‐ Chair, Developer Programs, IntelliFest.org, Oct 2013,
San Diego, CA Career Path -‐ Siemens Corporate Research: Learning & Expert Systems -‐ Technology division of Donaldson, LuQin and JenreSe
company (Pershing): Ar8ficial Intelligence Group -‐ Network Monitoring
-‐ Several startups: Classifica8on, Web Crawling, Security, Financial Trading etc.
5/2/13 2
Outline
• Task descrip8on • Approaches • Why search paradigm? • Hierarchical matching • Results • Acknowledgments
5/2/13 3
The Device Iden8fica8on Task
• Computa8onally, it’s a CLASSIFICATION problem: { a0, a1, a2, a3……….. an } è { ci } ai = ( aSribute | field | key ) value ci = ( label | signature | class | hash )
• Returning devices should be correctly iden8fied within certain tolerances
• New classes may be created if a good match is not found in the repository of known devices
• Devices age out, based on data reten8on policy 5/2/13 4
Task Challenges
• Extremely vola8le aSributes • There are no pivot aSributes to divide and conquer the search space
• Changing distribu8ons • Emphasis on PRECISION • Stringent RESPONSE 8me
5/2/13 5
Engineering Challenges
• Precision (accuracy) and latency (response 8me) are antagonis8c constraints
• Project management
Repository Size (millions)
Load (TPS)
Latency (ms)
Project start 28 200 < 100
Present 280 300 < 100
Change 10 X 1.5 X None
5/2/13 6
Approaches
• Rules engine • Learning models • Vector space models Need an enterprise grade solu8on!
5/2/13 7
Rules Engine
• No experts • Number of rules? • Maintenance?
Not a viable approach!
5/2/13 8
Learning Models
• Most machine learning methods deal predominantly with binary classifica8on problems (eg. fraud / not fraud) or a small number of target classes
• Few exemplars for each class • ASribute values may be unbounded • ASributes may not follow a natural progression
5/2/13 9
Learning Models …
• Unsupervised learning such as clustering methods would make good models, but not good enough to be of prac8cal use. Any simplifica8on process will compromise on accuracy
• Ability to explain is cri8cal • Tend to ignore domain knowledge Challenge in providing enterprise solu8on
5/2/13 10
Thoughts
• No comparable applica8on with such requirements
• Build and deploy a classifier that explains itself easily, scales temporally and offers quick response
• Use domain knowledge to guide verifica8on • Improve the classifier through machine learning methods by analyzing performance in the field
5/2/13 11
Vector-‐Space Models
• Similarity based search make vector-‐space model a good choice for genera8ng selec8ons
• Given the vola8le nature of data, informa8on retrieval (IR) systems can adapt easily
• Good at neighborhood search Sensi8ve to individual aSribute changes!
5/2/13 12
Sources of Inspira8on
• Lucene/Solr features • Documenta8on from (erstwhile) Lucid Imagina8on
• Ease with which Lucene/Solr could be installed and explored
Very short learning curve for novices!
5/2/13 13
Feature Selec8on
• Primi8ve and derived aSributes • Entropy • Distribu8on
5/2/13 14
Domain
• Devices come with structural informa8on but not much grammar or seman8cs
• Bag-‐of-‐words (single field) approach is fast but not precise
• Using all fields is precise but response is slow Now what?
5/2/13 15
Disjunc8on Max • Matrix of all possible combina8ons of user input query and document fields
• Transforms into a Boolean query of Disjunc8onMaxQueries of each row
• Maximum score of sub clauses Is used by Disjunc8onMaxQuery
• No single term in user input dominates This is needed! Src: SearchHub and LucidWorks 5/2/13 16
DisMax Experiments (index size = 60 Million)
Scenario 1
mm=2 Solr fields = { a1, a2, a3 } Values= { phrase1, phrase2, phrase3} Must-‐Match Clauses Latency: YES (35 ms) Precision: NO (20% failure)
5/2/13 17
Scenario 2
mm = 50 % Solr fields = { a1 } Values= { term1, term2, term3 …. termn } Should-‐Match Clauses Latency: NO (> 2 seconds) Precision: YES (> 98%)
Possible Workaround
• Look-‐ahead: Customize Lucene/Solr to do a branch-‐and-‐bound search, bail out on some lower bound score
• Minimize candidates for DisMax search -‐ reduce total number of Solr instances to search -‐ reduce total number of disjunc8ve terms
[ Empirical es8mate: tn = 2 * tn-‐1 where t = 8me & n = number of disjunc8ve terms]
5/2/13 18
Phrases over Terms
• Used coloca8on (co-‐occurrence matrix) to determine most common phrases
• Delete terms covered by phrases • Add stop words based on frequency analysis • Ensure precision is preserved through regression tests
Reduced the number of DisMax terms by 30%
5/2/13 19
Sources of Inspira8on
• Planning in a Hierarchy of Abstrac8on Spaces, Ar8ficial Intelligence, Vol. 5, No. 2, pp. 115-‐135 (1974)
• Search Reduc8on in Hierarchical Problem Solving, Proc. Of the 9th IJCAI, AAAI Press, Menlo Park, CA (1991)
• Excep8onal Data Quality Using Intelligent Matching and Retrieval, AI Magazine, AAAI Press (Spring 2010)
5/2/13 20
Hierarchical Matching
Bag of words
Models Phrases
Filters DisMax
Query Formulator
Domain-‐specific paSerns
CSV/JSON
Solr instances selector
To Solr Servers
5/2/13 21
Verifica8on
Conflict Resolu8on
• Top n candidates are returned from each Solr instance
• They are ranked based on custom verifica8on module
• Ties are broken using recency • Top candidate is persisted and returned along with custom score
5/2/13 22
Comments
• Dismax performs mul8dimensional match • Extracted mul8ple filters and arranged them hierarchically
• Separa8on of selec8on and evalua8on -‐ Selec8on = approximate solu8on -‐ Evalua8on = refinement
5/2/13 23
Where 8me went..
• ASribute selec8on • Ranking • Op8miza8on • Index re-‐genera8on • Regression tes8ng
5/2/13 24
Sources for Tune Up
• Scaling Solr, Lucene Revolu8on, May 2011 • Prac8cal Search with Solr: Beyond just Looking it Up, Lucid Imagina8on, May 2010
5/2/13 25
Tes8ng
• Precision tes8ng using self and mixed modes • Latency tests
-‐ custom harness for stand-‐alone tests -‐ integrated tests with JMeter framework
5/2/13 26
Results
5/2/13 27
Latency Percen8les
original edismax Ini8al solu8on
Op8miza8on 2: Domain paSerns, Stop words, de-‐dupe
Op8miza8on 1: Filters, Focused search, verifica8on
5/2/13 28
TPS
5/2/13 29
Response Times over Time
5/2/13 30
Project Execu8on
• Agile Methodology • Risk mi8ga8on through primary and con8ngency plans
• Rapid prototyping followed by good sozware engineering prac8ces
• Evalua8ng DSE (DataStax) & Solr Cloud
5/2/13 31
Gleanings
• You can classify anything with Lucene/Solr, lexicon is your own
• The ques8on is not whether Lucene/Solr can solve a par8cular classifica8on problem, but whether you can priori8ze among the many ways of doing it
• If you run into a problem, someone has solved it or will solve it in the near future
5/2/13 32
Gleanings …
• Deal with accuracy before latency • If precision, latency and scale are all cri8cal to your domain, expect to invest some8me in hierarchical abstrac8ons
• Index once, run any8me, anywhere, does not apply during development
• Throwing all data at Lucene/Solr will not work for mission cri8cal applica8ons
• Rapid prototyping and willingness to fail
5/2/13 33
Summary
Simplify and match at mul0ple levels of abstrac0on
5/2/13 34
Contributors
Chandra Mouleeswaran Research & Prototyping
Fang Chen Research & Prototyping
Luke Mertens Produc8za8on & Scalability
Brent Pearson Release Management
Tracy Hsu Precision Tes8ng & QA
5/2/13 35
Srinivas Nayani Deployment & QA
COMMENTS & FEEDBACK: Chandra Mouleeswaran [email protected]
5/2/13 36