Turning search upside down

Turning Search Upside Down using Lucene for very fast stored queries

Charlie Hull & Alan WoodwardManaging Director & Principal Developer at Flaxwww.flax.co.uk @FlaxSearch

http://www.flax.co.uk/

Who are Flax? Search engine specialists with decades of experience Developers, innovators and strategists Based in Cambridge, UK Technology agnostic – but open source exponents UK Authorized Partner of LucidWorks Customers include Reed Specialist Recruitment, NLA, Gorkana, Financial Times, News UK, UK Cabinet Office, Australian Associated Press, Accenture, University of Cambridge... We also run a Meetup group:

Who are we? Charlie Hull

– Managing Director and co-founder at Flax– I've been working in search since 1999– When I'm not running Flax I'm sometimes a stiltwalker &

juggler

Who are we? Charlie Hull

– Managing Director and co-founder at Flax– I've been working in search since 1999– When I'm not running Flax I'm sometimes a stiltwalker &

juggler

Alan Woodward– Principal developer at Flax– I used to run a 500m document FAST ESP index for Proquest– Lucene/Solr committer– When I'm not hacking I sing in the Cambridge University

Musical Society Chorus

Media monitoring Standard search – few queries over many documents

Monitoring search – many queries over each document

Customers interests manually turned into queries

Humans probably still have the final say on relevance

Eventual result is a list of articles emailed - or even printed

Exact position of a match is important to the customer

Media monitoring - parameters Tens of thousands of stored expressions or 'Keywords'

– can't rewrite these so must use same syntax!

Hundreds of thousands of articles to monitor every day

Source data can sometimes be scanned & OCR'd

False positives cost human operator time: false negatives cost customers!

Traditional approach is brute force using standard search engine software

Media monitoring – a Keyword("PALM BEACH COUNTY"; W/48 (("TOURIS*"; OR "TOUR"; OR "TOURS"; OR "TRAVEL*"; OR "HOLIDAY*"; OR "HOL"; OR "HOLS"; OR "HOTEL*"; OR "VISIT*"; OR "TRIP"; OR "TRIPS"; OR "DAYTRIP*"; OR "BEACH"; OR "!BEACHES"; OR "COAST"; OR "!COASTLINE*"; OR "ABTA"; OR "DAY TRIP*"; OR “SUITE"; OR "SUITES"; OR "A%CCOMMODATION"; OR "BED AND !BREAKFAST"; OR "B&B"; OR "BED & !BREAKFAST"; OR "!BREAKFAST AND BOARD"; OR "FULL BOARD"; OR "HALF BOARD"; OR "ALL !INCLUSIVE"; OR "THINGS TO DO"; OR "HOSP?TALITY"; OR "SHORT BREAK*"; OR "!WEEKEND BREAK"; OR "CITY BREAK*"; OR "!SIGHTSEE*"; OR "!VACATION*"; OR "E%XCURSION*"; OR "FLY* WITH"; OR "FLY* THERE"; OR "FLY* DRIVE"; OR "!GETAWAY"; OR "!BACKPACK*"; OR "BACK PACK*"; OR "!ECOTOURIS*"; OR "!WATERSPORT*"; OR "WATER SPORT*"; OR "FESTIVAL*"; OR "RESORT* & SPA"; OR "RESORT* AND SPA"; OR "WHALE WATCH*"; OR "GET THERE"; OR "WHERE TO STAY"; OR "GETTING THERE"; OR "STAYCATION*"; OR "VILLA"; OR "VILLAS"; OR "AIRPORT*"; OR "SPA"; OR "SPAS"; OR "OUTDOOR EVENT*"; OR "OUTDOOR ADVENTURE*"; OR "OUTDOOR PURSUIT*"; OR "OUTDOOR ACTIVIT*"; OR "CLIMBING WALL*"; OR "CLIMBING CENTRE*"; OR "ROCK CLIMB*"; OR "WHITE WATER RAFTING";) OR ("PLACES"; W/4 ("TO STAY"; OR "TO SEE"; OR "TO EAT";)) OR (("FLIGHT*"; OR "FLY"; OR "FLYING"; OR "CRUISE*";) W/4 ("OFFER"; OR "AVAILABLE"; OR "DEPART*"; OR "FROM"; OR "TRANSFER*";))))

Media monitoring – another Keyword(((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR ";G.P.R.S"; OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE COMM*"; OR ";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR ";H.S.U.P.A"; OR ";HIGH SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O"; OR ";SMS"; OR ";SHORT MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR ";!CELLPHONE*"; OR ";!TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR ";TELCO*"; OR ";VODAFONE"; OR ";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*"; OR ";TEXT MESSAG*"; OR ";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE & !WIRELESS"; OR ";CABLE AND !WIRELESS";) W/48 ((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR ";!REFINANC*"; OR ";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR ";TAKEOVER*"; OR ";BUYOUT*"; OR ";BUY-OUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*"; OR ";ACCOUNT*"; OR ";MONEY"; OR ";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*"; OR ";JOINT*"; OR ";NEW VENTURE*"; OR ";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";!EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR ";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR ";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR ";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MIS-SELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*"; OR ";FIBRE OPTIC*"; OR ";TAX"; OR ";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR ";STAFF"; OR ";WORKER*"; OR ";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR ";UNFAIR"; OR ";%UNSCRUPULOUS"; OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*"; OR ";PRICE PLAN*"; OR ";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR ((";CALLS COST";) W/6 (";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";))))

Media monitoring – an article<?xml version="1.0" encoding="UTF-8"?> <document> <attributes> <field name="ARTICLEID">2277480</field> <field name="HEADLINE" type="TEXT">Holding a man of integrity, honour</field> <field name="SOURCEID" type="STRING">2</field> <field name="ARTAREA" type="TEXT">161.82</field> <field name="PAGESUM" type="STRING">1</field> <field name="SDATE" type="STRING">2011-08-02</field> <field name="PAGE" type="STRING">A03 NEW NAA</field> <field name="CDATE" type="DATE">2011-08-02 01:08:16</field> <field name="AUTHORNAME" type="STRING">MICHELLE GRATTAN</field> <field name="ARTAREAPERCENT" type="STRING">7.66</field> <field name="SENTENCES" type="STRING">LONG-time Victorian Labor leader and Hawke government minister Clyde Holding, who has died aged 80, was hailed yesterday as a policy and party reformer who kept in touch with his local community.</field> <field name="TEXT">Holding a man of integrity, honour By MICHELLE GRATTAN LONG-time Victorian Labor leader and Hawke government minister Clyde Holding, who has died aged 80, was hailed yesterday as a policy and party reformer who kept in touch with his local community. Mr Holding, sitting in the seat of Richmond, was Victorian opposition leader from 1967 to 1977 before entering federal politics as MP for Melbourne Ports in 1977. He held several portfolios in the Hawke government...</field> <field name="ORGFILENAME" type="STRING">AAAGE20110802NAANEWA03.pdf</field> <field name="SUBTITLE" type="STRING"></field> </attributes> </document>

So how does it work? Incoming article is added to a Lucene MemoryIndex

We parse the original Keyword syntax into a store of Lucene queries

Stored queries are run against this single-document index

Matches from each query are gathered and emitted

Simple!

It can't be that easy?Problems:

Doesn't scale very well with increasing numbers of queries Complex queries need to be rewritten for every document.

It can't be that easy?Problems:

Doesn't scale very well with increasing numbers of queries Complex queries need to be rewritten for every document.

Solution:

Cut down the number of queries run over each document Important not to get false negatives

“Turning search upside down” Narrow down our query space by selecting queries likely to match a given

document. Sounds a bit like... a search? Create an index of queries, using terms extracted from the query objects

themselves. Incoming documents are converted to a disjunction query using the

MemoryIndex's terms list. Specialised Collector then runs each matching query against the document. Massively reduces the number of queries run against a given document.

Indexing Queries TermQueries are simple – just extract the term For Disjunction queries, we extract terms from all subclauses For Conjunction queries, we can just use a single subclause Positional queries are treated as conjunctions Simple Numeric queries treated as terms Empty term sets are indexed with special terms to match all docs NumericRange queries are ignored

Indexing Queries RegexpQueries are a bit more complex:

Extract the longest exact substring from the regex and index that Create ngrams from the MemoryIndex termlist when building the document

disjunction query Reduces the effectiveness of the query filtering, but still allows significant

speedups.

Hacking Lucene & SolrTo allow MemoryIndex to expose positions we fixed for Lucene 4.x:

LUCENE-3831 - Passing a null fieldname to MemoryFields#terms in MemoryIndex throws a NPELUCENE-3827 - Make term offsets work in MemoryIndex

Contributed to the work on adding interval iterators to Lucene Scorers:

LUCENE 2878 - Allow Scorer to expose positions and payloads aka. nuke spans

...and then we rebuilt Solr on top of these modifications.

Why not use Elasticsearch percolators? Need to emit exact hit positions using Interval Iterators

Percolator doesn't allow such efficient optimisations based on document-specific filters – we need to cut down the number of queries we run as much as possible.

Why not use Elasticsearch percolators? Need to emit exact hit positions using Interval Iterators

Percolator doesn't allow such efficient optimisations based on document-specific filters – we need to cut down the number of queries we run as much as possible.

The fastest way to run a query is not to run it at all!

Introducing 'Luwak'A library for running many stored queries, fast.

Features Works with standard Lucene queries (TermQuery, BooleanQuery,

RegexpQuery...) Can be extended to work with custom query types Run multiple instances of it to scale for document volume Based on our fork of Lucene (should be merged into trunk soon though...) Available Real Soon Now on Github

Using LuwakMonitor monitor = new Monitor();monitor.addQueries(querylist);

InputDocument doc = source.getNextDocument();DocumentMatches matches = monitor.match(doc);

DocumentMatches object contains exact hit positions per-field for each query that matched the InputDocument, as well as some metrics (querycount, preptime, querytime).

Luwak output{"stats": {"preptime":3,"querytime":23,"querycount":1},"id":"C:\\TEST\\BMA\\2013\\04\\17\\e10a0737.xml","uniqueId":"e10a0737","Content":"… lots of xml …","QueryMatches":[

{"QueryId":"1", "FieldHighlights":[

{"Highlights":[{"WordOffset":11,"Offset":80,"Length":7,"Word":"quibble"},{"WordOffset":115,"Offset":739,"Length":7,"Word":"quibble"},{"WordOffset":117,"Offset":749,"Length":7,"Word":"quibble"}],

"Field":"bodytext"}]}]}

How fast is Luwak?For simple keywords (<20 terms):

70,000 keywords applied per second to an article Tested on a Macbook 20 times faster than our own previous implementation (using Xapian)

For more complex keywords (some run to three pages!) 20,000 keywords applied in 0.5 seconds Processes approximately 2000 articles/hour

Building Luwak into a monitoring system Monitors are packaged up into a service using Dropwizard We run multiple instances of the monitoring system to handle the required article throughput Queries held in a central database, and updates pushed out to each monitor instance using notification queues Articles are supplied in NewsML format, either on disk or via message brokers. We build a separate Solr index of articles – a searchable archive – for testing new Keywords and for end customers to use Everything uses RESTful Web Service APIs

What next? 3 clients are live with one in progress

Replacing systems built on dtSearch, Verity & Autonomy Can we create query parsers for other legacy systems?



Improve Luwak performance by sharding the Keyword set We know of cases of 1.5m keywords, hundreds of articles/sec




Get Lucene intervals code into trunk, so Luwak isn't based on a fork




Get Lucene intervals code into trunk, so Luwak isn't based on a fork

Find other uses for Luwak? Add metadata to items based on their content Classification into a taxonomy, nodes defined by Keywords

Thankyou!

Any questions?

[email protected]@[email protected]@romseygeekwww.flax.co.uk/blog +44 (0) 8700 118334

mailto:[email protected]

https://twitter.com/FlaxSearch

mailto:[email protected]

http://www.flax.co.uk/blog

Turning search upside down

Technology

Transcript of Turning search upside down