Search engines in the industry
a use case
Different interests
● researchers / engineers look for high precision and recall
● editors / writers are concerned about matching of queries and results
● marketers want to change / adapt results
Designing a search engine
● functional requirements○ search
■ keywords, boolean retrieval, natural language○ indexing
■ data sources■ data types
○ administration■ manage scoring / boosting functions
Designing a search engine
● architectural requirements○ resiliency○ scalability○ no downtime○ work with existing infrastructure○ platforms○ migrating from legacy systems○ talk to other systems
Designing a search engine
● performance requirements○ search
■ query per second■ time per search request
○ index■ document per second■ time per indexing request
○ SLA?
Designing a search engine
● search engine performance requirements○ recall percentiles threshold○ precision percentiles threshold○ minimize empty results
● often mostly unknown ○ published vs unpublished / to be written documents
● almost always umanageable○ cannot decide when
■ it’ll be ready■ it’ll have to be indexed■ it’ll have to be searchable
● heterogeneous○ different writers, languages, topics, styles, etc.
Data
Process
Project
● ~50M heterogeneous documents● Migrating from old commercial solution to
Apache Solr● Google like search● Targeted search for different types of
contents
Advanced capabilities
● Smart understanding of queries● Smart suggestion of queries ● Suggestion of similar / important contents● Automatic classification of contents
Responsibilities
● architecture analysis and design○ scaling under high load
● continuous definition of algorithms for indexing and searching
● system maintenance
Skills required
● basics of information retrieval● a bit of distributed systems● some natural language processing● some machine learning
Architecture analysis and design
● Shape up a prototype architecture○ separate machines for indexing and search○ multiple load balanced machines for searching○ define indexing and search algorithms
● Evaluate architecture○ stress tests (performance)○ quality tests (accuracy)
● Iterate
Architecture analysis and design
● analyze existing documents○ avg size○ language○ topics, style, etc.
● analyze existing query logs○ avg response time○ avg length (how much it takes to specify a query?)○ avg query per second
Most time spent on
● testing how documents get indexed● testing how user queries get transformer in
platform specific queries● tweaking indexing algorithms● tweaking search algorithms● tweaking ranking● platform optimization for scalability
Challenges
● Architecture constraints● Performance● Diverging stakeholders concerns● Dynamically scaling search
Sample architecture constraint #1
● Data storage has to be on NFS● Lucene is IO intensive● NFS makes it slower● Concurrent read writes makes it error prone
Sample architecture constraint #2
● Change search engine● Systems talking to the SE need to switch
API● Only in the long run● In the short run an adapter layer for old APIs
on new APIs has to be developed
Indexing performance
● Most of the indexing time is spent converting data from the old (indxing) format to the new (indexing) format
● The adaption layer between old and new API becomes the bottleneck
● Time to switch to the new API natively
Diverging concerns
● Article authors check the search engine exactly handles their writings wanting perfect recall and precision○ so lot of time is spent on adjusting ranking
● Markters want to be able to overcome ranking and put something they want to sell○ ranking algorithm gets breached
● Need flexible algorithms
Scale dinamically
● Search engine needs not to break even under high peaks of load
● Such peaks are often unpredictable● Need a fast way to add more computing
power
Takeaways
● small iterations (no waterfalls!)○ analyze portion of data / queries○ change search / index algorithms○ test, involve stakeholders○ forces ability to reindex quickly
● look at data (documents, query logs)
Top Related