Search engines in the industry

24
Search engines in the industry a use case

description

 

Transcript of Search engines in the industry

Page 1: Search engines in the industry

Search engines in the industry

a use case

Page 2: Search engines in the industry
Page 3: Search engines in the industry

Different interests

● researchers / engineers look for high precision and recall

● editors / writers are concerned about matching of queries and results

● marketers want to change / adapt results

Page 4: Search engines in the industry

Designing a search engine

● functional requirements○ search

■ keywords, boolean retrieval, natural language○ indexing

■ data sources■ data types

○ administration■ manage scoring / boosting functions

Page 5: Search engines in the industry

Designing a search engine

● architectural requirements○ resiliency○ scalability○ no downtime○ work with existing infrastructure○ platforms○ migrating from legacy systems○ talk to other systems

Page 6: Search engines in the industry

Designing a search engine

● performance requirements○ search

■ query per second■ time per search request

○ index■ document per second■ time per indexing request

○ SLA?

Page 7: Search engines in the industry

Designing a search engine

● search engine performance requirements○ recall percentiles threshold○ precision percentiles threshold○ minimize empty results

Page 8: Search engines in the industry

● often mostly unknown ○ published vs unpublished / to be written documents

● almost always umanageable○ cannot decide when

■ it’ll be ready■ it’ll have to be indexed■ it’ll have to be searchable

● heterogeneous○ different writers, languages, topics, styles, etc.

Data

Page 9: Search engines in the industry

Process

Page 10: Search engines in the industry

Project

● ~50M heterogeneous documents● Migrating from old commercial solution to

Apache Solr● Google like search● Targeted search for different types of

contents

Page 11: Search engines in the industry

Advanced capabilities

● Smart understanding of queries● Smart suggestion of queries ● Suggestion of similar / important contents● Automatic classification of contents

Page 12: Search engines in the industry

Responsibilities

● architecture analysis and design○ scaling under high load

● continuous definition of algorithms for indexing and searching

● system maintenance

Page 13: Search engines in the industry

Skills required

● basics of information retrieval● a bit of distributed systems● some natural language processing● some machine learning

Page 14: Search engines in the industry

Architecture analysis and design

● Shape up a prototype architecture○ separate machines for indexing and search○ multiple load balanced machines for searching○ define indexing and search algorithms

● Evaluate architecture○ stress tests (performance)○ quality tests (accuracy)

● Iterate

Page 15: Search engines in the industry

Architecture analysis and design

● analyze existing documents○ avg size○ language○ topics, style, etc.

● analyze existing query logs○ avg response time○ avg length (how much it takes to specify a query?)○ avg query per second

Page 16: Search engines in the industry

Most time spent on

● testing how documents get indexed● testing how user queries get transformer in

platform specific queries● tweaking indexing algorithms● tweaking search algorithms● tweaking ranking● platform optimization for scalability

Page 17: Search engines in the industry

Challenges

● Architecture constraints● Performance● Diverging stakeholders concerns● Dynamically scaling search

Page 18: Search engines in the industry

Sample architecture constraint #1

● Data storage has to be on NFS● Lucene is IO intensive● NFS makes it slower● Concurrent read writes makes it error prone

Page 19: Search engines in the industry

Sample architecture constraint #2

● Change search engine● Systems talking to the SE need to switch

API● Only in the long run● In the short run an adapter layer for old APIs

on new APIs has to be developed

Page 20: Search engines in the industry

Indexing performance

● Most of the indexing time is spent converting data from the old (indxing) format to the new (indexing) format

● The adaption layer between old and new API becomes the bottleneck

● Time to switch to the new API natively

Page 21: Search engines in the industry

Diverging concerns

● Article authors check the search engine exactly handles their writings wanting perfect recall and precision○ so lot of time is spent on adjusting ranking

● Markters want to be able to overcome ranking and put something they want to sell○ ranking algorithm gets breached

● Need flexible algorithms

Page 22: Search engines in the industry

Scale dinamically

● Search engine needs not to break even under high peaks of load

● Such peaks are often unpredictable● Need a fast way to add more computing

power

Page 23: Search engines in the industry
Page 24: Search engines in the industry

Takeaways

● small iterations (no waterfalls!)○ analyze portion of data / queries○ change search / index algorithms○ test, involve stakeholders○ forces ability to reindex quickly

● look at data (documents, query logs)