Real time fulltext search with sphinx

35
Real time fulltext search with Sphinx Adrian Nuta // Sphinxsearch // 2013

description

My talk about real-time indexing and searching with Sphinx. It was given at the 2013 Froscon in Sankt Augustin, Germany.

Transcript of Real time fulltext search with sphinx

Page 1: Real time fulltext search with sphinx

Real time fulltext search

with Sphinx

Adrian Nuta // Sphinxsearch // 2013

Page 2: Real time fulltext search with sphinx

Quick intro

Sphinx search

• high performance fulltext search engine

• written in C++

• serving searches since 2001

• can work on any modern architecture

• distributed under GPL2 licence

Page 3: Real time fulltext search with sphinx

Why a search engine?

• performanceo a search engine delivery faster a search and with

less resourses

• quality of searcho build-in FTS in databases don’t offer advanced

search options

• independent FTS engines offer speed not

only for FT searches, but other types, like

geo or faceted searches

Page 4: Real time fulltext search with sphinx

Classic way of indexing in Sphinx

on-disk (classic) method:

• use a data source which is indexed

• to update the index you need to reindex again

• in addition to main index, a secondary index

(delta) index can be used to reindex only latest

changes

• easy because indexing doesn’t require changes

in the application, but:

• reindexing, even delta one, can put pressure

on data source and system

Page 5: Real time fulltext search with sphinx

Real time indexing in Sphinx

• index has no data source

• everything that needs be indexed must be added manually in the index

• you can add/update/remove at any time

• compared to classic method, RT requires changes in the application

• performance is same or near same as classic index

• Only specific requirement :

workers = threads

Page 6: Real time fulltext search with sphinx

Structures

Page 7: Real time fulltext search with sphinx

RealTime index definition

index rt {

type = rt

rt_field = title

rt_field = content

rt_attr_uint = user_id

rt_attr_string = title

rt_attr_json = metadata

}

Page 8: Real time fulltext search with sphinx

Schema - Fields

rt_field - fulltext field, raw text is not stored

Tokenization features:

wildcarding ( prefix or infix),

morphology, custom charset definition,

stopwords, synonyms, segmentation, html

stripping, paragraph/sentence detection etc.

Page 9: Real time fulltext search with sphinx

Schema - Attributes

• rt_attr_uint & rt_attr_bigint

• rt_attr_bool

• rt_attr_float

• rt_attr_multi & rt_attr_multi64 -integer set

• rt_attr_timestamp

• rt_attr_string - actual text stored, kept in memory, used only for display, sorting and grouping.

• rt_attr_json - full support for JSON documents

Page 10: Real time fulltext search with sphinx

Content manipulation

Page 11: Real time fulltext search with sphinx

Quick intro to SphinxQL

• our SQL dialect

• any mysql client can be used to connect to

Sphinx

• MySQL server is not required!

• Full document updates only possible with

SphinxQL

• to enable it, add in searchd section of config

listen = host:port:mysql41

Page 12: Real time fulltext search with sphinx

Content insert

$mysql> INSERT INTO rt

(id,title,content,user_id,metadata)

VALUES(100,’My title’, ‘Some long content

to search’, 10,

’{“image_id”:1,”props”:[20,30,40]}’);

Page 13: Real time fulltext search with sphinx

Full content replace

$mysql> REPLACE INTO rt

(id,title,content,user_id,metadata)

VALUES(100,’My title’, ‘Some long content

to search’, 10,

’{“image_id”:1,”props”:[20,30,40]}’);

• needed for text field, json and string attribute

updates

Page 14: Real time fulltext search with sphinx

Updating numerics

• For numeric attributes including MVA:

$mysql> UPDATE rt SET user_id = 10 WHERE id

= 100;

• For numeric JSON elements it’s possible to

do inplace updates:

$mysql> UPDATE rt SET metadata.image_id =

1234 WHERE id=100;

Page 15: Real time fulltext search with sphinx

Deleting

$mysql> DELETE FROM rt WHERE id = 100;

$mysql> DELETE FROM rt WHERE user_id > 100;

$mysql> TRUNCATE RTINDEX rt;

● empty the memory shard, delete all disk shards and

release the index binlogs

Page 16: Real time fulltext search with sphinx

Adding new attributes

mysql> ALTER TABLE rt ADD COLUMN gid

INTEGER;

• only for int/bigint/float/bool attributes for

now

Page 17: Real time fulltext search with sphinx

Searching

Page 18: Real time fulltext search with sphinx

Searching

• no difference in searching a RT or classic

index

• dict = keywords required for wildcard search.

Page 19: Real time fulltext search with sphinx

Relevancy ranking

• build-in rankers:o proximity_bm25 ( default)

o none, matchany,wordcount,fieldmask,bm25

• custom ranker - create own expression rank

exampleranker = proximity_bm25

same as ranker = expr(‘sum(lcs*user_weight)*1000+bm25’)

Page 20: Real time fulltext search with sphinx

Tokenization settings example

index rt {

charset_type = utf-8

dict = keywords

min_word_len = 2

min_infix_len = 3

morphology = stem_en

enable_star = 1

}

Page 21: Real time fulltext search with sphinx

Operators on fulltext fields

• Boolean: hello | world, hello ! world

• phrasing: “hello world”

• proximity: “hello world”~10

• quorum: “world is a beautiful place”/3

• exact form: =cats and =dogs

• strict order: cats << and << dogs

• zone limit: (h2,h4) cats and dogs

• SENTENCE: all SENTENCE words SENTENCE “ in

one sentence”

• PARAGRAPH: “this search” PARAGRAPH “is fast”

• selected fields only: @(title,body) hello world

• excluded fields: @!(title,body) hello world

Page 22: Real time fulltext search with sphinx

Using API

<?php

require("sphinxapi.php");

$cl = new SphinxClient();

$res = $cl->Query('search me now','rt');

print_r($res);

Official: PHP, Python, Ruby, Java, C

Unofficial: JS(Node.js), perl, C++, Haskell,

.NET

Page 23: Real time fulltext search with sphinx

Using SphinxQL

$mysql> SELECT * FROM rt WHERE

MATCH('”search me fuzzy”~10') AND featured

= 1 LIMIT 0,20;

$mysql> SELECT * FROM rt WHERE

MATCH('”search me fuzzy”~10 @tag

computers') AND featured = 1 GROUP BY

user_id ORDER BY title ASC LIMIT 30,60

OPTION field_weights=(title=10,content=1),

ranker=expr(‘sum((4*lcs+2*(min_hit_pos==1)

+exact_hit)*user_weight)*1000+bm25’);

Page 24: Real time fulltext search with sphinx

Boolean filtering

$mysql> SELECT *,

views > 10 OR category = 4 AS cond

FROM rt WHERE

MATCH('”search me proximity”~10') AND

featured = 1 AND cond = 1

GROUP BY user_id ORDER BY title ASC

LIMIT 30,60 OPTION ranker=sph04;

Page 25: Real time fulltext search with sphinx

Geo search

mysql> SELECT *, GEODIST(lat,long,0.71147,-

1.29153) as distance FROM rt WHERE distance <

1000 ORDER BY distance ASC;

mysql> SELECT *, GEODIST(lat,long,40.76439,-

73.99976,

{in=degrees,out=miles,method=adaptive}) as

distance FROM rt WHERE distance < 10 ORDER BY

distance ASC;

Page 26: Real time fulltext search with sphinx

Multi-queries

mysql> DELIMITER \\

mysql> SELECT *,COUNT(*) as counter FROM rt WHERE

MATCH('search me') GROUP by property_one ORDER by

counter DESC;SELECT *,COUNT(*) as counter FROM rt WHERE

MATCH('search me') GROUP by property_two ORDER by

counter DESC;SELECT *,COUNT(*) as counter FROM rt WHERE

MATCH('search me') GROUP by property_three ORDER by

counter DESC;

\\

• used for faceting

Page 27: Real time fulltext search with sphinx

Internals

Page 28: Real time fulltext search with sphinx

Internal architecture

Each RT index is a sharded index consisting of:

• one memory shard for latest content

• one or more disk shards

Page 29: Real time fulltext search with sphinx

Internal shards management

rt_mem_limit = maximum size of memory

shard

When full, is flushed to disk as a new disk

shard.

• OPTIMIZE INDEX rt - merge all disk shards

into one.o Merging too intensive? throttle with rt_merge_iops

and rt_merge_maxiosize

Page 30: Real time fulltext search with sphinx

Binlog support

Sphinx support binlogs, so memory shard will not be lost in case of disasters

• binlog_flusho like innodb_flush_log_at_trx_commit

o 0 - flush and sync every second - fastest, 1 sec lose

o 1 - flush and sync every transaction - most safe, but slowest

o 2 - flush every transaction, sync every second - best

balance, default mode

• binlog_patho binlog_path = # disable logging

Page 31: Real time fulltext search with sphinx

Fast RT setup using classic index

• Create classic index to get initial data.

• Declare a RT index

• mysql> ATTACH INDEX classic TO RTINDEX rt

• transform classic index to RT

• operation is almost instant o in essence is a file renaming: classic index

becomes a RT disk shard

Page 32: Real time fulltext search with sphinx

Sphinx use 1 CPU core per

index

More power?

Distribute!

Page 33: Real time fulltext search with sphinx

Distributed RT index

Update on each shard, search on everythingindex distributed

{

type = distributed

local = rtlocal_one

local = rtlocal_two

agent = some.ip:rtremote_one

}

don’t forget about dist_threads = x

Page 34: Real time fulltext search with sphinx

Copy RT index from one server to

another

• just simulate a daemon restart

• searchd --stopwait

• flushes memory shard to disk

• Copy all index files to new server.

• Add RT index on new server sphinx.conf

• Start searchd on new server

Page 35: Real time fulltext search with sphinx

Questions?

www.sphinxsearch.com

Docs: http://sphinxsearch.com/docs/

Wiki: http://sphinxsearch.com/wiki/

Official blog: http://sphinxsearch.com/blog/

SVN repository: https://code.google.com/p/sphinxsearch/