Búsquedas Full Text con esteroides - Sphinx Search
-
Upload
diego-sapriza -
Category
Technology
-
view
558 -
download
2
description
Transcript of Búsquedas Full Text con esteroides - Sphinx Search
BúsquedaS Full Text !"# $%&$r"'($%
Diego Sapriza Senior Soft. Engineer
PHPer ~ DevOps
uruguaSHo
Ur)*)+,
el país de los repechajes
por cada habitante
http://AV4TAr.com
PHP.meetup.uy
DevOps.meetup.uy .
.uy
@AV4TAr
B!" D#$#
Pr",$!&"
• Buscador
• Relevancia
• Búsqueda Facetada • Tags • Geo-búsquedas
• Millones de registros
• Velocidad
• Escalamiento
• Simplicidad.-
si no puedo encontrar nada r$-$v+#&$ y rápido
L. (+&. NO %'rv$#…
RÁPIDO
obtener los resultados que
!"#"$%&'() en vez de los resultados que
%"-" !"'#!'($# con nuestra consulta.
2.2.1 beta!
¿Q)/ $%?
motor de búsqueda Full Text indexa Bases de Datos (y xmls) diseñado para escalar fácilmente
¿P"rq)/ )%+r-"?
velocidad de indexación y búsqueda mejor relevancia
escalabilidad búsquedas Facetadas
geo-búsqueda morfología HTML Stripping
…
VUELA!!!
I($+ b0%'!+
configurar índice indexar
consultar el índice repetir
C"12"#$#&$%
aplicación cliente
indexer searchd
base de datos orígen de datos
base de datos orígen de datos
¿De dónde saco los datos?
SQL mysql, pgsql, mssql, odbc,…
XMLpipes
indexer
¿Cómo y dónde indexo los datos?
stopwords, wordforms, … ejecución períodica
hola “cliente” procesa consultas utilizando índices
searchd
aplicación cliente
Sphinx API php, python, ruby, java, c#, nodejs, haskell…
SphinxQL mysql
SphinxSE storage engine
Sphinx API php, python, ruby, java, c#, nodejs, haskell…
<?php!require('/path/to/sphinxapi.php');!$cl = new SphinxClient();!$cl->SetServer('10.1.1.4', 3312);!$cl->SetFilter('author_id', array (123));!$cl->SetSortMode(SPH_SORT_ATTR_DESC, 'post_date');!$cl->Query('test', 'main delta');!
SphinxQL mirá mamá sin Base de Datos!!!
mysql_connect() a sphinx!
Sphinx SE storage engine
SELECT * !FROM sphinx_table s!JOIN products p ON p.id = s.id!WHERE s.query = ‘@title iPad’!ORDER BY p.price ASC!
aplicación cliente
indexer searchd
base de datos orígen de datos
F-)3" datos
aplicación cliente
indexer searchd
? '#&$r+!!'4#
base de datos orígen de datos
source users_index!{!
!type = mysql!!sql_user = sphinx!!sql_pass = sph.09$!!sql_db = wby_beta!!sql_host = 127.0.0.1!
!!sql_query = SELECT u.id, u.id as users_id, CONCAT( u.name, ' ',
u.lastname ) AS name, u.profession, IF(u.gender='m',1,IF(u.gender='f',2,3)) as numeric_gender, u.city, u.state, u.country, c.email FROM users u, credentials c WHERE c.userHash = u.credentials_userHash AND u.temporal = 'n'!!
!sql_attr_uint = users_id!!sql_attr_uint = numeric_gender!
}!!
index users_index!{!
!source = users_index!!path = /wby/sphinx/data/usersindex!!docinfo = extern!!min_word_len = 2!!charset_type = sbcs!!min_infix_len = 3!!enable_star = 0!
}!!
indexer!{!
!mem_limit!= 4096MB!!max_iops != 0!!write_buffer != 12M!!max_iosize != 1048576!
!}!!searchd!{!
!#listen = 127.0.0.1:3312!!listen = 0.0.0.0:3312!!log ! ! != /wby/sphinx/searchd.log!!query_log = /wby/sphinx/query.log!!read_timeout = 5!!client_timeout = 300!!max_children = 30!!pid_file = /wby/sphinx/searchd.pid!!max_matches = 1000 !!
}!!
(+&+ %")r!$
5#('!$
'#(6$r %$+r!7(
(+&+ %")r!$% source users_src!{!!type = mysql!!sql_user = DBUSER!!sql_pass = ******!!sql_db = DB1!!sql_host = 127.0.0.1!
!
!sql_query = \!
SELECT id, nombre, edad, ciudad, \! fecha_edit FROM users!!
!sql_attr_uint = edad!!sql_attr_timestamp = fecha_edit!
}!
mysql
pgsql
odbc
Sphinx devuelve
“solo”
%*$ y '&r%b+&)
5#('!$ disk-based
index users_index!{!!source = users_src!!path = /data/usrs_index!!min_word_len = 2!!charset_type = utf-8!
}!
mysql
5#('!$ disk-based
index users_index!{! source = users_src! source = users_src1! source = users_src2!
!!
!...!
multiples orígenes
pgsql
odbc
mysql
5#('!$ Distribuído index users_index_dist!
{! type = distributed! local = archive! agent = srv1.net:9312:src2! agent = srv2.net:9312:src3!}!
agent
agent
mysql mysql
xml
5#('!$ Real Time
index rt_users_index!{!
!type = rt!!path = /sph/data/rt_usersindex!!rt_field = name!!rt_field = city!!rt_attr_uint = id!!rt_attr_timestamp = date_added!!rt_mem_limit = 256MB!
}!
# ./indexer users_index!!
# ./indexer user_timelines --rotate Sphinx 2.0.3-release (r3043) Copyright (c) 2001-2011, Andrew Aksyonoff Copyright (c) 2008-2011, Sphinx Technologies Inc (http://sphinxsearch.com) using config file '/sphinx/etc/sphinx.conf'... indexing index 'user_timelines'...
collected 1.303.297 docs, 4631.5 MB sorted 769.8 Mhits, 100.0% done total 1.303.297 docs, 4631519329 bytes
total 1463.481 sec, 3164727 bytes/sec, 890.54 docs/sec total 1665 reads, 62.531 sec, 1639.9 kb/call avg, 37.5 msec/call avg total 5302 writes, 12.536 sec, 1022.3 kb/call avg, 2.3 msec/call avg rotating indices: succesfully sent SIGHUP to searchd (pid=22994).
~24 minutos, 4.5GB.
'#(6+r 1+'#
1+&!7'#* 1"($%
• SPH_MATCH_ALL* • SPH_MATCH_ANY • SPH_MATCH_PHRASE • SPH_MATCH_BOOLEAN • SPH_MATCH_EXTENDED • SPH_MATCH_FULLSCAN
6&$#($( %'#&8'%
• y / o: hola | mundo, hola & mundo!
• No: hola –mundo!
• Búsqueda por campo: @title hola @body mundo!
6&$#($( %'#&8'%
• x Frase: “Hola mundo”!
• x Proximidad: “Hola mundo”~10!
• Distancia: hola NEAR/10 mundo!
1)!7" 10%
• aaa << bbb << ccc!• ^hello world$!• ”Chile" PARAGRAPH ”Mundial”!• @* hello!• @!(title,body) hello world!• @body[50] hello!
"hello world" @title "example program"~5 @body python -(php|perl) @* code!
cta1sfter:/srv/sphinx/bin# mysql -‐P9306 -‐-‐protocol=tcp -‐-‐prompt='sphinxQL> ’ Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 1 Server version: 2.0.3-‐release (r3043) Type 'help;' or '\h' for help. Type '\c' to clear the buffer. sphinxQL> SELECT * from user_timelines WHERE MATCH ('superbowl'); +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | id | weight | twitter_id | tweets_id | link_id | created | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | 109531197 | 4675 | 24488771 | 57371370 | 35471785 | 1359858567 | | 109492540 | 4673 | 56690354 | 57351558 | 35459063 | 1359843568 | | 109493484 | 4673 | 24488771 | 57351953 | 35459063 | 1359843239 | | 109496715 | 4673 | 24488771 | 57353282 | 35459063 | 1359843352 | | 109496743 | 4673 | 24488771 | 57353292 | 35459063 | 1359843241 | | 109496779 | 4673 | 24488771 | 57353305 | 35459063 | 1359842932 | ... +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ 20 rows in set (0.04 sec)
¿C41" 1+#&$#*" -. 5#('!$% +!&)+-9+(.?
Sobre todos los grandes!!!
&7$ DELTA, ,") 1)%& )%$.
1$r*$
Cuidado con el espacio en disco!!!
Geodistancia
mysql> SELECT *, CONTAINS(GEOPOLY2D(40.95164274496,-76.88583678218,41.188446201688,-73.203723511772,!39.900666261352,-74.171833538046,40.059260979044,-76.301076056469),latitude_deg,longitude_deg) AS inside FROM geodemo WHERE inside=1 LIMIT 0,100 ;!
TIP: shpinx.conf.php #!/usr/bin/php <?php for ($i=1; $i<=4; $i++): ?> source chunk<?= $i ?> { sql_host = localhost sql_user = sphinx_usr sql_pass = **** sql_db = dbchunk<?=$i?> . . . } <?php endfor; // end source loop ?>
f+!&%
• standalone • múltiples BDS • no actualiza los índices solo • sphinx solo devuelve ids • Gran consumo de disco • Fácil de integrar
• órden por relevancia • exact search / boolean
search... • API en varios lenguajes • implementa protocolo
MySQL • Fácil de escalar
Pr$*)#&+%? @AV4TAr
http://AV4TAr.com
Gr+!'+%, #. v$1. $#...
cta1sfter:/srv/sphinx/bin# mysql -‐P9306 -‐-‐protocol=tcp -‐-‐prompt='sphinxQL> ’ Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 1 Server version: 2.0.3-‐release (r3043) Type 'help;' or '\h' for help. Type '\c' to clear the buffer. sphinxQL> SELECT * from user_timelines WHERE MATCH ('superbowl'); +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | id | weight | twitter_id | tweets_id | link_id | tld_id | extracted | created_stamp | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | 109531197 | 4675 | 24488771 | 57371370 | 35471785 | 132427 | 1 | 1359858567 | | 109492540 | 4673 | 56690354 | 57351558 | 35459063 | 685 | 1 | 1359843568 | | 109493484 | 4673 | 24488771 | 57351953 | 35459063 | 685 | 1 | 1359843239 | | 109496715 | 4673 | 24488771 | 57353282 | 35459063 | 685 | 1 | 1359843352 | | 109496743 | 4673 | 24488771 | 57353292 | 35459063 | 685 | 1 | 1359843241 | | 109496779 | 4673 | 24488771 | 57353305 | 35459063 | 685 | 1 | 1359842932 | ... +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ 20 rows in set (0.04 sec) sphinxQL> show meta; +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | Variable_name | Value | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ | total | 1000 | | total_found | 6302 | | time | 0.034 | | keyword[0] | superbowl | | docs[0] | 6302 | | hits[0] | 12189 | +-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐+ 6 rows in set (0.00 sec)
source user_timelines : base {
sql_query_pre = SELECT @tt_id:=id FROM `tweets_timelines` WHERE `created` <= DATE_SUB(CURDATE(),INTERVAL 8 DAY) ORDER BY created DESC LIMIT 1
sql_query_pre = REPLACE INTO sph_counter SET counter_id = "user_timelines", modif=NOW(), max_doc_id = ( SELECT MAX(id) max FROM tweets_timelines), last_doc_id = max_doc_id
sql_query = SELECT tt.id, tt.twitter_id, tt.tweets_id, lm.id AS link_id, lm.expanded_link, lm.title, lm.description, lm.body, lm.tld_id, lm.extracted, UNIX_TIMESTAMP(tt.created) AS created_stamp FROM links_metadata lm, tweets_timelines tt WHERE tt.id >= @tt_id AND lm.extracted = 1 AND tt.links_id = lm.id AND tt.id <= (SELECT max_doc_id FROM sph_counter WHERE counter_id="user_timelines") sql_attr_uint = twitter_id sql_attr_uint = tweets_id sql_attr_uint = link_id sql_attr_uint = tld_id sql_attr_timestamp = created_stamp sql_attr_uint = extracted } index user_timelines { source = user_timelines html_strip = 1 html_remove_elements = a, img path = /sphinx/data/user_timelines_index docinfo = extern charset_type = utf-‐8 }
source delta_user_timelines : user_timelines{ sql_query_pre = SET NAMES utf8
sql_query_pre = SELECT @tt_id:=id FROM `tweets_timelines` WHERE `created` <= \ DATE_SUB(CURDATE(),INTERVAL 8 DAY) ORDER BY created DESC LIMIT 1
sql_query_pre = SELECT @max:=max(tt.id) FROM links_metadata lm, tweets_timelines tt \
WHERE lm.extracted = 1 AND tt.links_id = lm.id sql_query = SELECT tt.id, tt.twitter_id, tt.tweets_id, lm.id AS link_id, lm.expanded_link,
lm.title, lm.description, lm.body, lm.tld_id, lm.extracted, \ UNIX_TIMESTAMP(tt.created) AS created_stamp \
FROM links_metadata lm, tweets_timelines tt \ WHERE tt.id >= @tt_id AND lm.extracted = 1 AND tt.links_id = lm.id AND \
tt.id>( SELECT max_doc_id FROM sph_counter WHERE counter_id="user_timelines" )
sql_query_post = UPDATE sph_counter SET last_doc_id=@max WHERE counter_id="user_timelines" } index delta_user_timelines : user_timelines{ source = delta_user_timelines html_strip = 1 html_remove_elements = a, img path = /sphinx/data/delta_user_timelines_index docinfo = extern charset_type = utf-‐8 }
L'#:%
• http://sphinxsearch.com/docs/current.html • http://AV4TAr.com • http://bit.ly/sphinx-autosuggest • http://bit.ly/sphinx-query-builder • http://bit.ly/sphinx-zfconf-011 • http://bit.ly/sphinx-high-performance