Fulltext engine for non fulltext searches

Post on 13-Jul-2015

622 views 5 download

Transcript of Fulltext engine for non fulltext searches

Fulltext engine for

Non-Fulltext Queries

Adrian Nuta // Sphinxsearch // 2013

• Introduction

• Non-fulltext queries

• Special data columns

• Fulltext for speed-up non-fulltext

Introduction

What is Sphinx

• free, open-source, search server

• fast 700 qps /core / 1M docs

• flexible 100+ features

• scalableo 300 mil. q / day

o 50 TB data, 100+ boxes

Fulltext Fields Attributes

Sphinx document

Doc

ID

Integer, Float, Bool, Timestamp, MVA,

String, JSON

...

● Inverted index

● indexed, not stored

● stored, not indexed

● held in memory or

on disk

MySQL

Application

Sphinx

MySQL protocol

MySQL protocol

MySQL language

SphinxQL language

MySQL connector

MySQL connector

SELECT * FROM mytable WHERE ...

Meet SphinxQL

Non-fulltext queries

What Sphinx can do beside fulltext?

• usual WHERE, ORDER, GROUP BY

• GROUP BY custom extensions:

o WITHIN GROUP ORDER BY

o GROUP <N> BY

• Aggregation, timestamp,math functions

• Comparasion functions: IF(), INTERVAL(), IN()

• Geo spatial: GEODIST(), GEOPOLY2D()

WITHIN GROUP ORDER BYmysql> SELECT *,DAY(added) as today FROM facetdemo WHERE property2 = 160 AND today =26 GROUP BY brand_id WITHIN GROUP ORDER BY

price ASC ORDER BY brand_id ASC;

+---------+-------+----------+-----------+------------+---------------------+------------+----------+-------+

| id | price | brand_id | property2 | added | title | brand_name | property | today |

+---------+-------+----------+-----------+------------+---------------------+------------+----------+-------+

| 520157 | 10 | 1 | 160 | 1382745486 | Product Nine Seven | brand1 | Three | 26 |

| 1726473 | 10 | 2 | 160 | 1382796463 | Product Two Three | brand2 | Eight | 26 |

| 1588875 | 11 | 3 | 160 | 1382762264 | Product Three Six | brand3 | Five | 26 |

| 1556197 | 10 | 4 | 160 | 1382754018 | Product Eight Six | brand4 | Seven | 26 |

| 751443 | 11 | 5 | 160 | 1382803444 | Product Six Three | brand5 | One | 26 |

| 512776 | 11 | 6 | 160 | 1382743642 | Product Ten Five | brand6 | Six | 26 |

mysql> SELECT *,DAY(added) as today FROM facetdemo WHERE property2 = 160 AND today =26 GROUP BY brand_id WITHIN GROUP ORDER

BY price DESC ORDER BY brand_id ASC;

+---------+-------+----------+-----------+------------+---------------------+------------+----------+-------+

| id | price | brand_id | property2 | added | title | brand_name | property | today |

+---------+-------+----------+-----------+------------+---------------------+------------+----------+-------+

| 815154 | 998 | 1 | 160 | 1382819286 | Product Two Nine | brand1 | Eight | 26 |

| 2793903 | 999 | 2 | 160 | 1382813601 | Product Eight Five | brand2 | Two | 26 |

| 699831 | 1000 | 3 | 160 | 1382790589 | Product One Six | brand3 | Eight | 26 |

| 714052 | 1000 | 4 | 160 | 1382794137 | Product One Ten | brand4 | Three | 26 |

| 2791902 | 999 | 5 | 160 | 1382813140 | Product Five Three | brand5 | Four | 26 |

| 2753725 | 1000 | 6 | 160 | 1382803662 | Product Seven Three | brand6 | Two | 26 |

Using GROUP <N> BY

mysql> SELECT * FROM facetdemo GROUP 3 BY brand_id WITHIN GROUP ORDER BY added DESC ORDER BY brand_id ASC;

+---------+-------+----------+------------+---------------------+------------+----------+

| id | price | brand_id | added | title | brand_name | property |

+---------+-------+----------+------------+---------------------+------------+----------+

| 1479848 | 938 | 1 | 1382735889 | Product Ten Seven | brand1 | Four |

| 2479064 | 398 | 1 | 1382734998 | Product Ten Five | brand1 | Eight |

| 1480553 | 687 | 1 | 1382734048 | Product Four Two | brand1 | One |

| 1479580 | 62 | 2 | 1382734834 | Product Nine Seven | brand2 | Ten |

| 1479585 | 357 | 2 | 1382734834 | Product Six Two | brand2 | Five |

| 477383 | 908 | 2 | 1382733871 | Product Ten Three | brand2 | Eight |

| 2478429 | 425 | 3 | 1382734839 | Product Three Ten | brand3 | Five |

| 477456 | 519 | 3 | 1382734818 | Product Ten One | brand3 | Six |

| 477521 | 190 | 3 | 1382734403 | Product Three Two | brand3 | Five |

| 2478459 | 931 | 4 | 1382734850 | Product One Two | brand4 | Five |

| 1479718 | 891 | 4 | 1382734065 | Product Two One | brand4 | Three |

| 2478514 | 106 | 4 | 1382733868 | Product Six Seven | brand4 | One |

| 477297 | 991 | 5 | 1382734844 | Product Five Eight | brand5 | Four |

| 2479053 | 648 | 5 | 1382733994 | Product Six One | brand5 | Nine |

| 1480798 | 250 | 5 | 1382732121 | Product One Seven | brand5 | Eight |

Using HAVING

mysql> SELECT *,COUNT(*) FROM facetdemo where property2 = 190 and price>900 GROUP BY brand_id HAVING COUNT(*)>1000;

+-------+-------+----------+-----------+------------+-------------------+------------+----------+----------+

| id | price | brand_id | property2 | added | title | brand_name | property | count(*) |

+-------+-------+----------+-----------+------------+-------------------+------------+----------+----------+

| 2566 | 934 | 24 | 190 | 1382615816 | Product One Three | brand24 | Six | 1023 |

| 4807 | 905 | 11 | 190 | 1382616392 | Product Five Six | brand11 | Eight | 1023 |

| 5539 | 985 | 44 | 190 | 1382616552 | Product Ten Four | brand44 | Three | 1009 |

| 7655 | 912 | 10 | 190 | 1382617104 | Product Four Five | brand10 | Ten | 1028 |

| 16837 | 968 | 20 | 190 | 1382619365 | Product One Nine | brand20 | Five | 1015 |

+-------+-------+----------+-----------+------------+-------------------+------------+----------+----------+

5 rows in set (0.17 sec)

Comparing simple queriesOperation Example MySQL Sphinx difference

Filter by integer, group by

integer

WHERE property_int =190

GROUP BY brand_id0.32 0.14 2.2x

Group by integer, order by

count(*)

GROUP BY brand_id ORDER BY

COUNT(*) DESC

1.76 0.53 3.3x

Filter by integer, order by

timestamp

WHERE brand_id=20 ORDER BY

added ASC

0.00 0.14 0

Filter by integer, order by

timestamp and integer

column

WHERE brand_id=20 ORDER BY

added DESC, property_int ASC

0.31 0.19 1.5x

Using IF comparasion

mysql> SELECT COUNT(*), IF( property2=270 OR price<80, 1,

IF(property2=280 OR price> 900,2,3)

) AS expr FROM facetdemo GROUP BY expr;

+----------+------+

| count(*) | expr |

+----------+------+

| 7494455 | 3 |

| 1357178 | 2 |

| 1148366 | 1 |

+----------+------+

3 rows in set (1.04 sec)

Using INTERVAL for segmentation

mysql> SELECT id, price, INTERVAL(price,0,300,600,900) AS pricerange, COUNT(*) FROM facetdemo WHERE

brand_id=27 GROUP BY pricerange ORDER BY pricerange ASC;

+------+-------+------------+----------+

| id | price | pricerange | count(*) |

+------+-------+------------+----------+

| 219 | 196 | 1 | 58283 |

| 46 | 467 | 2 | 60535 |

| 109 | 667 | 3 | 60789 |

| 5 | 962 | 4 | 20285 |

+------+-------+------------+----------+

4 rows in set (0.19 sec)

Geo spatial in Sphinx

GEODIST(lat1, lon1, lat2, lon2, { option=value, ... })

o in { deg | degrees | rad | radians}

o out {m | meters | km | ft | mi | miles }

o method {haversine | adaptive}

haversine - high precision, expensive

adaptive - good precision, cheaper

(Polar flat-Earth algorithm )

• POLY2D(x1,y1,x2,y2,x3,y3, …)

• GEOPOLY2D (lat1,lng1,lat2,lng2,lat3,lng3,...)

• lat/lng in degrees

• CONTAINTS( polygon, x, y )

mysql> SELECT *, CONTAINS(GEOPOLY2D(40.95164274496,-76.88583678218,41.188446201688,-

73.203723511772,39.900666261352,-74.171833538046,40.059260979044,-

76.301076056469),latitude_deg,longitude_deg) AS inside FROM geodemo WHERE inside=1

LIMIT 0,100 ;

Special data columns

• set of integers column

Multi value attribute (MVA)

199.99

24

128

300

float MVA

Price Categories

... ... ...

MVA with multiple selection

mysql> SELECT id,price,brand_id,categories FROM facetdemo WHERE categories IN (13,14);

+------+-------+----------+------------+

| id | price | brand_id | categories |

+------+-------+----------+------------+

| 1 | 874 | 47 | 13 |

| 2 | 712 | 38 | 11,14 |

| 9 | 113 | 25 | 12,14 |

| 17 | 440 | 46 | 13,15 |

| 19 | 206 | 50 | 13,17 |

| 21 | 76 | 28 | 7,10,13 |

| 22 | 363 | 21 | 13,17,20 |

...

Grouping on MVA

mysql> SELECT id,price,brand_id,categories,GROUPBY(),COUNT(*) FROM facetdemo GROUP BY categories;

+------+-------+----------+------------+-----------+----------+

| id | price | brand_id | categories | groupby() | count(*) |

+------+-------+----------+------------+-----------+----------+

| 1 | 874 | 47 | 13 | 13 | 362931 |

| 2 | 712 | 38 | 11,14 | 14 | 185023 |

| 2 | 712 | 38 | 11,14 | 11 | 329874 |

| 3 | 773 | 7 | 12,16 | 16 | 143837 |

| 3 | 773 | 7 | 12,16 | 12 | 349446 |

| 4 | 803 | 31 | 6,9 | 9 | 267583 |

| 4 | 803 | 31 | 6,9 | 6 | 184772 |

...

Going further: JSON

• starting with 2.1 Sphinx supports JSON

documents

• useful for o unstructured data

o complex one to many relations

{

"id": 1,

"gid": 2,

"title": "some title",

"tags":

[ "tag1", "tag2", "tag3" ],

"property": [

{

"name": "color",

"value": "blue"

},

{

"name": "weight",

"value": 2.56

}

]

}

JSON attributes

• filter, sort and group

• JSON/MVA array functions:

LENGTH(), LEAST(), GREATEST()

• Advanced JSON search in array of objects:

ANY(), ALL(), INDEXOF()

Advanced searching in JSON

document :

id : 1011

title : Hotel Sky

myjson: {

offers: {

{

‘type’ : 3,

‘start’ : start_timestamp,

‘end’: end_timestamp

},

{

‘type’ : 1,

‘start’ : start_timestamp,

‘end’: end_timestamp

}

}

}

SELECT *,ANY (

( item.type = 1 AND

item.start > my_start_timestamp AND

item.end < my_end_timestamp )

FOR item IN myjson.offers

) AS condition

FROM index

WHERE condition =1

• ANY ( cond FOR var IN json.array)

o true if one element match condition

• ALL ( cond FOR var IN json.array)

o true if all elements match condition

• INDEXOF ( cond FOR var IN json.array)

o returns index key of first element that match

condition

Fulltext for speed up

non-fulltext

SELECT *,(...) as heavy_expr

WHERE attr=x AND heavy_expr =1

SELECT *,(...) as heavy_expr

WHERE MATCH(‘attrx’) AND heavy_expr =1

No fulltext match, query does fullscan,

computes for whole collection the heavy

expression

Fulltext match, heavy expression is

computed only on result set returned by

fulltext match

Sphinx with FT filter

Operation Example MySQL Sphinx w/o FT Sphinx with FT

Filter by integer,

order by

timestamp and

integer column

WHERE

brand_id=20

ORDER BY added

DESC, property_int

ASC

0.31 0.19

Fulltext filter,

order by

timestamp and

integer column

WHERE

MATCH(‘brand20’)

ORDER BY added

DESC, property_int

ASC

0.13

Speed up geo spatial with fulltext

• example: find items around a point in New York city in a

10km radius. Speed-up: search only items belonging to

New York states

mysql> SELECT *, GEODIST(0.710011075352, -

1.2918035709982,latitude,longitude,{in=rad,out=km,method=adaptive}) as distance FROM geodemo WHERE

distance < 10 ORDER BY distance ASC LIMIT 0,10;10 rows in set (0.17 sec)

mysql> SELECT *, GEODIST(0.710011075352, -

1.2918035709982,latitude,longitude,{in=rad,out=km,method=adaptive}) as distance FROM geodemo WHERE

MATCH('@state_code NY') AND distance < 10 ORDER BY distance ASC LIMIT 0,10;10 rows in set (0.03 sec)

Questions?

adrian.nuta@sphinxsearch.com

http://www.sphinxsearch.com