Full text search

46
© 2013 NTT DATA, Inc. Rahila Syed Beena Emerson Full text Search

description

This contains basic information about full text search and how it can be implemented in PostgreSQL. This was presented at India PostgreSQL meetup at Pune on 16 Nov, 2013.

Transcript of Full text search

Page 1: Full text search

© 2013 NTT DATA, Inc.

Rahila Syed Beena Emerson

Full text Search

Page 2: Full text search

© 2013 NTT DATA, Inc. 2

• Full text search and its types

• Full text search in PostgreSQL

• PostgreSQL extension

• Similarity Search

Index

Page 3: Full text search

3 © 2013 NTT DATA, Inc.

Full Text Search

Page 4: Full text search

© 2013 NTT DATA, Inc. 4

• Searching for a group of keywords in a pile of texts

– Document

– Query

– Similarity

• Full text search in database

– Searching for a set of keywords in a text field of a database table

– The data used for full text search can be huge

– Indexing words and associating indexed words with documents

What is full text search?

Page 5: Full text search

5 © 2013 NTT DATA, Inc.

Full Text Search in PostgreSQL

Page 6: Full text search

© 2013 NTT DATA, Inc. 6

• Creating Tokens

– Parsing document into set of tokens like numbers, words, complex words, email addresses.

• Creating Lexemes

– Normalization: Dictionary controls this.

• Removal of suffixes – converts variants into a single form (worry, worries, worried, etc.)

• Conversion to lower case

• Remove stop words – common words useless for searching (the, at etc.)

• Storing preprocessed documents

– Storing documents and creating indexes over them for faster search

• Relevance ranking

Steps

Page 7: Full text search

© 2013 NTT DATA, Inc. 7

Full text search in PostgreSQL

• Full integration

• 27 built-in configurations for 10 languages

• Support of user-defined FTS configurations

• Pluggable dictionaries ( ispell, snowball, thesaurus ), parsers

• Relevance ranking

• GIN and GiST index

Page 8: Full text search

© 2013 NTT DATA, Inc. 8

Full text search in PostgreSQL

Morphological Search

• Indexed tokens are words of a language

• Eg. Tree, book, rain

• Small index size

• Good in orthographical variants

• Search results depends on division of words

• Used for large documents like thesis

• Ex. Tsvector

N-gram search

• Indexed tokens are characters.

• Eg. _t, tr, re, e_ (2 grams)

• Big index size

• Cannot match orthographical variants

• Results closer to indexed LIKE

• Better suited for a limited set of words

• Ex. pg_bigm, pg_tigm

Page 9: Full text search

© 2013 NTT DATA, Inc. 9

• Search similar words(No linguistic support)

• Ranking of search results

• Searches substrings

– Indexes does not support substring search

– LIKE operator doesn’t use INDEX when preceded by %.

– Low performance when compared to full text search using GIN and GiST

• Accuracy issue

Eg. LIKE %one% matches prone, money, lonely

Why full text search?

Page 10: Full text search

© 2013 NTT DATA, Inc. 10

• POSIX Expression =# EXPLAIN ANALYZE SELECT * FROM fulltext_search WHERE doc ~ 'postgresql';

QUERY PLAN

--------------------------------------------------------------------------

Seq Scan on fulltext_search (cost=10000000000.00..10000000473.77 rows=40

width=152) (actual time=10.871..390.019 rows=250 loops=1)

Filter: (doc ~ 'postgresql'::text)

Rows Removed by Filter: 11397

Total runtime: 390.060 ms

Measurement results

• LIKE Query =# EXPLAIN ANALYZE SELECT * FROM fulltext_search WHERE doc LIKE '%postgresql%';

QUERY PLAN

------------------------------------------------------------------------

Seq Scan on fulltext_search (cost=10000000000.00..10000000473.77 rows=40 width=152) (actual time=1.342..110.107 rows=250 loops=1)

Filter: (doc ~~ '%postgresql%'::text)

Rows Removed by Filter: 11397

Total runtime: 110.134 ms

Page 11: Full text search

© 2013 NTT DATA, Inc. 11

Measurement results

• Full Text Search

Nested Loop (cost=352.83..508.22 rows=107 width=64) (actual

time=1.397..1.575 rows=250 loops=1)

-> Function Scan on to_tsquery query (cost=0.00..0.01 rows=1 width=32)

(actual time=0.023..0.023 rows=1 loops=1)

-> Bitmap Heap Scan on full_text_search (cost=352.83..507.14 rows=107

width=32) (actual time=1.371..1.516 rows=250 loops=1)

Recheck Cond: (query.query @@ to_tsvector('english'::regconfig,

doc))

-> Bitmap Index Scan on full_search_idx (cost=0.00..352.80

rows=107 width=0) (actual time=1.354..1.354 rows=348 loops=1)

Index Cond: (query.query @@

to_tsvector('english'::regconfig, doc))

Total runtime: 1.619 ms

Page 12: Full text search

© 2013 NTT DATA, Inc. 12

Normal Search: SELECT * FROM tbl WHERE col1 LIKE 'The tiger is the largest cat

species';

col1

--------------------------------------

The tiger is the largest cat species

(1 row)

Ranking Example

Full Text Search: SELECT col1, similarity(col1, 'The tiger is the largest cat

species') AS sml

FROM tbl_t WHERE col1 % 'The tiger is the largest cat species'

ORDER BY sml DESC, col1;

col1 | sml

-----------------------------------------+----------

The tiger is the largest cat species | 1

The peacock is the largest bird species | 0.511111

The cheetah is the fastest cat species | 0.466667

(3 rows)

Page 13: Full text search

© 2013 NTT DATA, Inc. 13

• GIN(Generalized Inverted Index)

• Custom strategies for particular data types

• Inverted indexes

• Interface for custom data types

• Slower to update

• Deterministic

• Appropriate for fixed data sets.

Indexes Used in Full Text Search

KEY TID

Meetup

100 ,140

Pune 100 , 150

Here 100

Page 14: Full text search

© 2013 NTT DATA, Inc. 14

• GiST (Generalized Search Tree)

• Interface for data types and access methods

• Document is represented in the index by a fixed-length signature

• Based on hash tables

• Probability of false match

• Table row must be retrieved to see if the match is correct

• In appropriate for large data sets

• Filtering data at the end of index search to remove false match

EXPLAIN SELECT * FROM tab WHERE text_search @@

to_tsquery(‘Mountain'); ------------------------------- QUERY PLAN -----------------------

-------------------

Index Scan using text_search_idx on tab (cost=0.00..12.29 rows=2

width=1469)

Index Cond: (textsearch @@ '‘Mountain'''::tsquery) Filter: (textsearch @@ ''‘Mountain'''::tsquery)

Indexes Used in Full Text Search

Page 15: Full text search

© 2013 NTT DATA, Inc. 15

• Representation of document best suited for full text search

• Normalized lexemes formed by pre-processing of the documents

• Functions to convert normal text to tsvector:

• to_tsvector to_tsvector([ config regconfig, ] document text) returns

tsvector

=# SELECT to_tsvector('english', 'Glad to be part of this

meetup');

to_tsvector

------------------------------

'glad':1 'meetup':7 'part':4

(1 row)

• The query above specifies 'english' as the configuration to be used to

parse and normalize the strings. The default_text_search_config value will be used if the configuration parameter is omitted.

tsvector

Page 16: Full text search

© 2013 NTT DATA, Inc. 16

• Representation of search query best suited for full text search

• Normalized lexemes formed by processing the query

• Maybe combined using AND, OR, or NOT operator.

• All keywords used for search

tsquery

Page 17: Full text search

© 2013 NTT DATA, Inc. 17

• Functions to convert normal text to tsquery:

• to_tsquery to_tsquery([ config regconfig, ] querytext text) returns

tsquery

=# SELECT to_tsquery('meetups & in & ! Pune');

to_tsquery

--------------------

'meetup' & !'pune'

(1 row)

• plainto_tsquery plainto_tsquery([ config regconfig, ] querytext text)

returns tsquery

=# SELECT plainto_tsquery ('english','meetups in Pune');

plainto_tsquery

-------------------

'meetup' & 'pune'

(1 row)

tsquery

Page 18: Full text search

© 2013 NTT DATA, Inc. 18

• Checks a tsvector(document) with a tsquery(search word)

• Returns true if all tsquery elements are present in the tsvector of the document

=# SELECT to_tsvector('Welcome to this postgresql meetup') @@

plainto_tsquery('PostgreSQL Meetups');

?column?

----------

t

(1 row)

=# SELECT to_tsvector('Welcome to this postgresql meetup') @@

plainto_tsquery('Pune meetup');

?column?

----------

f

(1 row)

Match operator @@

Page 19: Full text search

© 2013 NTT DATA, Inc. 19

SELECT * FROM <table> WHERE

to_tsvector('<config>', <colname>) @@ to_tsquery('<config>',

'<search word>');

The configuration parameter of the functions to_tsvector and to_tsquery should be same.

Example:

=# SELECT * FROM tbl WHERE to_tsvector('english', col) @@

to_tsquery('english', 'enjoy');

col

--------------------------------

He enjoyed the party

He enjoys the classical music.

(2 rows)

Full text search without index

Page 20: Full text search

© 2013 NTT DATA, Inc. 20

• Creating the index CREATE INDEX <index_name> ON <table> USING

gin(to_tsvector('<config>', <col>));

• Performing search using the index: SELECT * FROM <table> WHERE to_tsvector('<config>', <col>) @@

plainto_tsquery('<config>','<search word>')

Example:

=# CREATE INDEX idx ON tbl USING gin(to_tsvector('english',

col));

=# SELECT * FROM tbl WHERE to_tsvector('english', col) @@

plainto_tsquery('english','enjoy');

col

--------------------------------

He enjoyed the party

He enjoys the classical music.

(2 rows)

Full text search using index

Page 21: Full text search

© 2013 NTT DATA, Inc. 21

• Procedure

– Create a column of tsvector type

– Define a trigger which will automatically update the tsvector column

– Perform Search on the tsvector column

• Advantages:

– No need to specify the text search configuration in every query in order to make use of the index

– Faster searches as the to_tsvector function will not be called for each search query.

Full text search using separate column

Page 22: Full text search

© 2013 NTT DATA, Inc. 22

Example:

=# CREATE TABLE tbl (col text, tsv_col tsvector);

=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE

ON tbl FOR EACH ROW EXECUTE PROCEDURE

tsvector_update_trigger(tsv_col, 'pg_catalog.english', col);

=# INSERT INTO tbl VALUES ('He enjoyed the party'),('He enjoys the

classical music.'),('The moon winked at him');

=# SELECT * FROM tbl;

col | tsv

--------------------------------+---------------------------------

He enjoyed the party | 'enjoy':2 'parti':4

He enjoys the classical music. | 'classic':4 'enjoy':2 'music':5

The moon winked at him | 'moon':2 'wink':3

(3 rows)

Full text search using separate column

Page 23: Full text search

© 2013 NTT DATA, Inc. 23

Example:

=# CREATE TABLE tbl (col text, tsv_col tsvector);

=# CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE

ON tbl FOR EACH ROW EXECUTE PROCEDURE

tsvector_update_trigger(tsv_col, 'pg_catalog.english', col);

=# INSERT INTO tbl VALUES ('He enjoyed the party'),('He enjoys the

classical music.'),('The moon winked at him');

=# SELECT col FROM tbl WHERE tsv_col @@ to_tsquery('enjoys');

col

--------------------------------

He enjoyed the party

He enjoys the classical music.

(2 rows)

Full text search using separate column

Page 24: Full text search

© 2013 NTT DATA, Inc. 24

Ranking

•ts_rank

–Lexical ranking

ts_rank([ weights float4[], ] vector tsvector, query tsquery [,

normalization integer ]) returns float4

=# select ts_rank(to_tsvector('Free text seaRCh is a wonderful

Thing'), to_tsquery('wonderful | thing'));

ts_rank ----------- 0.0607927

•ts_rank_cd

–Proximity ranking

=# select ts_rank_cd(to_tsvector('Free text seaRCh is a

wonderful Thing'), to_tsquery('wonderful & thing'));

ts_rank_cd ------------ 0.1

Page 25: Full text search

© 2013 NTT DATA, Inc. 25

Ranking

• Structural ranking – Query

select ts_rank( array[0.1,0.1,0.9,0.1],

setweight(to_tsvector('All about search'), 'B') ||

setweight(to_tsvector('Free text seaRCh is a

wonderfulThing'),'A'),

to_tsquery('wonderful & search'));

– Result

ts_rank

0.328337

Page 26: Full text search

26 © 2013 NTT DATA, Inc.

PostgreSQL Extension

Page 27: Full text search

© 2013 NTT DATA, Inc. 27

• Uses index made from trigrams – 3 consecutive characters from string.

• Find string similarity by comparing the trigrams.

• provides GiST and GIN index operator classes to create index. CREATE INDEX <idx> ON <tbl> USING gist(<col> gist_trgm_ops);

CREATE INDEX <idx> ON <tbl> USING gin (<col> gin_trgm_ops);

• Problem:

− No partial match algorithm

− Slow when search key is < 3 characters

GIN_SEARCH_MODE_ALL is used

pg_trgm

Page 28: Full text search

© 2013 NTT DATA, Inc. 28

• PostgreSQL module which provides full text search capability using 2-gram index.

• Based on pg_trgm

• First released on April 2013. Version 1.1 to be released soon.

• Developed by NTT Data

• Site: http://sourceforge.jp/projects/pgbigm/

pg_bigm

Page 29: Full text search

© 2013 NTT DATA, Inc. 29

Difference

Feature pg_trgm pg_bigm

Method of full text search

3-gram " a", " ab", abc, bcd

2-gram " a", ab, bc, cd, "d "

Available index GIN and GiST GIN only

1-2 character keyword search

Slow Fast

Page 30: Full text search

© 2013 NTT DATA, Inc. 30

• Download tar.gz file from the site

• Install pg_bigm $ make USE_PGXS=1

$ su

# make USE_PGXS=1 install

• Register- Set the postgresql.conf variables: – shared_preload_libraries = 'pg_bigm'

– custom_variable_classes = 'pg_bigm' (only in 9.1)

• Load into the required database =# CREATE EXTENSION pg_bigm;

Install pg_bigm

Page 31: Full text search

© 2013 NTT DATA, Inc. 31

Argument: Search String

Return Value: Array of all possible 2-gram character string

Procedure:

• For each word perform the following:

• Add a space character before and after the text

• Moving from left to right extract strings in the unit of 2 characters.

=# SELECT show_bigm('ab');

show_bigm

----------------

{" a",ab,"b "}

(1 row)

Function – show_bigm

Page 32: Full text search

© 2013 NTT DATA, Inc. 32

Argument: Search string

Return Value: String in a pattern to be used in LIKE for full-text search

Procedure:

• Add % to the beginning and the end of retrieval string.

• Add a backlash (\) before every underscore (_), percent (%) and backlash (\) present in the retrieval string.

=# SELECT likequery ('pg_bigm ppt');

likequery

----------------

%pg\_bigm ppt%

(1 row)

Function - likequery

Page 33: Full text search

© 2013 NTT DATA, Inc. 33

• Only GIN support

• Create Index on the text column of a table CREATE INDEX <index_name> ON <table> USING gin (<column>,

gin_bigm_ops);

Creation of Index

Key TID

" c" 1

" m" 5

at 1, 5

ca 1

ma 5

"t " 1, 5

TID Data

1 cat

5 mat

Generate bigrams cat - " c", at, ca, "t "

mat - " m", at, ma, "t "

Table

Index

Page 34: Full text search

© 2013 NTT DATA, Inc. 34

SELECT * FROM <tbl> WHERE <col> LIKE likequery(‘<word>');

=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE col LIKE likequery('cat');

QUERY PLAN

-------------------------------------------------------------------

Bitmap Heap Scan on tbl (cost=12.00..16.01 rows=1 width=4) (actual

time=0.038..0.039 rows=1 loops=1)

Recheck Cond: (col ~~ '%cat%'::text)

-> Bitmap Index Scan on idx (cost=0.00..12.00 rows=1 width=0)

(actual time=0.025..0.025 rows=1 loops=1)

Index Cond: (col ~~ '%cat%'::text)

Total runtime: 0.093 ms

(5 rows)

Full text search Query

Page 35: Full text search

© 2013 NTT DATA, Inc. 35

Full text search Query

Generate bigrams

Key TID

" c" 1

" m" 5

at 1, 5

ca 1

ma 5

"t " 1, 5

TID Data

1 cat

Result Candidates

Perform Recheck

Search key

Index lookup

TID Data

1 cat

Final Result

Page 36: Full text search

© 2013 NTT DATA, Inc. 36

• Removes wrong results from result candidates of index scan.

=# EXPLAIN ANALYZE SELECT * FROM tbl WHERE col LIKE

likequery('trial');

QUERY PLAN

-------------------------------------------------------------------

------------------------------------------

Bitmap Heap Scan on tbl (cost=24.00..28.01 rows=1 width=5)

(actual time=0.060..0.060 rows=1 loops=1)

Recheck Cond: (col ~~ '%trial%'::text)

Rows Removed by Index Recheck: 1

-> Bitmap Index Scan on idx (cost=0.00..24.00 rows=1 width=0)

(actual time=0.043..0.043 rows=2 loops=1)

Index Cond: (col ~~ '%trial%'::text)

Total runtime: 0.117 ms

(6 rows)

Why Recheck?

Page 37: Full text search

© 2013 NTT DATA, Inc. 37

Why Recheck?

Key TID

" t" 1, 2

al 1, 2

ia 1, 2

iv 2

“l " 1, 2

ri 1, 2

tr 1, 2

vi 2

TID Data

1 trial

2 trivial

trial " t",al,ia,"l ",ri,tr

trivial " t",al,ia,iv,"l ",ri,tr,vi

Search ‘trial’

TID Data

1 trial

2 trivial

TID Data

1 trial

Index scan

Recheck

Page 38: Full text search

© 2013 NTT DATA, Inc. 38

Parameter - enable_recheck

• To disable Recheck and get all the results retrieved by index scan

• Values on/off

=# SET pg_bigm.enable_recheck = on;

=# SELECT * FROM tbl WHERE doc LIKE likequery('trial');

doc

----------------------

He is awaiting trial

(1 row)

=# SET pg_bigm.enable_recheck = off;

=# SELECT * FROM tbl WHERE doc LIKE likequery('trial');

doc

--------------------------

He is awaiting trial

It was a trivial mistake

(2 rows)

Disabling Recheck

Page 39: Full text search

© 2013 NTT DATA, Inc. 39

=# CREATE TABLE tbl (col text);

=# CREATE INDEX tbl_idx ON tbl USING gin (col gin_bigm_ops);

=# INSERT INTO tbl VALUES

('He is awaiting trial'),

('Those orchids are very special to her '),

('pg_bigm performs full text search using 2 gram index'),

('pg_trgm performs full text search using 3 gram index');

=# SELECT * FROM tbl WHERE col LIKE likequery('full text search');

col

------------------------------------------------------

pg_bigm performs full text search using 2 gram index

pg_trgm performs full text search using 3 gram index

(2 rows)

pg_bigm Full Text Search Sample

Page 40: Full text search

40 © 2013 NTT DATA, Inc.

Similarity Search

Page 41: Full text search

© 2013 NTT DATA, Inc. 41

Argument: The 2 strings whose similarity is to be checked

Return value - the similarity value of two arguments (0 - 1)

• measures the similarity of two strings by counting the number of 2-grams they share.

=# SELECT bigm_similarity ('test','text');

bigm_similarity

-----------------

0.6

(1 row)

Function – bigm_similarity

Page 42: Full text search

© 2013 NTT DATA, Inc. 42

• specifies threshold used for the similarity search

• Search returns rows with similarity value >= similarity_limit

• Default: 0.3

• SET command can be used to modify the value.

=# SHOW pg_bigm.similarity_limit;

pg_bigm.similarity_limit

--------------------------

0.3

(1 row)

=# SET pg_bigm.similarity_limit = 0.5;

Parameter - similarity_limit

Page 43: Full text search

© 2013 NTT DATA, Inc. 43

• Used to perform similarity search

• Uses full text search index.

• Returns rows whose similarity is higher than or equal to the value of pg_bigm.similarity_limit

SELECT * FROM <tbl> WHERE <col> =% ‘<key>';

Similarity Operator - =%

Page 44: Full text search

© 2013 NTT DATA, Inc. 44

=# SET pg_bigm.similarity_limit = 0.2;

=# SELECT *, bigm_similarity(col, 'test') FROM tbl WHERE col =%

'test';

col | bigm_similarity

-------+-----------------

test | 1

text | 0.6

treat | 0.333333

(3 rows)

=# SET pg_bigm.similarity_limit = 0.5;

=# SELECT *, bigm_similarity(col, 'test') FROM tbl WHERE col =%

'test';

col | bigm_similarity

------+-----------------

test | 1

text | 0.6

(2 rows)

Similarity Search Sample

Page 46: Full text search

© 2013 NTT DATA, Inc.