A sneak peek at CBFT: A Full Text Search for Couchbase: Couchbase Connect 2015

50
A SNEAK PEEK AT CBFT Couchbase Full-Text Server Marty Schoch & Steve Yen, Couchbase, Inc.

Transcript of A sneak peek at CBFT: A Full Text Search for Couchbase: Couchbase Connect 2015

A SNEAK PEEK AT CBFT

Couchbase Full-Text ServerMarty Schoch & Steve Yen, Couchbase, Inc.

©2015 Couchbase Inc. 2

about the speakers

Steve [email protected] Couchbase

©2015 Couchbase Inc. 3

about the speakers

Marty Schochlead contributor to bleve

the most popular, open-source

full-text indexing engine

for golang

©2015 Couchbase Inc. 4

agenda

why cbft?what’s full-text search and how’s it work?designdemostatus / roadmap / what’s next

©2015 Couchbase Inc. 5

agenda

why cbft?what’s full-text search and how’s it work?designdemostatus / roadmap / what’s next

©2015 Couchbase Inc. 6

why cbft?

couchbase connectors…

yes

yes

Lucidworks yes

©2015 Couchbase Inc. 7

why cbft?

couchbase connectors… yet another tier &

cluster to manage

yesyes

yesyes

Lucidworks yesyes

©2015 Couchbase Inc. 8

why cbft?

why cbft?

simpleintegrated

80/20 of features

©2015 Couchbase Inc. 9

agenda

why cbft?what’s full-text search and how’s it work?designdemostatus / roadmap / what’s next

©2015 Couchbase Inc. 10

what’s full-text search?

©2015 Couchbase Inc. 11

advanced search

©2015 Couchbase Inc. 12

search results

©2015 Couchbase Inc. 13

search results SpellingSuggestions

©2015 Couchbase Inc. 14

search results SpellingSuggestions

Result TextSnippets

©2015 Couchbase Inc. 15

search results SpellingSuggestions

Result TextSnippets

HighlightedSearch Terms

©2015 Couchbase Inc. 16

faceted search

©2015 Couchbase Inc. 17

JSON document in Couchbase

Key: akay1980

Document: {

“name”: “Alan Kay”, “description”: “... the wisest

engineer ...” }

©2015 Couchbase Inc. 18

Text Analysis : tokenizer + token filters

A pipeline of transformations

One Tokenizer

Zero or more Token Filters

©2015 Couchbase Inc. 19

“… the wisest engineer …”

thewises

tenginee

r• Seems like simple whitespace… but, this doesn’t work for

all languages• Unicode standard rules help (see Unicode Standard Annex

#29)• Still need to account for exceptions

• E-mail addresses and URLs don’t follow normal rules

Text Analysis : tokenizer + token filters

©2015 Couchbase Inc. 20

Text Analysis : tokenizer + token filters

thewises

tengineer

Stop WordRemoval the

wisest

engineer

Stemming wise engineer

©2015 Couchbase Inc. 21

Inverted Index

wise

engineer…

…, akay1980, …

…, akay1980, …

Inverted Index

©2015 Couchbase Inc. 22

Search

wise

engineer…

…, akay1980, …

…, akay1980, …

engineersInverted Index

©2015 Couchbase Inc. 23

Search

wise

engineer…

…, akay1980, …

…, akay1980, …

engineers

engineer

Apply the same analysis at search time that we used at index time.

Inverted Index

©2015 Couchbase Inc. 24

Search

wise

engineer…

…, akay1980, …

…, akay1980, …

engineers

engineer

Exact Match

Apply the same analysis at search time that we used at index time.

Inverted Index

©2015 Couchbase Inc. 25

Document Scoring

• tf/idf scoring• Term Frequency• How often does a term occur in

a doc?• More often yields a higher score

• Inverse Document Frequency• How many docs have this term?• More docs yield lower score

(because the term is more common)

©2015 Couchbase Inc. 26

Quality Results

• Getting high quality results depends on the right text analysis

• Beware: adjustments that increase precision may reduce recall (and the other way around)

©2015 Couchbase Inc. 27

agenda

why cbft?what’s full-text search and how’s it work?designdemostatus / roadmap / what’s next

©2015 Couchbase Inc. 28

cbft design / index partitioning

©2015 Couchbase Inc. 29

cbft design / index partitioning

bucket partitions: 0, 1, 2, 3, 4, … … ,1021, 1022, 1023 (1024 vbuckets)

©2015 Couchbase Inc. 30

cbft design / index partitioning

bucket partitions: 0, 1, 2, 3, 4, … … ,1021, 1022, 1023 (1024 vbuckets)

index partitions: A B C

©2015 Couchbase Inc. 31

cbft design / index partitioning

bucket partitions: 0, 1, 2, 3, 4, … … ,1021, 1022, 1023 (1024 vbuckets)

index partitions: A B C

©2015 Couchbase Inc. 32

cbft design / index partitioning

bucket partitions: 0, 1, 2, 3, 4, … … ,1021, 1022, 1023 (1024 vbuckets)

index partitions: A B C

(groups of vbuckets) 0-399 400-799 800-1023

©2015 Couchbase Inc. 33

cbft design / index partitioning

bucket partitions: 0, 1, 2, 3, 4, … … ,1021, 1022, 1023 (1024 vbuckets)

index partitions: A B C

(groups of vbuckets) 0-399 400-799 800-1023

cbft nodes:

X

©2015 Couchbase Inc. 34

cbft design / index partitioning

bucket partitions: 0, 1, 2, 3, 4, … … ,1021, 1022, 1023 (1024 vbuckets)

index partitions: A B C

(groups of vbuckets) 0-399 400-799 800-1023

assign to cbft nodes:

cbft nodes:

X

©2015 Couchbase Inc. 35

cbft design / index partitioning

bucket partitions: 0, 1, 2, 3, 4, … … ,1021, 1022, 1023 (1024 vbuckets)

index partitions: A B C

(groups of vbuckets) 0-399 400-799 800-1023

assign to cbft nodes:

cbft nodes:

X Y Z

©2015 Couchbase Inc. 36

cbft design / index partitioning

bucket partitions: 0, 1, 2, 3, 4, … … ,1021, 1022, 1023 (1024 vbuckets)

index partitions: A B C

(groups of vbuckets) 0-399 400-799 800-1023

assign to cbft nodes:replicas, too:

cbft nodes:

X Y Z

©2015 Couchbase Inc. 37

cbft design / index partitioning

bucket partitions: 0, 1, 2, 3, 4, … … ,1021, 1022, 1023 (1024 vbuckets)

index partitions: A B C

(groups of vbuckets) 0-399 400-799 800-1023

assign to cbft nodes:replicas, too:

cbft nodes:

X Y Z

©2015 Couchbase Inc. 38

cbft design / indexing

couchbase couchbase couchbase

cbft cbft cbft

DCP streams

©2015 Couchbase Inc. 39

cbft design / indexing

couchbase couchbase couchbase

cbft cbft cbft

DCP streams

©2015 Couchbase Inc. 40

cbft design / queries

cbft cbft

a query sentto any cbftnode…

your application

cbftR

ES

T

©2015 Couchbase Inc. 41

cbft design / queries

cbft cbft

a query sentto any cbftnode…

…is scatter / gathered

to the other cbft nodes

your application

cbftR

ES

T

©2015 Couchbase Inc. 42

agenda

why cbft?what’s full-text search and how’s it work?designdemostatus / roadmap / what’s next

©2015 Couchbase Inc. 43

agenda

why cbft?what’s full-text search and how’s it work?designdemostatus / roadmap / what’s next

©2015 Couchbase Inc. 44

project status

cbft is developer preview!

please help kick the tires

http://labs.couchbase.com/cbft

©2015 Couchbase Inc. 45

project status / roadmap / what’s next

today

bleve full-text engine yadvanced mappings yfaceted search y

incremental indexing y

index partitioning and replication y

index aliasesy

©2015 Couchbase Inc. 46

project status / roadmap / what’s next

today future

bleve full-text engine yy

advanced mappings yy

faceted search yy

incremental indexing yy

index partitioning and replication yy

index aliasesy y

integrated into Couchbase Server & N1QLy

API stabilityy

production qualityy

performance optimization / tuningy

forestdb storage & partial rollbacks y

security, SSLy

more docs, examples, SDK supporty

©2015 Couchbase Inc. 47

links & Q+A

http://labs.couchbase.com/cbftdownloads, getting started, tech docs

and, share your feedback!

THANKS! (and please do the survey!)

©2015 Couchbase Inc. 48

A SNEAK PEEK AT CBFT

couchbase full-text server

THANKS! (and please do the survey!)

©2015 Couchbase Inc. 50

cbft design

couchbase couchbase couchbase

cbft cbft cbft

cfg

DCP streamsfor incrementalindex updates

a cfg bucketholds metadata

about the indexes