Riak Search - Berlin Buzzwords 2010

Post on 15-Jan-2015

2.118 views 2 download

Tags:

description

Riak Search is a distributed data indexing and search platform built on top of Riak. The talk will introduce Riak Search, covering overall goals, architecture, and core functionality.

Transcript of Riak Search - Berlin Buzzwords 2010

Berlin Buzzwords· June 2010

Basho Technologies

Rusty Klophaus - @rklophaus

Riak SearchA Full-Text Search

and Indexing Engine

based on Riak

Why did we build it?

What are the major goals?

How does it work?

2

Part One

Why did we build

Riak Search?

3

Riak is

a scalable, highly-available, networked,

open-source key/value store.

4

Key/Value

CLIENT RIAK

5

Writing to a Key/Value Store

Object

CLIENT RIAK

6

Writing to a Key/Value Store

Key

Object

CLIENT RIAK

Querying a Key/Value Store

7

Key + Instructions

Object(s)

CLIENT RIAK

Walk to Related

Keys

Querying Riak via LinkWalking

8

Key(s) + JS Functions

Computed Value(s)

CLIENT RIAK

Map

Reduce

Map

Querying Riak via Map/Reduce

9

Key/Value Stores

like

Key-Based Queries

10

where Category == "Shoes"

CLIENT RIAK

WTF!? I'm aKV store!

Query by Secondary Index

11

"Converse AND Shoes"

CLIENT RIAK

This is getting old.

Full-Text Query

12

These kinds of queries

need an Index.

*Market Opportunity!*

13

Part Two

What are the major

goals of Riak Search?

14

Your Application

Riak

An application built on Riak.

15

Your Application

RiakIndex

Object

Hrm... I need an index.

16

Your Application

Riak???

Hrm... I need an index with more features.

17

Your Application

RiakLucene

Lucene should do the trick...

18

Your Application

Lucene Lucene Lucene Riak

...shard to add more storage capacity...

19

Your Application

Lucene Lucene Lucene

Lucene Lucene Lucene

Lucene Lucene Lucene

Riak

...replicate to add more throughput.

20

Your Application

Lucene Lucene Lucene

Lucene Lucene Lucene

Lucene Lucene Lucene

Riak

...replicate to add more throughput.

21

Operations nightmare!

Your Application

Riak-ifiedLucene

Riak

What do we really want?

22

Your Application

RiakSearch

Riak

What do we really want?

23

Functionality? Be like Lucene (and more).

• Lucene Syntax

• Leverages Java Lucene Analyzers

• Solr Endpoints

• Integration via Riak Post-Commit Hook (Index)

• Integration via Riak Map/Reduce (Query)

• Near-Realtime

• Schema-less

24

Operations? Be like Riak.

• No special nodes

• Add nodes, get more compute and storage

• Automatically load balance

• Replicas for durability and performance

• Index and query in parallel

• Swappable storage backends

25

Part Three

How do we do it?

26

A Gentle Introduction to

Document Indexing

27

Every dog has his day.#1

day, 1

dog, 1

every, 1

has, 1

his, 1

Inverted IndexDocument

The Inverted Index

28

The dog's bark is worse than his bite.

Every dog has his day.

Let the cat out of the bag.

It's raining cats and dogs.

#1

#2

#3

#4

Combined Inverted IndexDocuments

and, 4

bag, 3

bark, 2

bite, 2

cat, 3

cat, 4

day, 1

dog, 1

dog, 2

dog, 4

every, 1

has, 1

...

The Inverted Index

29

"dog AND cat"

AND

dog cat

At Query Time...

30

AND

dog cat

dog, 1

dog, 2

dog, 4

cat, 3

cat, 4

At Query Time...

31

AND(Merge Intersection)

1

2

4

3

4

Result: 4

At Query Time...

32

OR(Merge Union)

1

2

4

3

4

Result: 1, 2, 3, 4

At Query Time...

33

Complex Behavior from Simple Structures

34

Storage Approaches...

35

Riak Search uses

Consistent Hashing

to store data on

Partitions

36

Partitions = 10

Number of Nodes = 5

Partitions per Node = 2

Replicas (NVal) = 2

Introduction to Consistent Hashing and Partitions

37

Object

Introduction to Consistent Hashing and Partitions

38

Document Partitioning

vs.

Term Partitioning

39

...and the

Resulting Tradeoffs

40

Every dog has his day.#1

Document Partitioning @ Index Time

41

"dog OR cat"

Document Partitioning @ Query Time

42

Every dog has his day.#1

day, 1

dog, 1

every, 1

has, 1

his, 1

Term Partitioning @ Index Time

43

day, 1 has, 1

every, 1his, 1

dog, 1

Term Partitioning @ Index Time

44

"dog OR cat"

Term Partitioning @ Query Time

45

Document Partitioning Term Partitioning

+ Lower Latency Queries

- Lower Throughput

- Lots of Disk Seeks

- Higher Latency Queries

+ Higher Throughput

- Hotspots in Ring (the "Obama" problem)

Tradeoffs...

46

Riak Search: Term Partitioning

47

Term-partitioning is the most viable approach for our beta clients’ needs: high throughput on Really Big Datasets.

Optimizations:

• Term splitting to reduce hot spots

• Bloom filters & caching to save query-time bandwidth

• Batching to save query-time & index-time bandwidth

Support for either approach eventually.

Part Four

Review

48

"Converse AND Shoes"

CLIENT RIAK

WTF!? I'm a

KV store!

Riak Search turns this...

49

"Converse AND Shoes"

CLIENT RIAK

Gladly!

...into this...

50

"Converse AND Shoes"

CLIENT RIAK

Keys or Objects

...into this...

51

Your Application

RiakSearch

Riak

...while keeping operations easy.

52

Thanks! Questions?

Search Team:

John Muellerleile - @jrecursive

Rusty Klophaus - @rklophaus

Kevin Smith - @kevsmith

Currently working with a small set of Beta users.

Open-source release planned for Q3.

www.basho.com