Content Search for Business Using Solr: Presented by Wei Zhao, Box

40

Transcript of Content Search for Business Using Solr: Presented by Wei Zhao, Box

Wei Zhao

Backend Software Engineer at Box

[email protected]

3

November 2014

Content Search for Business Using Solr

4

to make organizations more productive,

competitive and collaborative by connecting

people and their most important information

Box mission

5

25MM+Users

225K+ Businesses

99%Fortune 500

6

Box search mission is to make user content

easy to discover.

7

10Billion+Documents

10TB+ Index size

100M+Daily requests

Box uses Solr for search

8

Quick Search

9

Quick Search

10

Full Search

11

Sharding – splitting the index

Agenda

Highly available search

A few more things

1

2

3

4

5 Q&A

Currently working on

12

We shard things

13

Shard ID = File ID % Total Shards

14

Multi-tenant – One big logical index for all users

Solr index

Shard1 Shard2 Shard3 ShardN

15

Search scope

16

File ID: 12345

OwnerID: user1

Parent Folders IDs: folder1, folder2

File Name: Solr.ppt

File Content: blah

......

A typical Solr Document

17

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder2

Owner: User1

Parent:Folder1Folder4

File 1 File 2

File 3 File 4

18

User1 with no share folder

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder2

Owner: User1

Parent:Folder1Folder4

File 1 File 2

File 3 File 4

19

User2 shares Folder2 with User1

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder2

Owner: User1

Parent:Folder1Folder4

File 1 File 2

File 3 File 4

20

User2 shares Folder2 with User1

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder2

Owner: User1

Parent:Folder1Folder4

File 1 File 2

File 3 File 4

21

User2 shares Folder2 with User1

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder5

Owner: User1

Parent:Folder1Folder4

File 1 File 2

File 3 File 4

Removed out of Folder2

22

User2 shares Folder2 with User1

Owner: User1Parent: Folder1

Owner: User2Parent: Folder3

Owner: User2Parent: Folder5

Owner: User1

Parent:Folder1Folder4

File 1 File 2

File 3 File 4

Removed out of Folder2

23

Highly Available Search

24

• Index is highly available

• Search functionality is highly available

25

Index workflow

26

Box Front

EndUpload

Index Queue

Queue 1

Queue 2

Queue 3

Indexer 1

Indexer 3

Indexer 2

MySQL

Index1

Index2

Index2

27

Search workflow

28

Box Front

End

queryHA

Proxy Head node

HA Proxy

1 2 3 N

Box Front

End

queryHA

Proxy Head node

HA Proxy

1 2 3 N

Data center boundary

29

A few more things

30

File Content Search

31

Box Front

End

Upload

MySQL

Box File

Storage

IndexerSolrIndex

Text ExtractionExtracted

Text

32

Multi-language support

33

Raw file content

Language detector

English tokenizer

Spanish tokenizer

Japanese tokenizer

German tokenizer

file_content_en

File_content_es{hola}

file_content_ja....

File_content_de

34

To Dos

• Scale language support

• Support document with mixed languages

35

Search Warm-up

36

• Front end informs backend to warm up on keyboard focus

• Backend prepares the search filter and caches it in a search session

• Backend sends a warm-up query to Solr

37

What we are working on

38

Things we are working on

• Search suggestions

• Search operators

• Use machine learning to influence ranking

• Logical sharding

39

Question?

40

Contact: [email protected]

We are hiring!