Be Lazy & Scale

25
Be Lazy & Scale Full-Text Tagging Billions Of Messages

Transcript of Be Lazy & Scale

Page 1: Be Lazy & Scale

Be Lazy & ScaleFull-Text Tagging Billions Of Messages

Page 2: Be Lazy & Scale
Page 3: Be Lazy & Scale
Page 4: Be Lazy & Scale

reverse mapping checking getaddrinfo for xxxxx [xxx.xxx.xxx.xxx] failed - POSSIBLE BREAK-IN ATTEMPT!

pam_unix(sshd:session): session opened for user xxxxxx by (uid=0)

Bad protocol version identification 'root' from xxx.xx.xxx.xx port xxxxx

reverse mapping checking getaddrinfo for xxxxx [xxx.xxx.xxx.xxx] failed - POSSIBLE BREAK-IN ATTEMPT!

Bad protocol version identification 'root' from xxx.xx.xxx.xx port xxxxx

pam_unix(sshd:session): session opened for user xxxxxx by (uid=0)

Page 5: Be Lazy & Scale
Page 6: Be Lazy & Scale

PercolatorTraditionally you design documents based on your data, store them into an index, and then define queries via the search API in order to retrieve these documents. The percolator works in the opposite direction. First you store queries into an index and then, via the percolate API, you define documents in order to retrieve these queries.https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html

reverse mapping checking getaddrinfo for xxxxx [xxx.xxx.xxx.xxx] failed - POSSIBLE BREAK-IN ATTEMPT!

reverse mapping checking getaddrinfo for xxxxx [xxx.xxx.xxx.xxx] failed - POSSIBLE BREAK-IN ATTEMPT!

"possible break-in attempt!"

"bad protocol version identification"

"session opened"

Page 7: Be Lazy & Scale
Page 8: Be Lazy & Scale
Page 9: Be Lazy & Scale

/0-10/173

Page 10: Be Lazy & Scale
Page 11: Be Lazy & Scale

$$$

$$$

Page 12: Be Lazy & Scale

Bad protocol version identification ...

"bad protocol"Phrase Query

versionTerm Query

ident*Prefix Query

Boolean Query AND, OR, NOT

Page 13: Be Lazy & Scale

105s

1 Big OR

+3.8%

109s

160

500000

~ 33%

Tags(real life)Runs(based on real messages)Matches

-8.5%96s

Using single char message

'a'

Page 14: Be Lazy & Scale

105s

Trivial 1 Term

clause / tag

-72.8%28.6s

160

~ 29550000

0

~ 33%

Tags(real life)Terminal ClausesRuns(based on real messages)Matches

-41%62.7s

Keep only 1 clause / tag

Page 15: Be Lazy & Scale

Perco. Queries Index

Register Queries

In-Memory Index

Bad protocol ...

Bad protocol ...

Perco. Req. Bad protocol ...

Perco. Resp.

ExecuteEachQuery

Page 16: Be Lazy & Scale
Page 17: Be Lazy & Scale

[0, 1, 2, 3]"POSSIBLE BREAK-IN ATTEMPT!"

connect*

version

Query Term Index

possible --> 0break --> 1in --> 2attempt --> 3version --> 4

Query Clauses Rewritten Clauses

connect*

4

Page 18: Be Lazy & Scale

Query Term Indexpossible --> 0break --> 1in --> 2attempt --> 3version --> 4

reverse mapping checking getaddrinfo for xxxxx [xxx.xxx.xxx.xxx] failed - POSSIBLE BREAK-IN ATTEMPT!

Raw Message

[reverse, mapping, checking, getaddrinfo, for, xxxxx, xxx, xxx, xxx, xxx, failed, possible, break, in, attempt]

Analyzed Message

[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3]Message Rewritten in Query Space

truetruetruetruefalse

Query Term Presence Bitset

Page 19: Be Lazy & Scale

[reverse, mapping, checking, getaddrinfo, for, xxxxx, xxx, xxx, xxx, xxx, failed, possible, break, in, attempt]

[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3]

Analyzed Message

Message Rewritten in Query Space

truetruetruetruefalse

Query Term Presence Bitset

[0, 1, 2, 3]"POSSIBLE BREAK-IN ATTEMPT!"

Quick Check / Early Termination

Actual Check~ contains

Page 20: Be Lazy & Scale

[reverse, mapping, checking, getaddrinfo, for, xxxxx, xxx, xxx, xxx, xxx, failed, possible, break, in, attempt]

[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3]

Analyzed Message

Message Rewritten in Query Space

truetruetruetruefalse

Query Term Presence Bitset

connect*connect*

Brute Force /startsWith (FAST!)

Page 21: Be Lazy & Scale

[reverse, mapping, checking, getaddrinfo, for, xxxxx, xxx, xxx, xxx, xxx, failed, possible, break, in, attempt]

[-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3]

Analyzed Message

Message Rewritten in Query Space

truetruetruetruefalse

Query Term Presence Bitset

4version

Simple Lookup

Page 22: Be Lazy & Scale

AND/OR/NOT

Page 23: Be Lazy & Scale
Page 24: Be Lazy & Scale

105s

160Tags

500000Runs

~ 33%Matches

7.3s

x14.4Faste

r 8.8s

x22.2Faste

r

195s

320Tags500000Runs~ 33%Matches

Page 25: Be Lazy & Scale