In-Memory Text Search Engines - KIT

31
In-Memory Text Search Engines Peter Sanders, Frederik Transier

Transcript of In-Memory Text Search Engines - KIT

In-Memory Text Search Engines

Peter Sanders,

Frederik Transier

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 2

Guest Logo

Demands and Goals of Text Search Engines

� Locating words in a huge number of documents.

� Important Operations: AND queries, phrase queries, document reporting.

� Using main-memory for maximumquery performance.

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 3

Guest Logo

Inverted Index

� Unique IDs for:

� documents

� terms

� A document becomesa list of term IDs.

� Inverted Index: For each term we storea list of document IDs.

� Positional Inverted Index:Positions list for eachterm / document pair.

1037551357

9975221

120 151103513573

2

1

bye

world

hello

Structure of an Inverted Index

dictionary Inverted Lists

321253

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 4

Guest Logo

Intersection Algorithms: Randomized Inverted Indices

� Randomized Inverted Index: document IDs are assigned randomly. (e.g. using pseudo-random permutation)

� Two-level data structure:

� Split the range of document IDs into buckets based on their most significant

bits.

� Lookup-table: direct access to the first

value of a bucket1011011

1010100

1001011

1000100

0111101

0110111

0101110

0101000

0011111

0011000

0001111

0000111

11

10

01

00

12

8

4

0

lookup –table

Inverted List with lookup table

B = 4

( lookup table size: log2(n/B) )

0

n-1

4

8

n=12

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 5

Guest Logo

Intersection Algorithms: Randomized Inverted Indices

� Lookup is an intersection algorithm that runs on this data structure:

Inverted List with lookup table

n=12

Function lookup(M,N)

O := {} // output

i := -1 // current bucket key (now a dummy)

foreach d ∈ M do // unpack M

h := d >> kN // bucket key

l := d & (2kN-1) // least significant bits

if h > i then // a new bucket

i := h // set current bucket

j := t[i] // get start position

e := t[i+1] // get end position

while j < e do // bucket not exhausted

l‘ := N[j] // unpack if necessary

if l ≤ l‘ then

if l = l‘ then O := O ∪ {d}

break while

j++

return O

Lookup runs in

expected timeO(m+min{n,Bm})

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 6

Guest Logo

Experiments: Space Consumption (WT2g)

No significant differencesbetween ∆-bitc. and ∆-escaped

for the rand. representation.

∆-bitc. does notwork well for the

det. representation.

Escaping can exploit

nonuniformity of input.

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 7

Guest Logo

Experiments: Performance of Lookup (WT2g)

bit-compressed ∆-encoding

Large bucketsare bad for

small ratios...

…but good fornearly equal

lengths.

Bucket size 8

seems to be a good compromise.

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 8

Guest Logo

Experiments: Performance Comparison (WT2g)

Lookup is best

up to ratioclose to one.

Zipper, skippperand lookup are all very good for lists

of similar lengths.

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 9

Guest Logo

Experiments: Space-time Tradeoff of Encodings (WT2g.s)

Performance loss of

∆-encoding isnegligible low for

rand. representation.

Escapingrequires

perceivablemore time.

Clearly different

running times for det. representation.

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 10

Guest Logo

Experiments: Impact of Randomization on Lookup

Randomization gives theoretical

performance guarantees, but in practice deterministic data often

outperforms randomization.

Lookup is also a good heuristics for non-

randomized data.

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 11

Guest Logo

Suffix Arrays

a

abra

abracadabra

acadabra

adabra

bra

bracadabra

cadabra

dabra

ra

racadabra

� Suffix Arrays as full-text index: concatenate all text documents.

� Phrase search: A home match for SAs.

� AND searches:

� search for terms

� Intersect the occurence lists.

� Document reporting

� Compressed or distinct SAs are available from the Pizza&Chili website:

http://pizzachili.di.unipi.itSuffix array of „abracadabra“

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 12

Guest Logo

Performance of different SAs on WT2g 1-50000

AND

phrase

3.12.13.13.2peak mem usage [GB]

23.08.711.59.3indexing time [min]

0.770.841.390.64compression

279.3302.6500.9230.9size [MB]

AF-indexSSA2CCSACSA

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 13

Guest Logo

Modular Design of the Compressed Inverted Index (CII)

Preprocessor

Document-grained

inverted index

Positional index

Text delta

Bags ofwords

Dictionary

Documents normalizeddocuments (term

ID/doc ID/pos)

terms

Differencesbetween

documents and their normalized

versions

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 14

Guest Logo

CII: Document-grained Inverted Index

10

9

8

7

6

5

4

3

2

1

1

.

.

.

4

4

3

2

1

…96521

…109432

Directstoreddoc IDs

Golomb coded list of doc ID deltas

max. K=128 values

max. K=128 values

….........

…... > K values

….........

…... > K values

Two-Level Lookup lists

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 15

Guest Logo

Document-grained Index: Lookup List

1011011

1010100

1001011

1000100

0111101

0110111

0101110

0101000

0011111

0011000

0001111

0000111

rankpos

12

8

4

0

11

10

01

00

84

56

28

0

lookup – table

Inverted List with lookup table

B = 4

( lookup table size: log2(n/B) )

0

n-1

4

8

n=12

� Two-level data structure, refinement of [Sanders and Transier, 2007]:

� Split the range of document IDs into buckets based on their most significant bits.

� Lookup-table: direct access to the first value of a bucket

� Rank information: number of values smaller than the

contents of a bucket.

� Variable Golomb coding:

estimating the average of each bucket.

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 16

Guest Logo

CII: Positional Index

10

9

8

7

6

5

4

3

2

1

3

.

.

.

64

128

3

17

1

5

…96521

…1034102

Directstored

positions

Golomb coded list of doc ID deltas

singledoc

multi doc / single pos

….........

…... multi doc / multi pos

Two-Level list (top-levelindexed by ranks)

Bitcompressed lists (indexed by ranks)

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 17

Guest Logo

Querying the Compressed Inverted Index

� AND Query: Intersection of inverted lists in increasing order of list lengths.

� Phrase Query:

� As AND queries, but keeping track of the current ranks.

� Retrieve corresponding position lists.

� Check positions.

(((( ∩ )∩ )∩ )

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 18

Guest Logo

document 5document 5

document 5

CII: Document Reporting with Text Delta

0

8

6

9

3

1

4

9

7

5

Escape dictionary

List of termseperators

42

27

25

7

5

1

0

3

2

0

List of term escapes

All chars to uppercase

First char to uppercase

List of original terms 5

Next itema term

Next 8 termsare seperated

by spaces.

bag of words

List of normalized terms

document index

positional index

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 19

Guest Logo

� We have implemented all algorithms using C++.

� One core of an Intel Core 2 Duo E6600, clocked at 2.4 GHz with 2 x 2MB L2 cache and 4 GB main memory.

� openSuSE 10.2 (kernel 2.6.18), gcc 4.1.2 (-O3)

� Timing with PAPI 3.5.0

Experiments: Test System

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 20

Guest Logo

Experiments: Test Data

� Real world instance: first 50.000 docs of WT2g.

� Pseudo real-world queries:

� AND and phrase queries.

� Selecting random hits.

� Query lengths: 1-10 terms.

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 21

Guest Logo

Indexing: Space requirements

25.1bag of words

412.8 (360.5)input size (norm.)

23.9dictionary

32.4document Index

126.3positional Index

230.8suffix array

0.1doc bounds

3.20.7peak mem usage [GB]

- (9.3)5.6 (5.1)indexing time [min]

- (0.64)0.76 (0.57)compression

314.8sum + text delta [MB]

108.7text delta [MB]

230.9206.1sum [MB]

CSACII

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 22

Guest Logo

Experiments: Average AND Query Time

3-4 orders of magnitude slower than CII

Average < 1 msfor all lengths.

tim

e [

ms]

query lengths

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 23

Guest Logo

Experiments: How many % AND queries take longer than t?

35 s in worst

case

All queries aredone in less than

3.6 ms.

Query length

of 2 terms.

% q

ueri

es

to b

ean

sw

ere

d

time [ms]

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 24

Guest Logo

Experiments: Average Phrase Query Time

CSA is marginal faster.

(factor 0.94 – 0.62,< 1ms absolute)

CII is more than

20 times fasterthan CSA.

query lengths

tim

e [

ms]

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 25

Guest Logo

Experiments: How many % phrase queries take longer than t?

t < 190 ms for all queries.

Query length of 2 terms.

Factor24 apart.

time [ms]

% q

ueri

es

to b

ean

sw

ere

d

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 26

Guest Logo

Experiments: Document Reporting

6-8 MB/sBandwith of CSA is

about 5 times smaller

Assuming a disk access latency of

5ms, load from disk would be fasterfor documents > 32 KB.

data

rate

[M

B/s

]

text size [KB]

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 27

Guest Logo

Outlook & Future Work

� Bag of words: adaptive coding scheme?

� Compression of the dictionary.

� Speeding up the most expensive phrase queries.

� Construction time?

� Fast updates?

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 28

Guest Logo

Conclusion

Thank You forYour attention!

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 29

Guest Logo

CII on the complete WT2g / WT2g.s

short queries are slower on WT2g.s, as there are more results. – Large queries are faster, as they

benefit from more lookup lists.

AND

phrase

Phrase queries are faster on

WT2g.s, because theposition lists are shorter.

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 30

Guest Logo

Copyright 2007 SAP AG. All Rights Reserved

No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP AG. The information contained herein may be changed without prior notice.

Some software products marketed by SAP AG and its distributors contain proprietary software components of other software vendors.

Microsoft, Windows, Outlook, and PowerPoint are registered trademarks of Microsoft Corporation.

IBM, DB2, DB2 Universal Database, OS/2, Parallel Sysplex, MVS/ESA, AIX, S/390, AS/400, OS/390, OS/400, iSeries, pSeries, xSeries, zSeries, z/OS, AFP, Intelligent Miner, WebSphere, Netfinity, Tivoli, and Informix are trademarks or registered trademarks of IBM Corporation.

Oracle is a registered trademark of Oracle Corporation.

UNIX, X/Open, OSF/1, and Motif are registered trademarks of the Open Group.

Citrix, ICA, Program Neighborhood, MetaFrame, WinFrame, VideoFrame, and MultiWin are trademarks or registered trademarks of Citrix Systems, Inc.

HTML, XML, XHTML and W3C are trademarks or registered trademarks of W3C®, World Wide Web Consortium, Massachusetts Institute of Technology.

Java is a registered trademark of Sun Microsystems, Inc.

JavaScript is a registered trademark of Sun Microsystems, Inc., used under license for technology invented and implemented by Netscape.

MaxDB is a trademark of MySQL AB, Sweden.

SAP, R/3, mySAP, mySAP.com, xApps, xApp, SAP NetWeaver, and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and in several other countries all over the world. All other product and service names mentioned are the trademarks of their respective companies. Data contained in this document serves informational purposes only. National product specifications may vary.

The information in this document is proprietary to SAP. No part of this document may be reproduced, copied, or transmitted in any form or for any purpose without the express prior written permission of SAP AG.

This document is a preliminary version and not subject to your license agreement or any other agreement with SAP. This document contains only intended strategies, developments, and functionalities of the SAP® product and is not intended to be binding upon SAP to any particular course of business, product strategy, and/or development. Please note that this document is subject to change and may be changed by SAP at any time without notice.

SAP assumes no responsibility for errors or omissions in this document. SAP does not warrant the accuracy or completeness of the information, text, graphics, links, or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, or non-infringement.

SAP shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials. This limitation shall not apply in cases of intent or gross negligence.

The statutory liability for personal injury and defective products is not affected. SAP has no control over the information that you may access through the use of hot links contained in these materials and does not endorse your use of third-party Web pages nor provide any warranty whatsoever relating to third-party Web pages.

Universität Karlsruhe / SAP AG 2008, In-Memory Search Engines / Peter Sanders, Frederik Transier / 31

Guest Logo

Copyright 2007 SAP AG. Alle Rechte vorbehalten

Weitergabe und Vervielfältigung dieser Publikation oder von Teilen daraus sind, zu welchem Zweck und in welcher Form auch immer, ohne die ausdrückliche schriftliche Genehmigung durch SAP AG nicht gestattet. In dieser Publikation enthaltene Informationen können ohne vorherige Ankündigung geändert werden.

Die von SAP AG oder deren Vertriebsfirmen angebotenen Softwareprodukte können Softwarekomponenten auch anderer Softwarehersteller enthalten.

Microsoft®, WINDOWS®, NT®, EXCEL®, Word®, PowerPoint® und SQL Server® sind eingetragene Marken der Microsoft Corporation.

IBM®, DB2®, DB2 Universal Database, OS/2®, Parallel Sysplex®, MVS/ESA, AIX®, S/390®, AS/400®, OS/390®, OS/400®, iSeries, pSeries, xSeries, zSeries, z/OS, AFP, Intelligent Miner, WebSphere®, Netfinity®, Tivoli®, Informix und Informix® Dynamic ServerTM sind Marken der IBM Corporation.

ORACLE® ist eine eingetragene Marke der ORACLE Corporation.

UNIX®, X/Open®, OSF/1® und Motif® sind eingetragene Marken der Open Group.

Citrix®, das Citrix-Logo, ICA®, Program Neighborhood®, MetaFrame®, WinFrame®, VideoFrame®, MultiWin® und andere hier erwähnte Namen von Citrix-Produkten sind Marken von Citrix Systems, Inc.

HTML, DHTML, XML, XHTML sind Marken oder eingetragene Marken des W3C®, World Wide Web Consortium, Massachusetts Institute of Technology.

JAVA® ist eine eingetragene Marke der Sun Microsystems, Inc.

JAVASCRIPT® ist eine eingetragene Marke der Sun Microsystems, Inc., verwendet unter der Lizenz der von Netscape entwickelten und implementierten Technologie.

MaxDB ist eine Marke von MySQL AB, Schweden.

SAP, R/3, mySAP, mySAP.com, xApps, xApp, SAP NetWeaver, und weitere im Text erwähnte SAP-Produkte und -Dienstleistungen sowie die entsprechenden Logos sind Marken oder eingetragene Marken der SAP AG in Deutschland und anderen Ländern weltweit. Alle anderen Namen von Produkten und Dienstleistungen sind Marken der jeweiligen Firmen. Die Angaben im Text sind unverbindlich und dienen lediglich zu Informationszwecken. Produkte können länderspezifische Unterschiede aufweisen.

Die in dieser Publikation enthaltene Information ist Eigentum der SAP. Weitergabe und Vervielfältigung dieser Publikation oder von Teilen daraus sind, zu welchem Zweck und in welcher Form auch immer, nur mit ausdrücklicher schriftlicher Genehmigung durch SAP AG gestattet.

Bei dieser Publikation handelt es sich um eine vorläufige Version, die nicht Ihrem gültigen Lizenzvertrag oder anderen Vereinbarungen mit SAP unterliegt. Diese Publikation enthält nur vorgesehene Strategien, Entwicklungen und Funktionen des SAP®-Produkts. SAP entsteht aus dieser Publikation keine Verpflichtung zu einer bestimmten Geschäfts- oder Produktstrategie und/oder bestimmten Entwicklungen. Diese Publikation kann von SAP jederzeit ohne vorherige Ankündigung geändert werden.

SAP übernimmt keine Haftung für Fehler oder Auslassungen in dieser Publikation. Des Weiteren übernimmt SAP keine Garantie für die Exaktheit oder Vollständigkeit der Informationen, Texte, Grafiken, Links und sonstigen in dieser Publikation enthaltenen Elementen. Diese Publikation wird ohne jegliche Gewähr, weder ausdrücklich noch stillschweigend, bereitgestellt. Dies gilt u. a., aber nicht ausschließlich, hinsichtlich der Gewährleistung der Marktgängigkeit und der Eignung für einen bestimmten Zweck sowie für die Gewährleistung der Nichtverletzung geltenden Rechts.

SAP haftet nicht für entstandene Schäden. Dies gilt u. a. und uneingeschränkt für konkrete, besondere und mittelbare Schäden oder Folgeschäden, die aus der Nutzung dieser Materialien entstehen können. Diese Einschränkung gilt nicht bei Vorsatz oder grober Fahrlässigkeit.

Die gesetzliche Haftung bei Personenschäden oder Produkthaftung bleibt unberührt. Die Informationen, auf die Sie möglicherweise über die in diesem Material enthaltenen Hotlinkszugreifen, unterliegen nicht dem Einfluss von SAP, und SAP unterstützt nicht die Nutzung von Internetseiten Dritter durch Sie und gibt keinerlei Gewährleistungen oder Zusagen über Internetseiten Dritter ab.