Integrating Lucene into a Transactional XML Database

24
1 © Copyright 2012 EMC Corporation. All rights reserved. Integrating Lucene search engine into a transactional XML database Boston May 7-10 2012 Petr Pleshachkov, EMC [email protected] , May 9, 2012

description

Presented by Petr Pleshachkov, EMC - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 In this talk we will present an integration of the Lucene search engine with EMC Documentum xDB database (native XML database). We will introduce a new approach implemented in xDB 10.3 which integrates Lucene index (used for XQuery queries optimization) into transactional xDB engine on the storage level. That is, Lucene files are stored to the XDB data pages instead of the file system as in earlier releases, Lucene accesses all the files through xDB buffer pool instead of the just the Operating system buffer cache. This approach allows us to simplify the implementation of traditional database features for Lucene within xDB like transactions isolation, rollbacks, recovery after database crashes, snapshots construction , replication, hot backups, buffer management, etc. We cover performance analysis of new approach for queries and ingest operations, performance tuning tips and future optimization techniques in the area. The presentation is intended as a description of an implementation and performance analysis.

Transcript of Integrating Lucene into a Transactional XML Database

  • 1. Boston May 7-10 2012 Integrating Lucene search engine into a transactional XML databasePetr Pleshachkov, [email protected], May 9, 2012 Copyright 2012 EMC Corporation. All rights reserved.1

2. My BackgroundPetr Pleshachkov, Principal Software EngineerxDB/xPlore team in Rotterdam My site: EMC Netherlands Other xPlore/xDB sites: Pleasanton (California), Shanghai (China), and Grenoble (France)Areas of expertise: Semistructured data management Databases: transaction management, query optimization, full-text searchAcademia & Research: PhD in Computer Science, ISP RAS Copyright 2012 EMC Corporation. All rights reserved. 2 3. AgendaOverview of EMC Documentum xDB/xPloreIntegration of Lucene into xDBxDB transaction model & lucene transaction managementPerformance analysisFuture optimizations Copyright 2012 EMC Corporation. All rights reserved.3 4. Introducing Documentum xPlore EMC Documentum is a leadingsupplier of Enterprise ContentManagement software xPlore Provides IntegratedSearch for Documentum but is built as a standalone search engine to replace FAST Instream Highly deployed across Documentum environments worldwide (over 70+ countries) xPlore Search Engine built overEMC xDB, Lucene, and leadingcontent extraction and linguisticanalysis software Copyright 2012 EMC Corporation. All rights reserved. 4 5. Key values which xDB brings for xPloreWhy build a search engine over an XML database? Flexible, hierarchical query & datamodels Joins High throughput, low-latency indexing See documents within secs after saving Leverage B-tree indexes whenappropriate Lucene doesnt fit all uses Rich, innovative query language Enterprise, single unified database Copyright 2012 EMC Corporation. All rights reserved.5 6. Documentum xDB Formerly XHive database 100% Java XML stored in persistent DOM format Each XML node can be located through a 64 bit identifier Structure mapped to pages Easy to operate on GB XML files Full Transactional Database Query Language: XQuery Indexing & Optimization Palette of index options optimizer can pick from At it simplest: indexLookup(key) -> node id Backup/Restore, scalability, multi-node architecture Copyright 2012 EMC Corporation. All rights reserved.6 7. xDB Data Storage Model An XML Document can be thought of as a collection of elements, attributes (or xml nodes) ABCThis node structure Dcan be represented as Ea tree - DOM modelDatabaseA B C D Epage Copyright 2012 EMC Corporation. All rights reserved. 7 8. Libraries & Indexes = = xDB Library X-Hive Library A = X-Hive Index= xDB Index = = xDB xml X-Hive xml fileBCfileScope of indexcovers all xml files inAall sub-librariesC B Copyright 2012 EMC Corporation. All rights reserved. 8 9. Lucene IntegrationBoth value and full-text queries supported XML SubPaths mapped into lucene fields Tokenized and value based indexes availableComposite key queries supported Lucene index is much more flexible than B- tree composite indexes Copyright 2012 EMC Corporation. All rights reserved.9 10. Multipath Index Definition BRUTUSI am not gamesome: I do lack some partCASSIUSThen, Brutus, I have much mistook your passion;By means whereof this breast of mine hath buriedThoughts of great value, worthy cogitations.INDEX ROOT PATH: //SPEECHSubPath1: (/SPEAKER, VALUE_COMPARISON) SubPath2: (//LINE, FULL_TEXT_SEARCH) Copyright 2012 EMC Corporation. All rights reserved.10 11. Lucene Query Mapping for $SPEECH score $s in collection(col1)// SPEECH[SPEAKER=BRUTUS and //LINE contains text lack] order by $s return $SPEECHBooleanQuery (TermQuery1, TermQuery2, BooleanClause.Occur.MUST)TermQuery1= TermQuery(new Term(/speaker_field, BRUTUS))TermQuery2=TermQuery(new Term(//line_field, lack)) Copyright 2012 EMC Corporation. All rights reserved.11 12. Lucene SubIndexesEach user transaction creates a separate Lucene subIndexTransaction performs all the updates in its own indexThe delete operation does not physically touch subIndexes created by other transactionsA pair (minLSN, maxLSN) is associated with each subIndex, which is used to construct a global index snapshot. Copyright 2012 EMC Corporation. All rights reserved.12 13. BlacklistsThe delete operation of transaction: Physically deletes document from transactions own subIndex Adds a pair (subIndexMinLSN, NODE_ID) to the blacklist structureThe persistent blacklist structure is represented as xdb B-tree index with key = subIndexMinLSN, value=NODE_IDPeriodically merge operation merges small subIndexes into bigger one and physically deletes documents. Copyright 2012 EMC Corporation. All rights reserved.13 14. xDB transaction managementARIES-based ACID transactions Every page has a Log Sequence Number (pageLSN) Buffer manager tracks dirty pages using RecLSNs Log ALL updates on per page basis, including updates performed during rollbacks Periodically asynchronous thread runs checkpoint procedure The recovery procedure: Repeat the history. Redo all the updates since thelast successful checkpoint Undo not complete transactions Copyright 2012 EMC Corporation. All rights reserved. 14 15. xDB transaction isolationREAD_WRITE transaction follow two-phase- locking rule: Expanding phase: locks are acquired and no locks are released Shrinking phase: locks are released and no locks are acquiredREAD_ONLY transaction does not acquire any locks! The data snapshot at the moment of transaction start is used Using log records we undo recent changes on the page level Copyright 2012 EMC Corporation. All rights reserved.15 16. How to integrate Lucene into transactional xDB database ?Old Solution (xDB 10.1/10.2 releases) All lucene files are stored in separate directory New transaction model for lucene indexes is implemented Lucene does not use xDB buffer pool Backup/restore and replication do not use xDB mechanismsNew Solution (xDB 10.3) All lucene files are stored in xDB data segment xDB transaction model is used since all the updates go through xDB data pages Backup/restore and replication are supported automatically Copyright 2012 EMC Corporation. All rights reserved. 16 17. Lucene Index Access Model New LIDirectoryImpl class is implemented (extendsDirectory class) LIDirectory class stores all files in xDB blob objects LIIndexInput class extends BufferedIndexInput void readInternal(byte[] b, int offset, int len) Reads data from the blob The blob object is buffered on the xdb buffer managementlevel LIIndexOutput class extends BufferedIndexOutput void flushBuffer(byte[] b, int offset, int len) Writes lucene data to the blob object The operation is logged automatically on the buffer managerlevel Copyright 2012 EMC Corporation. All rights reserved.17 18. Lucene Index Access Model (cont)Queries Indexer IndexReader IndexWriter LIDirectoryImpl LIIndexInput LIIndexOutput readInternal ushBuer Lucene Caches buered data pages Lucene Blob Objects Copyright 2012 EMC Corporation. All rights reserved. 18 19. Lucene SubIndex Storage ModelDirectory pageLIDirectoryStore LiFileEntryStoreLiFileEntryStore BlobStore pageBlobStore pageBlob Tail Blob Tail BlobBlob Blob BlobBlob Blob pagepage page pagepage page Copyright 2012 EMC Corporation. All rights reserved. 19 20. Lucene Index Master Record (MIR) Tracks information about all subindexes SI_1 SI_2SI_3 SI_N and their state Represented as a B- tree concurrent indexDirectoryDirectory Used for lucene index object Object view construction Blob objects Updated concurrently by Ingest transactions and merging/cleaning tasks Periodically asynchronous tasks merges subIndexes into bigger one Copyright 2012 EMC Corporation. All rights reserved.20 21. Ingest performance analysis(in seconds) 3000 25002526.601 20002149.636 1500 10001009.459 1015.937 500 180.956 205.068 0 Ingest 10000 docs Ingest 50000 docsIngest 100000 docsxDB 10.3 (pre-release) xDB 10.2 Copyright 2012 EMC Corporation. All rights reserved. 21 22. Query performance analysis (response time in ms.)1614 14.0131210 10.08 8 7.7137.088 6 4 2 0Q1 serie: queries with range and 3 valueQ2 serie: queries with full-text and 2comparison conditions value-comparison conditionsxDB 10.3 (pre-release) xDB 10.2 Copyright 2012 EMC Corporation. All rights reserved. 22 23. Future optimizationsReduce number of separate subIndexesFinal/NonFinal merge optimizationsAdvanced buffer management techniquesConcurrent Lucene MultiPath Index Copyright 2012 EMC Corporation. All rights reserved.23