A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and...

download A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

of 45

Transcript of A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and...

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    1/45

    A Study of I/O and VirtualizationPerformance with a Search Engine

    based on an XML database and

    LuceneEd Buech, EMC

    [email protected], May 25, 2011

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    2/45

    Agenda

    My Background Documentum xPlore Context and History Overview of Documentum xPlore Tips and Observations on IO and Host

    Virtualization

    3

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    3/45

    My Background Ed Buech Information Intelligence Group within EMC EMC Distinguished Engineer & xPlore Architect Areas of expertise

    Content Management (especially performance &scalability)

    Database (SQL and XML) and Full text search Previous experience: Sybase and Bell Labs

    Part of the EMC Documentum xPloredevelopment team

    Pleasanton (CA), Grenoble (France), Shanghai,and Rotterdam (Netherlands)

    4

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    4/45

    Documentum search 101

    Documentum Content Server provides an object/relational data model and query language

    Object metadata called attributes (sample: title, subject,author)

    Sub-types can be created with customer defined attributes Documentum Query Language (DQL) Example:

    SELECT object_name FROM foo

    WHERE subject = bar AND customer_id = ID1234

    DQL also support full text extensions Example:

    SELECT object_name FROM foo

    SEARCH DOCUMENT CONTAINShello world

    WHERE subject = bar AND customer_id = ID1234

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    5/45

    Introducing Documentum xPlore

    Provides IntegratedSearch for Documentum

    but is built as astandalone search

    engine to replace FASTInstream

    Built over EMC xDB,Lucene, and leading

    content extraction and

    linguistic analysis

    software

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    6/45

    Documentum Search

    History-at-a-glance

    almost 15 years of Structured/Unstructured integrated search

    Verity Integration 1996 2005

    Basic full text search throughDQL

    Basic attribute search1 day 1 hour latencyEmbedded implementation

    FAST Integration 2005 2011Combined structured /unstructured search

    2 5 min latencyScore ordered results

    xPlore Integration 2010 - ??? Replaces FAST in DCTM Integrated security Deep facet computation HA/DR improvements Latency: typically seconds

    Improved Administration

    Virtualization Support

    1996 20102005

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    7/45

    Enhancing Documentum Deployments

    with Search

    Without Full Text in a Documentum deployment a DQL query will bedirected to the RDBMS

    DQL is translated into SQL However, relational querying has many limitations.

    ContentServer

    DCTM clientDQL SQL

    RDBMS

    search

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    8/45

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    9/45

    Some Basic Design Concepts

    behind Documentum xPlore

    Inverted Indexes are not optimized for all use-cases B+-tree indexes can be far more efficient for

    simple, low-latency/highly dynamic scenarios

    De-normalization cant efficiently solve allproblems

    Update propagation problem can be deadly Joins are a necessary part of most applications

    Applications need fine control over not onlysearch criteria, but also result sets

    10

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    10/45

    Design concepts (cont)

    Applications need fluid, changing metadataschemas that can be efficiently queried

    Adding metadata through joins with side-tablescan be inefficient to query

    Users want the power of Information Retrievalon their structured queries Data Management, HA, DR shouldnt be an

    after-thought

    When possible, operate within standards Lucene is not a database. Most Lucene

    applications deploy with databases.

    11

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    11/45

    Lessons Learned

    Structured Queryuse-cases

    UnstructuredQuery use-cases

    Fit touse-case

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    12/45

    Indexes, DB, and IR

    Structured Queryuse-cases

    UnstructuredQuery use-cases

    Relational DBtechnology

    Fit touse-case

    Scoring,

    Relevance,

    Entities

    Hierarchical data

    representations

    (XML)

    Full Text

    searches

    Constantly

    changing

    schemas

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    13/45

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    14/45

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    15/45

    Documentum xPlore

    Bringbest-of-breedXMLDatabasewithpowerfulApacheLuceneFulltextEngine ProvidesstructuredandunstructuredsearchleveragingXMLandXQuerystandards

    DesignedwithEnterprisereadiness,scalabilityandingesCon AdvancedDataManagementfuncConalitynecessaryforlargescalesystems

    IndustryleadinglinguisCctechnologyandcomprehensiveformatfilters MetricsandAnalyCcs

    xDB Transaction, Index& Page Management

    xDB Query Processing&

    Optimization

    xDB API

    xPlore APISearch

    Services

    Node & DataManagement

    Services

    IndexingServices

    AdminServices

    ContentProcessing

    Services

    Analytics

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    16/45

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    17/45

    Scope of index

    covers all xml files in

    all sub-libraries

    A

    B C

    Libraries / Collections & Indexes

    A

    B

    C

    = xDB segment

    = xDBLibrary /xPlore collection

    = xDBIndex

    = xDBxml file (dftxml, trackingxml, status, metrics, audit)

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    18/45

    Lucene Integration

    Transactional Non-committed index updates in separate

    (typically in memory) lucene indexes

    Recently committed (but dirty) indexes backed byxDB log

    Query to index leverages Lucene multi-searcherwith filter to apply update/delete blacklisting

    Lucene indexes managed to fit into xDBsARIES-based recovery mechanism

    No changes to Lucene Goal: no obstacles to be as current as possible

    19

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    19/45

    Lucene Integration (cont)

    Both value and full text queries supported XML elements mapped to lucene fields Tokenized and value-based fields available

    Composite key queries supported Lucene much more flexible than traditional B-

    tree composite indexes

    ACL and Facet information stored in Lucenefield array

    Documentums security ACL security modelhighly complex and potentially dynamic

    Enables secure facet computation20

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    20/45

    xPlore has lucene search engine

    capabilities plus.

    XQuery provides powerful query & datamanipulation language

    A typical search engine cant even express a join Creation of arbitrary structure for result set Ability to call to language-based functions or java-

    based methods

    Ability to use B-tree based indexes when needed xDB optimizer decides this

    Transactional update and recovery of data/index Hierarchical data modeling capability

    Ti d Ob ti

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    21/45

    Tips and Observations on

    IO and Host Virtualization

    Virtualization offers huge savings for companies

    through consolidation and automation

    Both Disk and Host virtualization available However, there are pitfalls to avoid

    One-size-fits-all Consolidation contention Availability of resources

    22

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    22/45

    Tip #1: Dont assume that

    one-size-fits all

    Most IT shops will create VM or SANtemplates that have a fixed resource

    consumption

    Reduces admin costs Example: Two CPU VM with 2 GB of memory Deviations from this must be made in a special

    request

    Recommendations: Size correctly, dont accept insufficient resources Test pre-production environments

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    23/45

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    24/45

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    25/45

    EMC Symmetrix:Nondisruptive MobilityVirtual LUN VP Mobility

    Fast, efficient mobilityMaintains replication and

    quality of service during

    relocationsSupports up to thousands of

    concurrent VPLUNmigrations

    Recommendation: work withstorage technicians to

    ensure backend storage has

    sufficient I/O

    Virtual Pools

    Flash400 GB

    RAID 5

    Tier 2

    Fibre Channel600 GB 15K

    RAID 1

    SATA2 TBRAID 6

    VLUN

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    26/45

    Tip #2: Consolidation Contention

    Virtualization provides benefit from consolidation Consolidation provides resources to the active

    Your resources can be consumed by other VMs,other apps

    Physical resources can be over-stretched Recommendations:

    Track actual capacity vs. planned Vmware: track number of times your VM is denied CPU SANs: track % I/O utilization vs. number of I/Os

    For Vmware leverage guaranteed minimumresource allocations and/or allocate to non-

    overloaded HW

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    27/45

    Some Vmware statistics

    Readymetric Generated by Vcenter and represents the

    number of cycles (across all CPUs) in which VMwas deniedCPU

    Generated in milliseconds and real-timesample happens at best every 20 secs

    For interactive apps: As a percentage of offeredcapacity > 10% is considered worrisome

    Pages-in, Pages-out Can indicate over subscription of memory

    28

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    28/45

    Sample %Ready for a production VM with xPloredeployment for an entire week

    29

    0%

    2%

    4%

    6%

    8%

    10%

    12%

    14%

    16%

    official area thatIndicates pain

    In this case Avgresp time

    doubled andmax resp time

    grew by 5x

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    29/45

    S S btl ti ith

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    30/45

    Some Subtleties with

    Interactive CPU denial

    The Ready metric represents denial upondemand

    Interactive workloads can be bursty If no demand, then Readycounter will be low

    Poor user response encourages less usage Like walking on a broken leg Causing less Readysamples

    31

    20 sec interval

    Denialspike

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    31/45

    Sharing I/O capacity

    If Multiple VMs (or servers) are sharing thesame underlying physical volumes and thecapacity is not managed properly then the available I/O capacity of the volume could

    be less than the theoretical capacity

    This can be seen if the OS tools show that thedisk is very busy (high utilization) while thenumber of I/Os is lower than expected

    Volume forLucene

    application

    Volume forother

    application

    Both volumes spread over the same set of drivesand effectively sharing the I/O capacity

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    32/45

    Recommendations on diagnosing

    disk I/O related issues

    On Linux/UNIX Have IT group install SAR and IOSTAT

    Also install a disk I/O testing tool (like Bonnie) Compare Bonnie output with SAR & IOSTAT

    data High disk Utilization at much lower achieved rates could

    indicate contention from other applications

    Also, High SAR I/O wait time might be anindication of slow disks

    On Windows Leverage the Windows Performance Monitor Objects: Processor, Physical Disk, Memory

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    33/45

    Sample output from the Bonnie tool

    Bonnie is an open source disk I/O driver tool for Linux that can be useful forpretesting Linux disk environments prior to an xPlore/Lucene install.

    bonnie -s 1024 -y -u -o_direct -v 10 -p 10This will increase the size of the file to 2 Gb.Examine the output. Focus on the random I/O area:

    ---Sequential Output (sync)----- ---Sequential Input-- --Rnd Seek--CharUnlk- -DIOBlock- -DRewrite- -CharUnlk- -DIOBlock- --04k (10)-

    Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU Mach2 10*2024 73928 97 104142 5.3 26246 2.9 8872 22.5 43794 1.9 735.7 15.2

    -s 1024 means that 2 GB files will be created

    -o_direct means that direct I/O (by-passing buffer cache)

    will be done

    -v 10 means that 10 different 2GB files will be created.

    -p 10 means that 10 different threads will query those files

    This output meansthat the random read

    test saw 735 random I/

    Os per sec at 15%

    CPU busy

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    34/45

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    35/45

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    36/45

    IO / caching test use-case

    Unselective Term search 100 sample queries Avg( hits per term) = 4,300+, max ~ 60,000 Searching over 100s of DCTM object attributes + content

    Medium result window Avg( results returned per query) = 350 (max: 800)

    Stored Fields Utilized Some security & facet info

    Goal: Pre-cache portions of the index to improve response time in

    scenarios

    Reboot, buffer cache contention, & vm memory contention

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    37/45

    Some xPlore Structures for Search

    Dictionary of termsPosting list (doc-ids for term)

    Stored fields (facets and node-ids)

    Security indexes(b-tree based)

    xDB XMLstore

    (containstext for

    summary)

    1st doc N-thdoc

    Facet decompression map

    Frequency and position structures ignored for simplicity

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    38/45

    IO model for search in xPlore

    Search Term:term1 term2

    Dictionary Posting list (doc-ids for term)

    Stored fields

    Xdb node-id

    plus facet /security info

    Security lookup(b-tree based)

    xDB XMLstore

    (containstext for

    summary)

    Resultset

    Facet decompression map

    S ti f i l i

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    39/45

    Separation ofcovering values instored fields and summary

    Facet

    Calc

    FinalFacetcalc values

    overthousands of

    results

    Res-1 - sum

    Res-2 - sumRes-3 - sum

    :

    :Res-350-sum

    Xdb docswith text for

    summary

    Small numberfor result

    window

    Smallstructure

    Potentiallythousands of

    results

    Stored fields(Random access)

    Potentiallythousands

    of hits

    Security

    lookup

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    40/45

    xPlore Memory Pool areas

    at-a-glance

    xPlore Instance (fixed size)

    memory

    xDB

    Buffer

    Cache

    LuceneCaches

    &

    workingmemory

    xPlorecaches

    Other vm

    working

    mem

    Operating

    System

    File Buffer

    cache

    (dynamically

    sized)

    Native code

    content

    extraction &

    linguistic

    processing

    memory

    Lucene data resides primarily in

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    41/45

    Lucene data resides primarily in

    OS buffer cache

    42

    xPlore Instance (fixed size)

    memory

    xDBBuffer

    Cache

    Lucene

    Caches

    &working

    memory

    xPlore

    caches

    Other vm

    workingmem

    Operating

    System

    File Buffer

    cache

    (dynamically

    sized)

    Native code

    content

    extraction &

    linguisticprocessing

    memory

    Dictionary of terms

    Posting list (doc-ids for term)

    Stored fields (facets and node-ids)

    1st doc N-th

    doc

    xDB XML

    store

    (contains

    text for

    summary)

    N-th

    doc

    Potential for manythings to sweep

    lucene from thatcache

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    42/45

    Test Env

    32 GB memory Direct attached storage (no SAN) 1.4 million documents Lucene index size = 10 GB Size of internal parts of Lucene CFS file

    Stored fields (fdt, fdx): 230 MB (2% of index) Term Dictionary (tis,tii): 537 MB (5% of index) Positions (prx): 8.78 GB (80% of index) Frequencies (frq) : 1.4 GB (13 % of index)

    Text in xDB stored compressed separately43

    S f

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    43/45

    Some results of the query suite

    Test Avg Respto

    consumeall results

    (sec)

    MB pre-cached

    I/O perresult

    Total MBloaded into

    memory(cached + test)

    Nothing cached 1.89 0 0.89 77

    Stored fields cached 0.95 241 0.38 272

    Term dict cached 1.73 537 0.79 604

    Positions cached 1.58 8,789 0.74 8,800

    Frequencies cached 1.65 1,406 0.63 1,436

    Entire index cached 0.59 10,970 < 0.05 10,970

    44

    Linux buffer cache cleared completely before each run Resp as seen by final user in Documentum Facets not computed in this example. Just a result set returned. With Facets

    response time difference more pronounced.

    Mileage will vary depending on a series of factors that include query complexity,compositions of the index, and number of results consumed

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    44/45

    Other Notes

    Caching 2% of index yields a response timethat is only 60% greater than if the entire indexwas cached.

    Caching cost only 9 secs on a mirrored drive pair Caching cost 6800 large sequential I/Os vs.

    potentially 58,000 random I/Os

    Mileage will vary, factors include Phrase search

    Wildcard search Multi-term search

    SANs can grow I/O capacity as searchcomplexity increases

    45

  • 8/6/2019 A Study of I/O and Virtualization Performance with a Search Engine based on an XML database and Lucene

    45/45

    Contact

    Ed Buech [email protected] http://community.emc.com/people/Ed_Bueche/blog http://community.emc.com/docs/DOC-8945