Scaling the Content Repository with Elasticsearch

SCALINGSCALINGTHE DOCUMENT REPOSITORYTHE DOCUMENT REPOSITORY

WITH ELASTICSEARCHWITH ELASTICSEARCH

SOME CONTEXTSOME CONTEXTWhat we Do and What Problems We Try to Solve

NUXEONUXEO

Nuxeo

we provide a Platform that developers can use to build highly

customized Content Applications

we provide components, and the tools to assemble them

everything we do is open source (for real)

various customers - various use cases

me: developer & CTO - joined the Nuxeo project 10+ years ago

Track game builds Electronic Flight Bags Central repository for Models Food industry PLM

https://github.com/nuxeo

DOCUMENT REPOSITORYDOCUMENT REPOSITORY

Store Documents / Assets / Objects

Blob objects

Complex data Structures

Hierarchy, references and links

Audit trail & VersioningData level security & encryptionLifecycle, workflows ... API (REST, CMIS, Java, JS...)

CRUD

Search

Service API

Heavily configurable : all data structures are flexible / customizable

Used by developers to buildContent Applications on top of

the Nuxeo Repository

OUR CHALLENGESOUR CHALLENGES

CRUD on large repository works

inject at 6,000 docs/s up to 1 Billion

not so many companies have that many documents anyway

Queries are the main scalability issue

impact of c_ud vs search

multi-criteria queries + full-text

security filtering

configurable data structures

user defined queries

UI heavily depends on search

Search API is the most used:

search is the main scalability challenge

HISTORY : NUXEO & LUCENEHISTORY : NUXEO & LUCENE

2006: Nuxeo CPS 3.6

(Python / Zope based)

Replace built-in index with

lucene + XML-RPC server

pyLucene

(GCJ build+ python bindings!)

Complex setup

2007: Nuxeo Platform 5.1

JCR : queries (and backup) issues

Integrate Compass Core

transactionnal & storage abstraction

Missing sync & concurrency issues

2009: Nuxeo 5.2

VCS : Homebrew SQL based repository

Search in database but some real limitations

2013 / 2014: Nuxeo 5.9.3

Reintroduce Lucene in the stack via elasticsearch

Learn from our past mistakes

Leverage elasticsearch architecture

easy deployment

safe indexing

powerful search

... we are now happy with Elasticsearch

Lucene and Nuxeo have a long story ...

REPOSITORY & SEARCHREPOSITORY & SEARCHUnderstanding the Issue

REPOSITORY & SEARCHREPOSITORY & SEARCH

Search API is the most used :


COMPLEX SQL QUERIESCOMPLEX SQL QUERIES

Configurable Data Structure+ User defined multi-criteria searches=> multiple & complex SQL queries

Search API is the most used:


SELECT "hierarchy"."id" AS "_C1" FROM "hierarchy" JOIN "fulltext" ON "fulltext"."id" = "hierarchy"."id" LEFT JOIN "misc" "_F1" ON "hierarchy"."id" = "_F1"."id" LEFT JOIN "dublincore" "_F2" ON "hierarchy"."id" = "_F2"."id" WHERE ("hierarchy"."primarytype" IN ('Video', 'Picture', 'File', 'Audio')) AND ((TO_TSQUERY('english', 'sydney') @@NX_TO_TSVECTOR("fulltext"."fulltext"))) AND ("hierarchy"."isversion" IS NULL) AND ("_F1"."lifecyclestate" <> 'deleted') AND ("_F2"."created" IS NOT NULL )

ORDER BY "_F2"."created" DESC

LIMIT 201 OFFSET 0;

ABOUT SQL LIMITATIONSABOUT SQL LIMITATIONSScaling queries is complex

depend on indexes, I/O speed and available memory

can not satisfy all types of queries

poor performances on unselective multi-criteria queries

some types of queries can simply not be fast in SQL

Scalability

Scale up is expensive

Scale out is complex at best (XA & MVCC)

Sharding requires a global index

Fulltext support is usually poor

limitations on features & impact on performances

SQL technology is not the solution

IS NOSQL THE SOLUTION!?IS NOSQL THE SOLUTION!?

USING NOSQL FOR THE REPOSITORYUSING NOSQL FOR THE REPOSITORY

ABOUT THE NOSQL OPTIONABOUT THE NOSQL OPTION(sadly) NoSQL is no magic

it does work very well for CRUD and it scales easily, but

query options are limited and performance is not that good

multi-document transactions is usually not safe

more adapted for DBs with billions of entries and simple queries

SQL has some real advantages

ACID (and MVCC) is good

Workflows and bulk updates are a typical use case

(even transient) lack of consistency is complex to explain to users

lot of existing tools (BI & reporting), lot of existing skills (DBA)

PGSQL (or AWS RDS) can be very cost effective

SQL or NoSQL repository are not the solution

KEEP THE REPOSITORYKEEP THE REPOSITORYSQL OR NOSQLSQL OR NOSQL

BUTBUTFIND A SUPER FAST INDEX ENGINEFIND A SUPER FAST INDEX ENGINE

REPOSITORY & ELASTICSEARCHREPOSITORY & ELASTICSEARCHToward an Hybrid Storage

HYBRID STORAGEHYBRID STORAGEUse each storage solution for what it does the best

SQL DB

store content in an ACID way

store & retrieve

queries needed ACID and MVCC

elasticsearch

provide powerful and scalable queries

do the heavy lifting that the RDBMS can not do

scoring, native full-text, aggregates

distributed search

Route the query to the correct index dependingon requirements

ELASTICSEARCH & REPOSITORYELASTICSEARCH & REPOSITORY

One querySeveral possible backends

PERFORMANCE RESULTSPERFORMANCE RESULTSFast indexing

No ACID constraints / No impedance issue

3,500 documents/s when using SQL backend

10,000 documents/s when using MongoDB

Super query performance

query on term using inverted index

very efficient caching

native full text support & distributed architecture

3,000 queries/s with 1 elasticsearch node

6,000 queries/s with 2 elasticsearch nodes

SOME REAL LIFE FEEDBACKSOME REAL LIFE FEEDBACK

“ We are now testing the Nuxeo 6 stack in AWS.DB is Postgres SQL db.r3.8xlarge which is a a 32 cpusBetween 350 and 400 tps the DB cpu is maxed out.

“ Please activate nuxeo-elasticsearch !

“ We are now able to do about 1200 tps with almost 0 DB activity.Question though, Nuxeo and ES do not seem to be maxed out ?

“ It looks like you have some networkcongestion between your client and the servers.

“ ...right... we have pushed past 1900 tps ... I think we are close todeclaring success for this configuration ...

Customer

Customer

Customer

Nuxeo support

Nuxeo support

SQL VS ELASTICSEARCHSQL VS ELASTICSEARCH

Scalability is simply fromanother order of magnitude

SCALE OUTSCALE OUT

UNIFIED INDEX ON SHARDED REPOSITORYUNIFIED INDEX ON SHARDED REPOSITORY

Tested with 10 PgSQL databases

10 x 100 Million documents => 1 Billion documents

1 elasticsearch cluster

IS THIS MAGIC?IS THIS MAGIC?

For users

it really looks like magic

For sales guys & solution architects

it is magic: it unleashes a lot of possibilities

performance is just one aspect

For Nuxeo Core Dev team

it was almost magic: some integration work was needed

INTEGRATING ELASTICSEARCHINTEGRATING ELASTICSEARCHInside nuxeo-elasticsearch Plugin

CHALLENGES TO ADDRESSCHALLENGES TO ADDRESS

Keep index in sync with the repository

No transaction management

Do not lose anything

Without support for update

Mitigate eventually consistent effect

Avoid displaying transient inconsistent state

Handle security filtering

Without join

Without post-filtering

SECURITY FILTERINGSECURITY FILTERING

Constraints

Filtering must be done at index level : no post filtering

Join is not an option

can not join with DB or withing lucene (previously tested without success)

Solution

index the ReadACL as part of the JSON Document

list of groups / users who can read the resource

automatically add a filter clause on ACL

Consequences

Recursive indexing is needed

More pressure to maintain re-indexing procesing

in last resort: the Document security is checked by the repository anyway

SAFE INDEXING FLOWSAFE INDEXING FLOWDo not try to make it Transactionnal

Collect and de-duplicate Repository Events during Transaction

Wait for commit to be done at the repository level

then call elasticsearch

Do not lose any updaterun Indexing Tasks in a distributed Job infrastructure

Jobs should be persisted

Jobs should be retried

Jobs should be monitored

ASYNC INDEXING FLOWASYNC INDEXING FLOW

MITIGATE EVENTUALLY CONSISTENTMITIGATE EVENTUALLY CONSISTENT

In the code :

use case : need to see results from within the transactionquery directly on the repository

leverage ACID and MVCC of SQL repository

full-text search and facets are usually not needed by the code

For the users :

use case : see changes in listings in "real time"use pseudo-real time indexing

indexing actions triggered by UI threads are flagged

run as afterCompletion listener

refresh elasticsearch index

PSEUDO-SYNC INDEXING FLOWPSEUDO-SYNC INDEXING FLOW

DOES THIS WORK ?DOES THIS WORK ?

Live for about 18 months now No missing sync issue

some customers asked for verification toolsbut no problem was foundre-index in bulk mode is very fast anyway

No consistency issues

good usage of hybrid query engines

elasticsearch helped address several scaling challenges

but elasticsearch brings us much more than just scalability

BONUS FROM ELASTICSEARCHBONUS FROM ELASTICSEARCHMore than Raw Speed

LEVERAGE AGGREGATESLEVERAGE AGGREGATES

Leverage elasticsearch aggregates

integrate with the Query system (PageProvider)

integrate with the Listing / UI model (ContentView)

Allow to easily build and configure faceted search

ADVANCED INDEXINGADVANCED INDEXINGFine tuning of elasticsearch indexing

multi language support using multiple analyzers and copy_to

compound fields created using groovy scripts

Introduce elasticsearch hints into NXQL

select a specific elasticsearch index / analyzer

leverage elasticseach operators

do geolocation search

-- Use an explicit Elasticsearch fieldSELECT * FROM Document WHERE /*+ES: INDEX(dc:title.ngram) */ dc:title = 'foo'

-- Use ES operators not present in NXQLSELECT * FROM Document WHERE /*+ES: OPERATOR(regex) */ dc:title = 's.*y'SELECT * FROM Document WHERE /*+ES: OPERATOR(fuzzy) */ dc:title = 'zorkspaces'

-- Use ES for GeoQuery based on geo_hash_cell location near a point using geohash; SELECT * FROM Document WHERE /*+ES: OPERATOR(geo_hash_cell)*/ osm:location IN ('40','-74','5')

leverage what comes for free with elasticsearch

INDEX AUDIT TRAIL WITH ELASTICSEARCHINDEX AUDIT TRAIL WITH ELASTICSEARCHUse elasticsearch to store & index Audit trail

all events are serialized in JSON and stored inside elasticsearch

Unleash Audit system power

can store a lot of events

can store and query arbitrary JSON structure

ELASTICSEARCH PASS-THROUGHELASTICSEARCH PASS-THROUGH

Expose an HTTP pass-through API on top of Nuxeo integration

Integrate Authentication & Authorization

not all users can access workflow index

Integrate Security Filtering

activate data level security filtering

Expose "virtual index" via http

index + filter

Use elasticsearch API related components on Nuxeo data

Documents + Audit log

With embedded security

Easy real time data analytics on business data

DATA ANALYTICS WITH ELASTICSEARCHDATA ANALYTICS WITH ELASTICSEARCHQueries on Documents + Audit: flexible reporting on workflows

READ DOCUMENTS FROM ELASTICSEARCHREAD DOCUMENTS FROM ELASTICSEARCH

Full JSONDocument is stored in elasticsearch

required to be able to do fast re-indexing

We can retrieve Documents from elasticsearch

execute full search & retrieve without touching the DB

By controling indexing we can use the elasticsearch index

as a persistent cache on top of the repository

as a staging area for queries

_source

NEXT STEPSNEXT STEPSLeveraging Even More elasticsearch

NEXT STEPSNEXT STEPS

Leverage elasticsearch percolator

push update on the nuxeo-drive clients

notify users about saved search

automatic categorization

Search result highlighting

not sure why it is still not there ...

Plug automatic denormalization

ANY QUESTIONS ?ANY QUESTIONS ?Thank You !

https://github.com/nuxeo

http://www.nuxeo.com/careers/

Scaling the Content Repository with Elasticsearch

Software

Transcript of Scaling the Content Repository with Elasticsearch