Elasticsearch from the trenches

27
elasticsearch elasticsearch from the trenches from the trenches Jai Jones [email protected]

Transcript of Elasticsearch from the trenches

Page 1: Elasticsearch from the trenches

elasticsearchelasticsearchfrom the trenchesfrom the trenches

Jai [email protected]

Page 2: Elasticsearch from the trenches

about meabout me

solution architect at slalomenjoy building search apps7+ years Lucene2+ years Hibernate Search~2 years Elasticsearch

Page 3: Elasticsearch from the trenches

agendaagenda

the askinitial approachproblemsnext stepslessons learnedimprovementsquestions

Page 4: Elasticsearch from the trenches

the askthe ask

search 6 billions docs in under 1.5 secindex 2 millions new docs / dayexport billions of docs to CSV filesindex and search docs in realtimeuse search throughout the application

free text searchfaceted navigationsuggestionsdashboards

Page 5: Elasticsearch from the trenches

free text searchfree text search

Page 6: Elasticsearch from the trenches

faceted navigationfaceted navigation

drill down

Page 7: Elasticsearch from the trenches

suggestionssuggestions

Page 8: Elasticsearch from the trenches

dashboardsdashboards

Page 9: Elasticsearch from the trenches

hardwarehardwareused "large" serversservers had lots of CPUs & RAMnon-RAIDed spinning disks

5 dedicated nodesall nodes store dataall nodes are masterall nodes sort & aggregate

clustercluster

initial approachinitial approach

Page 10: Elasticsearch from the trenches

shardsshardsused the default shard count5 primary + 1 replicaunlimited primary shards / node

indicesindicesdata was chronologicalused the time-based index strategy

weekly indices for transaction logsdaily indices for audit logs

initial approachinitial approach

Page 11: Elasticsearch from the trenches

memorymemorydedicated 31 GB to the jvm heapused remaining memory for file system cacheturned off linux process swappingmaxed out linux file descriptorsused G1 Garbage Collector

initial approachinitial approach

index mappingsindex mappingsindexed all fieldsstored big documents with 60+ fieldsnested documentsparent-child relationships

Page 12: Elasticsearch from the trenches

searchessearchessearched all indicesused query_string searchessearched all fieldssorted & aggregated on any fieldrange queriesparent-child queries

GET /index-*/_search

"query_string" : { "query": "+(eggplant | potato)", "default_field": "_all", "default_operator": "and"}

initial approachinitial approach

Page 13: Elasticsearch from the trenches

problemsproblems

OutOfMemoryErrorfield data exceeded jvm heapshard count was in the thousandsgarbage collector could not free memory

CircuitBreakerExceptionfield data exceeded jvm heapsearch results exceeded jvm heap

slow searches (latency increased from seconds to minutes)nodes became unresponsivefrequent GC pauses

early signs

Page 14: Elasticsearch from the trenches

cluster downcluster down

index corruptiondata loss

nodes failed to restart

Page 15: Elasticsearch from the trenches

next stepsnext steps

shard capacityshard capacityunderstand data & searches

size based on actual usage

field datafield datamonitor

identify the producers

reduce usage

searchsearchidentify bottlenecks

optimize

clusterclusterfind failure points

make topology changes

make hardware changes

identify and fix problems...

Page 16: Elasticsearch from the trenches

shard capacityshard capacity

1 shard can handle a lot of dataactually it held ~5x more datadidn't need 5 shards per indexdid't need weekly/daily indices

learned...learned...

shard is the unit of scalehow much data can a single shard hold?find the single shard breaking point

1. loaded a single shard with data

2. ran typical searches

3. recorded search response time

4. repeated until response time became unacceptable

Page 17: Elasticsearch from the trenches

field datafield datawhich fields and indices are using a lot of field data?use the stats API to find out

fields used for sorting & aggregationhigh cardinality fieldsid-cache for parent-child relationshipsfield data is loaded first time field is accessedfield data is maintained per-indexfield data is not GC'd

culprits...culprits...

# Node Statscurl -XGET 'http://localhost:9200/_nodes/stats/indices/fielddata?human'

# Indices Statcurl -XGET 'http://localhost:9200/_stats/fielddata/?human'

Page 18: Elasticsearch from the trenches

searchsearch

searching all indices is slow, CPUintensive and causes field data tobe loaded for every index

# Searches all indices/indexname-*/_search

# Search specific indices/indexname-2015/_search

query_string is flexible but allowsinefficient searches like leadingwildcard searches and searches_all fields by default

{ "query_string" : { "default_field" : "_all", "allow_leading_wildcard" "true", "query" : "this AND that OR thus" }}

what are the bottlenecks and resource killers?

Page 19: Elasticsearch from the trenches

clustercluster

field data used up 70-90% of the heap memorynot much heap left for node & shard managementstop the world Garbage Collector (GC) pauses made thecluster unresponsivenodes dropped out of the clusterthe G1 GC had longer pauses than the CMS GCsorting, aggregations, id-cache for parent-childrelationships used up a lot of heap memorymanaging too many shards used a lot of heap memory

why is the cluster crashing?

Page 20: Elasticsearch from the trenches

lessons learned...lessons learned...

number of shards / node should not exceed the number of CPU coresfigure out the single shard capacitymonitor field data usagefield data usage is permanent and does not get garbage collectedtoo high field data usage will bring down the clustersearch specific indices by target date rangetune and test all search API searchessplit cluster into data, client and master nodes use the default ES JVM settings and garbage collector

Page 21: Elasticsearch from the trenches

hardwarehardwareused "large" serversservers had lots of CPUs & RAMnon-RAIDed spinning disksput master and client nodes on same servers

5 8 dedicated nodesall nodes are master dedicated master nodesall nodes store data dedicated data nodes all nodes sort & aggregate dedicated client nodes

clustercluster

improvementsimprovements

Page 22: Elasticsearch from the trenches

shardsshardsdefault shard count didn't work5 1 primary + 1 replicaunlimited primary shards / node # of primaryshards less than # of CPU cores

indicesindicesdata was chronologicalused the time-based index strategyweekly monthly indices for transaction logsdaily monthly indices for audit logs

improvementsimprovements

Page 23: Elasticsearch from the trenches

memorymemorydedicated 31 GB to the jvm heapused remaining memory for file system cacheturned off linux process swappingmaxed out linux file descriptorsused new G1 GC used stable CMS GC

improvementsimprovements

Page 24: Elasticsearch from the trenches

index mappingsindex mappingsindexed all 40 fieldsstored big documents with 60+ fieldsnested documentsparent-child relationshipsused field aliases to define alternatefields used in sorting and aggregationused doc_value on sortable &aggregation fieldschanged boolean data type to string

"field": { "index": "no"}

# uses field data"fieldA": { "type": "boolean"}

# uses doc_value (no field data)"fieldA": { "type": "string", "index": "analyzed", "fields": { "raw" : { "type" : "string", "index" : "not_analyzed", "fielddata": { "format": "doc_values" } } }}

improvementsimprovements

Page 25: Elasticsearch from the trenches

searchessearchessearch all indices target specificindices query_string simple_query_stringsearch on all some fieldssorting & aggregations on all lowcardinality fieldsrange queries filtersparent-child nested queriesadded query timeouts

GET /index-201501/_search

"simple_query_string" : { "query": "+(eggplant | potato)", "fields": ["field1", "field2"], "default_operator": "and"}

improvementsimprovements

Page 26: Elasticsearch from the trenches

Questions?Questions?

Page 27: Elasticsearch from the trenches

Thank YouThank You