How we (Almost) Forgot Lambda Architecture and used Elasticsearch
-
Upload
michael-stockerl -
Category
Software
-
view
266 -
download
2
Transcript of How we (Almost) Forgot Lambda Architecture and used Elasticsearch
Michael Stockerl Data Engineer
[email protected]@stockerlm
Lambda Architecture Example: Answer Score● Better sorting● Hide bad answers● Google Thin Content (SEO)
Learnings+ Reads are fast+ Spark helps building a Lambda Architecture- Still duplicate code and complexity- Each change needs an update of the batch view
Overall ranking with MySQL
SELECT user_id, SUM(points) as scoreFROM event_logWHERE created_at BETWEEN now() AND 90 Days agoGROUP BY user_idORDER BY score DESC
First results of performance test● Some queries were fast enough● BUT: 17 - 20 seconds queries in worst case scenario
Aggregations in Elasticsearch
The aggregations framework helps provide aggregated data based on a search query. It is based on simple building blocks called aggregations, that can be composed in order to build complex summaries of the data.
elasticsearch documentation
Aggregation for Top User List
"aggregations": { "top_users": { "terms": { "field": "user_id", "size": 100,
"shard_size": 2000, "order": { "total_score": "desc" } }, "total_score": { "sum": { "field": "score" } } } }
Aggregation for Top User List
"aggregations": { "top_users": { "terms": { "field": "user_id", "size": 100,
"shard_size": 2000, "order": { "total_score": "desc" } }, "total_score": { "sum": { "field": "score" } } } }
groupBy
Aggregation for Top User List
"aggregations": { "top_users": { "terms": { "field": "user_id", "size": 100,
"shard_size": 2000, "order": { "total_score": "desc" } }, "total_score": { "sum": { "field": "score" } } } }
order by
Aggregation for Top User List
"aggregations": { "top_users": { "terms": { "field": "user_id", "size": 100,
"shard_size": 2000, "order": { "total_score": "desc" } }, "total_score": { "sum": { "field": "score" } } } }
tune accuracy
Request cache● Search on local shards● Cache local● Invalidated on changes● Hits.total, aggregations and suggestions
Request cache● Search on local shards● Cache local● Invalidated on changes● Hits.total, aggregations and suggestions
➔ Too much updates➔ A lot of cache misses
Split data:● Data of today: use index template to create index with first event● Historical data: index without changes
Incoming Event
historical data
data of today
Use filtered aliases to select data of time range
Incoming Event
historical data
data of todaytoday
90days
filtered alias
Use cached results from historical data
Incoming Event
historical data
data of todaytoday
90days
filtered alias
Cac
he
_search?request_cache=true
service
The next day
Incoming Event
historical data
data of yesterdaytoday
90days
filtered alias
Cac
he
_search?request_cache=true
service
data of today
Merge the old indices
Incoming Event
historical data
data of yesterdaytoday
90days
filtered alias
Cac
he
_search?request_cache=true
service
data of today
Warm cache already in merge job
Incoming Event
historical data
data of todaytoday
90days
filtered alias
Cac
he
_search?request_cache=true
service
Learnings:
● Improved internal reindex framework● Alias are always your friends● Request cache FTW● Cache miss, when you use index name instead of alias (?)● Results may not be 100% accurate (but no problem for us)