Dev nexus 2017

56
Big Data processing with Elasticsearch and Apache Spark DevNexus, 2017

Transcript of Dev nexus 2017

Page 1: Dev nexus 2017

Big Data processing with Elasticsearch and Apache Spark

DevNexus, 2017

Page 2: Dev nexus 2017

Who Am I•Roy Russo•VP Engineering, Predikto•Co-Author - Elasticsearch in

Action•ElasticHQ.org

2

Page 3: Dev nexus 2017

Why Am I Here?•What is Search•Elasticsearch 101 in 10 minutes•Elasticsearch and Big Data processing-The Predikto process -Elasticsearch at scale

3

Page 4: Dev nexus 2017

Search is about filtering

information and

determining relevance.

4

Page 5: Dev nexus 2017

How does a Search Engine Work?

5

Select * FROM make WHERE name LIKE ‘%Tesla%’

Page 6: Dev nexus 2017

Search Engines use Magic

6

Where Magic == Inverted Index

It’s FM!

Page 7: Dev nexus 2017

Inverted Index1.Take some documents2.Tokenize them3.Find unique tokens4.Map tokens to documents

7

apple oranges peachDocument 1Document 2Document 3Document 4Document 5Document 6

Page 8: Dev nexus 2017

Inverted Index

8

apples oranges peachDocument 1Document 2Document 3Document 4Document 5Document 6

Search for “apple peach”

Page 9: Dev nexus 2017

What is Elasticsearch?

9

Page 10: Dev nexus 2017

Elasticsearch is…•Search and Analytics engine-Yes, analytics too.

•Document Store-Every field is indexed/searchable

•Distributed- Fault-tolerant

10

Page 11: Dev nexus 2017

What Elasticsearch is not•Key-Value Store-Redis, Riak

•Column Family Store-C*, HBase

•Graph Database-Neo4J

•Relational Database

11

Page 12: Dev nexus 2017

ElasticSearch in a Nutshell•Based on Apache Lucene•Distributed•Document-Oriented•Schema "free"•HTTP + JSON•(Near) Real-time search•Ecosystem-Hosting, Monitoring Apps, Clients (SDK)

12

Page 13: Dev nexus 2017

Where can I get it?•Free and Open Source•https://www.elastic.co/•https://github.com/elastic/elasticsearch•Backed by a Company, Elastic-Training-Support-Auth/AuthZ-Kibana for UI

13

Page 14: Dev nexus 2017

How do I run it?•Download it-https://www.elastic.co/downloads

•bin/elasticsearch•http://localhost:9200

14

{status: 200,name: "Tesla",cluster_name: "elasticsearch_royrusso",version: {

number: "1.4.2",build_hash: "927caff6f05403e936c20bf4529f144f0c89fd8c",build_timestamp: "2014-12-16T14:11:12Z",build_snapshot: false,lucene_version: "4.10.2"

},tagline: "You Know, for Search"

}

Page 15: Dev nexus 2017

Elasticsearch requires Java. *

15

* You have 5 seconds to whine about it and then shutup.

Page 16: Dev nexus 2017

ElasticSearch for Centralized Logs

•Logstash + ElasticSearch + Kibana (ELK)

16

Page 17: Dev nexus 2017

Based on Apache Lucene•Free and Open Source•Started in 1999•What’s it do?

17

Page 18: Dev nexus 2017

Elasticsearch is a Document Store

18

Page 19: Dev nexus 2017

Document Store•Like MongoDB and CouchDB… but more

like Solr•Document DB

19

Page 20: Dev nexus 2017

Modeled in JSON

20

{ "genre": "Crime", "language": "English", "country": "USA", "runtime": 170, "title": "Scarface", "year": 1983}

{ "_index": "imdb", "_type": "movie", "_id": "u17o8zy9RcKg6SjQZqQ4Ow", "_version": 1, "_source": { "genre": "Crime", "language": "English", "country": "USA", "runtime": 170, "title": "Scarface", "year": 1983 }}

Page 21: Dev nexus 2017

Schema-Free•Dynamic Mapping-Elasticsearch "guesses" the data-types (string,

int, float…)

21

"imdb": { "movie": { "properties": { "country": { "type": "string“, “store”:true, “index”:false }, "genre": { "type": "string“, "null_value" : "na“, “store”:false, “index:true }, "year": { "type": "long" } } }}

Page 22: Dev nexus 2017

Terminology

•Cluster: 1..N Nodes w/ same Cluster Name•Node: One ElasticSearch instance (1 java

proc)•Shard = One Lucene instance-0 or more replicas 2

2

MySQL ElasticsearchDatabase Index

Table TypeRow Document

Column FieldSchema Mapping

Index (Everything is indexed)SQL Query DSL

Page 23: Dev nexus 2017

REST•HTTP Verbs: GET, POST, PUT, DELETE•JSON •_cat API

23

% curl '192.168.56.10:9200/_cat/health?v&ts=0'cluster status nodeTotal nodeData shards pri relo init unassignfoo green 3 3 3 3 0 0 0

Page 24: Dev nexus 2017

Random notes about Elasticsearch•I recommend public trainings for

beginners; developers and devops… •The QDSL is NOT intuitive.•Official support is expensive. •Release roadmap on github is hit-or-miss

in bizarro-land.•Beware client SDKs. Some of the

maintainers can be finicky.•The technology is great, but the space is

noisy.

24

Page 25: Dev nexus 2017

Elasticsearch is Distributed

25

Page 26: Dev nexus 2017

High Availability•No need for load balancer•Different Node Types•Indices are partitioned in to Shards•Replica shards on different Nodes•Automatic Master election & failover

26

Page 27: Dev nexus 2017

Cluster Topology

27

A1 A2B2 B2 B1

B3

B1 A1 A2

B3

4 Node Cluster

Index A: 2 Shards & 1 ReplicaIndex B: 3 Shards & 1 Replica

Page 28: Dev nexus 2017

Discovery•Nodes discover each other using multicast.-Unicast is an option

•Each cluster has an elected master node-Beware of split-brain

28

discovery.zen.ping.multicast.enabled: falsediscovery.zen.ping.unicast.hosts: ["host1", "host2:port", "host3"]

Page 29: Dev nexus 2017

Nodes• Master node handles cluster-wide (Meta-API) events- Node participation- New indices create/delete- Re-Allocation of shards

• Data Nodes- Indexing / Searching operations

• Tribe Nodes- Cross-cluster search

• Client Nodes- REST calls- Light-weight load balancers

• Ingest Nodes- Pipelines/Processors: Date, Convert, Rename, Split, Sort, UC, LC,

Foreach29

Page 30: Dev nexus 2017

The Basics - Shards•Primary Shard:- First time Indexing- Index has 1..N primary shards (default: 5)-# Not changeable once index created

•Replica Shard:-Copy of the primary shard-# Can be changed later-Each primary has 0..N replicas- HA:• Promoted to primary if primary fails• Get/Search handled by primary||replica3

0

Page 31: Dev nexus 2017

Shard Auto-Allocation

•Shard Phases:-Unassigned- Initializing-Started-Relocating

31

Node 1

0P

1R

Node 2

1P

0R

Node 3

0R

Add a Node: Shards Relocate

Page 32: Dev nexus 2017

New in 2.x and 5.x•2.x•Pipeline Aggregations•Say goodbye to Filters

•5.x• Ingest Node•Painless scripting•Data structures:• Not_analyzed -> text

Page 33: Dev nexus 2017

Elasticsearch in the Real World*

33

* Real World == My World

From Big Data to Predictions

Page 34: Dev nexus 2017

What we do1. Ingest Data : Sensors, maintenance,

weather, faults2. Generate Features3. Train Models4. Issue Predictions5. Visualize

Page 35: Dev nexus 2017

How it Works

Page 36: Dev nexus 2017

How it Works

36

MAX

Raw datain TB

Auto dynamic engineered

features

Auto dynamic feature

selection

Method and algorithmselection

Dynamic self learning

Page 37: Dev nexus 2017

How it REALLY Works

37

•Customer hands us X years of data•It’s a horrible mess•ETL to Elasticsearch via Spark•Feature generation via Spark•Train Models•Predictions Issued

85%

10%

5%

Onboarding

ETL Feature Generation Predictive Models

Page 38: Dev nexus 2017

What is feature generation?

38

•We want to predict engine overheating:•Feature 1: Max temperature of this

engine per week•Feature 2: Difference in temperature of

one reading compared to the mean of all readings across the fleet.

•Feature 3: For the month of July, how does mean engine temperature compare to all other engines for all July’s on record.

•…•Feature N

Page 39: Dev nexus 2017

The problem with 10 ms

39

•10ms response times are fast, unless:•10ms X 100 million queries:•16,667 minutes•277.8 hours•11.57 days

•More queries, slower responses•Downward spiral and cascade issues at

requester

Page 40: Dev nexus 2017

Inconvenient Truths in Data Science

40

•Everyone is a data scientist or Uber driver today; or both.

•Space is noisy. Lots of consulting-ware.•Customer data is a disaster.•Watson can’t even predict IBM layoffs.

Page 41: Dev nexus 2017

Elasticsearch at Predikto

41

•Write From Spark to Elasticsearch•Query from Spark to Elasticsearch•Visualize

Page 42: Dev nexus 2017

How we use Apache Spark

42

•It’s all DAGs•No new code-per-customer

•ETL:•DAG-driven•Configuration triggers functions.

•Feature Generation:•DAG-driven•Some magic

Page 43: Dev nexus 2017

DAG-driven ETL

43

Page 44: Dev nexus 2017

Architecture Considerations•Fast read•Mostly time-series data

•Fast write•Dynamic querying•Apache Spark integration•Easily deployed on AWS •Commodity hardware•HA•Schema free•GeoJSON

Page 45: Dev nexus 2017

Contenders•Cassandra •Datastax Enterprise•Relational (PSQL, MySQL)•Mongo•Solr•Hadoop

Page 46: Dev nexus 2017

Why Elasticsearch?•Fast read•Fast write (?)•Dynamic querying•Apache Spark integration via Hadoop

connector•Easy EC2 setup/configuration•HA•Schema free

Page 47: Dev nexus 2017

Heavy use in Machine Learning

47

•ETL process; FAST WRITES•Data is denormalized•Fields typed / standardized•Parallelized via Spark

•Feature generation; FAST READS•Billions of queries during process•Parallelized via Spark

•User Interface / API; FAST READS / DYNAMIC QUERYING•Query anything, any way

Page 48: Dev nexus 2017

Current Infrastructure•NGINX as reverse proxy•Why?

•Elasticsearch nodes are EC2 r4.2xl w/ 60GB RAM•EBS mount• SSD. Never use spinning media!• ~500GB per volume

•VPC•Cluster-per-customer

Page 49: Dev nexus 2017

Node Configuration•A base yml config:•bootstrap.mlockall: true•discovery.zen.ping.multicast.enabled:

false• index.search.slowlog.threshold.query.de

bug: 1s•plugin.mandatory: cloud-aws•discovery.type: ec2

•ES_HEAP_SIZE=30g•MAX_OPEN_FILES=65535•MAX_LOCKED_MEMORY=unlimited

Page 50: Dev nexus 2017

Deploying on AWS•Highly recommend the Cloud-AWS plugin

for discovery. https://www.elastic.co/guide/en/elasticsearch/plugins/current/discovery-ec2.html

1. Nodes should be configured with the same cluster name AND security group.

2. All EC2 instances must have the same IAM Role at creation!

3. Security group should be open on 9200-9300.

4. Discovery will happen on boot.

Page 51: Dev nexus 2017

More about Elasticsearch on AWS

•Volumes:•Use EBS for smaller clusters (3-5)ish

nodes.• Instance store for larger clusters• ephemeral

•Amazon Linux AMIs•EC2 performance enhancements may

help here.•Small instance types get network

throttled.•Tribe nodes for cross-region synch•I don’t recommend the AWS Elasticsearch

service

Page 52: Dev nexus 2017

Tending to Elasticsearch•Curator: https://github.com/elastic/curator•Actions:•Create INDEX•Alias•Snapshot

•Backups on AWS:•For 5.x, install the Snapshot plugin.•Snapshots to S3• Requires key/secret

Page 53: Dev nexus 2017

Handling Fast Writes•Consider collectors for throttling writes•Always use the bulk API•Disable replicas•Index.refresh_interval : -1•Tuning for concurrent writing is an art.• Indexing buffer sizes•Translog flush size

Page 54: Dev nexus 2017

Handling Fast Reads•Fast hardware•Avoid joins (nested & parent-child)•Avoid scripting•Pre 5.x: Use filters•Cache

•indices.fielddata.cache.size•Atomic queries

Page 55: Dev nexus 2017

Maintaining Big Data•Big data grows•Archive/close unused data•Partition data:• Index by time, Index by time AND

device• Aliases• Templates / Curator

•Get your mapping right.•Re-indexing big data is not trivial• Elasticsearch 5.x reindexer

Page 56: Dev nexus 2017

Questions?

56