node-crate: node.js and big data

35
node-crate: node.js & big data by Stefan Thies

Transcript of node-crate: node.js and big data

Page 1: node-crate: node.js and big data

node-crate: node.js & big databy Stefan Thies

Page 2: node-crate: node.js and big data

The path is the goal

Page 3: node-crate: node.js and big data

2000-2013 www.verint.com

Dev Team Lead, Product Management, Sales Engineer

since 2013 Consulting / Outsourcing

bigdata-analyst.de

just started … DevOps Evangelist @ www.sematext.com

follow me @seti321

about me

Page 4: node-crate: node.js and big data

Product evaluations

MarkLogic*

MongoDB*

Elas1csearch*

CouchDB*

CRATE*

0*

10*

20*

30*

40*

50*

60*

70*

80*

90*

Document)oriented)data)stores)Points)for)product)evalua4on)criterias)of)the)specific)project)(RT,)scalability,)replica4on,))features)and)commercial))

Datenreihe1*

Page 5: node-crate: node.js and big data

How do I get here?• 2012-2014 Systems with Elasticsearch &

• Mobile Apps (Geo) with Appcelerator Titanium

• Data enrichment & Webcrawlers (whois, geo, appstores)

• Distributed Regex-Processing for CyberSec with 0MQ

• Security Layer around Elasticsearch (sails.js)

• … we did almost everything in NodeJS

Page 6: node-crate: node.js and big data

Design criterias

• Scalable & lean architecture

• Operations: NO Zoo of 3rd party components

• We choosed Elasticsearch at that time

• Automatic installation, Docker

• One Language: JavaScript / Node.js

Page 7: node-crate: node.js and big data

Security & Admin

- Policies, Users, Roles - REST API - Websockets / RT

Page 8: node-crate: node.js and big data

„data enrichment“• Hey, we got Elasticsearch - lookup queries for ‚static‘ data sources will be fast!

• Distributed processing based on 0MQ (pull/push) - high throughput, parallel processing, distributed worker processes

collection

Information extraction and

processingdata lookups Elasticsearch

Information extraction and

processingdata lookups Elasticsearch

Page 9: node-crate: node.js and big data

any problem?

collect mass data

Elasticsearch

Analyze & Visualize

other data sources Geo Company

dataOpen Source Information

massive updates!

processing queue / workers

Reporting (PDF)Accurate Counts (Facets) -> Aggregation

Page 10: node-crate: node.js and big data

OPS issuesalternative ‚any‘ DB (for updates) + ES

• It’s a big mess regarding compatibility, maintenance and monitoring all components - each box can be multiple machines, River might not be updated to latest DB or ES version, a bug might force you to upgrade one of the components and there the trouble of dependency starts …

• Reporting: custom programming DSL Queries, Rendering HTML with PhantomJS to PDF - painful if you know standard Report generators from SQL world. How to tell the customer to adapt it to his needs? Using some ‚standard‘ DB (SQL or NoSQL) supported by the reporting tools would solve it.

DB Vx.x

Data-Procssing Services

DB-River V y.y

Elasticsearch V z.z

Search & Analytics V. b.b

Page 11: node-crate: node.js and big data

Don’t panic

google like …

Page 12: node-crate: node.js and big data

A match at Slideshare!

• An early presentation of from Jodok got my attention

• http://www.crate.io

Page 13: node-crate: node.js and big data

• The Mountain Hackathon 2014

birthday of node-crate

Page 14: node-crate: node.js and big data

Package status

• Igor Likhamanov

• Stefan Thies

• Martin Heidegger joined recently and made high professional quality improvements!

Page 15: node-crate: node.js and big data

DevOps: Stack-Shrinking

• From 3 down to 1 storage service:

DB Vx.x

Data-Processing

DB-River V y.y

Elasticsearch V z.z

Search & Analytics V. b.b

Crate V a.a

Search & Analytics V. b.b

Data-Processing

Page 16: node-crate: node.js and big data

Data Enrichment Performance• Elasticsearch has no „update by query“

• If we need to update e.g. 50.000 records it means running a query to identify the relevant records and send 50.000 HTTP requests for update or build a a large bulk update request with 50.000 instructions -> overhead! -> K.O

• In Crate

• update something where something_else = ‚other_value’

• ONE command, still a heavy operation because of Lucene delete/index BUT not ’50.000 commands/network roundtrips’ on top …

Page 17: node-crate: node.js and big data

Data Enrichment - performance

collect mass data

CRATE data store

Analyze & Visualize

other data sources Geo … Open Source

Information

massive updates, no issue :)

processing queue / workers

Reporting (PDF) using CRATE JDBC

Page 18: node-crate: node.js and big data

BLOB’s (Images, videos, packet data, …)

• Traditionally

• Meta-Data in DB + Files in some filesystem / separate object storage

• Both behave different for scaling

• Crate stores BLOB’s like other shards including replicas

• More nodes more capacity, replicas etc.

• BLOB storage scales with the data store

• Would be perfect for ‚dropbox‘ like service :) or any archived data

Page 19: node-crate: node.js and big data

Demo: Installation, usage, examples walk through …

• https://www.youtube.com/watch?v=ZaDFrd4ZwQk (setup)

• https://github.com/megastef/node-crate (node-crate on github)

• http://techblog.bigdata-analyst.de (sample applications)

• https://crate.io/docs/stable/ (documentation of CRATE.IO)

Page 20: node-crate: node.js and big data

Simple Example

Page 21: node-crate: node.js and big data

Import Data (bulk insert)

COPY web_log FROM ‚/var/logs/web_log.json‘ WITH (bulk_size=15000, concurrency=2)

Page 22: node-crate: node.js and big data

create table web_log (ts timestamp, host

string, …);

Special data types for - IP - Geo Shapes - Objects (dynamic)

Page 23: node-crate: node.js and big data

insert into web_log (ts,useragent, ..) values (132323,

‚Safari‘, …)

Page 24: node-crate: node.js and big data

select

update

Page 25: node-crate: node.js and big data

Anything missing?

• „Kibana“

• see my blog how to add it (‚officially‘ not supported)

• Performance monitoring

• see next section …

Page 26: node-crate: node.js and big data

Using Kibana with Crate

Page 27: node-crate: node.js and big data

Performance Monitoring

Page 28: node-crate: node.js and big data

Setup & Run

If you can’t measure it you can’t fix it!

Page 29: node-crate: node.js and big data
Page 30: node-crate: node.js and big data

Monitoring - Sematext SPM supported Applications

+

Release status for CRATE/SPM monitor: Prototype pls. call me upon demand

NEW

Page 31: node-crate: node.js and big data

SPM Monitoring

Page 32: node-crate: node.js and big data

My NPM Modules

• node-crate - DB driver for Crate for NodeJS - help for ‚Waterline/sails.js‘ ORM appreciated! We are open for other suggestions, we like sails.js Websocket capability and security features (policies) and would get that ‚for free‘

• winston-crate - logger transport for Crate using node-crate

• bro-ids - simple interface to the BRO intrusion detection system (IP Monitoring)

Page 33: node-crate: node.js and big data

+ sematext related work• node-red-contrib-logsene - Node-Red (IoT, MQTT, …) - Logger for Logsene

• node-spm - Custom Metrics & Logging API for http://www.sematext.com adapted for NodeJS

• spmagent - Performance Monitoring for Node.js

• Garbage Collection, Event Loop Monitor, HTTP Metrics, Cluster mode, …

• Release: Very Soon! - Feb 2015 - [email protected] for early access

Page 34: node-crate: node.js and big data

Dig  Search?  Dig  Analy0cs?  Dig  Big  Data?  Dig  Performance?  Dig  Logging?  Dig  working  with  open  –  source?  

We‘re  hiring  planet  -­‐  wide!h2p://www.sematext.com/about/jobs.html  

Page 35: node-crate: node.js and big data

Thank you for your attention.

03.03.15 DevOps Frankfurt

“Metrics & more …”