Towards 100% uptime with node

Towards

100% Uptimewith Node.js

9M uniques / month.

75K+ users, some are paidsubscribers.

( We | you | users )hate downtime.

Important, butout of scope:

Redundant infrastructure.Backups.Disaster recovery.

In scope:Application errors.Deploys.Node.js stuff:

Domains.Cluster.Express.

Keys to 100% uptime.

1. Sensibly handleuncaught exceptions.

2. Use domainsto catch and contain errors.

3. Manage processeswith cluster.

4. Gracefully terminateconnections.

1. Sensibly handle uncaughtexceptions.

Uncaught exceptions happen when:An exception is thrown but not caught.An error event is emitted but nothing is listening for it.

From node/lib/events.js:

EventEmitter.prototype.emit = function(type) { // If there is no 'error' event listener then throw. if (type === 'error') { ... } else if (er instanceof Error) { throw er; // Unhandled 'error' event } else { ...

An uncaught exceptioncrashes the process.

If the process is a server:

x 100s??

It starts with...

Domains.2. Use domains to catch and contain errors.

try/catch doesn't doasync.

try { var f = function() { throw new Error("uh-oh"); }; setTimeout(f, 100);} catch (ex) { console.log("try / catch won't catch", ex);}

Domains are a bit liketry/catch for async.

var d = require('domain').create();

d.on('error', function (err) { console.log("domain caught", err);});

var f = d.bind(function() { throw new Error("uh-oh");});

setTimeout(f, 100);

The active domain isdomain.active.

var d = require('domain').create();console.log(domain.active); // <-- null

var f = d.bind(function() { console.log(domain.active === d) // <-- true console.log(process.domain === domain.active) // <-- true throw new Error("uh-oh");});

New EventEmitters bindto the active domain.

EventEmitter.prototype.emit = function(type) { if (type === 'error') { if (this.domain) { // This is important! ... this.domain.emit('error', er); } else if ...

Log the error.Helpful additional fields:

error.domainerror.domainEmittererror.domainBounderror.domainThrown

Then it's up to you.Ignore.Retry.Abort (e.g., return 500).Throw (becomes an unknown error).

Do I have to create a new domainevery time I do an async operation?

Use middleware.More convenient.

In Express, this might look like:var domainWrapper = function(req, res, next) { var reqDomain = domain.create(); reqDomain.add(req); reqDomain.add(res);

reqDomain.once('error', function(err) { res.send(500); // or next(err); });

reqDomain.run(next);};

Based on https://github.com/brianc/node-domain-middleware

https://github.com/mathrawka/express-domain-errors

Domain methods.add: bind an EE to the domain.run: run a function in context of domain.bind: bind one function.intercept: like bind but handles 1st arg err.dispose: cancels IO and timers.

Domainsare great

until they're not.

node-mongodb-native does notplay well with active domain.

console.log(domain.active); // a domainAppModel.findOne(function(err, doc) { console.log(domain.active); // undefined next();});

See https://github.com/LearnBoost/mongoose/pull/1337

Fix with explicit binding.console.log(domain.active); // a domainAppModel.findOne(domain.active.bind(function(err, doc) { console.log(domain.active); // still a domain next();}));

What other operations don't play wellwell with domain.active?

Good question!

Package authors could note this.

If you find one, let package author know.

Can 100% uptime be achievedjust by using domains?

No.Not if only one instance of your app

is running.

3. Manage processeswith cluster.

Cluster module.Node = one thread per process.

Most machines have multiple CPUs.

One process per CPU = cluster.

master / workers1 master process forks n workers.Master and workers communicate state via IPC.When workers want to listen to a socket, master registers themfor it.Each new connection to socket is handed off to a worker.No shared application state between workers.

What about when a workerisn't working anymore?

Some coordination is needed.

1. Worker tells cluster master it's done accepting new connections.

2. Cluster master forks replacement.

3. Worker dies.

Another use case for cluster:

Deployment.Want to replace all existing servers.

Something must manage that = cluster master process.

Zero downtime deployment.When master starts, give it a symlink to worker code.

After deploy new code, update symlink.

Send signal to master: fork new workers!

Master tells old workers to shut down, forks new workers fromnew code.

Master process never stops running.

Signals.A way to communicate with running processes.

SIGHUP: reload workers (some like SIGUSR2).

$ kill -s HUP <pid>$ service <node-service-name> reload

Process management options.

Forevergithub.com/nodejitsu/forever

Has been around...forever.No cluster awareness — used on a single process.Simply restarts the process when it dies.More comparable to Upstart or Monit.

Naughtgithub.com/superjoe30/naught

Newer.Cluster aware.Zero downtime errors and deploys.Runs as daemon.Handles log compression, rotation.

Reclustergithub.com/doxout/recluster

Newer.Cluster aware.Zero downtime errors and deploys.Does not run as daemon.Log agnostic.Simple, relatively easy to reason about.

We went with recluster.Happy so far.

I have been talking aboutstarting / stopping workers

as if it's atomic.

It's not.

4. Gracefully terminateconnections

when needed.

Don't call process.exit too soon!

Give it a grace period to clean up.

Need to clean up:In-flight requests.HTTP keep-alive (open TCP) connections.

Revisiting our middleware from earlier:var domainWrapper = function(afterErrorHook) { return function(req, res, next) { var reqDomain = domain.create(); reqDomain.add(req); reqDomain.add(res);

reqDomain.once('error', function(err) { next(err); if(afterErrorHook) afterErrorHook(err); // Hook. }); reqDomain.run(next); };};

1. Call server.close.var afterErrorHook = function(err) { server.close(); // <-- ensure no new connections}

2. Shut down keep-aliveconnections.

var afterErrorHook = function(err) { app.set("isShuttingDown", true); // <-- set state server.close();}

var shutdownMiddle = function(req, res, next) { if(app.get("isShuttingDown") { // <-- check state req.connection.setTimeout(1); // <-- kill keep-alive } next();}

Idea from https://github.com/mathrawka/express-graceful-exit

3. Then call process.exit

in server.close callback.var afterErrorHook = function(err) { app.set("isShuttingDown", true); server.close(function() { process.exit(1); // <-- all clear to exit });}

Set a timer.If timeout period expires and server is still around, call

process.exit.

Summing up:

Our ideal server.

On startup:Cluster master comes up (for example, via Upstart).Cluster master forks workers from symlink.Each worker's server starts accepting connections.

On deploy:Point symlink to new version.Send signal to cluster master.Master tells existing workers to stop accepting new connections.Master forks new workers from new code.Existing workers shut down gracefully.

On error:Server catches it via domain.Next action depends on you: retry? abort? rethrow? etc.

On uncaught exception:??

// The infamous "uncaughtException" event!process.on('uncaughtException', function(err) { // ??})

Back to where we started:

1. Sensibly handle uncaughtexceptions.

We have minimized these by using domains.

But they can still happen.

Node docs say not to keep running.

An unhandled exception means yourapplication — and by extension node.jsitself — is in an undefined state. Blindly

resuming means anything could happen.You have been warned.

http://nodejs.org/api/process.html#process_event_uncaughtexception

What to do?First, log the error so you know what happened.

Then, you've got tokill the process.

It's not so bad. We can now do sowith minimal trouble.

On uncaught exception:Log error.Server stops accepting new connections.Worker tells cluster master it's done.Master forks a replacement worker.Worker exits gracefully when all connections are closed, or aftertimeout.

What about the requestthat killed the worker?

How does the dying workergracefully respond to it?

Good question!

People are also under the illusion that it ispossible to trace back [an uncaught]

exception to the http request that causedit...

-felixge, https://github.com/joyent/node/issues/2582

This is too bad, because youalways want to return a response,

even on error.

This is Towards 100% Uptime b/c these approaches don'tguarantee response for every request.

But we can get very close.

Fortunately, given what we've seen,uncaughts shouldn't happen often.

And when they do, only oneconnection will be left hanging.

Must restart cluster master when:Upgrade Node.Cluster master code changes.

During timeout periods, might have:More workers than CPUs.Workers running different versions (old/new).

Should be brief. Probably preferable to downtime.

Be able to produce errors on demandon your dev and staging servers.

(Disable this in production.)

Keep cluster master simple.It needs to run for a long time without being updated.

Things change.I've been talking about:

{ "node": "~0.10.20", "express": "~3.4.0", "connect": "~2.9.0", "mongoose": "~3.6.18", "recluster": "=0.3.4"}

The Future:Node 0.11 / 0.12

For example, cluster module has some changes.

Cluster is experimental.Domains are unstable.

Good reading: (some answers more

helpful than others)Node.js Best Practice Exception Handling

Remove uncaught exception handler?Isaacs stands by killing on uncaughtDomains don't incur performance hits compared to try catchRejected PR to add domains to Mongoose, with discussionDon't call enter / exit across asyncComparison of naught and foreverWhat's changing in cluster

If you thought this was interesting,

We're hiring.careers.fluencia.com

Thanks!@williamjohnbertgithub.com/sandinmyjoints/towards-100-pct-uptimegithub.com/sandinmyjoints/towards-100-pct-uptime-examples

Towards 100% uptime with node

Technology

Transcript of Towards 100% uptime with node

1033900 FEB 2011 Uptime

Uptime issue #1, 2014

Uptime v3 to Uptime v4 mappings - Cisco · Page 1 of 17 Uptime Service Levels Effective Monday 9th October 2017 your existing Uptime® service levels will be changed to the new Uptime

Uptime anytime - man-es.com

Designed for Uptime · On any node, execute the SAPHanaSR-showAttr command and ensure that the new secondary HANA database (node01) sync_state is OK (SOK). {node2}:~ # SAPHanaSR-showAttr

Uptime Infrastructure Monitor - IDERA

Uptime issue #1,2013

Content - Uptime Engineering€¦ · Uptime Engineering GmbH 10.10.2019 12 Uptime Engineering in ROMEO Roles and responsibilities Degradation Analytics Engines Historical and real

Towards Resilience Against Node Failures in Overlay ...

SINGLE-NODE OPTIMIZATION TOWARDS MORE EFFICIENT AND … · 2018-05-16 · alcf computational performance workshop single-node optimization towards more efficient and productive quantum

Uptime issue #3, 2013

WORKING WITH DATA · Node-04 Node-05 Node-06 Node-07 Node-08 Node-09 Node-10 Node-11 Node-12 Node-13 Node-14 Node-15 Node-16 Queue1 - 50 hrs Queue2 – 200 hrs Compute job ... •

Uptime 1 - Strategy Final

Uptime Tier IV

Oceania Node: progress towards multi-functional soil information by Mike Grundy

Uptime issue #3, 2012

2015_02_06 Uptime Ebook-FINAL

Stratus Uptime Assurance Architecture · Stratus uptime assurance architecture for Linux environments 1 ... Stratus uptime assurance architecture for ... of downtime per year on average.

D E : TOWARDS DEEP GRAPH CONVOLU NETWORKS ON NODE ...

Uptime issue #1, 2013