Towards 100% uptime with node

80
Towards 100% Uptime with Node.js

description

Eliminating every last bit of downtime caused by deployment and application errors takes some work. Learn how a combination of domains, sensible handling of uncaught exceptions, graceful connection termination, and process management with the cluster module and its friends can give you confidence that your application is always available.

Transcript of Towards 100% uptime with node

Page 1: Towards 100% uptime with node

Towards

100% Uptimewith Node.js

Page 2: Towards 100% uptime with node

9M uniques / month.

75K+ users, some are paidsubscribers.

Page 3: Towards 100% uptime with node

( We | you | users )hate downtime.

Page 4: Towards 100% uptime with node

Important, butout of scope:

Redundant infrastructure.Backups.Disaster recovery.

Page 5: Towards 100% uptime with node

In scope:Application errors.Deploys.Node.js stuff:

Domains.Cluster.Express.

Page 6: Towards 100% uptime with node

Keys to 100% uptime.

Page 7: Towards 100% uptime with node

1. Sensibly handleuncaught exceptions.

Page 8: Towards 100% uptime with node

2. Use domainsto catch and contain errors.

Page 9: Towards 100% uptime with node

3. Manage processeswith cluster.

Page 10: Towards 100% uptime with node

4. Gracefully terminateconnections.

Page 11: Towards 100% uptime with node

1. Sensibly handle uncaughtexceptions.

Page 12: Towards 100% uptime with node

Uncaught exceptions happen when:An exception is thrown but not caught.An error event is emitted but nothing is listening for it.

Page 13: Towards 100% uptime with node

From node/lib/events.js:

EventEmitter.prototype.emit = function(type) { // If there is no 'error' event listener then throw. if (type === 'error') { ... } else if (er instanceof Error) { throw er; // Unhandled 'error' event } else { ...

Page 14: Towards 100% uptime with node

An uncaught exceptioncrashes the process.

Page 15: Towards 100% uptime with node

If the process is a server: 

x 100s??

Page 16: Towards 100% uptime with node

It starts with...

Page 17: Towards 100% uptime with node

Domains.2. Use domains to catch and contain errors.

Page 18: Towards 100% uptime with node

try/catch doesn't doasync.

try { var f = function() { throw new Error("uh-oh"); }; setTimeout(f, 100);} catch (ex) { console.log("try / catch won't catch", ex);}

Page 19: Towards 100% uptime with node

Domains are a bit liketry/catch for async.

var d = require('domain').create();

d.on('error', function (err) { console.log("domain caught", err);});

var f = d.bind(function() { throw new Error("uh-oh");});

setTimeout(f, 100);

Page 20: Towards 100% uptime with node

The active domain isdomain.active.

var d = require('domain').create();console.log(domain.active); // <-- null

var f = d.bind(function() { console.log(domain.active === d) // <-- true console.log(process.domain === domain.active) // <-- true throw new Error("uh-oh");});

Page 21: Towards 100% uptime with node

New EventEmitters bindto the active domain.

EventEmitter.prototype.emit = function(type) { if (type === 'error') { if (this.domain) { // This is important! ... this.domain.emit('error', er); } else if ...

Page 22: Towards 100% uptime with node

Log the error.Helpful additional fields:

error.domainerror.domainEmittererror.domainBounderror.domainThrown

Page 23: Towards 100% uptime with node

Then it's up to you.Ignore.Retry.Abort (e.g., return 500).Throw (becomes an unknown error).

Page 24: Towards 100% uptime with node

Do I have to create a new domainevery time I do an async operation?

Page 25: Towards 100% uptime with node

Use middleware.More convenient.

Page 26: Towards 100% uptime with node

In Express, this might look like:var domainWrapper = function(req, res, next) { var reqDomain = domain.create(); reqDomain.add(req); reqDomain.add(res);

reqDomain.once('error', function(err) { res.send(500); // or next(err); });

reqDomain.run(next);};

Based on https://github.com/brianc/node-domain-middleware

https://github.com/mathrawka/express-domain-errors

Page 27: Towards 100% uptime with node

Domain methods.add: bind an EE to the domain.run: run a function in context of domain.bind: bind one function.intercept: like bind but handles 1st arg err.dispose: cancels IO and timers.

Page 28: Towards 100% uptime with node

Domainsare great

until they're not.

Page 29: Towards 100% uptime with node

node-mongodb-native does notplay well with active domain.

console.log(domain.active); // a domainAppModel.findOne(function(err, doc) { console.log(domain.active); // undefined next();});

See https://github.com/LearnBoost/mongoose/pull/1337

Page 30: Towards 100% uptime with node

Fix with explicit binding.console.log(domain.active); // a domainAppModel.findOne(domain.active.bind(function(err, doc) { console.log(domain.active); // still a domain next();}));

Page 31: Towards 100% uptime with node

What other operations don't play wellwell with domain.active?

Good question!

Package authors could note this.

If you find one, let package author know.

Page 32: Towards 100% uptime with node

Can 100% uptime be achievedjust by using domains?

No.Not if only one instance of your app

is running.

Page 33: Towards 100% uptime with node

3. Manage processeswith cluster.

Page 34: Towards 100% uptime with node

Cluster module.Node = one thread per process.

Most machines have multiple CPUs.

One process per CPU = cluster.

Page 35: Towards 100% uptime with node

master / workers1 master process forks n workers.Master and workers communicate state via IPC.When workers want to listen to a socket, master registers themfor it.Each new connection to socket is handed off to a worker.No shared application state between workers.

Page 36: Towards 100% uptime with node

What about when a workerisn't working anymore?

Some coordination is needed.

Page 37: Towards 100% uptime with node

1. Worker tells cluster master it's done accepting new connections.

2. Cluster master forks replacement.

3. Worker dies.

Page 38: Towards 100% uptime with node

Another use case for cluster:

Deployment.Want to replace all existing servers.

Something must manage that = cluster master process.

Page 39: Towards 100% uptime with node

Zero downtime deployment.When master starts, give it a symlink to worker code.

After deploy new code, update symlink.

Send signal to master: fork new workers!

Master tells old workers to shut down, forks new workers fromnew code.

Master process never stops running.

Page 40: Towards 100% uptime with node

Signals.A way to communicate with running processes.

SIGHUP: reload workers (some like SIGUSR2).

$ kill -s HUP <pid>$ service <node-service-name> reload

Page 41: Towards 100% uptime with node

Process management options.

Page 42: Towards 100% uptime with node

Forevergithub.com/nodejitsu/forever

Has been around...forever.No cluster awareness — used on a single process.Simply restarts the process when it dies.More comparable to Upstart or Monit.

Page 43: Towards 100% uptime with node

Naughtgithub.com/superjoe30/naught

Newer.Cluster aware.Zero downtime errors and deploys.Runs as daemon.Handles log compression, rotation.

Page 44: Towards 100% uptime with node

Reclustergithub.com/doxout/recluster

Newer.Cluster aware.Zero downtime errors and deploys.Does not run as daemon.Log agnostic.Simple, relatively easy to reason about.

Page 45: Towards 100% uptime with node

We went with recluster.Happy so far.

Page 46: Towards 100% uptime with node

I have been talking aboutstarting / stopping workers

as if it's atomic.

It's not.

Page 47: Towards 100% uptime with node

4. Gracefully terminateconnections

when needed.

Page 48: Towards 100% uptime with node

Don't call process.exit too soon!

Give it a grace period to clean up.

Page 49: Towards 100% uptime with node

Need to clean up:In-flight requests.HTTP keep-alive (open TCP) connections.

Page 50: Towards 100% uptime with node

Revisiting our middleware from earlier:var domainWrapper = function(afterErrorHook) { return function(req, res, next) { var reqDomain = domain.create(); reqDomain.add(req); reqDomain.add(res);

reqDomain.once('error', function(err) { next(err); if(afterErrorHook) afterErrorHook(err); // Hook. }); reqDomain.run(next); };};

Page 51: Towards 100% uptime with node

1. Call server.close.var afterErrorHook = function(err) { server.close(); // <-- ensure no new connections}

Page 52: Towards 100% uptime with node

2. Shut down keep-aliveconnections.

var afterErrorHook = function(err) { app.set("isShuttingDown", true); // <-- set state server.close();}

var shutdownMiddle = function(req, res, next) { if(app.get("isShuttingDown") { // <-- check state req.connection.setTimeout(1); // <-- kill keep-alive } next();}

Idea from https://github.com/mathrawka/express-graceful-exit

Page 53: Towards 100% uptime with node

3. Then call process.exit

in server.close callback.var afterErrorHook = function(err) { app.set("isShuttingDown", true); server.close(function() { process.exit(1); // <-- all clear to exit });}

Page 54: Towards 100% uptime with node

Set a timer.If timeout period expires and server is still around, call

process.exit.

Page 55: Towards 100% uptime with node

Summing up:

Our ideal server.

Page 56: Towards 100% uptime with node

On startup:Cluster master comes up (for example, via Upstart).Cluster master forks workers from symlink.Each worker's server starts accepting connections.

Page 57: Towards 100% uptime with node

On deploy:Point symlink to new version.Send signal to cluster master.Master tells existing workers to stop accepting new connections.Master forks new workers from new code.Existing workers shut down gracefully.

Page 58: Towards 100% uptime with node

On error:Server catches it via domain.Next action depends on you: retry? abort? rethrow? etc.

Page 59: Towards 100% uptime with node

On uncaught exception:??

// The infamous "uncaughtException" event!process.on('uncaughtException', function(err) { // ??})

Page 60: Towards 100% uptime with node

Back to where we started:

1. Sensibly handle uncaughtexceptions.

We have minimized these by using domains.

But they can still happen.

Page 61: Towards 100% uptime with node

Node docs say not to keep running.

An unhandled exception means yourapplication — and by extension node.jsitself — is in an undefined state. Blindly

resuming means anything could happen.You have been warned.

http://nodejs.org/api/process.html#process_event_uncaughtexception

Page 62: Towards 100% uptime with node

What to do?First, log the error so you know what happened.

Page 63: Towards 100% uptime with node

Then, you've got tokill the process.

Page 64: Towards 100% uptime with node

It's not so bad. We can now do sowith minimal trouble.

Page 65: Towards 100% uptime with node

On uncaught exception:Log error.Server stops accepting new connections.Worker tells cluster master it's done.Master forks a replacement worker.Worker exits gracefully when all connections are closed, or aftertimeout.

Page 66: Towards 100% uptime with node

What about the requestthat killed the worker?

How does the dying workergracefully respond to it?

Good question!

Page 67: Towards 100% uptime with node

People are also under the illusion that it ispossible to trace back [an uncaught]

exception to the http request that causedit...

-felixge, https://github.com/joyent/node/issues/2582

Page 68: Towards 100% uptime with node

This is too bad, because youalways want to return a response,

even on error.

Page 69: Towards 100% uptime with node

This is Towards 100% Uptime b/c these approaches don'tguarantee response for every request.

But we can get very close.

Page 70: Towards 100% uptime with node

Fortunately, given what we've seen,uncaughts shouldn't happen often.

And when they do, only oneconnection will be left hanging.

Page 71: Towards 100% uptime with node

Must restart cluster master when:Upgrade Node.Cluster master code changes.

Page 72: Towards 100% uptime with node

During timeout periods, might have:More workers than CPUs.Workers running different versions (old/new).

Should be brief. Probably preferable to downtime.

Page 73: Towards 100% uptime with node

Tip:

Be able to produce errors on demandon your dev and staging servers.

(Disable this in production.)

Page 74: Towards 100% uptime with node

Tip:

Keep cluster master simple.It needs to run for a long time without being updated.

Page 75: Towards 100% uptime with node

Things change.I've been talking about:

{ "node": "~0.10.20", "express": "~3.4.0", "connect": "~2.9.0", "mongoose": "~3.6.18", "recluster": "=0.3.4"}

Page 76: Towards 100% uptime with node

The Future:Node 0.11 / 0.12

For example, cluster module has some changes.

Page 77: Towards 100% uptime with node

Cluster is experimental.Domains are unstable.

Page 79: Towards 100% uptime with node

If you thought this was interesting,

We're hiring.careers.fluencia.com

Page 80: Towards 100% uptime with node

[email protected]/sandinmyjoints/towards-100-pct-uptimegithub.com/sandinmyjoints/towards-100-pct-uptime-examples