Download - Towards 100% uptime with node

Transcript
Page 1: Towards 100% uptime with node

Towards

100% Uptimewith Node.js

Page 2: Towards 100% uptime with node

9M uniques / month.

75K+ users, some are paidsubscribers.

Page 3: Towards 100% uptime with node

( We | you | users )hate downtime.

Page 4: Towards 100% uptime with node

Important, butout of scope:

Redundant infrastructure.Backups.Disaster recovery.

Page 5: Towards 100% uptime with node

In scope:Application errors.Deploys.Node.js stuff:

Domains.Cluster.Express.

Page 6: Towards 100% uptime with node

Keys to 100% uptime.

Page 7: Towards 100% uptime with node

1. Sensibly handleuncaught exceptions.

Page 8: Towards 100% uptime with node

2. Use domainsto catch and contain errors.

Page 9: Towards 100% uptime with node

3. Manage processeswith cluster.

Page 10: Towards 100% uptime with node

4. Gracefully terminateconnections.

Page 11: Towards 100% uptime with node

1. Sensibly handle uncaughtexceptions.

Page 12: Towards 100% uptime with node

Uncaught exceptions happen when:An exception is thrown but not caught.An error event is emitted but nothing is listening for it.

Page 13: Towards 100% uptime with node

From node/lib/events.js:

EventEmitter.prototype.emit = function(type) { // If there is no 'error' event listener then throw. if (type === 'error') { ... } else if (er instanceof Error) { throw er; // Unhandled 'error' event } else { ...

Page 14: Towards 100% uptime with node

An uncaught exceptioncrashes the process.

Page 15: Towards 100% uptime with node

If the process is a server: 

x 100s??

Page 16: Towards 100% uptime with node

It starts with...

Page 17: Towards 100% uptime with node

Domains.2. Use domains to catch and contain errors.

Page 18: Towards 100% uptime with node

try/catch doesn't doasync.

try { var f = function() { throw new Error("uh-oh"); }; setTimeout(f, 100);} catch (ex) { console.log("try / catch won't catch", ex);}

Page 19: Towards 100% uptime with node

Domains are a bit liketry/catch for async.

var d = require('domain').create();

d.on('error', function (err) { console.log("domain caught", err);});

var f = d.bind(function() { throw new Error("uh-oh");});

setTimeout(f, 100);

Page 20: Towards 100% uptime with node

The active domain isdomain.active.

var d = require('domain').create();console.log(domain.active); // <-- null

var f = d.bind(function() { console.log(domain.active === d) // <-- true console.log(process.domain === domain.active) // <-- true throw new Error("uh-oh");});

Page 21: Towards 100% uptime with node

New EventEmitters bindto the active domain.

EventEmitter.prototype.emit = function(type) { if (type === 'error') { if (this.domain) { // This is important! ... this.domain.emit('error', er); } else if ...

Page 22: Towards 100% uptime with node

Log the error.Helpful additional fields:

error.domainerror.domainEmittererror.domainBounderror.domainThrown

Page 23: Towards 100% uptime with node

Then it's up to you.Ignore.Retry.Abort (e.g., return 500).Throw (becomes an unknown error).

Page 24: Towards 100% uptime with node

Do I have to create a new domainevery time I do an async operation?

Page 25: Towards 100% uptime with node

Use middleware.More convenient.

Page 26: Towards 100% uptime with node

In Express, this might look like:var domainWrapper = function(req, res, next) { var reqDomain = domain.create(); reqDomain.add(req); reqDomain.add(res);

reqDomain.once('error', function(err) { res.send(500); // or next(err); });

reqDomain.run(next);};

Based on https://github.com/brianc/node-domain-middleware

https://github.com/mathrawka/express-domain-errors

Page 27: Towards 100% uptime with node

Domain methods.add: bind an EE to the domain.run: run a function in context of domain.bind: bind one function.intercept: like bind but handles 1st arg err.dispose: cancels IO and timers.

Page 28: Towards 100% uptime with node

Domainsare great

until they're not.

Page 29: Towards 100% uptime with node

node-mongodb-native does notplay well with active domain.

console.log(domain.active); // a domainAppModel.findOne(function(err, doc) { console.log(domain.active); // undefined next();});

See https://github.com/LearnBoost/mongoose/pull/1337

Page 30: Towards 100% uptime with node

Fix with explicit binding.console.log(domain.active); // a domainAppModel.findOne(domain.active.bind(function(err, doc) { console.log(domain.active); // still a domain next();}));

Page 31: Towards 100% uptime with node

What other operations don't play wellwell with domain.active?

Good question!

Package authors could note this.

If you find one, let package author know.

Page 32: Towards 100% uptime with node

Can 100% uptime be achievedjust by using domains?

No.Not if only one instance of your app

is running.

Page 33: Towards 100% uptime with node

3. Manage processeswith cluster.

Page 34: Towards 100% uptime with node

Cluster module.Node = one thread per process.

Most machines have multiple CPUs.

One process per CPU = cluster.

Page 35: Towards 100% uptime with node

master / workers1 master process forks n workers.Master and workers communicate state via IPC.When workers want to listen to a socket, master registers themfor it.Each new connection to socket is handed off to a worker.No shared application state between workers.

Page 36: Towards 100% uptime with node

What about when a workerisn't working anymore?

Some coordination is needed.

Page 37: Towards 100% uptime with node

1. Worker tells cluster master it's done accepting new connections.

2. Cluster master forks replacement.

3. Worker dies.

Page 38: Towards 100% uptime with node

Another use case for cluster:

Deployment.Want to replace all existing servers.

Something must manage that = cluster master process.

Page 39: Towards 100% uptime with node

Zero downtime deployment.When master starts, give it a symlink to worker code.

After deploy new code, update symlink.

Send signal to master: fork new workers!

Master tells old workers to shut down, forks new workers fromnew code.

Master process never stops running.

Page 40: Towards 100% uptime with node

Signals.A way to communicate with running processes.

SIGHUP: reload workers (some like SIGUSR2).

$ kill -s HUP <pid>$ service <node-service-name> reload

Page 41: Towards 100% uptime with node

Process management options.

Page 42: Towards 100% uptime with node

Forevergithub.com/nodejitsu/forever

Has been around...forever.No cluster awareness — used on a single process.Simply restarts the process when it dies.More comparable to Upstart or Monit.

Page 43: Towards 100% uptime with node

Naughtgithub.com/superjoe30/naught

Newer.Cluster aware.Zero downtime errors and deploys.Runs as daemon.Handles log compression, rotation.

Page 44: Towards 100% uptime with node

Reclustergithub.com/doxout/recluster

Newer.Cluster aware.Zero downtime errors and deploys.Does not run as daemon.Log agnostic.Simple, relatively easy to reason about.

Page 45: Towards 100% uptime with node

We went with recluster.Happy so far.

Page 46: Towards 100% uptime with node

I have been talking aboutstarting / stopping workers

as if it's atomic.

It's not.

Page 47: Towards 100% uptime with node

4. Gracefully terminateconnections

when needed.

Page 48: Towards 100% uptime with node

Don't call process.exit too soon!

Give it a grace period to clean up.

Page 49: Towards 100% uptime with node

Need to clean up:In-flight requests.HTTP keep-alive (open TCP) connections.

Page 50: Towards 100% uptime with node

Revisiting our middleware from earlier:var domainWrapper = function(afterErrorHook) { return function(req, res, next) { var reqDomain = domain.create(); reqDomain.add(req); reqDomain.add(res);

reqDomain.once('error', function(err) { next(err); if(afterErrorHook) afterErrorHook(err); // Hook. }); reqDomain.run(next); };};

Page 51: Towards 100% uptime with node

1. Call server.close.var afterErrorHook = function(err) { server.close(); // <-- ensure no new connections}

Page 52: Towards 100% uptime with node

2. Shut down keep-aliveconnections.

var afterErrorHook = function(err) { app.set("isShuttingDown", true); // <-- set state server.close();}

var shutdownMiddle = function(req, res, next) { if(app.get("isShuttingDown") { // <-- check state req.connection.setTimeout(1); // <-- kill keep-alive } next();}

Idea from https://github.com/mathrawka/express-graceful-exit

Page 53: Towards 100% uptime with node

3. Then call process.exit

in server.close callback.var afterErrorHook = function(err) { app.set("isShuttingDown", true); server.close(function() { process.exit(1); // <-- all clear to exit });}

Page 54: Towards 100% uptime with node

Set a timer.If timeout period expires and server is still around, call

process.exit.

Page 55: Towards 100% uptime with node

Summing up:

Our ideal server.

Page 56: Towards 100% uptime with node

On startup:Cluster master comes up (for example, via Upstart).Cluster master forks workers from symlink.Each worker's server starts accepting connections.

Page 57: Towards 100% uptime with node

On deploy:Point symlink to new version.Send signal to cluster master.Master tells existing workers to stop accepting new connections.Master forks new workers from new code.Existing workers shut down gracefully.

Page 58: Towards 100% uptime with node

On error:Server catches it via domain.Next action depends on you: retry? abort? rethrow? etc.

Page 59: Towards 100% uptime with node

On uncaught exception:??

// The infamous "uncaughtException" event!process.on('uncaughtException', function(err) { // ??})

Page 60: Towards 100% uptime with node

Back to where we started:

1. Sensibly handle uncaughtexceptions.

We have minimized these by using domains.

But they can still happen.

Page 61: Towards 100% uptime with node

Node docs say not to keep running.

An unhandled exception means yourapplication — and by extension node.jsitself — is in an undefined state. Blindly

resuming means anything could happen.You have been warned.

http://nodejs.org/api/process.html#process_event_uncaughtexception

Page 62: Towards 100% uptime with node

What to do?First, log the error so you know what happened.

Page 63: Towards 100% uptime with node

Then, you've got tokill the process.

Page 64: Towards 100% uptime with node

It's not so bad. We can now do sowith minimal trouble.

Page 65: Towards 100% uptime with node

On uncaught exception:Log error.Server stops accepting new connections.Worker tells cluster master it's done.Master forks a replacement worker.Worker exits gracefully when all connections are closed, or aftertimeout.

Page 66: Towards 100% uptime with node

What about the requestthat killed the worker?

How does the dying workergracefully respond to it?

Good question!

Page 67: Towards 100% uptime with node

People are also under the illusion that it ispossible to trace back [an uncaught]

exception to the http request that causedit...

-felixge, https://github.com/joyent/node/issues/2582

Page 68: Towards 100% uptime with node

This is too bad, because youalways want to return a response,

even on error.

Page 69: Towards 100% uptime with node

This is Towards 100% Uptime b/c these approaches don'tguarantee response for every request.

But we can get very close.

Page 70: Towards 100% uptime with node

Fortunately, given what we've seen,uncaughts shouldn't happen often.

And when they do, only oneconnection will be left hanging.

Page 71: Towards 100% uptime with node

Must restart cluster master when:Upgrade Node.Cluster master code changes.

Page 72: Towards 100% uptime with node

During timeout periods, might have:More workers than CPUs.Workers running different versions (old/new).

Should be brief. Probably preferable to downtime.

Page 73: Towards 100% uptime with node

Tip:

Be able to produce errors on demandon your dev and staging servers.

(Disable this in production.)

Page 74: Towards 100% uptime with node

Tip:

Keep cluster master simple.It needs to run for a long time without being updated.

Page 75: Towards 100% uptime with node

Things change.I've been talking about:

{ "node": "~0.10.20", "express": "~3.4.0", "connect": "~2.9.0", "mongoose": "~3.6.18", "recluster": "=0.3.4"}

Page 76: Towards 100% uptime with node

The Future:Node 0.11 / 0.12

For example, cluster module has some changes.

Page 77: Towards 100% uptime with node

Cluster is experimental.Domains are unstable.

Page 79: Towards 100% uptime with node

If you thought this was interesting,

We're hiring.careers.fluencia.com

Page 80: Towards 100% uptime with node

[email protected]/sandinmyjoints/towards-100-pct-uptimegithub.com/sandinmyjoints/towards-100-pct-uptime-examples