Download - Bugs from Outer Space | while42 SF #6

Bugs

From Outer Space

While42 — SF chapter — #6

Why this talk?

Codito, ergo erro

I code, therefore I make mistakes

Outline

I'll show some really nasty bugs,

tell stories of unglorious battles.

(Some of which I've actually fought!)

Featuring: Node.js, EC2, LXC, pseudo-

terminals

and also: hardware bugs, dangerous bugs...

Our files,

Node.js is truncating

them!

It all starts with an angry customer.

“Sometimes, downloading this 700 KB JSON

file will fail, because it’s truncated!”

But… Do you even Content-Length?

(The client library should scream, but it

doesn’t.)

Gotta Sniff Some Packets

Log into the load balancer (running Hipache)...

# ngrep -tipd any -Wbyline '/api/v1/download-all-the-things' tcp port 80

interface: any

filter: (ip or ip6) and ( tcp port 80 )

match: /api/v1/download-all-the-things

####

T 2013/08/22 04:11:27.848663 23.20.88.251:55983 -> 10.116.195.150:80 [AP]

GET /api/v1/download-all-the-things.json HTTP/1.0.

Host: angrystartup.com

X-Forwarded-Port: 443.

X-Forwarded-For: ::ffff:24.13.146.16.

X-Forwarded-Proto: https.

...

Ngrep Doesn’t Cut It.

FETCH THE

WIRESHARKS!

# tcpdump -peni any -s0 -wdump tcp port 80

(Wait a bit)

^C

Transfer dump file

DEMO TIME!

What did we find out?

Truncated files happen because a chunk

(probably exactly one) gets dropped.

Impossible to reproduce locally.

Only the customer sees the problem.

THE PLOT THICKENS.

GET YOUR SWIMSUITS,

WE’RE DIVING INTO CODE!

This is Node.js.I have no idea

what I’m doing.

Add console.log() statements in Hipache.

Add console.log() statements in node-http-

proxy.

Add console.log() statements in node/lib/http.js.

The latter didn’t work.

“Fix”: replace require(‘http’) with require(‘_http’)

and add our own _http.js to our node_modules.

Do the same to net.js (in “our” _http.js).

Now analyze an endless stream of obscure events.

It’s all in the pauses

Backend sends lots of data to Hipache.

Hipache sends data to client, but client is slow.

Hipache “pauses” the backend stream.

(i.e. stops reading from the network socket.)

When the client has read enough data,

Hipache “resumes” the stream.

etc.

SO FAR, SO GOOD

It’s all in the awkward

……………………...pauses

There are two layers in Node: tcp and http.

When the tcp layer reads the last chunk,

the socket is closed by the backend.

The tcp layer notices, and sends an “end”

event.

The “end” event causes the “http” layer to finish

what it was doing, without sending a

“resume”.

As a result, some chunks remain in the buffers

of the tcp layer. Lost in space. Forever alone.

How do we fix this?

Pester Node.js folks

Catch that “end” event, and when it happens,

send a “resume” to the stream to drain it.

(Implementation detail: you only have the http

socket, and you need to listen for an event on

the tcp socket, so you need to do slightly dirty

things with the http socket. But eh, it works!)

What did we learn?

When you can’t reproduce a bug at will, record

it in action (tcpdump) and dissect it

(wireshark).

Spraying code with print statements helps.

(But it’s better to use the logging framework!)

You don’t have to know Node.js to fix Node.js!

Hardware has bugs, too

Pentium FDIV bug (1994):

errors at 4th decimal place

Pentium F00F bug (1997):

using the wrong instruction hangs the machine

ATA transfer speeds vary when you touch

ribbon cables (SATA introduced in 2003)

A story of Go, PTYs, LXC:

It never works the first time # docker run -t -i ubuntu echo hello world

2013/08/06 23:20:53 Error: Error starting container 06d642aae1a:

fork/exec /usr/bin/lxc-start: operation not permitted

# docker run -t -i ubuntu echo hello world

hello world


hello world


hello world


hello world

Strace to the rescue!

Steps:

1. Boot the machine.

2. Find pid of process to analyze.

(ps|grep, pidof docker...)

3. “strace -o log -f -p $PID”

4. “docker run -t -i run ubuntu echo hello world”

5. Ctrl-C the strace process.

6. Repeat steps 3-4-5, using a different log file.

Note: can also strace directly, e.g. “strace ls”.

Let’s compare the log files

Thousands and thousands of lines.

Look for the error message.

(e.g. “operation not permitted”)

Other approach: start from the end, and try to

find the point when things started to diverge.

That’s why we have dual 30” monitors.

Investigation results

First time [pid 1331] setsid() = 1331

[pid 1331] dup2(10, 0) = 0

[pid 1331] dup2(10, 1) = 1

[pid 1331] dup2(10, 2) = 2[pid 1331] ioctl(0,

TIOCSCTTY) = -1 EPERM (Operation not permitted)[pid 1331]

write(12, "\1\0\0\0\0\0\0\0", 8) = 8

[pid 1331] _exit(253) = ?

Second time (and every following attempt) [pid 1414] setsid() = 1414

[pid 1414] dup2(14, 0) = 0

[pid 1414] dup2(14, 1) = 1

[pid 1414] dup2(14, 2) = 2[pid 1414] ioctl(0,

TIOCSCTTY) = 0[pid 1414] execve("/usr/bin/lxc-start", ["lxc-

start", "-n", ...]) <...>

What does that mean?

For some reason, some part of the code wants

file descriptor 0 (that’s stdin) to be a terminal.

The first time we run, it fails, but in the process,

we acquire a terminal. (UNIX 101: when you don’t have a controlling terminal and open a file

which is a terminal, it becomes your controlling terminal, unless you

open the file with flag O_NOCTTY)

Next attempts are therefore successful.

… Really?

To confirm that this is indeed the bug:

● start the process with “setsid”

(which detaches from the controlling

terminal)

and see that the bug is back;

● check the output of “ps” (it shows controlling

terminals) and see that indeed, before the

first execution, we didn’t have a controlling

terminal, and we have one after!

23083 ? Sl+ 0:12 ./docker -d -b br0

How to fix the bug?

¯\_(ツ)_/¯

I don’t know — yet!

(The bug was diagnosed last week,

and honestly, it’s not a showstopper.)

What did we learn?

strace is awesome to analyze behavior of

running processes.

ltrace can be used, too, if you want to

analyze library calls rather than system calls.

If you’re really desperate, gdb is your friend.

(A very peculiar friend, but a friend

nonetheless.)

“Errare humanum est,

perseverare autem

diabolicum”

“To err is human,

but to really foul things up,

you need a computer”

Really nasty (and sad)

bug:

The Therac-25

Radiotherapy machine (shoots beams to cure cancer)

Two modes: low energy and high energy.

In high energy mode, a special filter is inserted.

In other versions, a hardware system prevented

the high energy beam from shooting if the

filter was not in place.

On the Therac-25, it’s in software.

Konami Code of Death

On the keyboard, press (in less than 8

seconds)

X ↑ E [ENTER] B

...And the high energy beam shoots, unfiltered!

6 accidents, 3 died. (This was 1985-1987.)

Explanation: race condition in the software.

Never happened during tests since this was

an unusual sequence—and operators were

Aggravating details

Many engineering and institutional issues.(No

software review, no evaluation of possible failures,

undocumented error codes, no sensor feedback…)

After entering the sequence and sending one

beam, the machine would display an error.

But errors happened “all the time” (usually

without adverse effect) so the operator would

just proceed (equivalent of pressing “retry”).

Let’s get back to weird

Linux Kernel bugs

Random crashes on EC2

Pool of ~50 identical instances, with same role.

Sometimes, one of them would crash.

Total crash: no SSH, no ping, no log, no

nothing.

EC2 console won’t show anything.

REPRODUCE THE BUG?

IMPOSSIBURU!

Try a million things...

Different kernel versions

Different filesystems tunings

Different security settings (GRSEC)

Different memory settings (overcommit, OOM)

Different instance sizes

Different EBS volumes

Different differences

NOTHING CHANGED

And one fine day...

A random test machine seems to exhibit the

bug very frequently (it would crash in a few

days, sometimes just a few hours).

CLONE IT!

ONE MILLION TIMES!

But, still...

We changed everything (again),

but we couldn’t find anything (again).

So we did something completely crazy:

we contacted AWS support (imagine that).

They asked us to repeat the tests with an

“official” image (AMI). This required porting

our runtime from Ubuntu 10.04 to 12.04.

And…(I’m running out of segues)

We re-ran the tests with the official image,

the machine crashed, we left it in crashed

state,

support analyzed the image.

Almost instanty, they told us

“oh yeah it’s a known issue,

see that link.”

U SERIOUS?

The explanation

The bug happens:

● on workloads using spinlocks intensively;

● only on Xen VMs with many CPUs.

It is linked to the special implementation of

spinlocks in Xen VMs.

When waking up CPUs waiting on a spinlock,

the code would only wake up the 1st one,

even if there were multiple CPUs waiting.

The patch (priceless) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c

index d69cc6c..67bc7ba 100644

--- a/arch/x86/xen/spinlock.c

+++ b/arch/x86/xen/spinlock.c

@@ -328,7 +328,6 @@ static noinline void

xen_spin_unlock_slow(struct xen_spinlock

*xl)

if (per_cpu(lock_spinners, cpu) == xl) {

ADD_STATS(released_slow_kicked, 1);

xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);

- break;

}

}

}

--

What did we learn?

We didn’t try all the combinations.

(Trying on HVM machines would have

helped!)

AWS support can be helpful sometimes.

(This one was a surprise.)

Trying to debug a kernel issue without console

output is like trying to learn to read in the

dark.

(Compare to local VM with serial output…)

Overall Conclusions

When facing a mystic bug from outer space:

● reproduce it at all costs!

● collect data with tcpdump, ngrep, wireshark,

strace, ltrace, gdb; and log files, obviously!

● don’t be afraid of uncharted places!

● document it, at least with a 2 AM ragetweet!

Thank you! Questions?

Gotta follow them all:

@kwarter

@while_42

@GITSF

@dot_cloud

@docker

Your speaker today was:

Jérôme Petazzoni, dotCloud

@jpetazzo