Bugs
From Outer Space
While42 — SF chapter — #6
Why this talk?
Codito, ergo erro
I code, therefore I make mistakes
Outline
I'll show some really nasty bugs,
tell stories of unglorious battles.
(Some of which I've actually fought!)
Featuring: Node.js, EC2, LXC, pseudo-
terminals
and also: hardware bugs, dangerous bugs...
Our files,
Node.js is truncating
them!
It all starts with an angry customer.
“Sometimes, downloading this 700 KB JSON
file will fail, because it’s truncated!”
But… Do you even Content-Length?
(The client library should scream, but it
doesn’t.)
Gotta Sniff Some Packets
Log into the load balancer (running Hipache)...
# ngrep -tipd any -Wbyline '/api/v1/download-all-the-things' tcp port 80
interface: any
filter: (ip or ip6) and ( tcp port 80 )
match: /api/v1/download-all-the-things
####
T 2013/08/22 04:11:27.848663 23.20.88.251:55983 -> 10.116.195.150:80 [AP]
GET /api/v1/download-all-the-things.json HTTP/1.0.
Host: angrystartup.com
X-Forwarded-Port: 443.
X-Forwarded-For: ::ffff:24.13.146.16.
X-Forwarded-Proto: https.
...
Ngrep Doesn’t Cut It.
FETCH THE
WIRESHARKS!
# tcpdump -peni any -s0 -wdump tcp port 80
(Wait a bit)
^C
Transfer dump file
DEMO TIME!
What did we find out?
Truncated files happen because a chunk
(probably exactly one) gets dropped.
Impossible to reproduce locally.
Only the customer sees the problem.
THE PLOT THICKENS.
GET YOUR SWIMSUITS,
WE’RE DIVING INTO CODE!
This is Node.js.I have no idea
what I’m doing.
Add console.log() statements in Hipache.
Add console.log() statements in node-http-
proxy.
Add console.log() statements in node/lib/http.js.
The latter didn’t work.
“Fix”: replace require(‘http’) with require(‘_http’)
and add our own _http.js to our node_modules.
Do the same to net.js (in “our” _http.js).
Now analyze an endless stream of obscure events.
It’s all in the pauses
Backend sends lots of data to Hipache.
Hipache sends data to client, but client is slow.
Hipache “pauses” the backend stream.
(i.e. stops reading from the network socket.)
When the client has read enough data,
Hipache “resumes” the stream.
etc.
SO FAR, SO GOOD
It’s all in the awkward
……………………...pauses
There are two layers in Node: tcp and http.
When the tcp layer reads the last chunk,
the socket is closed by the backend.
The tcp layer notices, and sends an “end”
event.
The “end” event causes the “http” layer to finish
what it was doing, without sending a
“resume”.
As a result, some chunks remain in the buffers
of the tcp layer. Lost in space. Forever alone.
How do we fix this?
Pester Node.js folks
Catch that “end” event, and when it happens,
send a “resume” to the stream to drain it.
(Implementation detail: you only have the http
socket, and you need to listen for an event on
the tcp socket, so you need to do slightly dirty
things with the http socket. But eh, it works!)
What did we learn?
When you can’t reproduce a bug at will, record
it in action (tcpdump) and dissect it
(wireshark).
Spraying code with print statements helps.
(But it’s better to use the logging framework!)
You don’t have to know Node.js to fix Node.js!
Hardware has bugs, too
Pentium FDIV bug (1994):
errors at 4th decimal place
Pentium F00F bug (1997):
using the wrong instruction hangs the machine
ATA transfer speeds vary when you touch
ribbon cables (SATA introduced in 2003)
A story of Go, PTYs, LXC:
It never works the first time # docker run -t -i ubuntu echo hello world
2013/08/06 23:20:53 Error: Error starting container 06d642aae1a:
fork/exec /usr/bin/lxc-start: operation not permitted
# docker run -t -i ubuntu echo hello world
hello world
# docker run -t -i ubuntu echo hello world
hello world
# docker run -t -i ubuntu echo hello world
hello world
# docker run -t -i ubuntu echo hello world
hello world
Strace to the rescue!
Steps:
1. Boot the machine.
2. Find pid of process to analyze.
(ps|grep, pidof docker...)
3. “strace -o log -f -p $PID”
4. “docker run -t -i run ubuntu echo hello world”
5. Ctrl-C the strace process.
6. Repeat steps 3-4-5, using a different log file.
Note: can also strace directly, e.g. “strace ls”.
Let’s compare the log files
Thousands and thousands of lines.
Look for the error message.
(e.g. “operation not permitted”)
Other approach: start from the end, and try to
find the point when things started to diverge.
That’s why we have dual 30” monitors.
Investigation results
First time [pid 1331] setsid() = 1331
[pid 1331] dup2(10, 0) = 0
[pid 1331] dup2(10, 1) = 1
[pid 1331] dup2(10, 2) = 2[pid 1331] ioctl(0,
TIOCSCTTY) = -1 EPERM (Operation not permitted)[pid 1331]
write(12, "\1\0\0\0\0\0\0\0", 8) = 8
[pid 1331] _exit(253) = ?
Second time (and every following attempt) [pid 1414] setsid() = 1414
[pid 1414] dup2(14, 0) = 0
[pid 1414] dup2(14, 1) = 1
[pid 1414] dup2(14, 2) = 2[pid 1414] ioctl(0,
TIOCSCTTY) = 0[pid 1414] execve("/usr/bin/lxc-start", ["lxc-
start", "-n", ...]) <...>
What does that mean?
For some reason, some part of the code wants
file descriptor 0 (that’s stdin) to be a terminal.
The first time we run, it fails, but in the process,
we acquire a terminal. (UNIX 101: when you don’t have a controlling terminal and open a file
which is a terminal, it becomes your controlling terminal, unless you
open the file with flag O_NOCTTY)
Next attempts are therefore successful.
… Really?
To confirm that this is indeed the bug:
● start the process with “setsid”
(which detaches from the controlling
terminal)
and see that the bug is back;
● check the output of “ps” (it shows controlling
terminals) and see that indeed, before the
first execution, we didn’t have a controlling
terminal, and we have one after!
23083 ? Sl+ 0:12 ./docker -d -b br0
How to fix the bug?
¯\_(ツ)_/¯
I don’t know — yet!
(The bug was diagnosed last week,
and honestly, it’s not a showstopper.)
What did we learn?
strace is awesome to analyze behavior of
running processes.
ltrace can be used, too, if you want to
analyze library calls rather than system calls.
If you’re really desperate, gdb is your friend.
(A very peculiar friend, but a friend
nonetheless.)
“Errare humanum est,
perseverare autem
diabolicum”
“To err is human,
but to really foul things up,
you need a computer”
Really nasty (and sad)
bug:
The Therac-25
Radiotherapy machine (shoots beams to cure cancer)
Two modes: low energy and high energy.
In high energy mode, a special filter is inserted.
In other versions, a hardware system prevented
the high energy beam from shooting if the
filter was not in place.
On the Therac-25, it’s in software.
Konami Code of Death
On the keyboard, press (in less than 8
seconds)
X ↑ E [ENTER] B
...And the high energy beam shoots, unfiltered!
6 accidents, 3 died. (This was 1985-1987.)
Explanation: race condition in the software.
Never happened during tests since this was
an unusual sequence—and operators were
Aggravating details
Many engineering and institutional issues.(No
software review, no evaluation of possible failures,
undocumented error codes, no sensor feedback…)
After entering the sequence and sending one
beam, the machine would display an error.
But errors happened “all the time” (usually
without adverse effect) so the operator would
just proceed (equivalent of pressing “retry”).
Let’s get back to weird
Linux Kernel bugs
Random crashes on EC2
Pool of ~50 identical instances, with same role.
Sometimes, one of them would crash.
Total crash: no SSH, no ping, no log, no
nothing.
EC2 console won’t show anything.
REPRODUCE THE BUG?
IMPOSSIBURU!
Try a million things...
Different kernel versions
Different filesystems tunings
Different security settings (GRSEC)
Different memory settings (overcommit, OOM)
Different instance sizes
Different EBS volumes
Different differences
NOTHING CHANGED
And one fine day...
A random test machine seems to exhibit the
bug very frequently (it would crash in a few
days, sometimes just a few hours).
CLONE IT!
ONE MILLION TIMES!
But, still...
We changed everything (again),
but we couldn’t find anything (again).
So we did something completely crazy:
we contacted AWS support (imagine that).
They asked us to repeat the tests with an
“official” image (AMI). This required porting
our runtime from Ubuntu 10.04 to 12.04.
And…(I’m running out of segues)
We re-ran the tests with the official image,
the machine crashed, we left it in crashed
state,
support analyzed the image.
Almost instanty, they told us
“oh yeah it’s a known issue,
see that link.”
U SERIOUS?
The explanation
The bug happens:
● on workloads using spinlocks intensively;
● only on Xen VMs with many CPUs.
It is linked to the special implementation of
spinlocks in Xen VMs.
When waking up CPUs waiting on a spinlock,
the code would only wake up the 1st one,
even if there were multiple CPUs waiting.
The patch (priceless) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index d69cc6c..67bc7ba 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -328,7 +328,6 @@ static noinline void
xen_spin_unlock_slow(struct xen_spinlock
*xl)
if (per_cpu(lock_spinners, cpu) == xl) {
ADD_STATS(released_slow_kicked, 1);
xen_send_IPI_one(cpu, XEN_SPIN_UNLOCK_VECTOR);
- break;
}
}
}
--
What did we learn?
We didn’t try all the combinations.
(Trying on HVM machines would have
helped!)
AWS support can be helpful sometimes.
(This one was a surprise.)
Trying to debug a kernel issue without console
output is like trying to learn to read in the
dark.
(Compare to local VM with serial output…)
Overall Conclusions
When facing a mystic bug from outer space:
● reproduce it at all costs!
● collect data with tcpdump, ngrep, wireshark,
strace, ltrace, gdb; and log files, obviously!
● don’t be afraid of uncharted places!
● document it, at least with a 2 AM ragetweet!
Thank you! Questions?
Gotta follow them all:
@kwarter
@while_42
@GITSF
@dot_cloud
@docker
Your speaker today was:
Jérôme Petazzoni, dotCloud
@jpetazzo
Top Related