Child panics on OpenSolaris

Poul-Henning Kamp phk at phk.freebsd.dk
Fri Feb 12 19:26:18 CET 2010


In message <282e72051002120949u56eb6914mbd55e5a355931a38 at mail.gmail.com>, Paul 
Wright writes:

>> I wonder if Solaris has some kind of "I already told you it were
>> closed once" logic...
>
>Here's a snippet from varnishlog for one of these panics (also
>attached to this email in case line wraps wreck formatting):

Interesting!

This time the EBADF comes in the original worker thread, before we
hand the file descriptor over to the waiter, eliminating that entire
ball of wax from the picture.

>  419091: /opt/sbin/varnishd'vca_return_session+0x1b1 [0x419091]
>  42679d: /opt/sbin/varnishd'cnt_wait+0x2bd [0x42679d]

I can find absolutely no trace of EBADF meaning "remote end closed"
in the Solaris docs or other docs on the web, but that as far as I
can tell that is indeed what happens here.

But as a kernel programmer, I can see where this might come from:

Receiving a TCP-RST means that the socket is never going to be
useful again.  Since you already have the socket/pcb locked, taking
it entirely out of its missery right away is cheap and more efficient,
than waiting for the process to notice and issue a close(2) on it,
and then have to relock the socket/pcb again etc. etc.

Next time you try to use the filedescriptor, there is no socket
and EBADF ensues.

Reasoning that most programs notice the return value, and call
strerror(3) not caring very much what the exact value of errno
is, you can get away with returning EBADF.

Varnish however, is written my a cranky old FreeBSD kernel hacker,
who has no pretentions about writing correct code the first time,
so 10% of the lines are asserts and yes I actually _do_ care about
the specific errno's returned.

And EBADF is not just any errorcode, it is the only errno which
has universally been recognized as meaning "programmer screwed up",
because you can only get it if you muck up your filedescriptors.

Or as one of the first hits Google gave me, when researching this
more politely but no less firmly describes it:

	Bad file number (EBADF): The file descriptor references a
	file that is either not open or is open for a conflicting
	purpose. (eg, a read(2) is specified against a file that
	is open for write(2) or vice-versa.) This is a programming
	bug.

	(http://www.princeton.edu/~unix/Solaris/troubleshoot/error.html)

If I had implemented the hack I suspect Solaris contains, I would
have found some bit somewhere, to make sure the errno would be the
correct, documented and expected:

	#define ECONNRESET      54     /* Connection reset by peer */

Somebody with a Solaris service contract, if such things still
exist, should report this as a bug to them...

I will add a workaround to Varnish, with a suitable sarcastic
commentary...

Poul-Henning

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.


More information about the varnish-misc mailing list