smart health check response?

Mon Sep 14 07:51:56 CEST 2009

So far my experience with Varnish has been that it runs great until it
becomes overworked and consumes all resources on a machine and then
finally doesn't respond to the parent's health check at which point
the parent restarts it -all of cache is destroyed.

We have a pretty large working set and so we're still experimenting
with Varnish configurations that can handle our extremely random IO
load.  One thing I've found is that Varnish will try it's hardest to
serve from cache, which I can't argue against in theory, but in
practice it would be nice if it just gave up sometimes and fell back
on the backends to keep itself healthy during times of extreme IOwait.
 I'd rather Varnish keep it's cache and let the backends pick up some
slack than have it get killed by it's parent watcher PID.   Once
persistent storage is available, frequent restarts may not be such a
big deal.

So what I'm thinking is to work around this for now with HAproxy.
Have it health check or track connections and decide to go straight to
the backends given some conditions.  This would give varnish a chance
to serve it's existing requests, potentially let IOwait drop, alert
us(we can add more varnish instances) and then pass health checks
again.

Haproxy polls varnishd, I've already added a simple health check in vcl_recv:
        if (req.url ~ "^/hc/vnsh\.health") {
          error 704 "";
        }
and vcl_error:
sub vcl_error {
  if (obj.status == 704) {
    set obj.status = 204;
    deliver;
  }

}

...are any of the stats available inside vcl_error?  something like
number of running threads or something else that would empower the
health check with some introspection?

The alternative is for me to have HAproxy track connections that could
correlate to max_threads.

Any feedback?  Have I missed any settings in Varnish that could
perform this same thing?