Varnish crashing when system starts to swap

Thu Apr 17 15:03:39 CEST 2008

We have an environment that serves lots of small dynamicly backend generated image files. The total dataset is about 2TB but we're not looking to cache all of it, just ease the load on the backend machines. We have about 2000-2500 hits/s in total today and we are running 3 apaches with mod_caucho as frontends.

We have installed varnish on the same servers as the apache frontends and configured them to use the local apache as backend. The machines are dual opterons with dualcore so 4 cores per server with 16GB of ram and we're running rhel 4.2.

This is our varnish setup:

user                 varnish (201)
group                varnish (201)
default_ttl          3600 [seconds]
thread_pools         1 [pools]
thread_pool_max      1000 [threads]
thread_pool_min      128 [threads]
thread_pool_timeout  60 [seconds]
overflow_max         100 [%]
rush_exponent        3 [requests per request]
sess_workspace       8192 [bytes]
obj_workspace        8192 [bytes]
sess_timeout         5 [seconds]
pipe_timeout         60 [seconds]
send_timeout         600 [seconds]
auto_restart         on [bool]
fetch_chunksize      128 [kilobytes]
vcl_trace            off [bool]
listen_address       ":80"
listen_depth         1024 [connections]
srcaddr_hash         1049 [buckets]
srcaddr_ttl          30 [seconds]
backend_http11       off [bool]
client_http11        off [bool]
cli_timeout          5 [seconds]
ping_interval        3 [seconds]
lru_interval         3600 [seconds]
cc_command           exec cc -fpic -shared -Wl,-x -o %o %s
max_restarts         4 [restarts]
max_esi_includes     5 [restarts]
cache_vbe_conns      off [bool]
cli_buffer           8192 [bytes]
diag_bitmap          0x0 [bitmap]

This is our startup command:

/opt/varnish/sbin/varnishd -a :80 -p lru_interval 3600 -f /opt/varnish/conf/default.vcl -T 127.0.0.1:6082 -t 3600 -w 128,1000,60 -u varnish -g varnish -s file,/srv/varnish/varnish_storage.bin,30G -P /var/run/varnish.pid

Varnish looks fine until it's had abour 1,5 million requests, then we can see the kswapd0 and kswapd1 start working and load average rises to about 200 and the machine gets totally unresponsive. Top shows a lot of cpu beeing spent on i/o waits and varnish child process restarts sometimes. In best case the process restarts and the server starts behaving within 5 minutes but sometimes varnish dies completely. One thing we have noticed is that the reserved memory for varnish keeps rising and when it crashes it is usually around 14G.

The varnish storage file is running on the same physical disk as the system and the swap, could that be the problem? Should varnish really allocate so much memory so that the system starts to swap to disk?

Any suggestions or comments are welcome.

Regards
Calle Korjus