Best practice for not caching content requested by crawlers

Lasse Karstensen lasse.karstensen at gmail.com
Fri Jul 20 11:04:09 CEST 2012


Damon Snyder:
> We have reason to believe that we have some amount of cache pollution from
> crawlers. We believe this to be the case after we attempted to determine
> the size of our hot data set.
[..]
> So my question is, what is the best practice for doing this? If a request
> comes from the crawler and its in the cache, I'm fine serving it from the
> cache. However if the request comes from the crawler and its not in the
> cache, I don't want varnish to cache it.

I'm not clear on whether this is a good idea or not, but you can do
it in VCL like this:

sub vcl_miss {
    if (req.http.user-agent ~ "(?i)yandex|msnbot") {
        return(pass);
    }
}

You can probably use openddr/deviceatlas/$favorite_detectionengine to get
better accuracy than this regex.

-- 
Lasse Karstensen
Varnish Software AS
http://www.varnish-software.com/



More information about the varnish-misc mailing list