From geoff at uplex.de Fri Sep 1 13:19:09 2023 From: geoff at uplex.de (Geoff Simmons) Date: Fri, 1 Sep 2023 15:19:09 +0200 Subject: Caching Modified URLs by Varnish instead of the original requested URL In-Reply-To: References: Message-ID: <1faee20c-6649-57e6-2953-fac9063fc30f@uplex.de> On 8/31/23 22:06, Guillaume Quintard wrote: > > Other options include vmod_querystring > > (Dridi might possibly be of assistance on this topic) and > vmod_urlplus > > (Varnish Enterprise), and the last, and possibly most promising one, > vmod_re2 > I would suggest going with vmod_re for a task like this: https://gitlab.com/uplex/varnish/libvmod-re Because: - VMOD re is based on Varnish's internal interface for regex-ing, so it uses the pcre2 library that's always installed with Varnish. For VMOD re2 you also have to install the re2 library. - pcre2 regex matching is generally faster than re2 matching. The point of re2 regexen is that matches won't go into catastrophic backtracking on pathological cases. - The real strength of re2 lies in the set interface, which matches multiple regexen "simultaneously", and then can tell you which one matched. The matching regex can be associated with a backend, a subroutine, or a number of other VCL objects; and there are a variety of other bells and whistles. VMOD re is just about subexpression capture, which is the job to be done here. For either VMOD re or re2, it's a good idea to initialize the regex in vcl_init, so that it's pre-compiled at runtime. The versions of the match function that take a regex as a parameter compile the regex on every invocation. So with VMOD re it would look like this: import re; sub vcl_init { new query_pattern = re.regex(".*(q=)(.*?)(\&|$).*"); } sub vcl_recv { if (query_pattern.match(req.url)) { set req.http.hash-url = query_pattern.backref(1) + std.lower(query_pattern.backref(2)) + query_pattern.backref(3); } } HTH, Geoff -- ** * * UPLEX - Nils Goroll Systemoptimierung Scheffelstra?e 32 22301 Hamburg Tel +49 40 2880 5731 Mob +49 176 636 90917 Fax +49 40 42949753 http://uplex.de -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 840 bytes Desc: OpenPGP digital signature URL: From geoff at uplex.de Fri Sep 1 14:45:04 2023 From: geoff at uplex.de (Geoff Simmons) Date: Fri, 1 Sep 2023 16:45:04 +0200 Subject: Caching Modified URLs by Varnish instead of the original requested URL In-Reply-To: <1faee20c-6649-57e6-2953-fac9063fc30f@uplex.de> References: <1faee20c-6649-57e6-2953-fac9063fc30f@uplex.de> Message-ID: Sorry, I get nerdy about this subject and can't help following up. I said: > - pcre2 regex matching is generally faster than re2 matching. The point > of re2 regexen is that matches won't go into catastrophic backtracking > on pathological cases. Should have mentioned that pcre2 is even better at subexpression capture, which is what the OP's question is all about. > sub vcl_init { > ????new query_pattern = re.regex(".*(q=)(.*?)(\&|$).*"); > } OMG no. Like this please: new query_pattern = re.regex("\b(q=)(.*?)(?:\&|$)"); I have sent an example of a pcre regex with .* (two of them!) to a public mailing list, for which I will burn in hell. To match a name-value pair in a cookie, use a regex with \b for 'word boundary' in front of the name. That way it will match either at the beginning of the Cookie value, or following an ampersand. And ?: tells pcre not to bother capturing the last expression in parentheses (they're just for grouping). Avoid .* in pcre regexen if you possibly can. You can, almost always. With .* at the beginning, the pcre matcher searches all the way to the end of the string, and then backtracks all the way back, looking for the first letter to match. In this case 'q', and it will stop and search and backtrack at any other 'q' that it may find while working backwards. pcre2 fortunately has an optimization that ignores a trailing .* if it has found a match up until there, so that it doesn't busily match the dot against every character left in the string. So this time .* does no harm, but it's superfluous, and violates the golden rule of pcre: avoid .* if at all possible. Incidentally, this is an area where re2 does have an advantage over pcre2. The efficiency of pcre2 matching depends crucially on how you write the regex, because details like \b instead of .* give it hints for pruning the search. While re2 matching usually isn't as fast as pcre2 matching against well-written patterns, re2 doesn't depend so much on that sort of thing. OK I can chill now, Geoff -- ** * * UPLEX - Nils Goroll Systemoptimierung Scheffelstra?e 32 22301 Hamburg Tel +49 40 2880 5731 Mob +49 176 636 90917 Fax +49 40 42949753 http://uplex.de -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 840 bytes Desc: OpenPGP digital signature URL: From guillaume.quintard at gmail.com Fri Sep 1 14:58:03 2023 From: guillaume.quintard at gmail.com (Guillaume Quintard) Date: Fri, 1 Sep 2023 07:58:03 -0700 Subject: Caching Modified URLs by Varnish instead of the original requested URL In-Reply-To: References: <1faee20c-6649-57e6-2953-fac9063fc30f@uplex.de> Message-ID: Thank you so much Geoff for that very useful knowledge dump! Good call out on the .*, I realized I carried them over too, when I copy-pasted the regex from the pure vcl example (where it's needed) to the vmod one. And so, just to be clear about it: - vmod-re is based on libpcre2 - vmod-re2 is based on libre2 Correct? I see no way I'm going to misremember that, at all :-D -- Guillaume Quintard On Fri, Sep 1, 2023 at 7:47?AM Geoff Simmons wrote: > Sorry, I get nerdy about this subject and can't help following up. > > I said: > > > - pcre2 regex matching is generally faster than re2 matching. The point > > of re2 regexen is that matches won't go into catastrophic backtracking > > on pathological cases. > > Should have mentioned that pcre2 is even better at subexpression > capture, which is what the OP's question is all about. > > > sub vcl_init { > > new query_pattern = re.regex(".*(q=)(.*?)(\&|$).*"); > > } > > OMG no. Like this please: > > new query_pattern = re.regex("\b(q=)(.*?)(?:\&|$)"); > > I have sent an example of a pcre regex with .* (two of them!) to a > public mailing list, for which I will burn in hell. > > To match a name-value pair in a cookie, use a regex with \b for 'word > boundary' in front of the name. That way it will match either at the > beginning of the Cookie value, or following an ampersand. > > And ?: tells pcre not to bother capturing the last expression in > parentheses (they're just for grouping). > > Avoid .* in pcre regexen if you possibly can. You can, almost always. > > With .* at the beginning, the pcre matcher searches all the way to the > end of the string, and then backtracks all the way back, looking for the > first letter to match. In this case 'q', and it will stop and search and > backtrack at any other 'q' that it may find while working backwards. > > pcre2 fortunately has an optimization that ignores a trailing .* if it > has found a match up until there, so that it doesn't busily match the > dot against every character left in the string. So this time .* does no > harm, but it's superfluous, and violates the golden rule of pcre: avoid > .* if at all possible. > > Incidentally, this is an area where re2 does have an advantage over > pcre2. The efficiency of pcre2 matching depends crucially on how you > write the regex, because details like \b instead of .* give it hints for > pruning the search. While re2 matching usually isn't as fast as pcre2 > matching against well-written patterns, re2 doesn't depend so much on > that sort of thing. > > > OK I can chill now, > Geoff > -- > ** * * UPLEX - Nils Goroll Systemoptimierung > > Scheffelstra?e 32 > 22301 Hamburg > > Tel +49 40 2880 5731 > Mob +49 176 636 90917 > Fax +49 40 42949753 > > http://uplex.de > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > -------------- next part -------------- An HTML attachment was scrubbed... URL: From geoff at uplex.de Fri Sep 1 15:30:28 2023 From: geoff at uplex.de (Geoff Simmons) Date: Fri, 1 Sep 2023 17:30:28 +0200 Subject: Caching Modified URLs by Varnish instead of the original requested URL In-Reply-To: References: <1faee20c-6649-57e6-2953-fac9063fc30f@uplex.de> Message-ID: <0991ccf7-53d1-2a3c-96da-2ce8fdfd0b79@uplex.de> On 9/1/23 16:58, Guillaume Quintard wrote: > > - vmod-re is based on libpcre2 > - vmod-re2 is based on libre2 > Correct? Correct. It used to be the case that libvmod-re only used the internal VRE interface. So the VMOD was using whatever Varnish used, so to speak, which happened to be pcre. But since the transition to pcre2, we have some direct calls into libpcre2. > I see no way I'm going to misremember that, at all :-D Yeah. Back when it was pcre vs re2, that wasn't so hard. But now, oh well. Best, Geoff -- ** * * UPLEX - Nils Goroll Systemoptimierung Scheffelstra?e 32 22301 Hamburg Tel +49 40 2880 5731 Mob +49 176 636 90917 Fax +49 40 42949753 http://uplex.de -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 840 bytes Desc: OpenPGP digital signature URL: From uday.polu at indiamart.com Sun Sep 3 15:57:08 2023 From: uday.polu at indiamart.com (Uday Kumar) Date: Sun, 3 Sep 2023 21:27:08 +0530 Subject: Caching Modified URLs by Varnish instead of the original requested URL In-Reply-To: References: Message-ID: Thanks Guillaume, I'll look into it. Thanks & Regards Uday Kumar On Fri, Sep 1, 2023 at 1:36?AM Guillaume Quintard < guillaume.quintard at gmail.com> wrote: > I'm pretty sure it's correctly lowercasing "\2" correctly. The problem is > that you want to lowercase the *value* referenced by "\2" instead. > > On this, I don't think you have a choice, you need to make that captured > group its own string, lowercase it, and only then concatenate it. Something > like: > > set req.http.hash-url = regsuball(req.http.hash-url, > ".*(q=)(.*?)(\&|$).*", "\1") + *std.tolower("regsuball(req.http.hash-url, > ".*(q=)(.*?)(\&|$).*", "\2")") + *regsuball(req.http.hash-url, > ".*(q=)(.*?)(\&|$).*", "\3")); > > It's disgusting, but eh, we started with regex, so... > > Other options include vmod_querystring > > (Dridi might possibly be of assistance on this topic) and vmod_urlplus > (Varnish > Enterprise), and the last, and possibly most promising one, vmod_re2 > which > would allow you to do something like > > if (myset.match(".*(q=)(.*?)(\&|$).*", "\1")) { > set req.http.hash-url = myset.matched(1) + std.lower(myset.matched(2)) > + myset.matched(3) > } > > -- > Guillaume Quintard > > > On Thu, Aug 31, 2023 at 1:03?AM Uday Kumar > wrote: > >> Hi Guillaume, >> >> In the process of modifying the query string in VCL code, we have a >> requirement of *lowercasing value of specific parameter*, instead of the *whole >> query string* >> >> *Example Request URL:* >> /search/ims?q=*CRICKET bat*&country_code=IN >> >> *Requirement:* >> We have to modify the request URL by lowercasing the value of only the *q >> *parameter >> i.e ./search/ims?q=*cricket bat*&country_code=IN >> >> *For that, we have found below regex:* >> set req.http.hash-url = regsuball(req.http.hash-url, "(q=)(.*?)(\&|$)", >> "\1"+*std.tolower("\2")*+"\3"); >> >> *ISSUE:* >> *std.tolower("\2")* in the above statement is *not lowercasing* the >> string that's captured, but if I test it using *std.tolower("SAMPLE"),* its >> lowercasing as expected. >> >> 1. May I know why it's not lowercasing if *std.tolower("\2") is used*? >> 2. Also, please provide possible optimal solutions for the same. (using >> regex) >> >> Thanks & Regards >> Uday Kumar >> >> >> On Wed, Aug 23, 2023 at 12:01?PM Uday Kumar >> wrote: >> >>> Hi Guillaume, >>> >>> *use includes and function calls* >>> This is great, thank you so much for your help! >>> >>> Thanks & Regards >>> Uday Kumar >>> >>> >>> On Wed, Aug 23, 2023 at 1:32?AM Guillaume Quintard < >>> guillaume.quintard at gmail.com> wrote: >>> >>>> Hi Uday, >>>> >>>> I'm not exactly sure how to read those diagrams, so I apologize if I'm >>>> missing the mark or if I'm too broad here. >>>> >>>> There are a few points I'd like to attract your attention to. The first >>>> one is that varnish doesn't cache the request or the URL. The cache is >>>> essentially a big hashmap/dictionary/database, in which you store the >>>> response. The request/url is the key for it, so you need to have it in its >>>> "final" form before you do anything. >>>> >>>> From what I read, you are not against it, and you just want to sanitize >>>> the URL in vcl_recv, but you don't like the idea of making the main file >>>> too unwieldy. If I got that right, then I have a nice answer for you: use >>>> includes and function calls. >>>> >>>> As an example: >>>> >>>> # cat /etc/varnish/url.vcl >>>> sub sanitize_url { >>>> # do whatever modifications you need here >>>> } >>>> >>>> # cat /etc/varnish/default.vcl >>>> include "./url.vcl"; >>>> >>>> sub vcl_recvl { >>>> call sanitize_url; >>>> } >>>> >>>> >>>> That should get you going. >>>> >>>> Hopefully I didn't miss the mark too much here, let me know if I did. >>>> >>>> -- >>>> Guillaume Quintard >>>> >>>> >>>> On Tue, Aug 22, 2023 at 3:45?AM Uday Kumar >>>> wrote: >>>> >>>>> Hello All, >>>>> >>>>> >>>>> For our spring boot application, we are using Varnish Caching in a >>>>> production environment. >>>>> >>>>> >>>>> >>>>> >>>>> Requirement: [To utilize cache effectively] >>>>> >>>>> Modify the URL (Removal of unnecessary parameters) while caching the >>>>> user request, so that the modified URL can be cached by varnish which >>>>> helps improve cache HITS for similar URLs. >>>>> >>>>> >>>>> For Example: >>>>> >>>>> Let's consider the below Request URL >>>>> >>>>> Url at time t, 1. samplehost.com/search/ims?q=bags&source=android >>>>> &options.start=0 >>>>> >>>>> >>>>> Our Requirement: >>>>> >>>>> To make varnish consider URLs with options.start=0 and without >>>>> options.start parameter as EQUIVALENT, such that a single cached >>>>> response(Single Key) can be utilized in both cases. >>>>> >>>>> >>>>> *1st URL after modification:* >>>>> >>>>> samplehost.com/search/ims?q=bags&source=android >>>>> >>>>> >>>>> *Cached URL at Varnish:* >>>>> >>>>> samplehost.com/search/ims?q=bags&source=android >>>>> >>>>> >>>>> >>>>> Now, Url at time t+1, 2. >>>>> samplehost.com/search/ims?q=bags&source=android >>>>> >>>>> >>>>> At present, varnish considers the above URL as different from 1st URL >>>>> and uses a different key while caching the 2nd URL[So, it will be a >>>>> miss] >>>>> >>>>> >>>>> *So, URL after Modification:* >>>>> >>>>> samplehost.com/search/ims?q=bags&source=android >>>>> >>>>> >>>>> Now, 2nd URL will be a HIT at varnish, effectively utilizing the >>>>> cache. >>>>> >>>>> >>>>> >>>>> NOTE: >>>>> >>>>> We aim to execute this URL Modification without implementing the >>>>> logic directly within the default.VCL file. Our intention is to >>>>> maintain a clean and manageable codebase in the VCL. >>>>> >>>>> >>>>> >>>>> To address this requirement effectively, we have explored two >>>>> potential Approaches: >>>>> >>>>> >>>>> Approach-1: >>>>> >>>>> >>>>> >>>>> Approach-2: >>>>> >>>>> >>>>> >>>>> >>>>> 1. Please go through the approaches mentioned above and let me know >>>>> the effective solution. >>>>> >>>>> 2. Regarding Approach-2 >>>>> >>>>> At Step 2: >>>>> >>>>> May I know if there is any way to access and execute a custom >>>>> subroutine from another VCL, for modifying the Request URL? if yes, >>>>> pls help with details. >>>>> >>>>> At Step 3: >>>>> >>>>> Tomcat Backend should receive the Original Request URL instead of the >>>>> Modified URL. >>>>> >>>>> 3. Please let us know if there is any better approach that can be >>>>> implemented. >>>>> >>>>> >>>>> >>>>> Thanks & Regards >>>>> Uday Kumar >>>>> _______________________________________________ >>>>> varnish-misc mailing list >>>>> varnish-misc at varnish-cache.org >>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >>>>> >>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From dridi at varni.sh Mon Sep 11 11:37:57 2023 From: dridi at varni.sh (Dridi Boukelmoune) Date: Mon, 11 Sep 2023 11:37:57 +0000 Subject: out of workspace (bo) In-Reply-To: References: Message-ID: Hi Kevyn, On Wed, Aug 30, 2023 at 11:37?AM Kevyn Fyleyssant wrote: > > Here is my VCL : > https://pastebin.com/TpN8r0Um > > And my varnishd command : > /usr/sbin/varnishd -a :8181 -p feature=+http2 -p http_resp_hdr_len=200k -p http_resp_size=2M -p http_req_hdr_len=200k -p workspace_backend=256k -p workspace_client=256k -p http_max_hdr=256 -f /etc/varnish/default.vcl -s malloc,4G You already have a fairly large workspace, but to solve this you will need to further increase workspace_backend. You configured Varnish to accept up to 200kB per header field, and up to 2MB per response headers (all fields combined). The worst case scenario wouldn't fit in 256kB. The file size should have no significant effect on workspace consumption, so this recurring overflow is probably caused by beresp headers alone. You should first make sure you have a good understanding of the origin server and why it may produce such large response headers. If this is legitimate, there is no way around increasing workspace_backend. Your VCL isn't doing workspace-intensive operations, so a little over 2MB (for example 2MB+64kB aka 2102kB) should be enough. Dridi