Re: No-Vary-Search from gs-lists-ietf-http-wg@gluelogic.com on 2024-06-14 (ietf-http-wg@w3.org from April to June 2024)

From: <gs-lists-ietf-http-wg@gluelogic.com>
Date: Thu, 13 Jun 2024 21:54:02 -0400
To: Jeremy Roman <jbroman@chromium.org>
Cc: ietf-http-wg@w3.org
Message-ID: <ZmuiurGTBkT82R0Q@xps13>
On Fri, Jun 14, 2024 at 11:09:17AM +1000, Mark Nottingham wrote:
> From a chair's perspective -- the important thing to establish at this point is whether the WG thinks there's a problem to solve here, and whether this is _approximately_ the right starting point. Document formatting and details of the specification aren't as relevant at this early stage.

As documented in the original message, numerous CDNs provide a feature
to treat query-string as no-vary.

I am commenting from experience at a previous $job where this feature
was desired, and at least at the time, not supported by Squid.

Yes, I believe this is approximately the right starting point, though my
preference would be "Vary-Search" or "Vary-Query".  ... naming it
Vary-Query might dovetail nicely with QUERY method being discussed in
https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-safe-method-w-body

Cheers, Glenn

> > On 14 Jun 2024, at 9:56 AM, gs-lists-ietf-http-wg@gluelogic.com wrote:
> > 
> > On Thu, Jun 13, 2024 at 05:50:05PM -0400, Jeremy Roman wrote:
> >> This is a little hard to read in the plaintext rendering, I'll admit. It's
> >> marginally better in the HTML rendering (though even then, the stylesheet
> >> makes it only slightly distinguishable).
> > 
> > Yes, I read the plaintext draft.
> > 
> >> WHATWG/W3C specs tend to be read in the HTML rendering, where these links
> >> would have been blue and underlined. If quotation marks would help with
> >> readability in addition to the section cross-link, that's certainly a
> >> typographical change I can make.
> > 
> > I believe that IETF desires RFCs to be readable in more than HTML,
> > but I am not the person to provide guidance on that.  Would the chair
> > advise?
> > 
> >> There are many references in the doc to WHATWG specs rather than IETF
> >>> specifications for URLs.  Is this intentional?
> >>> 
> >> 
> >> The embodiment we (Google Chrome) have been working on is in a web browser
> >> which implements the WHATWG URL specification, and we want this to be
> >> useful in web browsers (and HTTP servers which are interacting with web
> >> browsers), so being compatible with the way browsers deal with query
> >> strings (namely, the application/x-www-form-urlencoded parser, which is
> >> used for, for instance, the URLSearchParams object exposed to JavaScript
> >> code).
> >> 
> >> I'm less familiar with the IETF specifications that other uses of HTTP use,
> >> though for instance RFC 3986 (Uniform Resource Identifier: Generic Syntax)
> >> doesn't say much beyond "query components are often used to carry
> >> identifying information in the form of 'key=value' pairs", and RFC 9110
> >> (HTTP Semantics) §4.1 simply specifies that this is an optional URL
> >> component in the "http" and "https" URI schemes. This doesn't provide
> >> enough for the purposes of this document, even if the two are otherwise
> >> compatible (which I'm not sure they are).
> > 
> > If WHATWG spec provides a stricter query string format variant,
> > and this RFC draft requires use of that stricter or different
> > formatting variant, then the differences should be highlighted.
> > 
> > e.g.
> > "No-Vary-Search uses a variant of query string format defined in WHATWG
> > (reference) which is stricter than the varieties of query string syntax
> > allowed in RFC (reference)."
> > 
> > A reference to the WHATWG ABNF for the stricter format variant of the
> > query string is also recommended.
> > 
> >>> The document does not mention the implication of the union of variants
> >>> between Vary and No-Vary-Search response headers.  A CDN or browser
> >>> might have to limit the number of variants cached.
> >>> 
> >> 
> >> At present the limitations on the cache are not present here, just logic to
> >> determine whether a response is suitable to be used. RFC 9111 (HTTP
> >> Caching) §4.1 deals with the analogous question for existing variants to a
> >> limited extent:
> >> 
> >> """
> >> 
> >> If multiple stored responses match, the cache will need to choose one to
> >> use. When a nominated request header field has a known mechanism for
> >> ranking preference (e.g., qvalues on Accept and similar request header
> >> fields), that mechanism MAY be used to choose a preferred response. If such
> >> a mechanism is not available, or leads to equally preferred responses, the
> >> most recent response (as determined by the Date header field) is chosen, as
> >> per Section 4
> >> <https://www.rfc-editor.org/rfc/rfc9111#constructing.responses.from.caches>.
> >> <https://www.rfc-editor.org/rfc/rfc9111#section-4.1-6>
> >> 
> >> Some resources mistakenly omit the Vary header field from their default
> >> response (i.e., the one sent when the request does not express any
> >> preferences), with the effect of choosing it for subsequent requests to
> >> that resource even when more preferable responses are available. When a
> >> cache has multiple stored responses for a target URI and one or more omits
> >> the Vary header field, the cache SHOULD choose the most recent (see Section
> >> 4.2.3 <https://www.rfc-editor.org/rfc/rfc9111#age.calculations>) stored
> >> response with a valid Vary field value.
> >> 
> >> """
> >> I anticipated that any discussion of this issue would make most sense as
> >> part of any (not yet present) text discussing how this supplements RFC
> >> 9111. Are there novel considerations here about CDN and browser limits that
> >> merit specification?
> > 
> > A reference to that might be sufficient to acknowledge that this RFC
> > extends caching concerns for clients and caching intermediaries.
> > 
> >> Overall, this document uses idioms I am less familiar seeing in RFCs.
> >>> Maybe these idioms are more typical in WHATWG documents, but the
> >>> pseudo-code is different than what I typically see in RFCs.
> >>> Perhaps I am not familiar with the pseudo-markup variant, but it does
> >>> not look like markdown to me.
> >>> 
> >> 
> >> This is actually the RFC <em> element
> >> <https://authors.ietf.org/en/rfcxml-vocabulary#em> (semantic emphasis) as
> >> submitted, which is rendered in the plaintext rendering as leading and
> >> trailing underscores, but in the HTML and PDF renderings as italics. It's
> >> used here to set apart variable names from other text, because italics is
> >> the typographical convention for doing so in WHATWG/W3C algorithm
> >> pseudo-code. Not setting it apart at all might make it harder to skim the
> >> pseudocode (e.g., to clearly tell which values are used where), but it's
> >> certainly a bit noisy.
> >> 
> >> e.g. To _parse a URL search variance_ given _value_:
> >>> See also my confusion above reading
> >>>  "The obtain a URL search variance algorithm"
> >>> which could have been
> >>>  "The _obtain a URL search variance_ algorithm"
> >>> using the _every-other-word_ idiom from Section 4,
> >>> though _I_ _am_ _personally_ _not_ _a_ _fan_ _of_ _this_ formatting.
> >>> My preference would suggest using a real language (any one), instead of
> >>> pseudo-code, to create a reference implementation, if that is the goal.
> >>> Add comments to describe required behavior to clarify the reference
> >>> implementation.
> >>> 
> >> 
> >> It's a pseudo-code that is typical in WHATWG/W3C documents (
> >> https://infra.spec.whatwg.org/#algorithms). Those bodies tend to prefer
> >> explicit pseudo-code algorithms, as I understand it, because it forces the
> >> algorithm to be unambiguous about some precise aspects of the required
> >> behavior that are easy to gloss over in natural language (though of course,
> >> it's not the only way of doing so).
> >> 
> >> While the particular flavour of pseudo-code may not be familiar to this
> >> group, I have seen similar pseudo-code in documents such as RFC 8941
> >> (Structured Field Values for HTTP) which are relevant to this venue.
> > 
> > Again, I read the plaintext draft.  Let's wait for guidance from others
> > about document formatting and instead discuss content.
> > 
> >> The No-Vary-Search syntax with "except" reads to me as a double-negative:
> >>>  No-Vary-Search: params, except=("x")
> >>> 
> >>> Not knowing how far along this spec document is, was naming the header
> >>> "Vary-Search" considered?  With "Vary-Search", inverting the logic would
> >>> suggest "params" to default to all params varying (same as not
> >>> specifying Vary-Search), and "except" could be "no-vary"
> >>>  Vary-Search: params, no-vary=("x")
> >>> to indicate no-vary for "x", or
> >>>  Vary-Search: params, no-vary
> >>> to indicate all search params are no-vary (wildcard).
> >>> 
> >> 
> >> Yes, it was considered. This choice was made because it means that an
> >> absent header, or empty header value, should reflect that existing HTTP
> >> semantics are used.
> > 
> > The same applies to the theoretical "Vary-Search" I described.
> > 
> >> Since the default behavior is to vary on *all* parameters,
> >> their order, and in fact even the way those parameters are encoded in the
> >> URL, that means that the behavior of the header is naturally opposite to
> >> Vary (which starts from a default behavior of varying on *no* header
> >> fields).
> > 
> > That is more a definition than a reason.  Ok, I get that choice was
> > made, but why is it better than non-inverted logic?
> > 
> >> This does lead to a double negative, unfortunately, with the use of the
> >> term "vary". Conceptualizing it instead of as "not varying" as "is the same
> >> resource" addresses that (i.e., "This resource *is the same resource* as if
> >> it had been requested with other query parameters, except if x differs."
> >> has no double negative). Drawing the clear connection with Vary (which is
> >> well-known), though, seemed worth the double negative.
> > 
> > Respectfully, I disagree.
> > 
> > This header needs to be processed by caching intermediaries and client
> > for caching, just like "Vary" needs to be processed for caching
> > variants.  Caching variants will have to process and *invert* the logic
> > of "No-Vary-Search" to produce the "Vary"-set of varying parameters.
> > 
> > Put another way, storing the variant requires construction of a cache
> > key for the variant, which is some sort of encoding of the varying
> > parameters.  Since the set of varying parameters needs to be collected,
> > it makes more sense to me to have Vary and Vary-Search, rather than
> > Vary and invert-the-logic-of No-Vary-Search.  However, since
> > No-Vary-Search supports both positive and a negative ("except") ways to
> > define the varying search parameters or non-varying search parameters,
> > you might argue the opposite, depending on which version (positive or
> > negative) of No-Vary-Search you think may be used more frequently.
> > 
> >> 7.  Privacy Considerations
> >>> 
> >>> The ability to cache variants based on search parameters could possibly
> >>> compromise privacy due to fingerprinting and the ability to detect cache
> >>> hit versus cache miss even with coarse timing resolution.
> >>> 
> >> 
> >> Can you elaborate? If anything, I would have expected that disregarding
> >> certain query parameters would mean that someone probing the cache can
> >> learn *less* about which values of that query parameter have been seen by
> >> the cache previously.
> >> 
> >> In the context of web browsers, this sort of attack is also mitigated by HTTP
> >> cache partitioning
> >> <https://developer.chrome.com/blog/http-cache-partitioning> which is now
> >> specified (incompletely) by the WHATWG Fetch standard
> >> <https://fetch.spec.whatwg.org/#http-cache-partitions>.
> > 
> > If I have 10 different sets of 16-urls each, can I use caching to create
> > a tracking identifier if I assign the client one url from each of those
> > 10 sets and create a 10-hexdigit identifier?  Fetching all the URLs and
> > detecting which ones are cached might reveal the identifier?  When I
> > assign the URLs, I can assign a non-vary parameter 'tracker=1'.  When
> > a different page on a different site is requesting all the URLs to
> > detect which are cached, the client can add non-vary parameter
> > 'tracker=0'.  The server responding can assign HTTP caching headers
> > based on the response.  Anyway, this is not my area of expertise.
> > Yes, CORS headers factor in, but takes effort to set up properly, and
> > "properly" might be malicious if those headers are from malicious sites.
> > 
> > https://coveryourtracks.eff.org/
> > 
> > Some of your colleages at Google would be able to better explain all the
> > creative ways Google is fingerprinting clients in addition to cookies.
> > 
> > Cheers, Glenn
> > 
> >> Cheers, Glenn
> >>> 
> >>> 
> >>> On Wed, Jun 12, 2024 at 01:23:23PM -0400, Jeremy Roman wrote:
> >>>> In the interest of continuing discussion on this list, the WICG draft has
> >>>> been reformatted in RFC format and reported to the Datatracker:
> >>>> 
> >>>> https://datatracker.ietf.org/doc/draft-wicg-http-no-vary-search/01/
> >>>> or directly on GitHub
> >>>> 
> >>> https://jeremyroman.github.io/http-no-vary-search/draft-wicg-http-no-vary-search.html
> >>>> 
> >>>> The text has been left mostly unchanged so far (modulo very small
> >>> editorial
> >>>> changes), and does not yet reflect any change to RFC 9111 behavior
> >>> (though
> >>>> hopefully it's clear what those changes would be, from the existing
> >>> text).
> >>>> 
> >>>> On Tue, Mar 19, 2024 at 2:26 AM Mark Nottingham <mnot@mnot.net> wrote:
> >>>> 
> >>>>> Hi Jeremy,
> >>>>> 
> >>>>>> On 19 Mar 2024, at 11:44, Jeremy Roman <jbroman@chromium.org> wrote:
> >>>>>> 
> >>>>>> Unfortunately it is not possible for me to join personally (time
> >>> zones
> >>>>> and personal complications). We might be able to brief a Chrome team
> >>> member
> >>>>> who is attending if there is interest (depending when this is), though
> >>> as
> >>>>> you point out it would necessarily be a fairly brief overview on short
> >>>>> notice (so it might not be possible).
> >>>>> 
> >>>>> It doesn't look likely that we'll have time for additional
> >>> presentations.
> >>>>> I'd suggest continuing the discussion on the list.
> >>>>> 
> >>>>> Just for some context -- we found this kind of capability useful when I
> >>>>> was at Yahoo! way back in 2010:
> >>>>>  https://www.mnot.net/talks/pdf/Stupid_Web_Caching_Tricks.pdf#page=36
> >>>>> 
> >>>>> Cloudflare supports configuration to ignore the whole query string, as
> >>>>> well as specific arguments in it:
> >>>>>  https://developers.cloudflare.com/cache/how-to/cache-keys/
> >>>>> 
> >>>>> As does Fastly:
> >>>>>  https://docs.fastly.com/en/guides/making-query-strings-agnostic
> >>>>> 
> >>>>> 
> >>> https://www.fastly.com/documentation/solutions/examples/manipulate-query-string/
> >>>>> 
> >>>>> As does Akamai (apparently, based upon the information available):
> >>>>> 
> >>>>> 
> >>> https://community.akamai.com/customers/s/article/Remove-query-strings-from-forward-request-and-cache-key?language=en_US
> >>>>> 
> >>>>> I know Varnish supports this as well; I've done it with Squid (using a
> >>>>> helper) too. Not sure about eg nginx or Apache httpd.
> >>>>> 
> >>>>> So I suspect it's safe to say there's interest in this general feature
> >>>>> from people who use HTTP caches.
> >>>>> 
> >>>>> The difference here is the control mechanism to invoke that behaviour
> >>> --
> >>>>> putting it in a response header is really nice because it's a)
> >>>>> standardised, so (eventually) interoperable across implementations,
> >>> and b)
> >>>>> driven by the resource on the origin server, who has the most
> >>> information
> >>>>> about the URL's semantics (rather than relying on out-of-band
> >>>>> configuration).
> >>>>> 
> >>>>> However, when a cache has multiple stored responses and they have
> >>>>> conflicting information about the cache key, we need to be careful
> >>> about
> >>>>> specifying the interaction. In a way, this is similar to Vary -- it
> >>> faced a
> >>>>> similar question, and the decisions made in its design made
> >>> implementation
> >>>>> difficult. We chose a different approach in Key and Variants to address
> >>>>> that; we should probably have a similar discussion here.
> >>>>> 
> >>>>> Cheers,
> >>>>> 
> >>>>> 
> >>>>> --
> >>>>> Mark Nottingham   https://www.mnot.net/
> >>>>> 
> >>>>> 
> >>> 
> > 
> 
> --
> Mark Nottingham   https://www.mnot.net/
Received on Friday, 14 June 2024 01:54:18 UTC