Re: No-Vary-Search from Jeremy Roman on 2024-06-13 (ietf-http-wg@w3.org from April to June 2024)

From: Jeremy Roman <jbroman@chromium.org>
Date: Thu, 13 Jun 2024 17:50:05 -0400
To: gs-lists-ietf-http-wg@gluelogic.com
Cc: ietf-http-wg@w3.org
Message-ID: <CACuR13fy8jFDBpsJ6-pp7MQ-xxSivM6BCRX5vz+cnVdVi45t_Q@mail.gmail.com>
On Thu, Jun 13, 2024 at 4:16 AM <gs-lists-ietf-http-wg@gluelogic.com> wrote:

> Jeremy,
>
> My first impressions follow.  You're more than welcome to disagree.
>
>
> Section 2 contains this confusing sentence.  Please clarify in the doc.
> "given by the obtain a"?
>
>       |  Implementations instead need to
>       |  implement the processing model given by the obtain a URL search
>       |  variance algorithm (Section 4.2).
>
> Section 3 similarly contains
>
>    The obtain a URL search variance algorithm (Section 4.2) ensures that
>    all URL search variances obey the following constraints:
>
> If "obtain a URL search variance" algorithm is the name of an algorithm,
> please indicate such.  (perhaps by quoting the name of the algorithm?)
> ...The sentence did not read clearly until I read Section 4.2, which has
> that the title (notice difference in case)
>   "4.2.  Obtain a URL search variance"
>

This is a little hard to read in the plaintext rendering, I'll admit. It's
marginally better in the HTML rendering (though even then, the stylesheet
makes it only slightly distinguishable).

WHATWG/W3C specs tend to be read in the HTML rendering, where these links
would have been blue and underlined. If quotation marks would help with
readability in addition to the section cross-link, that's certainly a
typographical change I can make.

There are many references in the doc to WHATWG specs rather than IETF
> specifications for URLs.  Is this intentional?
>

The embodiment we (Google Chrome) have been working on is in a web browser
which implements the WHATWG URL specification, and we want this to be
useful in web browsers (and HTTP servers which are interacting with web
browsers), so being compatible with the way browsers deal with query
strings (namely, the application/x-www-form-urlencoded parser, which is
used for, for instance, the URLSearchParams object exposed to JavaScript
code).

I'm less familiar with the IETF specifications that other uses of HTTP use,
though for instance RFC 3986 (Uniform Resource Identifier: Generic Syntax)
doesn't say much beyond "query components are often used to carry
identifying information in the form of 'key=value' pairs", and RFC 9110
(HTTP Semantics) §4.1 simply specifies that this is an optional URL
component in the "http" and "https" URI schemes. This doesn't provide
enough for the purposes of this document, even if the two are otherwise
compatible (which I'm not sure they are).


> The document does not mention the implication of the union of variants
> between Vary and No-Vary-Search response headers.  A CDN or browser
> might have to limit the number of variants cached.
>

At present the limitations on the cache are not present here, just logic to
determine whether a response is suitable to be used. RFC 9111 (HTTP
Caching) §4.1 deals with the analogous question for existing variants to a
limited extent:

"""

If multiple stored responses match, the cache will need to choose one to
use. When a nominated request header field has a known mechanism for
ranking preference (e.g., qvalues on Accept and similar request header
fields), that mechanism MAY be used to choose a preferred response. If such
a mechanism is not available, or leads to equally preferred responses, the
most recent response (as determined by the Date header field) is chosen, as
per Section 4
<https://www.rfc-editor.org/rfc/rfc9111#constructing.responses.from.caches>.
<https://www.rfc-editor.org/rfc/rfc9111#section-4.1-6>

Some resources mistakenly omit the Vary header field from their default
response (i.e., the one sent when the request does not express any
preferences), with the effect of choosing it for subsequent requests to
that resource even when more preferable responses are available. When a
cache has multiple stored responses for a target URI and one or more omits
the Vary header field, the cache SHOULD choose the most recent (see Section
4.2.3 <https://www.rfc-editor.org/rfc/rfc9111#age.calculations>) stored
response with a valid Vary field value.

"""
I anticipated that any discussion of this issue would make most sense as
part of any (not yet present) text discussing how this supplements RFC
9111. Are there novel considerations here about CDN and browser limits that
merit specification?

Overall, this document uses idioms I am less familiar seeing in RFCs.
> Maybe these idioms are more typical in WHATWG documents, but the
> pseudo-code is different than what I typically see in RFCs.
> Perhaps I am not familiar with the pseudo-markup variant, but it does
> not look like markdown to me.
>

This is actually the RFC <em> element
<https://authors.ietf.org/en/rfcxml-vocabulary#em> (semantic emphasis) as
submitted, which is rendered in the plaintext rendering as leading and
trailing underscores, but in the HTML and PDF renderings as italics. It's
used here to set apart variable names from other text, because italics is
the typographical convention for doing so in WHATWG/W3C algorithm
pseudo-code. Not setting it apart at all might make it harder to skim the
pseudocode (e.g., to clearly tell which values are used where), but it's
certainly a bit noisy.

e.g. To _parse a URL search variance_ given _value_:
> See also my confusion above reading
>   "The obtain a URL search variance algorithm"
> which could have been
>   "The _obtain a URL search variance_ algorithm"
> using the _every-other-word_ idiom from Section 4,
> though _I_ _am_ _personally_ _not_ _a_ _fan_ _of_ _this_ formatting.
> My preference would suggest using a real language (any one), instead of
> pseudo-code, to create a reference implementation, if that is the goal.
> Add comments to describe required behavior to clarify the reference
> implementation.
>

It's a pseudo-code that is typical in WHATWG/W3C documents (
https://infra.spec.whatwg.org/#algorithms). Those bodies tend to prefer
explicit pseudo-code algorithms, as I understand it, because it forces the
algorithm to be unambiguous about some precise aspects of the required
behavior that are easy to gloss over in natural language (though of course,
it's not the only way of doing so).

While the particular flavour of pseudo-code may not be familiar to this
group, I have seen similar pseudo-code in documents such as RFC 8941
(Structured Field Values for HTTP) which are relevant to this venue.

The No-Vary-Search syntax with "except" reads to me as a double-negative:
>   No-Vary-Search: params, except=("x")
>
> Not knowing how far along this spec document is, was naming the header
> "Vary-Search" considered?  With "Vary-Search", inverting the logic would
> suggest "params" to default to all params varying (same as not
> specifying Vary-Search), and "except" could be "no-vary"
>   Vary-Search: params, no-vary=("x")
> to indicate no-vary for "x", or
>   Vary-Search: params, no-vary
> to indicate all search params are no-vary (wildcard).
>

Yes, it was considered. This choice was made because it means that an
absent header, or empty header value, should reflect that existing HTTP
semantics are used. Since the default behavior is to vary on *all* parameters,
their order, and in fact even the way those parameters are encoded in the
URL, that means that the behavior of the header is naturally opposite to
Vary (which starts from a default behavior of varying on *no* header
fields).

This does lead to a double negative, unfortunately, with the use of the
term "vary". Conceptualizing it instead of as "not varying" as "is the same
resource" addresses that (i.e., "This resource *is the same resource* as if
it had been requested with other query parameters, except if x differs."
has no double negative). Drawing the clear connection with Vary (which is
well-known), though, seemed worth the double negative.

7.  Privacy Considerations
>
> The ability to cache variants based on search parameters could possibly
> compromise privacy due to fingerprinting and the ability to detect cache
> hit versus cache miss even with coarse timing resolution.
>

Can you elaborate? If anything, I would have expected that disregarding
certain query parameters would mean that someone probing the cache can
learn *less* about which values of that query parameter have been seen by
the cache previously.

In the context of web browsers, this sort of attack is also mitigated by HTTP
cache partitioning
<https://developer.chrome.com/blog/http-cache-partitioning> which is now
specified (incompletely) by the WHATWG Fetch standard
<https://fetch.spec.whatwg.org/#http-cache-partitions>.

Cheers, Glenn
>
>
> On Wed, Jun 12, 2024 at 01:23:23PM -0400, Jeremy Roman wrote:
> > In the interest of continuing discussion on this list, the WICG draft has
> > been reformatted in RFC format and reported to the Datatracker:
> >
> > https://datatracker.ietf.org/doc/draft-wicg-http-no-vary-search/01/
> > or directly on GitHub
> >
> https://jeremyroman.github.io/http-no-vary-search/draft-wicg-http-no-vary-search.html
> >
> > The text has been left mostly unchanged so far (modulo very small
> editorial
> > changes), and does not yet reflect any change to RFC 9111 behavior
> (though
> > hopefully it's clear what those changes would be, from the existing
> text).
> >
> > On Tue, Mar 19, 2024 at 2:26 AM Mark Nottingham <mnot@mnot.net> wrote:
> >
> > > Hi Jeremy,
> > >
> > > > On 19 Mar 2024, at 11:44, Jeremy Roman <jbroman@chromium.org> wrote:
> > > >
> > > > Unfortunately it is not possible for me to join personally (time
> zones
> > > and personal complications). We might be able to brief a Chrome team
> member
> > > who is attending if there is interest (depending when this is), though
> as
> > > you point out it would necessarily be a fairly brief overview on short
> > > notice (so it might not be possible).
> > >
> > > It doesn't look likely that we'll have time for additional
> presentations.
> > > I'd suggest continuing the discussion on the list.
> > >
> > > Just for some context -- we found this kind of capability useful when I
> > > was at Yahoo! way back in 2010:
> > >   https://www.mnot.net/talks/pdf/Stupid_Web_Caching_Tricks.pdf#page=36
> > >
> > > Cloudflare supports configuration to ignore the whole query string, as
> > > well as specific arguments in it:
> > >   https://developers.cloudflare.com/cache/how-to/cache-keys/
> > >
> > > As does Fastly:
> > >   https://docs.fastly.com/en/guides/making-query-strings-agnostic
> > >
> > >
> https://www.fastly.com/documentation/solutions/examples/manipulate-query-string/
> > >
> > > As does Akamai (apparently, based upon the information available):
> > >
> > >
> https://community.akamai.com/customers/s/article/Remove-query-strings-from-forward-request-and-cache-key?language=en_US
> > >
> > > I know Varnish supports this as well; I've done it with Squid (using a
> > > helper) too. Not sure about eg nginx or Apache httpd.
> > >
> > > So I suspect it's safe to say there's interest in this general feature
> > > from people who use HTTP caches.
> > >
> > > The difference here is the control mechanism to invoke that behaviour
> --
> > > putting it in a response header is really nice because it's a)
> > > standardised, so (eventually) interoperable across implementations,
> and b)
> > > driven by the resource on the origin server, who has the most
> information
> > > about the URL's semantics (rather than relying on out-of-band
> > > configuration).
> > >
> > > However, when a cache has multiple stored responses and they have
> > > conflicting information about the cache key, we need to be careful
> about
> > > specifying the interaction. In a way, this is similar to Vary -- it
> faced a
> > > similar question, and the decisions made in its design made
> implementation
> > > difficult. We chose a different approach in Key and Variants to address
> > > that; we should probably have a similar discussion here.
> > >
> > > Cheers,
> > >
> > >
> > > --
> > > Mark Nottingham   https://www.mnot.net/
> > >
> > >
>
Received on Thursday, 13 June 2024 21:50:24 UTC