Re: No-Vary-Search from gs-lists-ietf-http-wg@gluelogic.com on 2024-06-13 (ietf-http-wg@w3.org from April to June 2024)

From: <gs-lists-ietf-http-wg@gluelogic.com>
Date: Thu, 13 Jun 2024 19:56:25 -0400
To: Jeremy Roman <jbroman@chromium.org>
Cc: ietf-http-wg@w3.org
Message-ID: <ZmuHKSKBZpXkd115@xps13>
On Thu, Jun 13, 2024 at 05:50:05PM -0400, Jeremy Roman wrote:
> This is a little hard to read in the plaintext rendering, I'll admit. It's
> marginally better in the HTML rendering (though even then, the stylesheet
> makes it only slightly distinguishable).

Yes, I read the plaintext draft.

> WHATWG/W3C specs tend to be read in the HTML rendering, where these links
> would have been blue and underlined. If quotation marks would help with
> readability in addition to the section cross-link, that's certainly a
> typographical change I can make.

I believe that IETF desires RFCs to be readable in more than HTML,
but I am not the person to provide guidance on that.  Would the chair
advise?

> There are many references in the doc to WHATWG specs rather than IETF
> > specifications for URLs.  Is this intentional?
> >
> 
> The embodiment we (Google Chrome) have been working on is in a web browser
> which implements the WHATWG URL specification, and we want this to be
> useful in web browsers (and HTTP servers which are interacting with web
> browsers), so being compatible with the way browsers deal with query
> strings (namely, the application/x-www-form-urlencoded parser, which is
> used for, for instance, the URLSearchParams object exposed to JavaScript
> code).
> 
> I'm less familiar with the IETF specifications that other uses of HTTP use,
> though for instance RFC 3986 (Uniform Resource Identifier: Generic Syntax)
> doesn't say much beyond "query components are often used to carry
> identifying information in the form of 'key=value' pairs", and RFC 9110
> (HTTP Semantics) §4.1 simply specifies that this is an optional URL
> component in the "http" and "https" URI schemes. This doesn't provide
> enough for the purposes of this document, even if the two are otherwise
> compatible (which I'm not sure they are).

If WHATWG spec provides a stricter query string format variant,
and this RFC draft requires use of that stricter or different
formatting variant, then the differences should be highlighted.

e.g.
"No-Vary-Search uses a variant of query string format defined in WHATWG
(reference) which is stricter than the varieties of query string syntax
allowed in RFC (reference)."

A reference to the WHATWG ABNF for the stricter format variant of the
query string is also recommended.

> > The document does not mention the implication of the union of variants
> > between Vary and No-Vary-Search response headers.  A CDN or browser
> > might have to limit the number of variants cached.
> >
> 
> At present the limitations on the cache are not present here, just logic to
> determine whether a response is suitable to be used. RFC 9111 (HTTP
> Caching) §4.1 deals with the analogous question for existing variants to a
> limited extent:
> 
> """
> 
> If multiple stored responses match, the cache will need to choose one to
> use. When a nominated request header field has a known mechanism for
> ranking preference (e.g., qvalues on Accept and similar request header
> fields), that mechanism MAY be used to choose a preferred response. If such
> a mechanism is not available, or leads to equally preferred responses, the
> most recent response (as determined by the Date header field) is chosen, as
> per Section 4
> <https://www.rfc-editor.org/rfc/rfc9111#constructing.responses.from.caches>.
> <https://www.rfc-editor.org/rfc/rfc9111#section-4.1-6>
> 
> Some resources mistakenly omit the Vary header field from their default
> response (i.e., the one sent when the request does not express any
> preferences), with the effect of choosing it for subsequent requests to
> that resource even when more preferable responses are available. When a
> cache has multiple stored responses for a target URI and one or more omits
> the Vary header field, the cache SHOULD choose the most recent (see Section
> 4.2.3 <https://www.rfc-editor.org/rfc/rfc9111#age.calculations>) stored
> response with a valid Vary field value.
> 
> """
> I anticipated that any discussion of this issue would make most sense as
> part of any (not yet present) text discussing how this supplements RFC
> 9111. Are there novel considerations here about CDN and browser limits that
> merit specification?

A reference to that might be sufficient to acknowledge that this RFC
extends caching concerns for clients and caching intermediaries.

> Overall, this document uses idioms I am less familiar seeing in RFCs.
> > Maybe these idioms are more typical in WHATWG documents, but the
> > pseudo-code is different than what I typically see in RFCs.
> > Perhaps I am not familiar with the pseudo-markup variant, but it does
> > not look like markdown to me.
> >
> 
> This is actually the RFC <em> element
> <https://authors.ietf.org/en/rfcxml-vocabulary#em> (semantic emphasis) as
> submitted, which is rendered in the plaintext rendering as leading and
> trailing underscores, but in the HTML and PDF renderings as italics. It's
> used here to set apart variable names from other text, because italics is
> the typographical convention for doing so in WHATWG/W3C algorithm
> pseudo-code. Not setting it apart at all might make it harder to skim the
> pseudocode (e.g., to clearly tell which values are used where), but it's
> certainly a bit noisy.
> 
> e.g. To _parse a URL search variance_ given _value_:
> > See also my confusion above reading
> >   "The obtain a URL search variance algorithm"
> > which could have been
> >   "The _obtain a URL search variance_ algorithm"
> > using the _every-other-word_ idiom from Section 4,
> > though _I_ _am_ _personally_ _not_ _a_ _fan_ _of_ _this_ formatting.
> > My preference would suggest using a real language (any one), instead of
> > pseudo-code, to create a reference implementation, if that is the goal.
> > Add comments to describe required behavior to clarify the reference
> > implementation.
> >
> 
> It's a pseudo-code that is typical in WHATWG/W3C documents (
> https://infra.spec.whatwg.org/#algorithms). Those bodies tend to prefer
> explicit pseudo-code algorithms, as I understand it, because it forces the
> algorithm to be unambiguous about some precise aspects of the required
> behavior that are easy to gloss over in natural language (though of course,
> it's not the only way of doing so).
> 
> While the particular flavour of pseudo-code may not be familiar to this
> group, I have seen similar pseudo-code in documents such as RFC 8941
> (Structured Field Values for HTTP) which are relevant to this venue.

Again, I read the plaintext draft.  Let's wait for guidance from others
about document formatting and instead discuss content.

> The No-Vary-Search syntax with "except" reads to me as a double-negative:
> >   No-Vary-Search: params, except=("x")
> >
> > Not knowing how far along this spec document is, was naming the header
> > "Vary-Search" considered?  With "Vary-Search", inverting the logic would
> > suggest "params" to default to all params varying (same as not
> > specifying Vary-Search), and "except" could be "no-vary"
> >   Vary-Search: params, no-vary=("x")
> > to indicate no-vary for "x", or
> >   Vary-Search: params, no-vary
> > to indicate all search params are no-vary (wildcard).
> >
> 
> Yes, it was considered. This choice was made because it means that an
> absent header, or empty header value, should reflect that existing HTTP
> semantics are used.

The same applies to the theoretical "Vary-Search" I described.

> Since the default behavior is to vary on *all* parameters,
> their order, and in fact even the way those parameters are encoded in the
> URL, that means that the behavior of the header is naturally opposite to
> Vary (which starts from a default behavior of varying on *no* header
> fields).

That is more a definition than a reason.  Ok, I get that choice was
made, but why is it better than non-inverted logic?

> This does lead to a double negative, unfortunately, with the use of the
> term "vary". Conceptualizing it instead of as "not varying" as "is the same
> resource" addresses that (i.e., "This resource *is the same resource* as if
> it had been requested with other query parameters, except if x differs."
> has no double negative). Drawing the clear connection with Vary (which is
> well-known), though, seemed worth the double negative.

Respectfully, I disagree.

This header needs to be processed by caching intermediaries and client
for caching, just like "Vary" needs to be processed for caching
variants.  Caching variants will have to process and *invert* the logic
of "No-Vary-Search" to produce the "Vary"-set of varying parameters.

Put another way, storing the variant requires construction of a cache
key for the variant, which is some sort of encoding of the varying
parameters.  Since the set of varying parameters needs to be collected,
it makes more sense to me to have Vary and Vary-Search, rather than
Vary and invert-the-logic-of No-Vary-Search.  However, since
No-Vary-Search supports both positive and a negative ("except") ways to
define the varying search parameters or non-varying search parameters,
you might argue the opposite, depending on which version (positive or
negative) of No-Vary-Search you think may be used more frequently.

> 7.  Privacy Considerations
> >
> > The ability to cache variants based on search parameters could possibly
> > compromise privacy due to fingerprinting and the ability to detect cache
> > hit versus cache miss even with coarse timing resolution.
> >
> 
> Can you elaborate? If anything, I would have expected that disregarding
> certain query parameters would mean that someone probing the cache can
> learn *less* about which values of that query parameter have been seen by
> the cache previously.
> 
> In the context of web browsers, this sort of attack is also mitigated by HTTP
> cache partitioning
> <https://developer.chrome.com/blog/http-cache-partitioning> which is now
> specified (incompletely) by the WHATWG Fetch standard
> <https://fetch.spec.whatwg.org/#http-cache-partitions>.

If I have 10 different sets of 16-urls each, can I use caching to create
a tracking identifier if I assign the client one url from each of those
10 sets and create a 10-hexdigit identifier?  Fetching all the URLs and
detecting which ones are cached might reveal the identifier?  When I
assign the URLs, I can assign a non-vary parameter 'tracker=1'.  When
a different page on a different site is requesting all the URLs to
detect which are cached, the client can add non-vary parameter
'tracker=0'.  The server responding can assign HTTP caching headers
based on the response.  Anyway, this is not my area of expertise.
Yes, CORS headers factor in, but takes effort to set up properly, and
"properly" might be malicious if those headers are from malicious sites.

https://coveryourtracks.eff.org/

Some of your colleages at Google would be able to better explain all the
creative ways Google is fingerprinting clients in addition to cookies.

Cheers, Glenn

> Cheers, Glenn
> >
> >
> > On Wed, Jun 12, 2024 at 01:23:23PM -0400, Jeremy Roman wrote:
> > > In the interest of continuing discussion on this list, the WICG draft has
> > > been reformatted in RFC format and reported to the Datatracker:
> > >
> > > https://datatracker.ietf.org/doc/draft-wicg-http-no-vary-search/01/
> > > or directly on GitHub
> > >
> > https://jeremyroman.github.io/http-no-vary-search/draft-wicg-http-no-vary-search.html
> > >
> > > The text has been left mostly unchanged so far (modulo very small
> > editorial
> > > changes), and does not yet reflect any change to RFC 9111 behavior
> > (though
> > > hopefully it's clear what those changes would be, from the existing
> > text).
> > >
> > > On Tue, Mar 19, 2024 at 2:26 AM Mark Nottingham <mnot@mnot.net> wrote:
> > >
> > > > Hi Jeremy,
> > > >
> > > > > On 19 Mar 2024, at 11:44, Jeremy Roman <jbroman@chromium.org> wrote:
> > > > >
> > > > > Unfortunately it is not possible for me to join personally (time
> > zones
> > > > and personal complications). We might be able to brief a Chrome team
> > member
> > > > who is attending if there is interest (depending when this is), though
> > as
> > > > you point out it would necessarily be a fairly brief overview on short
> > > > notice (so it might not be possible).
> > > >
> > > > It doesn't look likely that we'll have time for additional
> > presentations.
> > > > I'd suggest continuing the discussion on the list.
> > > >
> > > > Just for some context -- we found this kind of capability useful when I
> > > > was at Yahoo! way back in 2010:
> > > >   https://www.mnot.net/talks/pdf/Stupid_Web_Caching_Tricks.pdf#page=36
> > > >
> > > > Cloudflare supports configuration to ignore the whole query string, as
> > > > well as specific arguments in it:
> > > >   https://developers.cloudflare.com/cache/how-to/cache-keys/
> > > >
> > > > As does Fastly:
> > > >   https://docs.fastly.com/en/guides/making-query-strings-agnostic
> > > >
> > > >
> > https://www.fastly.com/documentation/solutions/examples/manipulate-query-string/
> > > >
> > > > As does Akamai (apparently, based upon the information available):
> > > >
> > > >
> > https://community.akamai.com/customers/s/article/Remove-query-strings-from-forward-request-and-cache-key?language=en_US
> > > >
> > > > I know Varnish supports this as well; I've done it with Squid (using a
> > > > helper) too. Not sure about eg nginx or Apache httpd.
> > > >
> > > > So I suspect it's safe to say there's interest in this general feature
> > > > from people who use HTTP caches.
> > > >
> > > > The difference here is the control mechanism to invoke that behaviour
> > --
> > > > putting it in a response header is really nice because it's a)
> > > > standardised, so (eventually) interoperable across implementations,
> > and b)
> > > > driven by the resource on the origin server, who has the most
> > information
> > > > about the URL's semantics (rather than relying on out-of-band
> > > > configuration).
> > > >
> > > > However, when a cache has multiple stored responses and they have
> > > > conflicting information about the cache key, we need to be careful
> > about
> > > > specifying the interaction. In a way, this is similar to Vary -- it
> > faced a
> > > > similar question, and the decisions made in its design made
> > implementation
> > > > difficult. We chose a different approach in Key and Variants to address
> > > > that; we should probably have a similar discussion here.
> > > >
> > > > Cheers,
> > > >
> > > >
> > > > --
> > > > Mark Nottingham   https://www.mnot.net/
> > > >
> > > >
> >
Received on Thursday, 13 June 2024 23:56:38 UTC