Re: No-Vary-Search from Mark Nottingham on 2024-06-14 (ietf-http-wg@w3.org from April to June 2024)

From: Mark Nottingham <mnot@mnot.net>
Date: Fri, 14 Jun 2024 11:09:17 +1000
To: gs-lists-ietf-http-wg@gluelogic.com
Cc: Jeremy Roman <jbroman@chromium.org>, ietf-http-wg@w3.org
Message-Id: <5A2743FB-BE1F-43CB-8654-040C5B7298CF@mnot.net>
From a chair's perspective -- the important thing to establish at this point is whether the WG thinks there's a problem to solve here, and whether this is _approximately_ the right starting point. Document formatting and details of the specification aren't as relevant at this early stage.

Cheers,


> On 14 Jun 2024, at 9:56 AM, gs-lists-ietf-http-wg@gluelogic.com wrote:
> 
> On Thu, Jun 13, 2024 at 05:50:05PM -0400, Jeremy Roman wrote:
>> This is a little hard to read in the plaintext rendering, I'll admit. It's
>> marginally better in the HTML rendering (though even then, the stylesheet
>> makes it only slightly distinguishable).
> 
> Yes, I read the plaintext draft.
> 
>> WHATWG/W3C specs tend to be read in the HTML rendering, where these links
>> would have been blue and underlined. If quotation marks would help with
>> readability in addition to the section cross-link, that's certainly a
>> typographical change I can make.
> 
> I believe that IETF desires RFCs to be readable in more than HTML,
> but I am not the person to provide guidance on that.  Would the chair
> advise?
> 
>> There are many references in the doc to WHATWG specs rather than IETF
>>> specifications for URLs.  Is this intentional?
>>> 
>> 
>> The embodiment we (Google Chrome) have been working on is in a web browser
>> which implements the WHATWG URL specification, and we want this to be
>> useful in web browsers (and HTTP servers which are interacting with web
>> browsers), so being compatible with the way browsers deal with query
>> strings (namely, the application/x-www-form-urlencoded parser, which is
>> used for, for instance, the URLSearchParams object exposed to JavaScript
>> code).
>> 
>> I'm less familiar with the IETF specifications that other uses of HTTP use,
>> though for instance RFC 3986 (Uniform Resource Identifier: Generic Syntax)
>> doesn't say much beyond "query components are often used to carry
>> identifying information in the form of 'key=value' pairs", and RFC 9110
>> (HTTP Semantics) §4.1 simply specifies that this is an optional URL
>> component in the "http" and "https" URI schemes. This doesn't provide
>> enough for the purposes of this document, even if the two are otherwise
>> compatible (which I'm not sure they are).
> 
> If WHATWG spec provides a stricter query string format variant,
> and this RFC draft requires use of that stricter or different
> formatting variant, then the differences should be highlighted.
> 
> e.g.
> "No-Vary-Search uses a variant of query string format defined in WHATWG
> (reference) which is stricter than the varieties of query string syntax
> allowed in RFC (reference)."
> 
> A reference to the WHATWG ABNF for the stricter format variant of the
> query string is also recommended.
> 
>>> The document does not mention the implication of the union of variants
>>> between Vary and No-Vary-Search response headers.  A CDN or browser
>>> might have to limit the number of variants cached.
>>> 
>> 
>> At present the limitations on the cache are not present here, just logic to
>> determine whether a response is suitable to be used. RFC 9111 (HTTP
>> Caching) §4.1 deals with the analogous question for existing variants to a
>> limited extent:
>> 
>> """
>> 
>> If multiple stored responses match, the cache will need to choose one to
>> use. When a nominated request header field has a known mechanism for
>> ranking preference (e.g., qvalues on Accept and similar request header
>> fields), that mechanism MAY be used to choose a preferred response. If such
>> a mechanism is not available, or leads to equally preferred responses, the
>> most recent response (as determined by the Date header field) is chosen, as
>> per Section 4
>> <https://www.rfc-editor.org/rfc/rfc9111#constructing.responses.from.caches>.
>> <https://www.rfc-editor.org/rfc/rfc9111#section-4.1-6>
>> 
>> Some resources mistakenly omit the Vary header field from their default
>> response (i.e., the one sent when the request does not express any
>> preferences), with the effect of choosing it for subsequent requests to
>> that resource even when more preferable responses are available. When a
>> cache has multiple stored responses for a target URI and one or more omits
>> the Vary header field, the cache SHOULD choose the most recent (see Section
>> 4.2.3 <https://www.rfc-editor.org/rfc/rfc9111#age.calculations>) stored
>> response with a valid Vary field value.
>> 
>> """
>> I anticipated that any discussion of this issue would make most sense as
>> part of any (not yet present) text discussing how this supplements RFC
>> 9111. Are there novel considerations here about CDN and browser limits that
>> merit specification?
> 
> A reference to that might be sufficient to acknowledge that this RFC
> extends caching concerns for clients and caching intermediaries.
> 
>> Overall, this document uses idioms I am less familiar seeing in RFCs.
>>> Maybe these idioms are more typical in WHATWG documents, but the
>>> pseudo-code is different than what I typically see in RFCs.
>>> Perhaps I am not familiar with the pseudo-markup variant, but it does
>>> not look like markdown to me.
>>> 
>> 
>> This is actually the RFC <em> element
>> <https://authors.ietf.org/en/rfcxml-vocabulary#em> (semantic emphasis) as
>> submitted, which is rendered in the plaintext rendering as leading and
>> trailing underscores, but in the HTML and PDF renderings as italics. It's
>> used here to set apart variable names from other text, because italics is
>> the typographical convention for doing so in WHATWG/W3C algorithm
>> pseudo-code. Not setting it apart at all might make it harder to skim the
>> pseudocode (e.g., to clearly tell which values are used where), but it's
>> certainly a bit noisy.
>> 
>> e.g. To _parse a URL search variance_ given _value_:
>>> See also my confusion above reading
>>>  "The obtain a URL search variance algorithm"
>>> which could have been
>>>  "The _obtain a URL search variance_ algorithm"
>>> using the _every-other-word_ idiom from Section 4,
>>> though _I_ _am_ _personally_ _not_ _a_ _fan_ _of_ _this_ formatting.
>>> My preference would suggest using a real language (any one), instead of
>>> pseudo-code, to create a reference implementation, if that is the goal.
>>> Add comments to describe required behavior to clarify the reference
>>> implementation.
>>> 
>> 
>> It's a pseudo-code that is typical in WHATWG/W3C documents (
>> https://infra.spec.whatwg.org/#algorithms). Those bodies tend to prefer
>> explicit pseudo-code algorithms, as I understand it, because it forces the
>> algorithm to be unambiguous about some precise aspects of the required
>> behavior that are easy to gloss over in natural language (though of course,
>> it's not the only way of doing so).
>> 
>> While the particular flavour of pseudo-code may not be familiar to this
>> group, I have seen similar pseudo-code in documents such as RFC 8941
>> (Structured Field Values for HTTP) which are relevant to this venue.
> 
> Again, I read the plaintext draft.  Let's wait for guidance from others
> about document formatting and instead discuss content.
> 
>> The No-Vary-Search syntax with "except" reads to me as a double-negative:
>>>  No-Vary-Search: params, except=("x")
>>> 
>>> Not knowing how far along this spec document is, was naming the header
>>> "Vary-Search" considered?  With "Vary-Search", inverting the logic would
>>> suggest "params" to default to all params varying (same as not
>>> specifying Vary-Search), and "except" could be "no-vary"
>>>  Vary-Search: params, no-vary=("x")
>>> to indicate no-vary for "x", or
>>>  Vary-Search: params, no-vary
>>> to indicate all search params are no-vary (wildcard).
>>> 
>> 
>> Yes, it was considered. This choice was made because it means that an
>> absent header, or empty header value, should reflect that existing HTTP
>> semantics are used.
> 
> The same applies to the theoretical "Vary-Search" I described.
> 
>> Since the default behavior is to vary on *all* parameters,
>> their order, and in fact even the way those parameters are encoded in the
>> URL, that means that the behavior of the header is naturally opposite to
>> Vary (which starts from a default behavior of varying on *no* header
>> fields).
> 
> That is more a definition than a reason.  Ok, I get that choice was
> made, but why is it better than non-inverted logic?
> 
>> This does lead to a double negative, unfortunately, with the use of the
>> term "vary". Conceptualizing it instead of as "not varying" as "is the same
>> resource" addresses that (i.e., "This resource *is the same resource* as if
>> it had been requested with other query parameters, except if x differs."
>> has no double negative). Drawing the clear connection with Vary (which is
>> well-known), though, seemed worth the double negative.
> 
> Respectfully, I disagree.
> 
> This header needs to be processed by caching intermediaries and client
> for caching, just like "Vary" needs to be processed for caching
> variants.  Caching variants will have to process and *invert* the logic
> of "No-Vary-Search" to produce the "Vary"-set of varying parameters.
> 
> Put another way, storing the variant requires construction of a cache
> key for the variant, which is some sort of encoding of the varying
> parameters.  Since the set of varying parameters needs to be collected,
> it makes more sense to me to have Vary and Vary-Search, rather than
> Vary and invert-the-logic-of No-Vary-Search.  However, since
> No-Vary-Search supports both positive and a negative ("except") ways to
> define the varying search parameters or non-varying search parameters,
> you might argue the opposite, depending on which version (positive or
> negative) of No-Vary-Search you think may be used more frequently.
> 
>> 7.  Privacy Considerations
>>> 
>>> The ability to cache variants based on search parameters could possibly
>>> compromise privacy due to fingerprinting and the ability to detect cache
>>> hit versus cache miss even with coarse timing resolution.
>>> 
>> 
>> Can you elaborate? If anything, I would have expected that disregarding
>> certain query parameters would mean that someone probing the cache can
>> learn *less* about which values of that query parameter have been seen by
>> the cache previously.
>> 
>> In the context of web browsers, this sort of attack is also mitigated by HTTP
>> cache partitioning
>> <https://developer.chrome.com/blog/http-cache-partitioning> which is now
>> specified (incompletely) by the WHATWG Fetch standard
>> <https://fetch.spec.whatwg.org/#http-cache-partitions>.
> 
> If I have 10 different sets of 16-urls each, can I use caching to create
> a tracking identifier if I assign the client one url from each of those
> 10 sets and create a 10-hexdigit identifier?  Fetching all the URLs and
> detecting which ones are cached might reveal the identifier?  When I
> assign the URLs, I can assign a non-vary parameter 'tracker=1'.  When
> a different page on a different site is requesting all the URLs to
> detect which are cached, the client can add non-vary parameter
> 'tracker=0'.  The server responding can assign HTTP caching headers
> based on the response.  Anyway, this is not my area of expertise.
> Yes, CORS headers factor in, but takes effort to set up properly, and
> "properly" might be malicious if those headers are from malicious sites.
> 
> https://coveryourtracks.eff.org/
> 
> Some of your colleages at Google would be able to better explain all the
> creative ways Google is fingerprinting clients in addition to cookies.
> 
> Cheers, Glenn
> 
>> Cheers, Glenn
>>> 
>>> 
>>> On Wed, Jun 12, 2024 at 01:23:23PM -0400, Jeremy Roman wrote:
>>>> In the interest of continuing discussion on this list, the WICG draft has
>>>> been reformatted in RFC format and reported to the Datatracker:
>>>> 
>>>> https://datatracker.ietf.org/doc/draft-wicg-http-no-vary-search/01/
>>>> or directly on GitHub
>>>> 
>>> https://jeremyroman.github.io/http-no-vary-search/draft-wicg-http-no-vary-search.html
>>>> 
>>>> The text has been left mostly unchanged so far (modulo very small
>>> editorial
>>>> changes), and does not yet reflect any change to RFC 9111 behavior
>>> (though
>>>> hopefully it's clear what those changes would be, from the existing
>>> text).
>>>> 
>>>> On Tue, Mar 19, 2024 at 2:26 AM Mark Nottingham <mnot@mnot.net> wrote:
>>>> 
>>>>> Hi Jeremy,
>>>>> 
>>>>>> On 19 Mar 2024, at 11:44, Jeremy Roman <jbroman@chromium.org> wrote:
>>>>>> 
>>>>>> Unfortunately it is not possible for me to join personally (time
>>> zones
>>>>> and personal complications). We might be able to brief a Chrome team
>>> member
>>>>> who is attending if there is interest (depending when this is), though
>>> as
>>>>> you point out it would necessarily be a fairly brief overview on short
>>>>> notice (so it might not be possible).
>>>>> 
>>>>> It doesn't look likely that we'll have time for additional
>>> presentations.
>>>>> I'd suggest continuing the discussion on the list.
>>>>> 
>>>>> Just for some context -- we found this kind of capability useful when I
>>>>> was at Yahoo! way back in 2010:
>>>>>  https://www.mnot.net/talks/pdf/Stupid_Web_Caching_Tricks.pdf#page=36
>>>>> 
>>>>> Cloudflare supports configuration to ignore the whole query string, as
>>>>> well as specific arguments in it:
>>>>>  https://developers.cloudflare.com/cache/how-to/cache-keys/
>>>>> 
>>>>> As does Fastly:
>>>>>  https://docs.fastly.com/en/guides/making-query-strings-agnostic
>>>>> 
>>>>> 
>>> https://www.fastly.com/documentation/solutions/examples/manipulate-query-string/
>>>>> 
>>>>> As does Akamai (apparently, based upon the information available):
>>>>> 
>>>>> 
>>> https://community.akamai.com/customers/s/article/Remove-query-strings-from-forward-request-and-cache-key?language=en_US
>>>>> 
>>>>> I know Varnish supports this as well; I've done it with Squid (using a
>>>>> helper) too. Not sure about eg nginx or Apache httpd.
>>>>> 
>>>>> So I suspect it's safe to say there's interest in this general feature
>>>>> from people who use HTTP caches.
>>>>> 
>>>>> The difference here is the control mechanism to invoke that behaviour
>>> --
>>>>> putting it in a response header is really nice because it's a)
>>>>> standardised, so (eventually) interoperable across implementations,
>>> and b)
>>>>> driven by the resource on the origin server, who has the most
>>> information
>>>>> about the URL's semantics (rather than relying on out-of-band
>>>>> configuration).
>>>>> 
>>>>> However, when a cache has multiple stored responses and they have
>>>>> conflicting information about the cache key, we need to be careful
>>> about
>>>>> specifying the interaction. In a way, this is similar to Vary -- it
>>> faced a
>>>>> similar question, and the decisions made in its design made
>>> implementation
>>>>> difficult. We chose a different approach in Key and Variants to address
>>>>> that; we should probably have a similar discussion here.
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> 
>>>>> --
>>>>> Mark Nottingham   https://www.mnot.net/
>>>>> 
>>>>> 
>>> 
> 

--
Mark Nottingham   https://www.mnot.net/
Received on Friday, 14 June 2024 01:09:29 UTC