- From: Mark Nottingham <mnot@mnot.net>
- Date: Fri, 14 Jun 2024 11:09:17 +1000
- To: gs-lists-ietf-http-wg@gluelogic.com
- Cc: Jeremy Roman <jbroman@chromium.org>, ietf-http-wg@w3.org
From a chair's perspective -- the important thing to establish at this point is whether the WG thinks there's a problem to solve here, and whether this is _approximately_ the right starting point. Document formatting and details of the specification aren't as relevant at this early stage. Cheers, > On 14 Jun 2024, at 9:56 AM, gs-lists-ietf-http-wg@gluelogic.com wrote: > > On Thu, Jun 13, 2024 at 05:50:05PM -0400, Jeremy Roman wrote: >> This is a little hard to read in the plaintext rendering, I'll admit. It's >> marginally better in the HTML rendering (though even then, the stylesheet >> makes it only slightly distinguishable). > > Yes, I read the plaintext draft. > >> WHATWG/W3C specs tend to be read in the HTML rendering, where these links >> would have been blue and underlined. If quotation marks would help with >> readability in addition to the section cross-link, that's certainly a >> typographical change I can make. > > I believe that IETF desires RFCs to be readable in more than HTML, > but I am not the person to provide guidance on that. Would the chair > advise? > >> There are many references in the doc to WHATWG specs rather than IETF >>> specifications for URLs. Is this intentional? >>> >> >> The embodiment we (Google Chrome) have been working on is in a web browser >> which implements the WHATWG URL specification, and we want this to be >> useful in web browsers (and HTTP servers which are interacting with web >> browsers), so being compatible with the way browsers deal with query >> strings (namely, the application/x-www-form-urlencoded parser, which is >> used for, for instance, the URLSearchParams object exposed to JavaScript >> code). >> >> I'm less familiar with the IETF specifications that other uses of HTTP use, >> though for instance RFC 3986 (Uniform Resource Identifier: Generic Syntax) >> doesn't say much beyond "query components are often used to carry >> identifying information in the form of 'key=value' pairs", and RFC 9110 >> (HTTP Semantics) §4.1 simply specifies that this is an optional URL >> component in the "http" and "https" URI schemes. This doesn't provide >> enough for the purposes of this document, even if the two are otherwise >> compatible (which I'm not sure they are). > > If WHATWG spec provides a stricter query string format variant, > and this RFC draft requires use of that stricter or different > formatting variant, then the differences should be highlighted. > > e.g. > "No-Vary-Search uses a variant of query string format defined in WHATWG > (reference) which is stricter than the varieties of query string syntax > allowed in RFC (reference)." > > A reference to the WHATWG ABNF for the stricter format variant of the > query string is also recommended. > >>> The document does not mention the implication of the union of variants >>> between Vary and No-Vary-Search response headers. A CDN or browser >>> might have to limit the number of variants cached. >>> >> >> At present the limitations on the cache are not present here, just logic to >> determine whether a response is suitable to be used. RFC 9111 (HTTP >> Caching) §4.1 deals with the analogous question for existing variants to a >> limited extent: >> >> """ >> >> If multiple stored responses match, the cache will need to choose one to >> use. When a nominated request header field has a known mechanism for >> ranking preference (e.g., qvalues on Accept and similar request header >> fields), that mechanism MAY be used to choose a preferred response. If such >> a mechanism is not available, or leads to equally preferred responses, the >> most recent response (as determined by the Date header field) is chosen, as >> per Section 4 >> <https://www.rfc-editor.org/rfc/rfc9111#constructing.responses.from.caches>. >> <https://www.rfc-editor.org/rfc/rfc9111#section-4.1-6> >> >> Some resources mistakenly omit the Vary header field from their default >> response (i.e., the one sent when the request does not express any >> preferences), with the effect of choosing it for subsequent requests to >> that resource even when more preferable responses are available. When a >> cache has multiple stored responses for a target URI and one or more omits >> the Vary header field, the cache SHOULD choose the most recent (see Section >> 4.2.3 <https://www.rfc-editor.org/rfc/rfc9111#age.calculations>) stored >> response with a valid Vary field value. >> >> """ >> I anticipated that any discussion of this issue would make most sense as >> part of any (not yet present) text discussing how this supplements RFC >> 9111. Are there novel considerations here about CDN and browser limits that >> merit specification? > > A reference to that might be sufficient to acknowledge that this RFC > extends caching concerns for clients and caching intermediaries. > >> Overall, this document uses idioms I am less familiar seeing in RFCs. >>> Maybe these idioms are more typical in WHATWG documents, but the >>> pseudo-code is different than what I typically see in RFCs. >>> Perhaps I am not familiar with the pseudo-markup variant, but it does >>> not look like markdown to me. >>> >> >> This is actually the RFC <em> element >> <https://authors.ietf.org/en/rfcxml-vocabulary#em> (semantic emphasis) as >> submitted, which is rendered in the plaintext rendering as leading and >> trailing underscores, but in the HTML and PDF renderings as italics. It's >> used here to set apart variable names from other text, because italics is >> the typographical convention for doing so in WHATWG/W3C algorithm >> pseudo-code. Not setting it apart at all might make it harder to skim the >> pseudocode (e.g., to clearly tell which values are used where), but it's >> certainly a bit noisy. >> >> e.g. To _parse a URL search variance_ given _value_: >>> See also my confusion above reading >>> "The obtain a URL search variance algorithm" >>> which could have been >>> "The _obtain a URL search variance_ algorithm" >>> using the _every-other-word_ idiom from Section 4, >>> though _I_ _am_ _personally_ _not_ _a_ _fan_ _of_ _this_ formatting. >>> My preference would suggest using a real language (any one), instead of >>> pseudo-code, to create a reference implementation, if that is the goal. >>> Add comments to describe required behavior to clarify the reference >>> implementation. >>> >> >> It's a pseudo-code that is typical in WHATWG/W3C documents ( >> https://infra.spec.whatwg.org/#algorithms). Those bodies tend to prefer >> explicit pseudo-code algorithms, as I understand it, because it forces the >> algorithm to be unambiguous about some precise aspects of the required >> behavior that are easy to gloss over in natural language (though of course, >> it's not the only way of doing so). >> >> While the particular flavour of pseudo-code may not be familiar to this >> group, I have seen similar pseudo-code in documents such as RFC 8941 >> (Structured Field Values for HTTP) which are relevant to this venue. > > Again, I read the plaintext draft. Let's wait for guidance from others > about document formatting and instead discuss content. > >> The No-Vary-Search syntax with "except" reads to me as a double-negative: >>> No-Vary-Search: params, except=("x") >>> >>> Not knowing how far along this spec document is, was naming the header >>> "Vary-Search" considered? With "Vary-Search", inverting the logic would >>> suggest "params" to default to all params varying (same as not >>> specifying Vary-Search), and "except" could be "no-vary" >>> Vary-Search: params, no-vary=("x") >>> to indicate no-vary for "x", or >>> Vary-Search: params, no-vary >>> to indicate all search params are no-vary (wildcard). >>> >> >> Yes, it was considered. This choice was made because it means that an >> absent header, or empty header value, should reflect that existing HTTP >> semantics are used. > > The same applies to the theoretical "Vary-Search" I described. > >> Since the default behavior is to vary on *all* parameters, >> their order, and in fact even the way those parameters are encoded in the >> URL, that means that the behavior of the header is naturally opposite to >> Vary (which starts from a default behavior of varying on *no* header >> fields). > > That is more a definition than a reason. Ok, I get that choice was > made, but why is it better than non-inverted logic? > >> This does lead to a double negative, unfortunately, with the use of the >> term "vary". Conceptualizing it instead of as "not varying" as "is the same >> resource" addresses that (i.e., "This resource *is the same resource* as if >> it had been requested with other query parameters, except if x differs." >> has no double negative). Drawing the clear connection with Vary (which is >> well-known), though, seemed worth the double negative. > > Respectfully, I disagree. > > This header needs to be processed by caching intermediaries and client > for caching, just like "Vary" needs to be processed for caching > variants. Caching variants will have to process and *invert* the logic > of "No-Vary-Search" to produce the "Vary"-set of varying parameters. > > Put another way, storing the variant requires construction of a cache > key for the variant, which is some sort of encoding of the varying > parameters. Since the set of varying parameters needs to be collected, > it makes more sense to me to have Vary and Vary-Search, rather than > Vary and invert-the-logic-of No-Vary-Search. However, since > No-Vary-Search supports both positive and a negative ("except") ways to > define the varying search parameters or non-varying search parameters, > you might argue the opposite, depending on which version (positive or > negative) of No-Vary-Search you think may be used more frequently. > >> 7. Privacy Considerations >>> >>> The ability to cache variants based on search parameters could possibly >>> compromise privacy due to fingerprinting and the ability to detect cache >>> hit versus cache miss even with coarse timing resolution. >>> >> >> Can you elaborate? If anything, I would have expected that disregarding >> certain query parameters would mean that someone probing the cache can >> learn *less* about which values of that query parameter have been seen by >> the cache previously. >> >> In the context of web browsers, this sort of attack is also mitigated by HTTP >> cache partitioning >> <https://developer.chrome.com/blog/http-cache-partitioning> which is now >> specified (incompletely) by the WHATWG Fetch standard >> <https://fetch.spec.whatwg.org/#http-cache-partitions>. > > If I have 10 different sets of 16-urls each, can I use caching to create > a tracking identifier if I assign the client one url from each of those > 10 sets and create a 10-hexdigit identifier? Fetching all the URLs and > detecting which ones are cached might reveal the identifier? When I > assign the URLs, I can assign a non-vary parameter 'tracker=1'. When > a different page on a different site is requesting all the URLs to > detect which are cached, the client can add non-vary parameter > 'tracker=0'. The server responding can assign HTTP caching headers > based on the response. Anyway, this is not my area of expertise. > Yes, CORS headers factor in, but takes effort to set up properly, and > "properly" might be malicious if those headers are from malicious sites. > > https://coveryourtracks.eff.org/ > > Some of your colleages at Google would be able to better explain all the > creative ways Google is fingerprinting clients in addition to cookies. > > Cheers, Glenn > >> Cheers, Glenn >>> >>> >>> On Wed, Jun 12, 2024 at 01:23:23PM -0400, Jeremy Roman wrote: >>>> In the interest of continuing discussion on this list, the WICG draft has >>>> been reformatted in RFC format and reported to the Datatracker: >>>> >>>> https://datatracker.ietf.org/doc/draft-wicg-http-no-vary-search/01/ >>>> or directly on GitHub >>>> >>> https://jeremyroman.github.io/http-no-vary-search/draft-wicg-http-no-vary-search.html >>>> >>>> The text has been left mostly unchanged so far (modulo very small >>> editorial >>>> changes), and does not yet reflect any change to RFC 9111 behavior >>> (though >>>> hopefully it's clear what those changes would be, from the existing >>> text). >>>> >>>> On Tue, Mar 19, 2024 at 2:26 AM Mark Nottingham <mnot@mnot.net> wrote: >>>> >>>>> Hi Jeremy, >>>>> >>>>>> On 19 Mar 2024, at 11:44, Jeremy Roman <jbroman@chromium.org> wrote: >>>>>> >>>>>> Unfortunately it is not possible for me to join personally (time >>> zones >>>>> and personal complications). We might be able to brief a Chrome team >>> member >>>>> who is attending if there is interest (depending when this is), though >>> as >>>>> you point out it would necessarily be a fairly brief overview on short >>>>> notice (so it might not be possible). >>>>> >>>>> It doesn't look likely that we'll have time for additional >>> presentations. >>>>> I'd suggest continuing the discussion on the list. >>>>> >>>>> Just for some context -- we found this kind of capability useful when I >>>>> was at Yahoo! way back in 2010: >>>>> https://www.mnot.net/talks/pdf/Stupid_Web_Caching_Tricks.pdf#page=36 >>>>> >>>>> Cloudflare supports configuration to ignore the whole query string, as >>>>> well as specific arguments in it: >>>>> https://developers.cloudflare.com/cache/how-to/cache-keys/ >>>>> >>>>> As does Fastly: >>>>> https://docs.fastly.com/en/guides/making-query-strings-agnostic >>>>> >>>>> >>> https://www.fastly.com/documentation/solutions/examples/manipulate-query-string/ >>>>> >>>>> As does Akamai (apparently, based upon the information available): >>>>> >>>>> >>> https://community.akamai.com/customers/s/article/Remove-query-strings-from-forward-request-and-cache-key?language=en_US >>>>> >>>>> I know Varnish supports this as well; I've done it with Squid (using a >>>>> helper) too. Not sure about eg nginx or Apache httpd. >>>>> >>>>> So I suspect it's safe to say there's interest in this general feature >>>>> from people who use HTTP caches. >>>>> >>>>> The difference here is the control mechanism to invoke that behaviour >>> -- >>>>> putting it in a response header is really nice because it's a) >>>>> standardised, so (eventually) interoperable across implementations, >>> and b) >>>>> driven by the resource on the origin server, who has the most >>> information >>>>> about the URL's semantics (rather than relying on out-of-band >>>>> configuration). >>>>> >>>>> However, when a cache has multiple stored responses and they have >>>>> conflicting information about the cache key, we need to be careful >>> about >>>>> specifying the interaction. In a way, this is similar to Vary -- it >>> faced a >>>>> similar question, and the decisions made in its design made >>> implementation >>>>> difficult. We chose a different approach in Key and Variants to address >>>>> that; we should probably have a similar discussion here. >>>>> >>>>> Cheers, >>>>> >>>>> >>>>> -- >>>>> Mark Nottingham https://www.mnot.net/ >>>>> >>>>> >>> > -- Mark Nottingham https://www.mnot.net/
Received on Friday, 14 June 2024 01:09:29 UTC