- From: Roy T. Fielding <fielding@gbiv.com>
- Date: Thu, 25 May 2023 10:21:34 -0700
- To: Tommy Pauly <tpauly@apple.com>
- Cc: HTTP Working Group <ietf-http-wg@w3.org>
> On May 24, 2023, at 7:26 PM, Tommy Pauly <tpauly@apple.com> wrote: > > Hello HTTP WG, > > As part of the WGLC for draft-ietf-httpbis-sfbis, we’ve been discussing the inclusion of "Display Strings” (strings that allow Unicode content). > > While not part of the initial scope of this -bis effort, this addition has had significant discussion and support expressed for inclusion. > > This email starts a formal consensus call to determine if the working group would like to expand the scope of draft-ietf-httpbis-sfbis to include Display Strings, specifically to merge in the following pull request (modulo any editorial changes that are needed): > > https://github.com/httpwg/http-extensions/pull/2494 I think this would have been better in parts, namely a) should sfbis add a data type for display strings of non-ASCII content? b) should display strings be encoded as ASCII via pct-encoding? c) should the encoded characters be limited to %x22 ("), %x25 (%), and relatively safe non-ASCII non-control valid UTF-8? I support (a) if we also require (c). I think (b) is unnecessary given that HTTP is 8-bit clean for UTF-8 and we are specifically talking about new fields for which there are no deployed parsers. Yes, I know what it says in RFC 9110. I think (c) is a requirement regardless of how we do (b). The PR doesn't clearly express any of these points. It says the strings contain Unicode (a character set) but they obviously don't; they contain sequences of unvalidated pct-encoded octets. This allows arbitrary octets to be encoded for something that is supposed to be a display string. I don't think these are editorial questions. I think we need to have at least rough consensus on *what* the feature is allowed to contain before we add the feature to the spec. If this is truly for a display string, the feature must be specific about the encoding and allowed characters. My suggestion would be to limit the string to non-CNTRL ASCII and non-control valid UTF-8. We don't want to allow anything that would twist the feature to some other ends. Assuming we do this with pct-encoding, we should not allow arbitrary octets to be encoded. We should disallow encodings that are unnecessary (normal printable ASCII aside from % and "), control characters, or octets not valid for UTF-8. That can be specified by prose and reference to the IETF specs, or we could specify the allowed ranges with a regular expression. Either one is better than allowing arbitrary octets to be encoded. In general, it is safer to send raw UTF-8 over the wire in HTTP than it is to send arbitrary pct-encoded octets, simply because pct-encoding is going to bypass most security checks long enough for the data to reach an applications where people do stupid things with strings that they assume contain something that is safe to display. Note that I am not saying that we should consider normalization or any other weirdness specific to Unicode. We don't need to. We just need to stay within the confines of what has already been defined as valid and safe UTF-8. Everything else is being actively targeted by pentesters and script kiddies, on every public server on the Internet, to the point where we have to block it within CDN configurations just to avoid overloading the origin servers. ....Roy
Received on Thursday, 25 May 2023 17:21:57 UTC