- From: Ilari Liusvaara <ilariliusvaara@welho.com>
- Date: Fri, 26 May 2023 12:52:31 +0300
- To: HTTP Working Group <ietf-http-wg@w3.org>
On Thu, May 25, 2023 at 10:21:34AM -0700, Roy T. Fielding wrote: > > If this is truly for a display string, the feature must be > specific about the encoding and allowed characters. > My suggestion would be to limit the string to non-CNTRL > ASCII and non-control valid UTF-8. We don't want to allow > anything that would twist the feature to some other ends. I think the set of allowed characters should be the 1,111,999 non-Cc unicode codepoints. However, unicode also has formatting control codepoints (including fun ones like direction overrides), and the set of those is not necressarily stable. Obviously, the effect of any formatting control should end with the string. > Assuming we do this with pct-encoding, we should not allow > arbitrary octets to be encoded. We should disallow encodings > that are unnecessary (normal printable ASCII aside from % and "), > control characters, or octets not valid for UTF-8. That can > be specified by prose and reference to the IETF specs, or > we could specify the allowed ranges with a regular expression. > Either one is better than allowing arbitrary octets to be encoded. I think it would be safer to add exactly one backslash escape sequence for the 1,111,904 codepoints that are neither Cc nor ASCII. The escape sequences should only consist of printable ASCII and should not contain further backslash nor dobule quote. It is possible to assign the escape sequences such that worst case overhead over UTF-8 is 1 byte per codepoint. -Ilari
Received on Friday, 26 May 2023 09:52:40 UTC