Re: Consensus call to include Display Strings in draft-ietf-httpbis-sfbis

> On May 24, 2023, at 7:26 PM, Tommy Pauly <tpauly@apple.com> wrote:
> 
> Hello HTTP WG,
> 
> As part of the WGLC for draft-ietf-httpbis-sfbis, we’ve been discussing the inclusion of "Display Strings” (strings that allow Unicode content).
> 
> While not part of the initial scope of this -bis effort, this addition has had significant discussion and support expressed for inclusion.
> 
> This email starts a formal consensus call to determine if the working group would like to expand the scope of draft-ietf-httpbis-sfbis to include Display Strings, specifically to merge in the following pull request (modulo any editorial changes that are needed):
> 
> https://github.com/httpwg/http-extensions/pull/2494

I think this would have been better in parts, namely

  a) should sfbis add a data type for display strings of non-ASCII content?

  b) should display strings be encoded as ASCII via pct-encoding?

  c) should the encoded characters be limited to %x22 ("), %x25 (%),
     and relatively safe non-ASCII non-control valid UTF-8?

I support (a) if we also require (c).

I think (b) is unnecessary given that HTTP is 8-bit clean for UTF-8
and we are specifically talking about new fields for which there
are no deployed parsers. Yes, I know what it says in RFC 9110.

I think (c) is a requirement regardless of how we do (b).

The PR doesn't clearly express any of these points. It says the
strings contain Unicode (a character set) but they obviously don't;
they contain sequences of unvalidated pct-encoded octets.
This allows arbitrary octets to be encoded for something that
is supposed to be a display string.

I don't think these are editorial questions. I think we need
to have at least rough consensus on *what* the feature is
allowed to contain before we add the feature to the spec.

If this is truly for a display string, the feature must be
specific about the encoding and allowed characters.
My suggestion would be to limit the string to non-CNTRL
ASCII and non-control valid UTF-8. We don't want to allow
anything that would twist the feature to some other ends.

Assuming we do this with pct-encoding, we should not allow
arbitrary octets to be encoded. We should disallow encodings
that are unnecessary (normal printable ASCII aside from % and "),
control characters, or octets not valid for UTF-8. That can
be specified by prose and reference to the IETF specs, or
we could specify the allowed ranges with a regular expression.
Either one is better than allowing arbitrary octets to be encoded.

In general, it is safer to send raw UTF-8 over the wire in HTTP
than it is to send arbitrary pct-encoded octets, simply because
pct-encoding is going to bypass most security checks long enough
for the data to reach an applications where people do stupid
things with strings that they assume contain something that is
safe to display.

Note that I am not saying that we should consider normalization
or any other weirdness specific to Unicode. We don't need to.
We just need to stay within the confines of what has already
been defined as valid and safe UTF-8. Everything else is being
actively targeted by pentesters and script kiddies, on every
public server on the Internet, to the point where we have to
block it within CDN configurations just to avoid overloading
the origin servers.

....Roy

Received on Thursday, 25 May 2023 17:21:57 UTC