Re: Consensus call to include Display Strings in draft-ietf-httpbis-sfbis

Hi Roy,

OK, I think we're making progress; see below.

> On 27 May 2023, at 5:38 am, Roy T. Fielding <> wrote:
> On May 25, 2023, at 3:38 PM, Mark Nottingham <> wrote:
>> Hi Roy,
>>> On 26 May 2023, at 3:21 am, Roy T. Fielding <> wrote:
>>> I think (b) is unnecessary given that HTTP is 8-bit clean for UTF-8
>>> and we are specifically talking about new fields for which there
>>> are no deployed parsers. Yes, I know what it says in RFC 9110.
>> Yes, the parsers may be new, but in some contexts, they may not have access to the raw bytes of the field value. Many HTTP libraries and abstractions (e.g., CGI) assume an encoding and expose strings; some of those may apply the advice that HTTP has documented for many years and assume ISO-8859-1.
> That's not a problem in practice, since the data does not change.
> It just looks like messy characters on display.
> What would be a problem is if an implementation transcoded the values 
> incorrectly while being parsed, or used code-point lengths instead
> of octet lengths for measuring the memory allocated in copies.
> But again, we are not breaking such systems: they are already broken
> and insecure, and at worst we are doing folks a service by surfacing
> the bad code in a visible way.

Right, I'm not concerned about this.

> The valid systems we might be breaking would be those that parse
> for high-bit octets and reject the message as invalid. I do not
> know of any such systems because of the legacy of ISO-8859-*
> (especially among Cyrillic servers). In any case, such systems
> don't use display strings.
> However, I agree that it is hard for me to argue against my
> own long history of being unable to adopt UTF-8 in HTTP.
> I just find it annoying to assume that a totally new parser
> of a totally new field should somehow be constrained in the
> parsing of its values by a mere perception of what might be
> the case for legacy parsers that shouldn't even be looking
> at new fields.

But the point here is that it's not a matter of new vs old parsers -- it's a matter of the fields being parsed by an existing implementation (of which there are a large number) and being handed (either as a string of some sort, or as binary that may have been converted to a string and back to binary again) to a new Structured Fields parser. What gets handed to the SF parser matters.

>> In the past we've made decisions like this and chosen to be conservative. We could certainly break that habit now, but we'd need (at the least) to have a big warning that this type might not be interoperable with deployed systems. Personally, I don't think that's worth it, given the relative rarity that we expect for this particular type, and the relatively low overhead of encoding.
> If this were an important use case, I would agree with you.
> We are talking about a display string, which seems to be
> the perfect opportunity to find out what we can get away
> with changing.

Hmm. "Use Structured Fields, maybe your content won't look messy" is hardly a selling point. 

>>> The PR doesn't clearly express any of these points. It says the
>>> strings contain Unicode (a character set) but they obviously don't;
>>> they contain sequences of unvalidated pct-encoded octets.
>>> This allows arbitrary octets to be encoded for something that
>>> is supposed to be a display string.
>> [...]
>>> If this is truly for a display string, the feature must be
>>> specific about the encoding and allowed characters.
>>> My suggestion would be to limit the string to non-CNTRL
>>> ASCII and non-control valid UTF-8. We don't want to allow
>>> anything that would twist the feature to some other ends.
>>> Assuming we do this with pct-encoding, we should not allow
>>> arbitrary octets to be encoded. We should disallow encodings
>>> that are unnecessary (normal printable ASCII aside from % and "),
>>> control characters, or octets not valid for UTF-8. That can
>>> be specified by prose and reference to the IETF specs, or
>>> we could specify the allowed ranges with a regular expression.
>>> Either one is better than allowing arbitrary octets to be encoded.
>> I think that's reasonable and we can discuss improvements after adopting the PR.
> I think the pct-encoding feature is actively dangerous without
> those constraints because it encourages a means to bypass HTTP's
> normal safeguards. I don't want to discuss them as improvements.
>>> In general, it is safer to send raw UTF-8 over the wire in HTTP
>>> than it is to send arbitrary pct-encoded octets, simply because
>>> pct-encoding is going to bypass most security checks long enough
>>> for the data to reach an applications where people do stupid
>>> things with strings that they assume contain something that is
>>> safe to display.
>> That's an odd assertion - where are those security checks taking place?
> In places like the Fastly config, right now, though I only do that
> for an incoming request-target when I don't need a premium WAF.
> For example (extracted from an error snippet):
>   if (var.path ~ {"%[0-7][0-9A-Fa-f]"}) {
>     set obj.http.x-error = "Forbidden encoded ASCII in URL path";
>     set obj.status = 403;
>     set obj.response = "Forbidden";
>     return (deliver);
>   }
> [Note that this is making assumptions about what is allowed
> in a URL path that is specific to the origin servers behind
> this CDN. It is not a universal config.]
> Others use a WAF (or mod_security rules) applied to various
> parts of a request message, or just bayesian analysis of
> example fails.
> What I mean by this odd assertion is that raw UTF-8 sent
> through the message parsing algorithm of HTTP will result
> in a very obvious message for recipients on the backend,
> even if it contains unwanted characters, whereas pct-encoding
> makes the message look safe until passes though the checks
> and it reaches a point in later processing where an application
> (perhaps unaware of the source of that data) foolishly
> decodes the string without expecting it to contain
> arbitrary octets that might become command invocations,
> request smuggling, or cache poisoning.

Right. For example:

The question, I think, is whether and when tripping this kind of rule is desirable.

If we use UTF-8 on the wire for Display Strings, that rule will fail in any WAF that's deployed it. Because many WAF configurations aren't easily accessible to the developers who are deploying new headers, that will create friction against its adoption, and cause them to do things like hide it in sf-binary or do percent encoding ad hoc.

If we percent-encode Display Strings, that rule will not trip. The encoded data will at least be in a standard format that CRS (etc.) can eventually learn about. However, as you point out, in the meantime the WAF isn't "seeing" the content.

So I think I agree with you that a profile of the allowable characters is needed, but still disagree that putting UTF-8 on the wire is a good idea.

> Of course, there is nothing preventing such pct-encoding from
> being included in any non-literal part of an HTTP message,
> which is what pentesters and script kiddies are constantly
> running against our Web properties (and those of our CMS
> customers) in the hope of finding some application, somewhere
> downstream, that will fail to validate the data it receives.
> This feature won't change that.


> The problem is that it takes what is normally considered
> an evil encoding (if found anywhere other than an expected
> URI-reference or x-url-encoded content) and calls it a
> "good encoding" for a display string, which means we will
> have to worry about breaking a new feature of HTTP instead
> of just blocking all bad strings.

Maybe, but there's considerable precedent; percent encoding is currently _the_ way to do this in HTTP fields (per RFC5987).

> Even so, I can live with pct-encodings when they are restricted
> to a reasonably safe range of characters for display.
> For example,
> % pcre2grep -e '^([\x20-\x21\x23-\x24\x26-\x5B\x5D-\x7E]|\x5C[\x22\x5C]|%((2[25])|([Cc][2-9A-Fa-f]%[89A-Fa-f][0-9A-Fa-f])|([Dd][0-9A-Fa-f]%[89A-Fa-f][0-9A-Fa-f])|([Ee][0-9A-Fa-f](%[89A-Fa-f][0-9A-Fa-f]){2})|([Ff][0-4](%[89A-Fa-f][0-9A-Fa-f]){3})))*$'
> which, IIRC, is a safe subset of display string characters
> that allows printable ASCII (aside from " and %), safe
> non-ASCII UTF-8 as pct-escapes (regardless of current
> Unicode code points), and disallows the unsafe UTF-8.
> Alternatively, require that pct-encoding be limited to %22, %25,
> and pct-encoded sequences of valid non-ASCII, non-control, UTF-8
> octets, as defined by [UTF-8].
> It's somewhat pedantic, but guides implementations toward
> detecting such errors rather than ignoring them as someone
> else's problem. Also, it is something people can implement with
> interoperability, rather than a string of Unicode characters
> in general (which isn't).

I'm far from a Unicode expert, but my understanding is that regex protects against some but not all issues with inbound UTF-8. For example, this still allows code points greater than U+10FFFF. 

Which is fine, but we should be careful in what we claim it does and doesn't do -- <> might be a good reference point. 

I'd be OK with including this in the spec, but would want to talk about how it's integrated. We can certainly describe the data model (Section 3) in these terms, but for parsing/serialiation, I'd want to make it possible to _not_ apply a regex like this (or whatever) if one has confidence that the implementation doing the decoding is handling UTF-8 errors correctly, which is going to be the case for most modern implementations AIUI.


Mark Nottingham

Received on Saturday, 27 May 2023 01:27:15 UTC