Re: Consensus call to include Display Strings in draft-ietf-httpbis-sfbis

Hi Mark,

On Sat, May 27, 2023 at 11:27:04AM +1000, Mark Nottingham wrote:
> > Others use a WAF (or mod_security rules) applied to various
> > parts of a request message, or just bayesian analysis of
> > example fails.
> > 
> > What I mean by this odd assertion is that raw UTF-8 sent
> > through the message parsing algorithm of HTTP will result
> > in a very obvious message for recipients on the backend,
> > even if it contains unwanted characters, whereas pct-encoding
> > makes the message look safe until passes though the checks
> > and it reaches a point in later processing where an application
> > (perhaps unaware of the source of that data) foolishly
> > decodes the string without expecting it to contain
> > arbitrary octets that might become command invocations,
> > request smuggling, or cache poisoning.
> Right. For example:
> The question, I think, is whether and when tripping this kind of rule is desirable.
> If we use UTF-8 on the wire for Display Strings, that rule will fail in any
> WAF that's deployed it. Because many WAF configurations aren't easily
> accessible to the developers who are deploying new headers, that will create
> friction against its adoption, and cause them to do things like hide it in
> sf-binary or do percent encoding ad hoc.

Well, for me this is not the correct question, otherwise it validates
ossification of the protocol caused by poorly configured components. We
should use what the protocol permits, not what some people are allowing
in their configs in quest of showing a maximum number of rejects to try
to impress their customers.

> If we percent-encode Display Strings, that rule will not trip. The encoded
> data will at least be in a standard format that CRS (etc.) can eventually
> learn about. However, as you point out, in the meantime the WAF isn't
> "seeing" the content.

And it can be the exact opposite. Very few headers take a percent encoding,
basically only those transporting a URL such as Location, Referer and Link,
and I have myself seen WAF rules 15 years ago that were systematically
blocking the percent character in any headers because its only valid purpose
was to try to inject SQL injections or shell code via the generic decoding
layer in the application.

> So I think I agree with you that a profile of the allowable characters is
> needed, but still disagree that putting UTF-8 on the wire is a good idea.

As much as I despise UTF-8 for all the trouble, security and confusion it
brings, I tend to think it would be safer to transport it with limited
restrictions than percent-encoding. In fact I wouldn't claim that this is
UTF-8, I would say that these are non-control character sequences made of
bytes in whatever ASCII-derived charset and that the encoding is advertised
somewhere else. But in any case it would remove 0x00-0x1F. Proceeding like
this will remove the constraint for implementations to change their code
every 6 months when new UTF-8 code sequences are added to support new
batches of stupid emojis. This is a serious constraint that I agree on
with PHK, because lots of people count on their gateway to obey the rule
of not transporting forbidden stuff. If the forbidden stuff is unclear or
evolves over time, that becomes particularly confusing. Instead, saying
we're transporting 0x20-0xFF relieves the implementations from the checks
for valid byte sequences and leaves it to the consumer, who has all the
context to judge how to proceed (including displaying a code box when
some breakage is found).

However where I disagree with PHK is on using sf-binary for this, because
we don't want to be 100% transparent in fields used by application layers
which usually have little knowledge of the implications of passing control
chars that might get decoded and placed as-is in headers, opening the way
to smuggling attacks. That's also why I don't like the use of pct-encoding
for this. I prefer that applications render strings incorrectly from time
to time than let pass anything or make it complicated to validate. Another
problem of percent encoding that we all know is that it doesn't resist
well to multi-hops. Everytime you chain multiple components that apply
various inspection or filtering, you can be sure that at least one of them
will not be able to refrain from the temptation of decoding a %25 in-place
before passing it. And when it's not the component itself, it's the user
who replaces it from the configuration using ad-hoc mechanisms before
inspecting string contents.

As such I'd just pass 0x20-0xFF as-is without encoding, that permits to
pass plain ASCII, UTF-8, ISO-LATIN* for human consumption, without tabs,
CR, LF etc that humans do not need when reading a message. When these
ones will slip into another header-field (because they will), they will
remain harmless and still easy to process if needed.

> I'm far from a Unicode expert, but my understanding is that regex protects
> against some but not all issues with inbound UTF-8. For example, this still
> allows code points greater than U+10FFFF. 

And regex are not practical for filtering in low-level components, and are
often blindly copy-pasted by admins and left in place forever thinking
they're now safe.

> Which is fine, but we should be careful in what we claim it does and doesn't do -- <> might be a good reference point. 

Exactly the reason for my point above, we must not claim to convey valid
UTF-8, nor even UTF-8, we convey extended character sets strings.

> I'd be OK with including this in the spec, but would want to talk about how
> it's integrated. We can certainly describe the data model (Section 3) in
> these terms, but for parsing/serialiation, I'd want to make it possible to
> _not_ apply a regex like this (or whatever) if one has confidence that the
> implementation doing the decoding is handling UTF-8 errors correctly, which
> is going to be the case for most modern implementations AIUI.

For me it's important that all lower components having no business
interfering with strings consumed by humans do not even have to parse
them nor verify them. It should be perfectly valid to convey the regular
UTF-8 attacks like non-canonical sequences, homoglyphs, RTL switching,
truncated and overly long code sequences without all the chain being
declared non-compliant. Otherwise the alternative will be simple, it
will consist in just dropping these fields :-/


Received on Saturday, 27 May 2023 08:38:06 UTC