Re: support for non-ASCII in strings, was: signatures vs sf-date

On 03.12.2022 10:52, Willy Tarreau wrote:
> Hi Julian,
>
> On Sat, Dec 03, 2022 at 08:47:10AM +0100, Julian Reschke wrote:
>>> There are some cases where non-ASCII strings are needed in header fields; mostly, when you're presenting something to a human from the fields. Those cases are not as common. However, there's a catch to adding them: if full unicode strings were available in the protocol, many designers will understandably use them because it's been drilled into all our heads that unicode is what you use for strings.
>>>
>>> Hence, footgun.
>>
>> I would appreciate if you would explain why there is a problem we need
>> to prevent, and what exactly that problem is. Do you have an example?
>
> The main problem I'm personally having with this is that lots of text-based
> processing (regex etc) that is designed to apply on a subset of the input
> set will first have to pass through some non-bijective transformation
> (typically iconv) and that's where problems start to happen, with the
> usual stuff such as accentuated letters which lose their accents and
> turn to the regular one, sometimes only after being turned to upper case,
> and so on, making it possible to make some invalid contents match certain
> rules on certain components. I am particularly worried of letting this
> enter the protocol. If I'm setting up a rule saying that /static always
> routes to the static server, it means that /stàtic will not go there. But
> what if down the chain this gets turned to /STATIC then back to /static,
> to finally match an existing directory on the default server ? You will
> of course tell me that this is a bad example as I'm putting it on the URL
> but the problem is exactly the same with other headers. Causing such trouble
> to Link, Content-Type (for content analysis evasion), the path or domain in
> Set-Cookie etc is really problematic. On the request path we could imagine
> such things landing as far as into logs or data bases, with some diacritics
> being accidently turned into language symbols or delimitors.
>
> I actually find it very nice that anything that is not computer-safe has
> to be percent-encoded, it clearly sets a limit between the two worlds,
> the one that must match bytes, and the one that interpret characters,
> including homoglyphs, emojis, RTL vs LTR etc. The world has had several
> decades to adapt to this, and web development frameworks now make it
> seamless for developers to deal with this. People set up blogs, shopping
> carts and discussion boards with a few lines of code without ever having
> to wonder how data are encoded over the wire.
>
> Computers don't need to know what characters *look like* but how they
> are encoded. Humans mostly don't need to know how they are encoded but
> are only interested in what they look like. The current situation serves
> both worlds perfectly fine, and a move in either direction would break
> this important balance in my opinion.
>
> We could of course imagine to pass some info indicating how contents are
> supposed to be interpreted when that's not obvious from the header field
> name, but if applications use non-standard fields, they're expected to
> either know how they are supposed to exploit their contents, or to ignore
> the header. That has always been like this and been fine. After all,
> nothing prevents one from passing percent-encoded sounds, images, or
> even shell code in headers if they want. Right now it's reliably
> transported till its target.
>
> Just my two cents,
> Willy

More than 2 cents, actually :-)

Willy, let me ask a clarifying question. As you mentioned that percent
escaping is fine, it seems what you're worried about are actual octets
with the highest bit set appearing in an HTTP field value? Or do I
misread that?

Best regards, Julian

Received on Saturday, 3 December 2022 13:48:30 UTC