Re: support for non-ASCII in strings, was: signatures vs sf-date

Hi Julian,

On Sat, Dec 03, 2022 at 08:47:10AM +0100, Julian Reschke wrote:
> > There are some cases where non-ASCII strings are needed in header fields; mostly, when you're presenting something to a human from the fields. Those cases are not as common. However, there's a catch to adding them: if full unicode strings were available in the protocol, many designers will understandably use them because it's been drilled into all our heads that unicode is what you use for strings.
> > 
> > Hence, footgun.
> 
> I would appreciate if you would explain why there is a problem we need
> to prevent, and what exactly that problem is. Do you have an example?

The main problem I'm personally having with this is that lots of text-based
processing (regex etc) that is designed to apply on a subset of the input
set will first have to pass through some non-bijective transformation
(typically iconv) and that's where problems start to happen, with the
usual stuff such as accentuated letters which lose their accents and
turn to the regular one, sometimes only after being turned to upper case,
and so on, making it possible to make some invalid contents match certain
rules on certain components. I am particularly worried of letting this
enter the protocol. If I'm setting up a rule saying that /static always
routes to the static server, it means that /stątic will not go there. But
what if down the chain this gets turned to /STATIC then back to /static,
to finally match an existing directory on the default server ? You will
of course tell me that this is a bad example as I'm putting it on the URL
but the problem is exactly the same with other headers. Causing such trouble
to Link, Content-Type (for content analysis evasion), the path or domain in
Set-Cookie etc is really problematic. On the request path we could imagine
such things landing as far as into logs or data bases, with some diacritics
being accidently turned into language symbols or delimitors.

I actually find it very nice that anything that is not computer-safe has
to be percent-encoded, it clearly sets a limit between the two worlds,
the one that must match bytes, and the one that interpret characters,
including homoglyphs, emojis, RTL vs LTR etc. The world has had several
decades to adapt to this, and web development frameworks now make it
seamless for developers to deal with this. People set up blogs, shopping
carts and discussion boards with a few lines of code without ever having
to wonder how data are encoded over the wire.

Computers don't need to know what characters *look like* but how they
are encoded. Humans mostly don't need to know how they are encoded but
are only interested in what they look like. The current situation serves
both worlds perfectly fine, and a move in either direction would break
this important balance in my opinion.

We could of course imagine to pass some info indicating how contents are
supposed to be interpreted when that's not obvious from the header field
name, but if applications use non-standard fields, they're expected to
either know how they are supposed to exploit their contents, or to ignore
the header. That has always been like this and been fine. After all,
nothing prevents one from passing percent-encoded sounds, images, or
even shell code in headers if they want. Right now it's reliably
transported till its target.

Just my two cents,
Willy

Received on Saturday, 3 December 2022 09:53:03 UTC