Libraries assuming iso-8859-1 (was: Re: Consensus call to include Display Strings in draft-ietf-httpbis-sfbis) from Martin J. Dürst on 2023-05-28 (ietf-http-wg@w3.org from April to June 2023)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Sun, 28 May 2023 15:44:41 +0900
To: Mark Nottingham <mnot@mnot.net>, Roy Fielding <fielding@gbiv.com>
Cc: Tommy Pauly <tpauly@apple.com>, HTTP Working Group <ietf-http-wg@w3.org>
Message-ID: <c81e6562-7927-a342-9032-df69aba4ad43@it.aoyama.ac.jp>
On 2023-05-26 07:38, Mark Nottingham wrote:
> Hi Roy,
> 
>> On 26 May 2023, at 3:21 am, Roy T. Fielding <fielding@gbiv.com> wrote:
>>
>> I think (b) is unnecessary given that HTTP is 8-bit clean for UTF-8
>> and we are specifically talking about new fields for which there
>> are no deployed parsers. Yes, I know what it says in RFC 9110.
> 
> Yes, the parsers may be new, but in some contexts, they may not have access to the raw bytes of the field value. Many HTTP libraries and abstractions (e.g., CGI) assume an encoding and expose strings; some of those may apply the advice that HTTP has documented for many years and assume ISO-8859-1.

This is a valid point, but I think it can be addressed rather easily.

The solution is simply to move back to bytes using ISO-8859-1 and then 
to move from bytes to characters using UTF-8. This can be done by the 
parser for Display Strings.

In Ruby, assuming that the string's encoding is ISO-8859-1 (Ruby carries 
an encoding for each string, and can to some extent deal with multiple 
strings in different encodings, although these days, it's mostly just 
everything in UTF-8), this would just be
    display_string.force_encoding('UTF-8')
(this just changes the interpretation of the underlying bytes).

In most other languages, where the actual string encoding is opaque and 
uniform (such as Python), this would be done e.g. by something like the 
following:
    display_string.encode('iso-8859-1').decode('utf-8')
Although I probably have written less than a dozen lines of Python code 
in my whole life, I checked this with
 >>> 
b'\xe2\x82\xac'.decode("iso-8859-1").encode('iso-8859-1').decode('utf-8')
which successfully printed
'€'
(the first "decode" is what the general HTTP library would do; the 
following encode/decode is what the structured header parser would do).

Of course, in a language such as Java, the whole thing would be a bit 
longer, having to instantiate a CharsetEncoder and a CharsetDecoder and 
so on :-(.

> Yes, in many cases you can use UTF-8 on the wire successfully. However, making that assumption is a local convention; we can't assume that it holds for the entire Internet, because we don't know all of the various implementations that have been deployed and how they behave. All we know is a) how the implementations we've seen behave, and b) what we've written down before.

The 'wire' (which for the moment I assume to be TCP and below, or TLS 
and below) just transports bytes, and therefore should not be a problem.
Problems may occur at places where (contrary to the HTTP specs) 
iso-8859-1 isn't passed through in headers. There may also be problems 
in cases iso-8859-1 is interpreted as excluding the bytes in the range 
0x80 to 0x9F. But I think it's easy to say that such cases should be 
very rare.

There may also be implementations that just cut off the most significant 
bit in each byte, or otherwise don't let non-ASCII bytes through. It 
would be good to know whether such cases actually have been reported, or 
whether that's just something we think might be out there but isn't 
actually confirmed.


> In the past we've made decisions like this and chosen to be conservative. We could certainly break that habit now, but we'd need (at the least) to have a big warning that this type might not be interoperable with deployed systems. Personally, I don't think that's worth it, given the relative rarity that we expect for this particular type, and the relatively low overhead of encoding.

If by overhead, you mean processing, then I agree that's low. If you 
mean size, I think the situation is different. Here's a little table 
with some of the most important scripts and the byte count and expansion 
factor for their characters when compared to legacy encodings and pure 
UTF-8:

                              Legacy  UTF-8   proposed  expansion
ASCII                        1       1       1         1
Latin+Accents, e.g. Polish   1       ~1.5    ~2        2
Arabic/Cyrillic/...          1       2       6         6
Indic scripts,...            1       3       9         9
Chinese/Japanese/...         2       3       9         4.5

So some text in an Indic or South Asian Script gets expanded by a factor 
of 9 when compared to a legacy singlebyte encoding.


>> In general, it is safer to send raw UTF-8 over the wire in HTTP
>> than it is to send arbitrary pct-encoded octets, simply because
>> pct-encoding is going to bypass most security checks long enough
>> for the data to reach an applications where people do stupid
>> things with strings that they assume contain something that is
>> safe to display.
> 
> That's an odd assertion - where are those security checks taking place?

I don't know about headers in general, but I hope people remember the 
attack where it was possible to smuggle a path like
/abc/def/../../xyz.html by percent-encoding (part of) "/../../" and 
access the file xyz.html which was access-protected, because the access 
check happened before the decoding.

Regards,   Martin.
Received on Sunday, 28 May 2023 06:44:53 UTC