Re: Libraries assuming iso-8859-1 (was: Re: Consensus call to include Display Strings in draft-ietf-httpbis-sfbis) from Mark Nottingham on 2023-05-29 (ietf-http-wg@w3.org from April to June 2023)

From: Mark Nottingham <mnot@mnot.net>
Date: Mon, 29 May 2023 11:16:53 +1000
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>
Cc: Roy Fielding <fielding@gbiv.com>, Tommy Pauly <tpauly@apple.com>, HTTP Working Group <ietf-http-wg@w3.org>
Message-Id: <E80E63C6-639E-41CC-BD40-6EA698092C14@mnot.net>

Hi Martin,

> On 28 May 2023, at 4:44 pm, Martin J. Dürst <duerst@it.aoyama.ac.jp> wrote:
> 
> On 2023-05-26 07:38, Mark Nottingham wrote:
>> Hi Roy,
>>> On 26 May 2023, at 3:21 am, Roy T. Fielding <fielding@gbiv.com> wrote:
>>> 
>>> I think (b) is unnecessary given that HTTP is 8-bit clean for UTF-8
>>> and we are specifically talking about new fields for which there
>>> are no deployed parsers. Yes, I know what it says in RFC 9110.
>> Yes, the parsers may be new, but in some contexts, they may not have access to the raw bytes of the field value. Many HTTP libraries and abstractions (e.g., CGI) assume an encoding and expose strings; some of those may apply the advice that HTTP has documented for many years and assume ISO-8859-1.
> 
> This is a valid point, but I think it can be addressed rather easily.
> 
> The solution is simply to move back to bytes using ISO-8859-1 and then to move from bytes to characters using UTF-8. This can be done by the parser for Display Strings.
> 
> In Ruby, assuming that the string's encoding is ISO-8859-1 (Ruby carries an encoding for each string, and can to some extent deal with multiple strings in different encodings, although these days, it's mostly just everything in UTF-8), this would just be
>   display_string.force_encoding('UTF-8')
> (this just changes the interpretation of the underlying bytes).
> 
> In most other languages, where the actual string encoding is opaque and uniform (such as Python), this would be done e.g. by something like the following:
>   display_string.encode('iso-8859-1').decode('utf-8')
> Although I probably have written less than a dozen lines of Python code in my whole life, I checked this with
> >>> b'\xe2\x82\xac'.decode("iso-8859-1").encode('iso-8859-1').decode('utf-8')
> which successfully printed
> '€'
> (the first "decode" is what the general HTTP library would do; the following encode/decode is what the structured header parser would do).
> 
> Of course, in a language such as Java, the whole thing would be a bit longer, having to instantiate a CharsetEncoder and a CharsetDecoder and so on :-(.

Yes - this would work in theory. 

My concern is mostly rooted in the fact that this approach is untested at any reasonable scale -- currently, a library can decode as UTF-8, iso-8859-1, or plain ASCII and it will "just work" because the amount of non-ASCII being transported in HTTP headers is minisule. Any interop seen so far is either accidental, or because all implementations that touch the message flows are known to the folks deploying non-ASCII fields.

If we deploy non-ASCII at scale, it's going to bring out the bugs -- places where someone assumed headers were ASCII, either in servers, implementations of interfaces like CGI, libraries, or intermediaries.

Now, we could say that those are exceptions few and far between, and that eventually those bugs will be corrected. However, the lingering doubt is going to stop folks (especially browser vendors) from using Display Strings, because they've been around this block many times -- esoteric issues in Web infrastructure have a very long half life. See also David Benjamin's mail.

So, while I think it would be great if we could use UTF-8 in HTTP fields directly, I agree with Julian's characterisation of this as appropriate for an experiment, not something that we should include in infrastructure like SF -- implying that it has to be very highly reliable. 

Cheers,

--
Mark Nottingham   https://www.mnot.net/

Received on Monday, 29 May 2023 01:17:05 UTC