- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Sun, 28 May 2023 15:44:41 +0900
- To: Mark Nottingham <mnot@mnot.net>, Roy Fielding <fielding@gbiv.com>
- Cc: Tommy Pauly <tpauly@apple.com>, HTTP Working Group <ietf-http-wg@w3.org>
On 2023-05-26 07:38, Mark Nottingham wrote: > Hi Roy, > >> On 26 May 2023, at 3:21 am, Roy T. Fielding <fielding@gbiv.com> wrote: >> >> I think (b) is unnecessary given that HTTP is 8-bit clean for UTF-8 >> and we are specifically talking about new fields for which there >> are no deployed parsers. Yes, I know what it says in RFC 9110. > > Yes, the parsers may be new, but in some contexts, they may not have access to the raw bytes of the field value. Many HTTP libraries and abstractions (e.g., CGI) assume an encoding and expose strings; some of those may apply the advice that HTTP has documented for many years and assume ISO-8859-1. This is a valid point, but I think it can be addressed rather easily. The solution is simply to move back to bytes using ISO-8859-1 and then to move from bytes to characters using UTF-8. This can be done by the parser for Display Strings. In Ruby, assuming that the string's encoding is ISO-8859-1 (Ruby carries an encoding for each string, and can to some extent deal with multiple strings in different encodings, although these days, it's mostly just everything in UTF-8), this would just be display_string.force_encoding('UTF-8') (this just changes the interpretation of the underlying bytes). In most other languages, where the actual string encoding is opaque and uniform (such as Python), this would be done e.g. by something like the following: display_string.encode('iso-8859-1').decode('utf-8') Although I probably have written less than a dozen lines of Python code in my whole life, I checked this with >>> b'\xe2\x82\xac'.decode("iso-8859-1").encode('iso-8859-1').decode('utf-8') which successfully printed '€' (the first "decode" is what the general HTTP library would do; the following encode/decode is what the structured header parser would do). Of course, in a language such as Java, the whole thing would be a bit longer, having to instantiate a CharsetEncoder and a CharsetDecoder and so on :-(. > Yes, in many cases you can use UTF-8 on the wire successfully. However, making that assumption is a local convention; we can't assume that it holds for the entire Internet, because we don't know all of the various implementations that have been deployed and how they behave. All we know is a) how the implementations we've seen behave, and b) what we've written down before. The 'wire' (which for the moment I assume to be TCP and below, or TLS and below) just transports bytes, and therefore should not be a problem. Problems may occur at places where (contrary to the HTTP specs) iso-8859-1 isn't passed through in headers. There may also be problems in cases iso-8859-1 is interpreted as excluding the bytes in the range 0x80 to 0x9F. But I think it's easy to say that such cases should be very rare. There may also be implementations that just cut off the most significant bit in each byte, or otherwise don't let non-ASCII bytes through. It would be good to know whether such cases actually have been reported, or whether that's just something we think might be out there but isn't actually confirmed. > In the past we've made decisions like this and chosen to be conservative. We could certainly break that habit now, but we'd need (at the least) to have a big warning that this type might not be interoperable with deployed systems. Personally, I don't think that's worth it, given the relative rarity that we expect for this particular type, and the relatively low overhead of encoding. If by overhead, you mean processing, then I agree that's low. If you mean size, I think the situation is different. Here's a little table with some of the most important scripts and the byte count and expansion factor for their characters when compared to legacy encodings and pure UTF-8: Legacy UTF-8 proposed expansion ASCII 1 1 1 1 Latin+Accents, e.g. Polish 1 ~1.5 ~2 2 Arabic/Cyrillic/... 1 2 6 6 Indic scripts,... 1 3 9 9 Chinese/Japanese/... 2 3 9 4.5 So some text in an Indic or South Asian Script gets expanded by a factor of 9 when compared to a legacy singlebyte encoding. >> In general, it is safer to send raw UTF-8 over the wire in HTTP >> than it is to send arbitrary pct-encoded octets, simply because >> pct-encoding is going to bypass most security checks long enough >> for the data to reach an applications where people do stupid >> things with strings that they assume contain something that is >> safe to display. > > That's an odd assertion - where are those security checks taking place? I don't know about headers in general, but I hope people remember the attack where it was possible to smuggle a path like /abc/def/../../xyz.html by percent-encoding (part of) "/../../" and access the file xyz.html which was access-protected, because the access check happened before the decoding. Regards, Martin.
Received on Sunday, 28 May 2023 06:44:53 UTC