- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 29 Aug 2002 13:50:50 +0900
- To: Francois Yergeau <FYergeau@alis.com>, charsets <ietf-charsets@iana.org>
Hello Francois, These are some comments that I wrote a long while ago, but just remembered them because of Patrick's mail. I think in particular the comment on the BOM below is very important. At 16:51 02/04/17 -0400, Francois Yergeau wrote: >Martin Duerst wrote: > > While the title of ISO/IEC 10646 includes 'multi-octet', I think > > this is confusing, >OK. The abstract now starts: > > ISO/IEC 10646-1 defines a large character set called the > Universal Character Set (UCS) which encompasses most of the world's > writing systems. The originally proposed encodings of the UCS, however, >were not compatible > with ... > >Same for the first para of the introduction. Other instances of >"multi-octet" were of a different kind and I left them alone. Please check. Very good. > > <14> > > This should be worded more general, at least inserting something > > like 'and similar algorithms'. > >Well, I do know about Boyer-Moore, but not others. I wouldn't want to >generalize to something wrong. There are many other cases, e.g. regular expressions,... Just mentioning the Boyer-Moore algorithm only is much too narrow. > > <25> > > 3. Fill in the bits marked x from the bits of the > > character number, > > expressed in binary. Start from the lower-order bits of the > > character number and put them first in the last octet of the > > sequence, then the next to last, etc. until all x bits are > > filled in. > > > > This misses one important detail: the sequence in which the bits > > are filled into a byte. This should be fixed. Maybe we can > > make things even clearer, as follows: > >This text dates back to RFC 2044 (October 1196) and since then nobody has >complained, in fact I have had a few reports from people saying this was the >clearest exposition of UTF-8 they had seen. I'm therefore very reluctant to >change it! > >It seems that people know where "lo-order bits" go in a byte. Your proposed >table may be more explicit, not necessarily clearer. It may well be that putting the lower-order bits on the right is the more obvious thing to do, but the description IS incomplete and should be fixed, even if it's the clearest one around. [I don't want to produce an implementation that does it the other way round just to show that the description is incomplete.] > > <32> > > 'different versions' gives the impression that these might be > > diverging versions. > >s/different versions/new versions/ Very nice. > > 5. Byte order mark (BOM) > > > > This section needs more work. The 'change log' says that it's > > mostly taken from the UTF-16 RFC. But the BOM for UTF-8 is > > much less necessary, and much more of a problem, than for UTF-16. > > We should clearly say that with IETF protocols, character encodings > > are always either labeled or fixed, and therefore the BOM SHOULD > > (and MUST at least for small segments) never be used for UTF-8. > > And we should clearly give the main argument, namely that it > > breaks US-ASCII compatibility (US-ASCII encoded as UTF-8 > > (without a BOM) stays exactly the same, but US-ASCII encoded > > as UTF-8 with a BOM is different). > >I don't quite see your point. A US-ASCII string, with or without a BOM, is >always a valid UTF-8 string, I don't see where compatibility is broken. I >can see that protocols shouldn't *require* a BOM, because then a strict >(BOM-less) ASCII string wouldn't meet the requirement. But that's not what >you're saying, right? No. What I'm saying is that an US-ASCII string with a BOM is no longer US-ASCII. Yet the property that US-ASCII stays US-ASCII is one of the important properties for which the IETF has chosen to use UTF-8. Upgrading software from US-ASCII to UTF-8 (without BOM) is in many cases really easy. With the BOM, it becomes very painful. Therefore, I would propose to include something like the following: >>>> In the context of IETF protocols, the character encoding is either identified by a label (such as the 'charset' label) or by specifying a fixed encoding for a particular protocol element. The BOM as an encoding 'signature' for UTF-8 is therefore unnecessary. For larger chunks of text (e.g. MIME entities,...), the BOM SHOULD NOT be used. For smaller chunks of text (e.g. headers or parts of headers), the BOM MUST NOT be used. > > <42> > > The character U+233B4 (a Chinese character meaning 'stump > > of tree'), prepended with a UTF-8 BOM, is encoded in UTF-8 as follows: > > > > Please don't give an example of a bad practice. > >I'll agree if we end up banning it, but otherwise I'd rather show it. Let's ban it. > > Then probably add an IANA consideration section where you say: > > "Please update the reference for UTF-8 to point to this memo." or so. > >Does that really belong *in* the doc itself? I think yes. Please check some other RFCs with IANA consideration sections. > > 8. Security Considerations > > > > - Most of the attacks described have actually taken place. > > I think some 'might's and 'could's should be changed so that > > it's clearer that these are very realistic threats. > >Suggestions? > > > > - It might be a good idea, here or somewhere else in the document, > > to provide some regular expressions that fully check UTF-8 byte > > sequences. > >Regexps would be nice, but we'd need to refer to a definition of the regexp >language itself. Any suitable source? There is the ABNF RFC for ABNF. For Perl, the usual Perl book should do: Programming Perl (3rd Edition) by Larry Wall, Tom Christiansen, Jon Orwant O'Reilly & Associates; ISBN: 0596000278; (July 2000) >Thanks for all the good comments! Here is one more: Maybe we should explicitly mention CESU-8, and clearly say that it's different from UTF-8, and it's not intended for use on the Internet. Unicode Technical Report #26 Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) I think it's better to call the devil by name than to ignore it. Regards, Martin.
Received on Thursday, 29 August 2002 00:54:43 UTC