- From: Martin Duerst <duerst@it.aoyama.ac.jp>
- Date: Mon, 03 Mar 2008 17:54:17 +0900
- To: Geoffrey Sneddon <foolistbar@googlemail.com>, Ian Hickson <ian@hixie.ch>
- Cc: HTML WG <public-html@w3.org>, public-i18n-core@w3.org
At 01:09 08/03/01, Geoffrey Sneddon wrote: > > >On 29 Feb 2008, at 01:21, Ian Hickson wrote: >> On Sat, 26 May 2007, Henri Sivonen wrote: >>> >>> The draft says: >>> "A leading U+FEFF BYTE ORDER MARK (BOM) must be dropped if present." >>> >>> That's reasonable for UTF-8 when the encoding has been established by >>> other means. >>> >>> However, when the encoding is UTF-16LE or UTF-16BE (i.e. supposed >>> to be >>> signatureless), do we really want to drop the BOM silently? >>> Shouldn't it >>> count as a character that is in error? >> >> Do the UTF-16LE and UTF-16BE specs make a leading BOM an error? Yes. See below for details. >> If yes, then we don't have to say anything, it's already an error. >> >> If not, what's the advantage of complaining about the BOM in this >> case? The fact that it needs explanation on this list should probably be taken as a hint that we better say something, or implementers will easily overlook this. >I don't see anything making a BOM illegal in UTF-16LE/UTF-16BE, in >fact, the only mention I find of it with regards to either in Unicode >5.0 is "In UTF-16(BE|LE), an initial byte sequence <(FE FF|FF FE)> is >interpreted as U+FEFF zero width no-break space." That's exactly it. To make it very explicit, there is one codepoint (U+FEFF) and two functions: BOM and ZWNBSP. What the above says is that U+FEFF at the start of files marked as UTF-16LE/UTF-16BE is always ZWNBSP, and therefore is never a BOM. This means that a leading BOM is forbidden. If there are HTML files that can start with arbitrary characters, then it might be okay to have a UTF-16LE or UTF-16BE file start with U+FEFF, because this can then be interpreted as a ZWNBSP (although a ZWNBSP at the start of a file doesn't make a lot of sense). If HTML files have to start with markup, then a UTF-16LE or UTF-16BE HTML file cannot start with U+FEFF, because a ZWNBSP isn't markup. (Last time I knew HTML, it had to have at least a <title> element, so it had to start with markup, but I don't know that is working out in HTML5.) Regards, Martin. >I suppose the rational given for removing it is the section that >follows D101 (e.g., "When converting between different encoding >schemes$B)6(BTF-8 byte sequences is not recommended by the Unicode >Standard."). > > >-- >Geoffrey Sneddon ><http://gsnedders.com/> > > #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
Received on Monday, 3 March 2008 08:55:48 UTC