- From: Geoffrey Sneddon <foolistbar@googlemail.com>
- Date: Mon, 3 Mar 2008 16:56:20 +0000
- To: Martin Duerst <duerst@it.aoyama.ac.jp>
- Cc: www-archive@w3.org
Off-list, as this isn't really related to the development of HTML whatsoever. On 3 Mar 2008, at 08:54, Martin Duerst wrote: >> I don't see anything making a BOM illegal in UTF-16LE/UTF-16BE, in >> fact, the only mention I find of it with regards to either in Unicode >> 5.0 is "In UTF-16(BE|LE), an initial byte sequence <(FE FF|FF FE)> is >> interpreted as U+FEFF zero width no-break space." > > That's exactly it. To make it very explicit, there is one codepoint > (U+FEFF) and two functions: BOM and ZWNBSP. What the above says is > that U+FEFF at the start of files marked as UTF-16LE/UTF-16BE is > always ZWNBSP, and therefore is never a BOM. This means that a leading > BOM is forbidden. Ah. My mistake: thinking of ZWNBSP as just being the character name, and not its specific meaning in the context (which of course is important for U+FEFF). > If there are HTML files that can start with arbitrary characters, then > it might be okay to have a UTF-16LE or UTF-16BE file start with U > +FEFF, > because this can then be interpreted as a ZWNBSP (although a ZWNBSP > at the start of a file doesn't make a lot of sense). If HTML files > have to start with markup, then a UTF-16LE or UTF-16BE HTML file > cannot start with U+FEFF, because a ZWNBSP isn't markup. > (Last time I knew HTML, it had to have at least a <title> element, > so it had to start with markup, but I don't know that is working > out in HTML5.) A conformant document must start with a doctype, but for a non- conforming document a (leading) ZWNBSP will just end up at the start of <body> (i.e., it gets treated like any other non-ASCII space character). -- Geoffrey Sneddon <http://gsnedders.com/>
Received on Monday, 3 March 2008 16:56:34 UTC