Re: BOM and UTF-16LE/BE (was: Re: several messages about handling encodings in HTML)

Off-list, as this isn't really related to the development of HTML  
whatsoever.

On 3 Mar 2008, at 08:54, Martin Duerst wrote:

>> I don't see anything making a BOM illegal in UTF-16LE/UTF-16BE, in
>> fact, the only mention I find of it with regards to either in Unicode
>> 5.0 is "In UTF-16(BE|LE), an initial byte sequence <(FE FF|FF FE)> is
>> interpreted as U+FEFF zero width no-break space."
>
> That's exactly it. To make it very explicit, there is one codepoint
> (U+FEFF) and two functions: BOM and ZWNBSP. What the above says is
> that U+FEFF at the start of files marked as UTF-16LE/UTF-16BE is
> always ZWNBSP, and therefore is never a BOM. This means that a leading
> BOM is forbidden.

Ah. My mistake: thinking of ZWNBSP as just being the character name,  
and not its specific meaning in the context (which of course is  
important for U+FEFF).

> If there are HTML files that can start with arbitrary characters, then
> it might be okay to have a UTF-16LE or UTF-16BE file start with U 
> +FEFF,
> because this can then be interpreted as a ZWNBSP (although a ZWNBSP
> at the start of a file doesn't make a lot of sense). If HTML files
> have to start with markup, then a UTF-16LE or UTF-16BE HTML file
> cannot start with U+FEFF, because a ZWNBSP isn't markup.
> (Last time I knew HTML, it had to have at least a <title> element,
> so it had to start with markup, but I don't know that is working
> out in HTML5.)

A conformant document must start with a doctype, but for a non- 
conforming document a (leading) ZWNBSP will just end up at the start  
of <body> (i.e., it gets treated like any other non-ASCII space  
character).


--
Geoffrey Sneddon
<http://gsnedders.com/>

Received on Monday, 3 March 2008 16:56:34 UTC