Re: byte order mark article

John Cowan, Wed, 21 Nov 2012 16:27:27 -0500:

> Per Unicode, in UTF-16LE and UTF-16BE documents, there is no such
> thing as a BOM.  If a UTF-16LE document begins FF FE, that means the
> first character is U+FEFF, ZERO BASED NON-BREAKING SPACE; likewise if
> a UTF-16BE document begins FE FF.
> 
> In the UTF-16 encoding, a leading FF FE or FE FF is a BOM rather than a
> character, and all following pairs of bytes are interpreted little-endian
> or big-endian respectively.  If the first two bytes are neither of these,
> a higher-level protocol must decide whether to interpret the pairs of
> bytes as big- or little-endian.  If no higher-level protocol exists,
> the interpretation is big-endian by default.

UTF-16LE and UTF-16BE theoretical ability to let a leading FF FE or FE 
FF represent a ZERO WIDTH NO-BREAK SPACE rather than a BOM, seems to be 
without value for mark-up languages. The only exception I can think of 
would be if was defined a markup language where the role of the '<' 
character (in XML) was replaced with the the very ZERO WIDTH NO-BREAK 
SPACE character.

Hence, it doesn't seem important that e.g. XML editors or XML parsers 
are able to handle UTF-16LE or UTF-16BE correctly with regard to 
whether FF FE or FE FF – as the first two bytes – represents a BOM or a 
ZERO WITH NO-BREAK SPACE. In fact, it seems better if they do not treat 
them like that as this removes at least one possible (fatal) error 
opportunity.

For that reason, it seems entirely OK that Firefox will, when version 
19 is released, treat a leading FF FE or FE FF as a BOM, even in XML 
documents. [1] (Can be tested e.g. in FirefoxNightly.) Thus Firefox 
aligns it XML and HTML parsing in this detail. And other browsers, at 
least Webkit, has done long ago. Though, I should add, that Firefox 19 
and Webkit also treat plain txt the same way.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=716579#c14

-- 
leif halvard silli

Received on Wednesday, 21 November 2012 23:17:16 UTC