W3C home > Mailing lists > Public > www-international@w3.org > April to June 2011

Re: Should the UTF-8 BOM trump overriding via HTTP or by users?

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Tue, 7 Jun 2011 17:43:47 +0200
To: Bjoern Hoehrmann <derhoermi@gmx.net>
Cc: www-international <www-international@w3.org>
Message-ID: <20110607174347216292.7e847ad2@xn--mlform-iua.no>
Bjoern Hoehrmann, Tue, 07 Jun 2011 16:56:29 +0200:
> * Leif Halvard Silli wrote:
>> Bjoern Hoehrmann, Tue, 07 Jun 2011 06:39:34 +0200:
>>> Higher-level information overrides lower-level information, explicit
>>> information overrides fallbacks, and user agents should do what their
>>> users want them to do. So, HTTP-level Content-Type overrides document-
>>> internal information, a BOM overrides user-chosen fallbacks, and user-
>>> chosen overrides trump anything else.
>> 
>> You portray the BOM as  "fallback". It actuallly is an encoding 
>> signature.
> 
> If you think I wrote something that is inconsistent with facts, then
> maybe you misread what I wrote? I did not, and did not mean to, por-
> tray a Unicode signature as a fallback in the sense I used the word.

I meant "fallback declaration" or "backup declaration, in case HTTP has 
no declaration".

> I meant fallback in the sense of a "If page lacks encoding declaration
> assume it's $encoding encoded" setting, as opposed to a "Whatever the
> page says it's encoded in, use $encoding to decode" setting.

Your priority map was easy to parse. So, in truth, what I reacted to 
was only the wording.

>> "Looks like a BOM". Looks like or are exactly those bytes? Can you 
>> describe a use case? When and how can an XML document/entity legally 
>> start with the BOM if it is not meant to  be interpreted as the BOM?  
> 
> Looks like as opposed to "defined as".
> 
>   Content-Type: application/xml-external-parsed-entity;charset=l1
> 
>   0xFE 0xFF
> 
> That's a properly formed external parsed entity containing LATIN SMALL
> LETTER THORN and LATIN SMALL LETTER Y WITH DIAERESIS. If you ignore the
> charset parameter, the bytes may look like a Unicode signature, but the
> bytes are not a Unicode signature because they are not defined as such.

So what could that seemingly narrow case lead to? 

Firstly, since the external entity is not UTF-8 or UTF-16 encoded, 
there is no guarantee that the parser will handle it. Thus a browser 
could not really be said to be breaking the XML spec if it was unable 
to handle such a thing properly.

Otherwise, the parser could let the user override a setting so that the 
parser could hanlde it.

Meanwhile, the external parsed entitied SHOULD itself begin with the 
text declaration, which in turn ought to tell the encoding. In that 
case, the external entitity did not need to be served with encoding 
information.

So there are serveral limitations and recomendations that have to be 
broken before that use case could be a real use case.

Btw, since this external parsed entity begins with those two characters 
rather than with U+FEFF, then XML 1.0 does not require that there is a 
BOM in the "current" (as opposed to in the "external parsed") XML file. 
In that regard, it is interesting to note that RFC 3023 is from 2001 
and doesn't discuss the UTF-8 BOM.

http://tools.ietf.org/html/rfc3023#page-15
-- 
Leif Halvard Silli
Received on Tuesday, 7 June 2011 15:44:20 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 7 June 2011 15:44:21 GMT