Re: several messages about handling encodings in HTML from Geoffrey Sneddon on 2008-02-29 (public-i18n-core@w3.org from January to March 2008)

From: Geoffrey Sneddon <foolistbar@googlemail.com>
Date: Fri, 29 Feb 2008 16:09:42 +0000
To: Ian Hickson <ian@hixie.ch>
Cc: whatwg@whatwg.org, HTML WG <public-html@w3.org>, public-i18n-core@w3.org
Message-Id: <75073856-A4D8-4826-AEF4-00F3E2E4C58C@googlemail.com>

On 29 Feb 2008, at 01:21, Ian Hickson wrote:

>> 	- Again there, shouldn't we be given unicode codepoints for that (as
>> it'll be a unicode string)?
>
> Not sure what you mean.

This is just me being incredibly dumb. Ignore it.

> On Sat, 26 May 2007, Henri Sivonen wrote:
>>
>> The draft says:
>> "A leading U+FEFF BYTE ORDER MARK (BOM) must be dropped if present."
>>
>> That's reasonable for UTF-8 when the encoding has been established by
>> other means.
>>
>> However, when the encoding is UTF-16LE or UTF-16BE (i.e. supposed  
>> to be
>> signatureless), do we really want to drop the BOM silently?  
>> Shouldn't it
>> count as a character that is in error?
>
> Do the UTF-16LE and UTF-16BE specs make a leading BOM an error?
>
> If yes, then we don't have to say anything, it's already an error.
>
> If not, what's the advantage of complaining about the BOM in this  
> case?

I don't see anything making a BOM illegal in UTF-16LE/UTF-16BE, in  
fact, the only mention I find of it with regards to either in Unicode  
5.0 is "In UTF-16(BE|LE), an initial byte sequence <(FE FF|FF FE)> is  
interpreted as U+FEFF zero width no-break space."

I suppose the rational given for removing it is the section that  
follows D101 (e.g., "When converting between different encoding  
schemes…UTF-8 byte sequences is not recommended by the Unicode  
Standard.").


--
Geoffrey Sneddon
<http://gsnedders.com/>

Received on Friday, 29 February 2008 16:09:58 UTC