[whatwg] UTF and BOM terminology from Henri Sivonen on 2007-05-27 (public-whatwg-archive@w3.org from May 2007)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Sun, 27 May 2007 11:56:29 +0300
Message-ID: <0DCBCBB0-FDC6-4286-993A-F66D275FB167@iki.fi>

"If the encoding is one of UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or  
UTF-32LE, then authors can use a BOM at the start of the file to  
indicate the character encoding."

That sentence should read:
"If the encoding is one of UTF-8, UTF-16, or UTF-32, then authors can  
use a BOM at the start of the file to indicate the character encoding."

The encoding labels with LE or BE in them mean BOMless variants where  
the encoding label on the transfer protocol level gives the  
endianness. See http://www.ietf.org/rfc/rfc2781.txt When the spec  
refers to UTF-16 with BOM in a particular endianness, I think the  
spec should use "big-endian UTF-16" and "little-endian UTF-16".

Since declaring endianness on the transfer protocol level has no  
benefit over using the BOM when the label is right and there's a  
chance to get the label wrong, the encoding labels with explicit  
endianness are harmful for interchange. In my opinion, the spec  
should avoid giving authors any bad ideas by reinforcing these labels  
by repetition.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/

Received on Sunday, 27 May 2007 01:56:29 UTC