- From: Chris Newman <Chris.Newman@INNOSOFT.COM>
- Date: Mon, 27 Jul 1998 11:52:38 -0700 (PDT)
- To: Larry Masinter <masinter@parc.xerox.com>
- Cc: ietf-charsets@ISI.EDU
On Fri, 24 Jul 1998, Larry Masinter wrote: > I think we're getting into trouble in this case because we're trying > to examine all of the possible senders and receivers of UTF-16 and then > defining when they should or shouldn't include a BOM. However, if you > had a registered charset, call it "marked-utf-16" with definition: > > Either big-endian UTF-16 > or a single BOM followed by little-endian UTF-16 > > then it would seem to be clear what a sender should send and what > a receiver should receive, without all of this complex case analysis. I find both this solution and the two charsets solution to be acceptable. Here's the ABNF for UTF-16, where the BOM is optional for network/big-endian byte-order and mandatory for little-endian byte-order. This is unambiguously parsable with one octet of lookahead. If the little-endian variation is eliminated, then it's unabiguously parsable without lookahead. UTF-16 = UTF-16BE-STR / UTF-16LE-STR UTF-16BE-STR = *UTF-16BE-CHAR UTF-16BE-CHAR = UTF-16BE-LO / UTF-16BE-HI / UTF-16BE-SUR UTF-16BE-LO = (%x00-d7 / %xe0-fe) %x00-ff UTF-16BE-HI = %xff %x00-fd UTF-16BE-SUR = %xd8-db %x00-ff %xdc-df %x00-ff UTF-16LE-STR = %xff %xfe *UTF-16LE-CHAR UTF-16LE-CHAR = UTF-16LE-LO / UTF-16LE-HI / UTF-16LE-SUR UTF-16LE-LO = %x00-ff (%x00-d7 / %xe0-fe) UTF-16LE-HI = %x00-fd %xff UTF-16LE-SUR = %x00-ff %xd8-db %x00-ff %xdc-df Note that this permits the BOM to be part of the data, so the XML spec would be compliant with this. I sure wish that Unicode/ISO-10646 had specified that network-byte order is required for use in files and on networks, then we might not have had this problem. This is just repeating the TIFF magic number mistake on a grander scale. - Chris --Boundary (ID uEbHHWxWEwCKT9wM3evJ5w)
Received on Monday, 27 July 1998 11:53:51 UTC