Re: writeToString, write and, UTF-16[BE|LE] (was: "LSSerializer.writeToString - what encoding declaration?") from John Cowan on 2004-02-05 (www-dom@w3.org from January to March 2004)

From: John Cowan <cowan@ccil.org>
Date: Wed, 4 Feb 2004 23:24:48 -0500
To: Steve Schafer <steve@fenestra.com>
Cc: www-dom@w3.org
Message-ID: <20040205042448.GF25073@ccil.org>

Steve Schafer scripsit:

> I can understand "should not" in this case, but I don't see a need for
> "must not." If that's enforced, then parsers that don't understand the
> "UTF-16LE" or "UTF16-BE" encoding declarations have little hope of
> heuristically discovering the actual encoding.

Not at all.  You can heuristically discover it by looking for the XML
declaration (ignoring NULs and interpreting the non-NULs as ASCII).
There are really only five cases for the Appendix F heuristic:

1) There is a UTF-16 or UTF-32 BOM followed by an ASCII encoding
   declaration with embedded NULs.  Believe the declared encoding.

2) There is a UTF-16 BOM but no encoding declaration.  The encoding is
   "UTF-16", and the BOM tells you whether it's big- or little-endian.

3) There is an ASCII encoding declaration.  Believe the declared
   encoding.  Skip any UTF-8 BOM that may be present.

4) There is an encoding declaration in EBCDIC.  Believe the declared
   encoding.

5) Otherwise, the encoding is UTF-8.  Skip any UTF-8 BOM that may
   be present.

The UTF-16LE and UTF-16BE encodings require an encoding declaration,
so they fall into case 1.  CESU-8, which is very like UTF-8, falls into
case 3.

-- 
Henry S. Thompson said, / "Syntactic, structural,               John Cowan
Value constraints we / Express on the fly."     jcowan@reutershealth.com
Simon St. Laurent: "Your / Incomprehensible     http://www.reutershealth.com
Abracadabralike / schemas must die!"            http://www.ccil.org/~cowan

Received on Wednesday, 4 February 2004 23:30:14 UTC