- From: John Cowan <cowan@ccil.org>
- Date: Wed, 4 Feb 2004 23:24:48 -0500
- To: Steve Schafer <steve@fenestra.com>
- Cc: www-dom@w3.org
Steve Schafer scripsit: > I can understand "should not" in this case, but I don't see a need for > "must not." If that's enforced, then parsers that don't understand the > "UTF-16LE" or "UTF16-BE" encoding declarations have little hope of > heuristically discovering the actual encoding. Not at all. You can heuristically discover it by looking for the XML declaration (ignoring NULs and interpreting the non-NULs as ASCII). There are really only five cases for the Appendix F heuristic: 1) There is a UTF-16 or UTF-32 BOM followed by an ASCII encoding declaration with embedded NULs. Believe the declared encoding. 2) There is a UTF-16 BOM but no encoding declaration. The encoding is "UTF-16", and the BOM tells you whether it's big- or little-endian. 3) There is an ASCII encoding declaration. Believe the declared encoding. Skip any UTF-8 BOM that may be present. 4) There is an encoding declaration in EBCDIC. Believe the declared encoding. 5) Otherwise, the encoding is UTF-8. Skip any UTF-8 BOM that may be present. The UTF-16LE and UTF-16BE encodings require an encoding declaration, so they fall into case 1. CESU-8, which is very like UTF-8, falls into case 3. -- Henry S. Thompson said, / "Syntactic, structural, John Cowan Value constraints we / Express on the fly." jcowan@reutershealth.com Simon St. Laurent: "Your / Incomprehensible http://www.reutershealth.com Abracadabralike / schemas must die!" http://www.ccil.org/~cowan
Received on Wednesday, 4 February 2004 23:30:14 UTC