Re: FW: Encoding detection again ... from C M Sperberg-McQueen on 1999-03-17 (xml-editor@w3.org from January to March 1999)

From: C M Sperberg-McQueen <cmsmcq@acm.org>
Date: Tue, 16 Mar 1999 18:11:33 -0600
To: msabin@cromwellmedia.co.uk
CC: xml-editor@w3.org, cmsmcq@acm.org
Message-Id: <199903170011.SAA141542@tigger.cc.uic.edu>

>From: Miles Sabin <msabin@cromwellmedia.co.uk>
>Date: Tue, 2 Mar 1999 15:51:43 -0000
>
>Hi,
>
>John Cowan suggested that I forward the following
>to you for consideration as an errata for XML 1.0.
>
>Cheers,
>
>
>Miles
>
>Miles Sabin                          Cromwell Media
>Internet Systems Architect           5/6 Glenthorne Mews
>+44 (0)181 410 2230                  London, W6 0LJ
>msabin@cromwellmedia.co.uk           England

Thank you for the note.  My comments follow.

>-----Original Message-----
>From: Miles Sabin 
>Sent: 02 March 1999 11:59 am
>To: 'xml-dev@ic.ac.uk'
>Subject: Encoding detection again ...
>
>
>I've been browsing throught the archives for an
>answer to this question, but I haven't been able
>to find anything that seems to give a completely
>unambiguous answer ...
>
>Appendix F of the spec say that given a document 
>starting with the 4 octet sequence,
>
>  00 3C 00 3F
>
>I'm to infer BOM-less big-endian UTF-16, and 
>given a document starting with,
>
>  3C 00 3F 00
>
>I'm to infer BOM-less little-endian UTF-16.
>
>What I what to know is: why could these 
>sequences not equally represent (respectively)
>big-endian UCS-2 or little-endian UCS-2? In
>other words, surely these octet sequences are
>ambiguous, and hence the encoding should be
>resolved definitively with either,
>
>  <?xml version="1.0" encoding="UTF-16"?>
>
>or,
>
>  <?xml version="1.0" encoding="ISO-10646-UCS-2"?>
>
>or an appropriate MIME header, ie.,
>
>  Content-type: text/xml; charset="utf-16"
>
>or,
>
>  Content-type: text/xml; charset="ISO-10646-UCS-2"
>
>Just so there's no confusion ... I'm assuming:
>
>1. Unicode == UTF-16
>2. UCS-2 != UTF-16 (because UCS-2 lacks UTF-16's
>   support for characters outside the BMP).

You are quite right; those octet sequences could easily be found
either in UTF-16 or in UCS-2, and thus don't wholly disambiguate them.
The intention is indeed just as you infer: from the first few octets,
the program knows enough to read the entire encoding declaration, and
the entire encoding declaration says whether it's UTF-16 or UCS-2.  A
similar process is envisaged with UTF-8, ISO 646, ASCII, ISO 8859-*,
Shift-JIS, EUC, etc., but there the ambiguity is explicitly stated.

It would be less confusing if the ambiguity were explicitly
recognized in this case as well.  

Thanks for pointing this out!

-C. M. Sperberg-McQueen
 Co-editor, XML 1.0
 Senior Research Programmer, University of Illinois at Chicago
 cmsmcq@uic.edu, cmsmcq@acm.org

Received on Tuesday, 16 March 1999 19:15:05 UTC