- From: C M Sperberg-McQueen <cmsmcq@acm.org>
- Date: Tue, 16 Mar 1999 18:11:33 -0600
- To: msabin@cromwellmedia.co.uk
- CC: xml-editor@w3.org, cmsmcq@acm.org
>From: Miles Sabin <msabin@cromwellmedia.co.uk> >Date: Tue, 2 Mar 1999 15:51:43 -0000 > >Hi, > >John Cowan suggested that I forward the following >to you for consideration as an errata for XML 1.0. > >Cheers, > > >Miles > >Miles Sabin Cromwell Media >Internet Systems Architect 5/6 Glenthorne Mews >+44 (0)181 410 2230 London, W6 0LJ >msabin@cromwellmedia.co.uk England Thank you for the note. My comments follow. >-----Original Message----- >From: Miles Sabin >Sent: 02 March 1999 11:59 am >To: 'xml-dev@ic.ac.uk' >Subject: Encoding detection again ... > > >I've been browsing throught the archives for an >answer to this question, but I haven't been able >to find anything that seems to give a completely >unambiguous answer ... > >Appendix F of the spec say that given a document >starting with the 4 octet sequence, > > 00 3C 00 3F > >I'm to infer BOM-less big-endian UTF-16, and >given a document starting with, > > 3C 00 3F 00 > >I'm to infer BOM-less little-endian UTF-16. > >What I what to know is: why could these >sequences not equally represent (respectively) >big-endian UCS-2 or little-endian UCS-2? In >other words, surely these octet sequences are >ambiguous, and hence the encoding should be >resolved definitively with either, > > <?xml version="1.0" encoding="UTF-16"?> > >or, > > <?xml version="1.0" encoding="ISO-10646-UCS-2"?> > >or an appropriate MIME header, ie., > > Content-type: text/xml; charset="utf-16" > >or, > > Content-type: text/xml; charset="ISO-10646-UCS-2" > >Just so there's no confusion ... I'm assuming: > >1. Unicode == UTF-16 >2. UCS-2 != UTF-16 (because UCS-2 lacks UTF-16's > support for characters outside the BMP). You are quite right; those octet sequences could easily be found either in UTF-16 or in UCS-2, and thus don't wholly disambiguate them. The intention is indeed just as you infer: from the first few octets, the program knows enough to read the entire encoding declaration, and the entire encoding declaration says whether it's UTF-16 or UCS-2. A similar process is envisaged with UTF-8, ISO 646, ASCII, ISO 8859-*, Shift-JIS, EUC, etc., but there the ambiguity is explicitly stated. It would be less confusing if the ambiguity were explicitly recognized in this case as well. Thanks for pointing this out! -C. M. Sperberg-McQueen Co-editor, XML 1.0 Senior Research Programmer, University of Illinois at Chicago cmsmcq@uic.edu, cmsmcq@acm.org
Received on Tuesday, 16 March 1999 19:15:05 UTC