charset details from Martin J. Duerst on 1999-10-25 (xml-editor@w3.org from October to December 1999)

From: Martin J. Duerst <duerst@w3.org>
Date: Mon, 25 Oct 1999 16:17:09 +0900
To: xml-editor@w3.org
Message-Id: <199910250722.QAA04840@sh.w3.mag.keio.ac.jp>
I herewith submit the following errata reports. For further details,
please see the thread starting at
http://lists.w3.org/Archives/Member/w3c-i18n-ig/1999Oct/0136.html
(w3c members only).

Regards,   Martin.

> 1) In http://www.w3.org/TR/REC-xml#charencoding
> 
> "All XML processors must be able to read entities in either UTF-8 or UTF-16."
> 
> This might be interpreted so that it would be okay to only support UTF-8
> or to only support UTF-16. But I know that's not the way it was intended.
> 
> A better wording would probably be:
> 
> "All XML processors must be able to read both entities in UTF-8 and
> entities in UTF-16."
> 
> 
> 2) again in http://www.w3.org/TR/REC-xml#charencoding
> 
>           In an encoding declaration, the values "UTF-8", "UTF-16",
>           "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used for the
>           various encodings and transformations of Unicode / ISO/IEC
>           10646, the values "ISO-8859-1", "ISO-8859-2", ... "ISO-8859-9"
>           should be used for the parts of ISO 8859, and the values
>           "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be used for the
>           various encoded forms of JIS X-0208-1997. XML processors
>           may recognize other encodings; it is recommended that
>           character encodings registered (as charsets) with the Internet
>           Assigned Numbers Authority [IANA], other than those just
>           listed, should be referred to using their registered names. Note
>           that these registered names are defined to be case-insensitive, so
>           processors wishing to match against them should do so in a
>           case-insensitive way.
> 
> There are several problems here:
> 
> - Case sensitivity is defined for each single value, instead of for the
>   value in general. Would be better to say that the value in general
>   is case-insensitive.
> - There is advice for how to label entities, but not for how to interpret
>   values. A parser that interprets "EUC-JP" as let's say "european
>   unified character set - joint code page" (there currently isn't such
>   a thing :-) would be fully conformant, although I'm not at all sure
>   that this was the intention when the XML spec was written, or that
>   this would be desirable. Two additions seem to be necessary:
>   - Say that all the values registered with IANA have to either be
>     interpreted in the way defined by IANA, or treated as unknown
>     (->error)
>   - Because we cannot predict what IANA will register in the future,
>     say that for anything not registered with IANA, the x- prefix
>     should be used.
>   This would bring things in line e.g. with the most recent wording
>   in XSLT. [http://www.w3.org/TR/xslt#output, see encoding]
> 
> Can you follow up with this, or tell me it's already dealt with,
> or tell me what I have to do? Many thanks in advance.
> 
> 
> Martin.


#-#-#  Martin J. Du"rst, World Wide Web Consortium
#-#-#  mailto:duerst@w3.org   http://www.w3.org
Received on Monday, 25 October 1999 03:21:34 UTC