Re: Unicode references from Martin Duerst on 2003-10-17 (public-qt-comments@w3.org from October 2003)

From: Martin Duerst <duerst@w3.org>
Date: Fri, 17 Oct 2003 16:09:27 -0400
To: "Ashok Malhotra" <ashokma@microsoft.com>, <w3c-i18n-ig@w3.org>, <w3c-xml-query-wg@w3.org>
Cc: <public-qt-comments@w3.org>, msm@w3.org
Message-Id: <4.2.0.58.J.20031017152433.055a75a8@localhost>
Hello Ashok,

Many thanks to you and others for the clarification. Now that we
know that you meant to write

 >>>>
There was a feeling that Unicode 4.0 was too new and as yet not widely 
implemented.
 >>>>

we can look at the actual issues.


The Character Model has a special section on how to reference Unicode/
ISO 10646:
http://www.w3.org/TR/charmod/#sec-RefUnicode
[note that that document itself unfortunately still references
Unicode 3 in the public version, but the editor's version is
changed to Unicode 4]

The important point from there is the distinction between generic
references and specific references. Generic references are appropriate
if the reference is to the coded character set, and the idea is that
newly encoded additions will be usable, too. My understanding is that
this is, and should be, the case for XML Query, with respect to
the data it processes.

Specific references have to be used if it is important to lock down
some things that are known to potentially change, such as some
character properties. Examples include properties used for regular
expressions in XML Schema, or for XML Query, this could be a specific
version of the Unicode collation algorithm (which is actually a
separate document and therefore not relevant here) or some specific
character properties.

For the former, using the latest major version seems to be a very
good idea. It is widely available in bookform and on the Web. It
would be strange to refer to an older book version; readers of
your spec who go out to buy the version 3.0 book and then discover
that there was version 4.0 would get angry at you for the money
they wasted. With respect to where generic references are appropriate,
there is absolutely no problem with referencing different versions
in different specs (e.g. Unicode 2.0 in XML 1.0, Unicode 3.0
in XML Schema, Unicode 4.0 in XML Query) because all these references
point to exactly the same thing, the open-ended coded character set,
and are consistent even though the numbers are not the same.

For the later, this has to be examined case-by-case. For example,
for regular expressions, XML Schema refers to the Unicode Database
version 3.1, although it dogdes complete interoperability by
encouraging support for newer versions. More than one reference
may be necessary.



At 08:28 03/10/16 -0700, Ashok Malhotra wrote:

>In their review of the F&O document the I18N WG asked that the references 
>to Unicode be changed to Unicode 4.0.and to change the reference to case 
>mapping to Annexure #15 of Unicode 4.0 and the normalization forms defined 
>in Unicode 4.0. See <file://Also 
>http://lists.w3.org/Archives/Public/public-qt-comments/2003Jul/0105.html>Al 
>so 
>http://lists.w3.org/Archives/Public/public-qt-comments/2003Jul/0105.html 
>and 
><http://lists.w3.org/Archives/Public/public-qt-comments/2003Jul/0106.html>h 
>ttp://lists.w3.org/Archives/Public/public-qt-comments/2003Jul/0106.html
>
>The XML Schema WG asked that the references to Unicode be consistent. See 
><http://lists.w3.org/Archives/Public/public-qt-comments/2003Aug/0003.html>h 
>ttp://lists.w3.org/Archives/Public/public-qt-comments/2003Aug/0003.html

I have looked through this document. I have not found any such request.
(I have copied Michael Sperberg-McQueen, maybe he can help)
The closest I have found is
http://www.w3.org/XML/Group/2003/07/xmlschema-fo-comments.html#d0e327
This refers to the fact that Unicode 2.0 and Unicode 3.0 do not
clearly outlaw encoding of non-BMP characters in six bytes (using
two surrogate codepoints).

If the intent of the XML Schema WG was to request that XML Query
should in any way tolerate such encoding, then this would not
be appropriate. XML 1.0 references RFC 2279 (see
http://www.w3.org/TR/REC-xml#sec-external-ent), which is already
clear that non-BMP characters have to be encoded as four bytes.
This would also be a bad idea given that there are known security
problems connected with overlong UTF-8 byte sequences,


>This was discussed in the joint WG meeting in Toronto.  There was a 
>feeling that Unicode was too new and as yet not widely implemented.  There 
>was also the argument for consistency.  Thus, it was decided that the F&O 
>would refer to the XML standard, since that is what the spec is based on, 
>and be consistent with it.

Finally, back to the original statement:

 >>>>
There was a feeling that Unicode 4.0 was too new and as yet not widely 
implemented.
 >>>>

Some parts of Unicode 2.0 may in some sense not yet be very well implemented
(e.g. Indic script rendering in certain environments). On the other hand,
some parts of Unicode 4.0 are already implemented, because Unicode 4.0 went
through various reviews and a beta period.

So the above claim doesn't seem to make much sense to me.


Regards,    Martin.
Received on Friday, 17 October 2003 16:11:35 UTC