W3C home > Mailing lists > Public > www-i18n-comments@w3.org > January 2004

Your comments on the Character Model [C068-C072, C079]

From: Richard Ishida <ishida@w3.org>
Date: Fri, 16 Jan 2004 08:18:49 -0000
To: <tbray@textuality.com>
Cc: <www-i18n-comments@w3.org>
Message-ID: <000301c3dc09$5ee0e800$6601a8c0@w3cishida>

Dear Tim,

Many thanks for your comments on the 2nd Last Call version of the Character
Model for the World Wide Web v1.0 [1].  We appreciate the interest you have
taken in this specification.

You can see the comments you submitted, grouped together, at 
(You can jump to a specific comment in the table by adding its ID to the end
of the URI.)

PLEASE REVIEW the decisions for the following additional comments and reply
to us within the next two weeks at mailto:www-i18n-comments@w3.org (copying
w3c-i18n-ig@w3.org) to say whether you are satisfied with the decision
        C068, C069, C070, C071 C072, C079

Information relating to these comments is included below. You will receive
notification of decisions on remaining comments at a later date.  Note also
our intention with regard to C074.

You can find the latest version of the Character Model at
http://www.w3.org/International/Group/charmod-edit/ . 

Best regards,
Richard Ishida, for the I18N WG


****C068 Tim Bray
   [531]3.6 Unique Character Encoding
     * See also the following comments: C114
     * Comment (received 2002-05-30) -- [534]Comments on Character Model
       [S] When designing a new protocol, format or API, specifications
       SHOULD mandate a unique character encoding.
       No. If the format is in XML and has likely usage scenarios which
       include creation by humans, this is a good enough reason to just
       go by the XML rules. For example, I habitually compose XML
       documents in ISO-8859-1, which suits my needs as a user of
       European languages. I see no reason whatsoever why a specification
       should invalidate either my habits or those of a Japanese author
       who wants to use some flavor of JIS.
       OK, I guess this argument could fall under the exception clause of
       SHOULD, but I'd go so far as to add
       [S] When designing an XML-based protocol which is apt to be
       authored by humans, specifications MUST NOT limit the use of
       character encodings beyond the rules provided by XML.
     * Decision: Partially accepted.
     * Rationale: We have added: "[S] When basing a protocol, format, or
       API on a protocol, format, or API that already has rules for
       character encoding, specifications SHOULD use rather than change
       these rules." and have added XML as an example. As said elsewhere,
       we prefer not to have requirements specific to a particular
       format. Also, the 'authored by humans' part is not necessarily
       true; in general, humans care about the actual text and about the
       tools they use, not about encodings.

    [531] http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Encodings

****C069 Tim Bray
   [536]3.6.2 Admissibility of UTF-*
     * Comment (received 2002-05-30) -- [538]Comments on Character Model
       The paragraph beginning
       '[S] If the unique encoding approach is not chosen, specifications
       MUST designate at least one of the UTF-8 and UTF-16 encoding forms
       of Unicode as admissible... '
       is fine, but if the format uses XML, then XML's rules cover this
       and in fact require that UTF-8 and -16 are both admissable; which
       takes priority over the language here and this should be noted.
     * Decision: Partially accepted.
     * Note: Covered by our edit resulting from [539]C114 and your
       previous comment [540]C068.

    [536] http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-EncodingIdent

****C070 Tim Bray
   [542]4 Early Uniform Normalization
     * Comment (received 2002-05-30) -- [544]Comments on Character Model
       I am unable to develop an intelligent opinion as to the
       cost-benefit trade-off of Early Uniform Normalization and will
       remain unable to do so without hard information as to the cost.
       For example, if there was a C-language library available
       unencumbered by licensing issues which had a memory footprint
       smaller than say 10k and which ran at I/O speeds, you could
       reasonably argue that this is a cost effectively equal to zero. On
       the other hand, if E.U.N. requires a memory footprint of 256K or,
       worse, understanding and linking to the entire ICU library
       (blecch), the cost is likely to be unacceptable in a large class
       of applications.
       There's a normalizer demo at Unicode.org referenced from Appendix
       D, which suggests that a few hundred lines of Java suffice, but I
       haven't had time to build to tables or to really think about
       whether they are being done in the best possible way.
       I think my blockage on this point will be shared by the AC members
       who will eventually be asked to express on opinion on E.U.N. So I
       think somebody owes the world the gift of a few quantitative
       research results on these numbers.
     * Decision: Noted.
       We agree with the sentiment. Refer to some mails by Mark about
       cost of checking/normalizing. Doing normalization really early
       (when data is input or converted) is usually very cheap because it
       can be done by design (e.g. keyboards with dead keys, conversion
       from a specific legacy encoding).
     * Decision: Noted.
       We agree that this is an important consideration. Please refer to
       some earlier mails by Mark Davis about cost of
       checking/normalizing. Doing normalization really early (when data
       is input or converted) is usually very cheap because it can be
       done by design (e.g. keyboards with dead keys, conversion from a
       specific legacy encoding). Normalization is indeed best run at i/o
       speed, but this should be human input speed rather than network
       i/o speed.
       A general normalization algorithm needs significantly more than
       10KB footprint. But there is quite a wide range of possible
       tradeoffs between speed and footprint.
       We have added references to implementations and additional
       material in Appendix D, resources for Normalization.
       resources There is also an FAQ at
       [546]http://www.unicode.org/faq/normalization.html. An
       implementation that I (MD) did for just *checking* NFC came in
       under 50KB (in C). Mark reported 110KB for actual normalization to
       NFC (in Java).

    [542] http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Normalization
    [546] http://www.unicode.org/faq/normalization.html

****C071 Tim Bray
   [548]6 Bit-by-bit identity
     * Comment (received 2002-05-30) -- [550]Comments on Character Model
       List item 4. 'Testing for bit-by-bit identity.'
       <pedantry intensity='severe'>This may be the way you do it but I
       think it's the wrong way to talk about it. The point about Unicode
       is that it says is a character is an thingie identified by number
       which has a bunch of properties. At the end of the day, what you
       want people to do is to normalize the data in computer storage to
       a series of non-negative integers and when testing for equality,
       if you have two sequences of non-negative integers which are equal
       in length and pairwise equal in value, then you have equality. It
       is is conceivable in theory that the integer values are stored
       differently in two parts of the same program; and in practice, who
       knows what lurks inside a Perl 'scalar', and and what really
       happens when perl processes the '==' operator?. So I think that
       item 4 should say the strings are pairwise numerically equal by
       code point and leave it at that.</pedantry>
     * Decision: Rejected.
     * Rationale: What actually happens in the various programming
       languages we know is that they all require care to make sure that
       the encoding is really the same. There is no C function to
       automatically compare multibyte and wide-character represetations,
       and so on. We think that it is much better to be too specific to
       make sure implementers don't forget anything, rather than too


****C072 Tim Bray
   [552]9 Referencing Unicode
     * Comment (received 2002-05-30) -- [554]Comments on Character Model
       [S] Since specifications in general need both a definition for
       their characters and the semantics associated with these
       characters, specifications SHOULD include a reference to the
       Unicode Standard, whether or not they include a reference to
       ISO/IEC 10646.
       Change SHOULD to MUST. There's no excuse for doing a spec that
       talks about this stuff without referencing Unicode. Among other
       things, it's easy to buy the Unicode spec, and the spec is useful;
       neither of these things are true about the ISO version.
     * Decision: Rejected.
     * Rationale: We do not think that a MUST is appropriate for this
       matter. Please see our answer to comment [555]C128.

    [552] http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-RefUnicode

****C079 Tim Bray
   [591]4.4 '[C] In order to conform to this
   specification, all text content on the web MUST ...'
     * Comment (received 2002-05-30) -- [593]Comments on Character Model
       '[C] In order to conform to this specification, all text content
       on the web MUST...' er, shouldn't this be [I] as well, since a lot
       of that content is produced by software? But see my comment to
       3.1.3 above since split into [594]C073 and [595]C074].
     * Decision: Rejected.
     * Rationale: Covered by other requirements.


[1] The version of CharMod you commented on: 
[2] Latest editor's version (still being edited): 
[3] Last Call comments table, sorted by ID: 
Received on Friday, 16 January 2004 03:18:50 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:20:14 UTC