- From: Richard Ishida <ishida@w3.org>
- Date: Fri, 16 Jan 2004 08:18:49 -0000
- To: <tbray@textuality.com>
- Cc: <www-i18n-comments@w3.org>
Dear Tim, Many thanks for your comments on the 2nd Last Call version of the Character Model for the World Wide Web v1.0 [1]. We appreciate the interest you have taken in this specification. You can see the comments you submitted, grouped together, at http://www.w3.org/International/Group/2002/charmod-lc/SortByOriginator.html# C067 (You can jump to a specific comment in the table by adding its ID to the end of the URI.) PLEASE REVIEW the decisions for the following additional comments and reply to us within the next two weeks at mailto:www-i18n-comments@w3.org (copying w3c-i18n-ig@w3.org) to say whether you are satisfied with the decision taken. C068, C069, C070, C071 C072, C079 Information relating to these comments is included below. You will receive notification of decisions on remaining comments at a later date. Note also our intention with regard to C074. You can find the latest version of the Character Model at http://www.w3.org/International/Group/charmod-edit/ . Best regards, Richard Ishida, for the I18N WG DECISIONS REQUIRING A RESPONSE ============================== ****C068 Tim Bray - [531]3.6 Unique Character Encoding * See also the following comments: C114 * Comment (received 2002-05-30) -- [534]Comments on Character Model [S] When designing a new protocol, format or API, specifications SHOULD mandate a unique character encoding. No. If the format is in XML and has likely usage scenarios which include creation by humans, this is a good enough reason to just go by the XML rules. For example, I habitually compose XML documents in ISO-8859-1, which suits my needs as a user of European languages. I see no reason whatsoever why a specification should invalidate either my habits or those of a Japanese author who wants to use some flavor of JIS. OK, I guess this argument could fall under the exception clause of SHOULD, but I'd go so far as to add [S] When designing an XML-based protocol which is apt to be authored by humans, specifications MUST NOT limit the use of character encodings beyond the rules provided by XML. * Decision: Partially accepted. * Rationale: We have added: "[S] When basing a protocol, format, or API on a protocol, format, or API that already has rules for character encoding, specifications SHOULD use rather than change these rules." and have added XML as an example. As said elsewhere, we prefer not to have requirements specific to a particular format. Also, the 'authored by humans' part is not necessarily true; in general, humans care about the actual text and about the tools they use, not about encodings. [531] http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Encodings [534] http://lists.w3.org/Archives/Public/www-i18n-comments/2002May/0031.html ****C069 Tim Bray - [536]3.6.2 Admissibility of UTF-* * Comment (received 2002-05-30) -- [538]Comments on Character Model The paragraph beginning '[S] If the unique encoding approach is not chosen, specifications MUST designate at least one of the UTF-8 and UTF-16 encoding forms of Unicode as admissible... ' is fine, but if the format uses XML, then XML's rules cover this and in fact require that UTF-8 and -16 are both admissable; which takes priority over the language here and this should be noted. * Decision: Partially accepted. * Note: Covered by our edit resulting from [539]C114 and your previous comment [540]C068. [536] http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-EncodingIdent [538] http://lists.w3.org/Archives/Public/www-i18n-comments/2002May/0031.html ****C070 Tim Bray - [542]4 Early Uniform Normalization * Comment (received 2002-05-30) -- [544]Comments on Character Model I am unable to develop an intelligent opinion as to the cost-benefit trade-off of Early Uniform Normalization and will remain unable to do so without hard information as to the cost. For example, if there was a C-language library available unencumbered by licensing issues which had a memory footprint smaller than say 10k and which ran at I/O speeds, you could reasonably argue that this is a cost effectively equal to zero. On the other hand, if E.U.N. requires a memory footprint of 256K or, worse, understanding and linking to the entire ICU library (blecch), the cost is likely to be unacceptable in a large class of applications. There's a normalizer demo at Unicode.org referenced from Appendix D, which suggests that a few hundred lines of Java suffice, but I haven't had time to build to tables or to really think about whether they are being done in the best possible way. I think my blockage on this point will be shared by the AC members who will eventually be asked to express on opinion on E.U.N. So I think somebody owes the world the gift of a few quantitative research results on these numbers. * Decision: Noted. We agree with the sentiment. Refer to some mails by Mark about cost of checking/normalizing. Doing normalization really early (when data is input or converted) is usually very cheap because it can be done by design (e.g. keyboards with dead keys, conversion from a specific legacy encoding). * Decision: Noted. We agree that this is an important consideration. Please refer to some earlier mails by Mark Davis about cost of checking/normalizing. Doing normalization really early (when data is input or converted) is usually very cheap because it can be done by design (e.g. keyboards with dead keys, conversion from a specific legacy encoding). Normalization is indeed best run at i/o speed, but this should be human input speed rather than network i/o speed. A general normalization algorithm needs significantly more than 10KB footprint. But there is quite a wide range of possible tradeoffs between speed and footprint. We have added references to implementations and additional material in Appendix D, resources for Normalization. [545]http://www.w3.org/International/Group/charmod-edit/#sec-n11n- resources There is also an FAQ at [546]http://www.unicode.org/faq/normalization.html. An implementation that I (MD) did for just *checking* NFC came in under 50KB (in C). Mark reported 110KB for actual normalization to NFC (in Java). [542] http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-Normalization [544] http://lists.w3.org/Archives/Public/www-i18n-comments/2002May/0031.html [545] http://www.w3.org/International/Group/charmod-edit/#sec-n11n-resources [546] http://www.unicode.org/faq/normalization.html ****C071 Tim Bray - [548]6 Bit-by-bit identity * Comment (received 2002-05-30) -- [550]Comments on Character Model List item 4. 'Testing for bit-by-bit identity.' <pedantry intensity='severe'>This may be the way you do it but I think it's the wrong way to talk about it. The point about Unicode is that it says is a character is an thingie identified by number which has a bunch of properties. At the end of the day, what you want people to do is to normalize the data in computer storage to a series of non-negative integers and when testing for equality, if you have two sequences of non-negative integers which are equal in length and pairwise equal in value, then you have equality. It is is conceivable in theory that the integer values are stored differently in two parts of the same program; and in practice, who knows what lurks inside a Perl 'scalar', and and what really happens when perl processes the '==' operator?. So I think that item 4 should say the strings are pairwise numerically equal by code point and leave it at that.</pedantry> * Decision: Rejected. * Rationale: What actually happens in the various programming languages we know is that they all require care to make sure that the encoding is really the same. There is no C function to automatically compare multibyte and wide-character represetations, and so on. We think that it is much better to be too specific to make sure implementers don't forget anything, rather than too abstract. [548] http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-IdentityMatching [550] http://lists.w3.org/Archives/Public/www-i18n-comments/2002May/0031.html ****C072 Tim Bray - [552]9 Referencing Unicode * Comment (received 2002-05-30) -- [554]Comments on Character Model [S] Since specifications in general need both a definition for their characters and the semantics associated with these characters, specifications SHOULD include a reference to the Unicode Standard, whether or not they include a reference to ISO/IEC 10646. Change SHOULD to MUST. There's no excuse for doing a spec that talks about this stuff without referencing Unicode. Among other things, it's easy to buy the Unicode spec, and the spec is useful; neither of these things are true about the ISO version. * Decision: Rejected. * Rationale: We do not think that a MUST is appropriate for this matter. Please see our answer to comment [555]C128. [552] http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-RefUnicode [554] http://lists.w3.org/Archives/Public/www-i18n-comments/2002May/0031.html ****C079 Tim Bray - [591]4.4 '[C] In order to conform to this specification, all text content on the web MUST ...' * Comment (received 2002-05-30) -- [593]Comments on Character Model '[C] In order to conform to this specification, all text content on the web MUST...' er, shouldn't this be [I] as well, since a lot of that content is produced by software? But see my comment to 3.1.3 above since split into [594]C073 and [595]C074]. * Decision: Rejected. * Rationale: Covered by other requirements. [591] http://www.w3.org/TR/2002/WD-charmod-20020430/#sec-NormalizationApplication [593] http://lists.w3.org/Archives/Public/www-i18n-comments/2002May/0031.html USEFUL LINKS ============== [1] The version of CharMod you commented on: http://www.w3.org/TR/2002/WD-charmod-20020430/ [2] Latest editor's version (still being edited): http://www.w3.org/International/Group/charmod-edit/ [3] Last Call comments table, sorted by ID: http://www.w3.org/International/Group/2002/charmod-lc/
Received on Friday, 16 January 2004 03:18:50 UTC