- From: Tim Bray <tbray@textuality.com>
- Date: Thu, 30 May 2002 16:55:27 -0700
- To: www-i18n-comments@w3.org, www-tag@w3.org
I think it's sensible to combine my input to the TAG discussion of this issue with my feedback to the charmod draft (but maybe I'm just being lazy). My comments fall into two classes: substantive discussions of technical content and editorial nits. In this note, the nits are at the end and TAGgers (in fact anyone but an actual charmod editor) can probably safely stop reading at the ============-line marking the end of substantive comments. Some of my opinions might end up being echoed by the TAG, but this document as of now is just my opinions. Substantive comments: (1) 3.1.5 Collation >>3.1.5 [S] [I] Software that sorts or searches text for users MUST >>do so on the basis of appropriate collation units and ordering >>rules for the relevant language and/or application. Hmm, there are cases where you just don't know the language, and even if you do, is this a requirement in the general case for things like XQuery? I think there are scenarios where it's reasonable to say a particular module shall order things by Unicode character number order and that's all there is to it. I think this should be rewritten to say that IF strings are being collated, they MUST be collated EITHER in the order appropriate to the language they're in, or if that's not possible by unicode character number. ---------------------------------------------------- (2) 3.6 Unique Character Encoding >>3.6 ... [S] When designing a new protocol, format or API, >>specifications SHOULD mandate a unique character encoding. No. If the format is in XML and has likely usage scenarios which include creation by humans, this is a good enough reason to just go by the XML rules. For example, I habitually compose XML documents in ISO-8859-1, which suits my needs as a user of European languages. I see no reason whatsoever why a specification should invalidate either my habits or those of a Japanese author who wants to use some flavor of JIS. OK, I guess this argument could fall under the exception clause of SHOULD, but I'd go so far as to add [S] When designing an XML-based protocol which is apt to be authored by humans, specifications MUST NOT limit the use of character encodings beyond the rules provided by XML. ---------------------------------------------------- (3) 3.6.2 Admissibility of UTF-* 3.6.2 The paragraph beginning "[S] If the unique encoding approach is not chosen, specifications MUST designate at least one of the UTF-8 and UTF-16 encoding forms of Unicode as admissible... " is fine, but if the format uses XML, then XML's rules cover this and in fact require that UTF-8 and -16 are both admissable; which takes priority over the language here and this should be noted. ---------------------------------------------------- (4) 4. Early Uniform Normalization I am unable to develop an intelligent opinion as to the cost-benefit trade-off of Early Uniform Normalization and will remain unable to do so without hard information as to the cost. For example, if there was a C-language library available unencumbered by licensing issues which had a memory footprint smaller than say 10k and which ran at I/O speeds, you could reasonably argue that this is a cost effectively equal to zero. On the other hand, if E.U.N. requires a memory footprint of 256K or, worse, understanding and linking to the entire ICU library (blecch), the cost is likely to be unacceptable in a large class of applications. There's a normalizer demo at Unicode.org referenced from Appendix D, which suggests that a few hundred lines of Java suffice, but I haven't had time to build to tables or to really think about whether they are being done in the best possible way. I think my blockage on this point will be shared by the AC members who will eventually be asked to express on opinion on E.U.N. So I think somebody owes the world the gift of a few quantitative research results on these numbers. ---------------------------------------------------- (4) 6. Bit-by-bit identity 6. list item 4. "Testing for bit-by-bit identity." <pedantry intensity="severe">This may be the way you do it but I think it's the wrong way to talk about it. The point about Unicode is that it says is a character is an thingie identified by number which has a bunch of properties. At the end of the day, what you want people to do is to normalize the data in computer storage to a series of non-negative integers and when testing for equality, if you have two sequences of non-negative integers which are equal in length and pairwise equal in value, then you have equality. It is is conceivable in theory that the integer values are stored differently in two parts of the same program; and in practice, who knows what lurks inside a Perl "scalar", and and what really happens when perl processes the "==" operator?. So I think that item 4 should say the strings are pairwise numerically equal by code point and leave it at that.</pedantry> ---------------------------------------------------- (5) Referencing Unicode >>9. ... [S] Since specifications in general need both a definition >>for their characters and the semantics associated with these >>characters, specifications SHOULD include a reference to the >> Unicode Standard, whether or not they include a reference to >>ISO/IEC 10646. Change SHOULD to MUST. There's no excuse for doing a spec that talks about this stuff without referencing Unicode. Among other things, it's easy to buy the Unicode spec, and the spec is useful; neither of these things are true about the ISO version. ================================================================== Nits: 3.1.3 "[S] Protocols, data formats and APIs MUST store, interchange or process text data in logical order" - shouldn't that be [S] [I] - software should do this too? In fact, arguably this should be [S] [I] [C]. Mind you, it seems that the boundaries between [S] [I] and [C] are pretty fuzzy. If I were editing this thing, I'd just drop the whole notation and rely on getting the normative language right about what must be done, relying on the spec/data/software authors to follow the normative language that reasonably applies to them. 3.1.6 There is a problem in the phrase beginning "also known as octets"... it seems backward; the reason we talk about "octets" is that some bytes *used to be* non-8-bit; the fact that they're all 8-bit now means that the term "octet" is probably a bit redundant. Perhaps the wording is correct but my brain obstinately insists on reading it backward so a little editorial cleanup is in order. 3.7 The bullet point beginning "[S] Escape syntax SHOULD either require explicit end delimiters" is fine, but the charmod document itself doesn't actually comply per section 1.3's description of the U+hhhh notation. It might be elegant to cite the containing document as an example of non-compliance :) 4.2.2, list item "2" uses the term "legacy encoding", since it's defined shouldn't it be in bold? 4.2.2 (second NOTE), 4.2.3 (first NOTE) the phrase "(or the absence thereof)" baffles me no matter how many times I read it... please clarify a bit. 4.4 "[C] In order to conform to this specification, all text content on the web MUST..." er, shouldn't this be [I] as well, since a lot of that content is produced by software? But see my comment to 3.1.3 above.
Received on Thursday, 30 May 2002 19:56:10 UTC