Comments on Character Model from Tim Bray on 2002-05-30 (www-i18n-comments@w3.org from May 2002)

From: Tim Bray <tbray@textuality.com>
Date: Thu, 30 May 2002 16:55:27 -0700
To: www-i18n-comments@w3.org, www-tag@w3.org
Message-ID: <3CF6BBEF.1050407@textuality.com>
I think it's sensible to combine my input to the TAG discussion of this 
issue with my feedback to the charmod draft (but maybe I'm just being 
lazy).  My comments fall into two classes: substantive discussions of 
technical content and editorial nits.  In this note, the nits are at the 
end and TAGgers (in fact anyone but an actual charmod editor) can 
probably safely stop reading at the ============-line marking the end of 
substantive comments.

Some of my opinions might end up being echoed by the TAG, but this 
document as of now is just my opinions.

Substantive comments:

(1) 3.1.5 Collation

 >>3.1.5 [S] [I] Software that sorts or searches text for users MUST
 >>do so on the basis of appropriate collation units and ordering
 >>rules for the relevant language and/or application.

Hmm, there are cases where you just don't know the language, and even if 
you do, is this a requirement in the general case for things like 
XQuery?  I think there are scenarios where it's reasonable to say a 
particular module shall order things by Unicode character number order 
and that's all there is to it.  I think this should be rewritten to say 
that IF strings are being collated, they MUST be collated EITHER in the 
order appropriate to the language they're in, or if that's not possible 
by unicode character number.

----------------------------------------------------

(2) 3.6 Unique Character Encoding

 >>3.6 ... [S] When designing a new protocol, format or API,
 >>specifications SHOULD mandate a unique character encoding.

No. If the format is in XML and has likely usage scenarios which include 
creation by humans, this is a good enough reason to just go by the XML 
rules.  For example, I habitually compose XML documents in ISO-8859-1, 
which suits my needs as a user of European languages.  I see no reason 
whatsoever why a specification should invalidate either my habits or 
those of a Japanese author who wants to use some flavor of JIS.

OK, I guess this argument could fall under the exception clause of 
SHOULD, but I'd go so far as to add

  [S] When designing an XML-based protocol which is apt to be
  authored by humans, specifications MUST NOT limit the use of
  character encodings beyond the rules provided by XML.

----------------------------------------------------

(3) 3.6.2 Admissibility of UTF-*

3.6.2 The paragraph beginning "[S] If the unique encoding approach is 
not chosen, specifications MUST designate at least one of the UTF-8 and 
UTF-16 encoding forms of Unicode as admissible... " is fine, but if the 
format uses XML, then XML's rules cover this and in fact require that 
UTF-8 and -16 are both admissable; which takes priority over the 
language here and this should be noted.

----------------------------------------------------

(4) 4. Early Uniform Normalization

I am unable to develop an intelligent opinion as to the cost-benefit 
trade-off of Early Uniform Normalization and will remain unable to do so 
without hard information as to the cost.  For example, if there was a 
C-language library available unencumbered by licensing issues which had 
a memory footprint smaller than say 10k and which ran at I/O speeds, you 
could reasonably argue that this is a cost effectively equal to zero. 
On the other hand, if E.U.N. requires a memory footprint of 256K or, 
worse, understanding and linking to the entire ICU library (blecch), the 
cost is likely to be unacceptable in a large class of applications.

There's a normalizer demo at Unicode.org referenced from Appendix D, 
which suggests that a few hundred lines of Java suffice, but I haven't 
had time to build to tables or to really think about whether they are 
being done in the best possible way.

I think my blockage on this point will be shared by the AC members who 
will eventually be asked to express on opinion on E.U.N.  So I think 
somebody owes the world the gift of a few quantitative research results 
on these numbers.

----------------------------------------------------

(4) 6. Bit-by-bit identity

6. list item 4. "Testing for bit-by-bit identity."

<pedantry intensity="severe">This may be the way you do it but I think 
it's the wrong way to talk about it.  The point about Unicode is that it 
says is a character is an thingie identified by number which has a bunch 
of properties.  At the end of the day, what you want people to do is to 
normalize the data in computer storage to a series of non-negative 
integers and when testing for equality, if you have two sequences of 
non-negative integers which are equal in length and pairwise equal in 
value, then you have equality.   It is is conceivable in theory that the 
integer values are stored differently in two parts of the same program; 
and in practice, who knows what lurks inside a Perl "scalar", and and 
what really happens when perl processes the "==" operator?.  So I think 
that item 4 should say the strings are pairwise numerically equal by 
code point and leave it at that.</pedantry>

----------------------------------------------------

(5) Referencing Unicode

 >>9.  ... [S] Since specifications in general need both a definition
 >>for their characters and the semantics associated with these
 >>characters, specifications SHOULD include a reference to the
 >> Unicode Standard, whether or not they include a reference to
 >>ISO/IEC 10646.

Change SHOULD to MUST.  There's no excuse for doing a spec that talks 
about this stuff without referencing Unicode.  Among other things, it's 
easy to buy the Unicode spec, and the spec is useful; neither of these 
things are true about the ISO version.

==================================================================
Nits:

3.1.3 "[S] Protocols, data formats and APIs MUST store, interchange or 
process text data in logical order" - shouldn't that be [S] [I] - 
software should do this too?  In fact, arguably this should be [S] [I] 
[C].  Mind you, it seems that the boundaries between [S] [I] and [C] are 
pretty fuzzy.  If I were editing this thing, I'd just drop the whole 
notation and rely on getting the normative language right about what 
must be done, relying on the spec/data/software authors to follow the 
normative language that reasonably applies to them.

3.1.6 There is a problem in the phrase beginning "also known as 
octets"... it seems backward; the reason we talk about "octets" is that 
some bytes *used to be* non-8-bit; the fact that they're all 8-bit now 
means that the term "octet" is probably a bit redundant.  Perhaps the 
wording is correct but my brain obstinately insists on reading it 
backward so a little editorial cleanup is in order.

3.7 The bullet point beginning "[S] Escape syntax SHOULD either require 
explicit end delimiters" is fine, but the charmod document itself 
doesn't actually comply per section 1.3's description of the U+hhhh 
notation.  It might be elegant to cite the containing document as an 
example of non-compliance :)

4.2.2, list item "2" uses the term "legacy encoding", since it's defined 
shouldn't it be in bold?

4.2.2 (second NOTE), 4.2.3 (first NOTE) the phrase "(or the absence 
thereof)" baffles me no matter how many times I read it... please 
clarify a bit.

4.4 "[C] In order to conform to this specification, all text content on 
the web MUST..." er, shouldn't this be [I] as well, since a lot of that 
content is produced by software?  But see my comment to 3.1.3 above.
Received on Thursday, 30 May 2002 19:56:10 UTC