Review of WD-charmod-20040225 from Tim Bray on 2004-03-05 (www-i18n-comments@w3.org from March 2004)

From: Tim Bray <tbray@textuality.com>
Date: Fri, 5 Mar 2004 15:05:57 -0800
To: www-i18n-comments@w3.org
Message-Id: <A9770732-6EF9-11D8-95ED-000A95A51C9E@textuality.com>
Mostly editorial.

1. 1.2, second <ul, second point:
"	• 	Non-ASCII characters [ISO/IEC 646] are being used"
awkward: the reference is to ASCII and not to "non-ASCII characters.  I 
suggest
"Characters outside the ASCII [ISO/IEC 646] repertoire are being used"

2. 1.2, send <ul, third point:
"	• 	More and more APIs are defined, not just protocols and  format"
So what?  Why is this point here?  Either remove it or explain how it 
relates to i18n.

3. 1.2, just below previous
"In short, the Web may be seen as a single, very large application..." 
this paragraph may or may not be true and is orthogonal to i18n (I 
think) so either remove it or explain why it matters

4. 1.2, 3rd last para
"It should be noted that such  aspects also exist in legacy encoding"
Awkward language, suggest ".. that such issues also exist for ..."

5. 1.3
The first sentence, beginning "For the purpose of this 
specification..." totally baffles me.  The notion of the "producer" of 
text data is entirely self-explanatory, and this sentence is 
unnecessary, and also confusing because most people don't have an 
internal world-view that distinguishes "products" and "formats".  I 
don't.  I suggest

"This specification distinguishes between the roles of <b>producer</b> 
and <b>recipient</b> of text data.  In a networked information system, 
a software module may be both a producer and a recipient."

6. 2., before the <ol>

A secification conforms... s/they/it/

7. 2. items 3 and 4 in the <ol>
I think "where applicable" is a little stronger and smoother than "if 
applicable"

8. 2. first para after the <ol> s/if it/if they/

9. 3.1, excerpt from Unicode

s/semantic values/semantic value/.

10. 3.3, first para after the <ul>
"Each glyph can be represented by a number of different glyph  images; 
a set of glyph images makes up a font."
The part before the semicolon is very awkward and I'm not sure I 
understand what it's saying.  Maybe an example?  Are you saying that 
even though é is a single character, the standalone accent is also in 
the font even if you can't use it standalone?

11. 3.3 material on selection

This section needs either to be split or a new section 3.3.1 selection. 
  There is a clear transition at the paragraph beginning "In the 
presence of bidirectional text..." from talking about directionality to 
talking about selection.  In fact, you could make a case for the 
paragraph beginning "Some scripts, in particular Arabic..." being a 
standalone section.  The material here on selection and 
bidirectionality is excellent and the usefulness would be better if it 
had a section number so people could reference it.

12. 4.1 first two sentences

the phrase "in particular on the WWW" is wrong, it's no more necessary 
to encode chars here than anywhere else.  I suggest "On the WWW, as in 
any computing environment, characters must be encoded to be of any 
use."

The second sentence beginning "In fact, much of the information..." is 
pure fluff, I suggest just losing it.  By byte count, the amount of 
text flowing around the network has been a small minority since the 
creation of alt.sex.pictures, which predates the web by a few years.  
You don't need to convince anyone that there's text out there and that 
encoding it is important.

13. 4.3 first para

"... where no markup or programing language applies." Non-idiomatic, 
suggest
"(not in the context of markup or a programming language)"

14. 4.3 Para beginning "Unicode contains some code points for internal 
use..."
Shouldn't the "should not" here be a MUST not?  No spec should *ever* 
specify sending a surrogate, except implicitly as part of an 
astral-plane character.

15. 4.4 C016
This is controversial.  I think in general this is reasonable, with the 
single exception of doing what XML did and blessing both UTF-8 and 
UTF-16.  The problem with a single encoding is that it forces people to 
choose between being Java/C# friendly (UTF-16) and C/C++ friendly 
(UTF-8).  Later on, you in fact seem to agree with this point.  
Furthermore it's trivially easy to distinguish between UTF-8 and UTF-16 
if you specify a BOM.  But I think that if I were defining the next CSS 
or equivalent I'd like to be able to say "UTF-8 or UTF-16" without 
feeling guilty.

16. Whole document
I don't see anywhere that it recommends that if you're using UTF-16 you 
always use a BOM, and that seems like a basic good practice, 
particularly if you're going to allow either UTF8 or UTF-16.

17. 4.4.2, C033
This is fuzzy and doesn't actually tell me anything that I can use.  
Either remove it or beef it up with examples.

18. 4.4.2, C034
Would be better to recast this as an imperative: If facilities are 
offered for identifying character encoding, content MUST make use of 
them.

19. 4.4.2, C036
Once again, fluffy, recast as an imperative. Even better, roll C035 and 
C036 together.

20. 4.6, last item in <ul>
Item #3 is fuzzy.  I think what you really mean is
3. Expressing characters that can't be input directly (e.g. because of 
keyboard limitations).
4. Expressing characters that can't be displayed (e.g. because of font 
limitations)

21. Third EXAMPLE
This is incorrect.  Within CDATA sections, &#xd801; is perfectly legal 
and just encodes a string of 8 ASCII characters.  Outside of CDATA 
sections "&#xd801;" is illegal, but that's an XML thing, not a CDATA 
section thing.

22. 4.6 C048
Seems silly.  We're pretty well deprecating everything except Unicode 
right, so this vague notion of "character set standards" is useless.  
And you already said use hex for Unicode.

23. 4.6 C049
The notion of a "character encoding based on Unicode" is jarring here.  
Doesn't the whole document say "use Unicode"?

24. 6.2 C056
I think it would be helpful to link back to the section where you show 
that a character does not map to a single unit of sound or display or 
input, as another good reason for this constraint.

24. 7. C058
Can you proceed to recommendation with this dependency on IRIs, which 
are not yet cooked?

25. 8. C062
I agree with this, could we strengthen it to say MUST reference 
Unicode?  Anyone defining a protocol or language that has text in it 
had better say the text is unicode and if they say so, should really 
have a normative reference, right?  Is there any situation we can 
imagine where it would be OK to not have such a reference?
Attachments

application/pkcs7-signature attachment: smime.p7s
Received on Friday, 5 March 2004 18:06:02 UTC