- From: Tim Bray <tbray@textuality.com>
- Date: Fri, 5 Mar 2004 15:05:57 -0800
- To: www-i18n-comments@w3.org
- Message-Id: <A9770732-6EF9-11D8-95ED-000A95A51C9E@textuality.com>
Mostly editorial. 1. 1.2, second <ul, second point: " • Non-ASCII characters [ISO/IEC 646] are being used" awkward: the reference is to ASCII and not to "non-ASCII characters. I suggest "Characters outside the ASCII [ISO/IEC 646] repertoire are being used" 2. 1.2, send <ul, third point: " • More and more APIs are defined, not just protocols and format" So what? Why is this point here? Either remove it or explain how it relates to i18n. 3. 1.2, just below previous "In short, the Web may be seen as a single, very large application..." this paragraph may or may not be true and is orthogonal to i18n (I think) so either remove it or explain why it matters 4. 1.2, 3rd last para "It should be noted that such aspects also exist in legacy encoding" Awkward language, suggest ".. that such issues also exist for ..." 5. 1.3 The first sentence, beginning "For the purpose of this specification..." totally baffles me. The notion of the "producer" of text data is entirely self-explanatory, and this sentence is unnecessary, and also confusing because most people don't have an internal world-view that distinguishes "products" and "formats". I don't. I suggest "This specification distinguishes between the roles of <b>producer</b> and <b>recipient</b> of text data. In a networked information system, a software module may be both a producer and a recipient." 6. 2., before the <ol> A secification conforms... s/they/it/ 7. 2. items 3 and 4 in the <ol> I think "where applicable" is a little stronger and smoother than "if applicable" 8. 2. first para after the <ol> s/if it/if they/ 9. 3.1, excerpt from Unicode s/semantic values/semantic value/. 10. 3.3, first para after the <ul> "Each glyph can be represented by a number of different glyph images; a set of glyph images makes up a font." The part before the semicolon is very awkward and I'm not sure I understand what it's saying. Maybe an example? Are you saying that even though é is a single character, the standalone accent is also in the font even if you can't use it standalone? 11. 3.3 material on selection This section needs either to be split or a new section 3.3.1 selection. There is a clear transition at the paragraph beginning "In the presence of bidirectional text..." from talking about directionality to talking about selection. In fact, you could make a case for the paragraph beginning "Some scripts, in particular Arabic..." being a standalone section. The material here on selection and bidirectionality is excellent and the usefulness would be better if it had a section number so people could reference it. 12. 4.1 first two sentences the phrase "in particular on the WWW" is wrong, it's no more necessary to encode chars here than anywhere else. I suggest "On the WWW, as in any computing environment, characters must be encoded to be of any use." The second sentence beginning "In fact, much of the information..." is pure fluff, I suggest just losing it. By byte count, the amount of text flowing around the network has been a small minority since the creation of alt.sex.pictures, which predates the web by a few years. You don't need to convince anyone that there's text out there and that encoding it is important. 13. 4.3 first para "... where no markup or programing language applies." Non-idiomatic, suggest "(not in the context of markup or a programming language)" 14. 4.3 Para beginning "Unicode contains some code points for internal use..." Shouldn't the "should not" here be a MUST not? No spec should *ever* specify sending a surrogate, except implicitly as part of an astral-plane character. 15. 4.4 C016 This is controversial. I think in general this is reasonable, with the single exception of doing what XML did and blessing both UTF-8 and UTF-16. The problem with a single encoding is that it forces people to choose between being Java/C# friendly (UTF-16) and C/C++ friendly (UTF-8). Later on, you in fact seem to agree with this point. Furthermore it's trivially easy to distinguish between UTF-8 and UTF-16 if you specify a BOM. But I think that if I were defining the next CSS or equivalent I'd like to be able to say "UTF-8 or UTF-16" without feeling guilty. 16. Whole document I don't see anywhere that it recommends that if you're using UTF-16 you always use a BOM, and that seems like a basic good practice, particularly if you're going to allow either UTF8 or UTF-16. 17. 4.4.2, C033 This is fuzzy and doesn't actually tell me anything that I can use. Either remove it or beef it up with examples. 18. 4.4.2, C034 Would be better to recast this as an imperative: If facilities are offered for identifying character encoding, content MUST make use of them. 19. 4.4.2, C036 Once again, fluffy, recast as an imperative. Even better, roll C035 and C036 together. 20. 4.6, last item in <ul> Item #3 is fuzzy. I think what you really mean is 3. Expressing characters that can't be input directly (e.g. because of keyboard limitations). 4. Expressing characters that can't be displayed (e.g. because of font limitations) 21. Third EXAMPLE This is incorrect. Within CDATA sections, � is perfectly legal and just encodes a string of 8 ASCII characters. Outside of CDATA sections "�" is illegal, but that's an XML thing, not a CDATA section thing. 22. 4.6 C048 Seems silly. We're pretty well deprecating everything except Unicode right, so this vague notion of "character set standards" is useless. And you already said use hex for Unicode. 23. 4.6 C049 The notion of a "character encoding based on Unicode" is jarring here. Doesn't the whole document say "use Unicode"? 24. 6.2 C056 I think it would be helpful to link back to the section where you show that a character does not map to a single unit of sound or display or input, as another good reason for this constraint. 24. 7. C058 Can you proceed to recommendation with this dependency on IRIs, which are not yet cooked? 25. 8. C062 I agree with this, could we strengthen it to say MUST reference Unicode? Anyone defining a protocol or language that has text in it had better say the text is unicode and if they say so, should really have a normative reference, right? Is there any situation we can imagine where it would be OK to not have such a reference?
Attachments
- application/pkcs7-signature attachment: smime.p7s
Received on Friday, 5 March 2004 18:06:02 UTC