- From: Karlsson Kent - keka <keka@im.se>
- Date: Tue, 30 Jan 2001 20:19:13 +0100
- To: "'www-i18n-comments@w3.org'" <www-i18n-comments@w3.org>
- Cc: "'Martin Duerst'" <duerst@w3.org>, misha.wolf@reuters.com, "'Asmus Freytag'" <asmusf@ix.netcom.com>, "'Kenneth Whistler'" <kenw@sybase.com>, "'Mark Davis'" <mark.davis@us.ibm.com>
I'm not at all sure the document is ready for "last call". See below on clause 4.2.2. ============================= * clause 1.3, code position notation; maybe sufficient here, but not precise. "Conformance" -> "Conformity" (English) The phrase "MUST NOT" reflects in itself a lack of internationalisation. In English, "must not" means the same as "shall not", so use the phrase "shall not". In other languages the word for "must" followed by the word for "not" (like in Swedish "måste inte") means the same as "does not have to", which is quite different from "shall not". However, the word for "shall" followed by the word for "not" does not have such an issue, but retains the English meaning. Similarly, "MAY NOT" also has the same kind of problem. This is the reason why ISO/IEC JTC 1 procedures does not allow the phrase "must not" (nor "must"), but instead uses the phrase "shall not" (and for similarity, uses "shall" for the positive requirements). The phrase "REQUIRED" seems superfluous, use "SHALL" (with a reformulation to form a proper sentence). The terminology (SHALL, ..., OPTIONAL, ...) should come before the conformity clause (among other definitions, that are generally missing; see also below). --typo: "All...specification" --> "All...specifications" (plural) * clause 3.1.6. Except for compressions; when is multiple 'characters' stored in a single 'physical unit of storage' (in a context where 'physical unit of storage' are such things as a byte or a wyde)? * clause 3.1.7. "MUST (sic) specify which"; but then there should be an explicit list in the "char-model" document, should there not? Otherwise you have an open-ended requirement. * clause 3.2 There is no definition of terms in the document. Terms such as "byte" and "wyde" are left for the reader to guess, likewise for "octet", though that is more precise. Note that some well-known standards (such as that for C) does NOT limit a "byte" to be an "octet". "code point"...; "code position" seems to be the 10646 term, though not formally defined. "Transfer Encoding Syntax" is missing here (and see below). * clause 3.6.1. "charset" is mentioned a number of times. It should say that XML uses a pseudo-attribute called "encoding" rather than "charset". It should be mentioned that due to a decision to have only a few "Transfer-Encoding" values, some encodings that are really Transfer Encoding Syntaxes got registered as "charsets". For instance UTF-7 (despite the name it's not a UTF) and HZ-GB-2312 are really TESes, not CESes. UTF-7 is already deprecated, and was only intended for e-mail in the same way as Quoted-Printable was only intended for (7-bit) e-mail. No TES should be used other than for backwards compatibility in e-mail support (i.e. SHOULD not be used else-where; or even SHALL NOT be used elsewhere...). There are also some 2022 "charsets" registered. But due to the lack of widespread support for 2022, it should be avoided except for backwards compatibility in e-mail support, and should not be used elsewhere. Further, there are some older registered encodings related to 10646/Unicode apart from UTF-7: --UCS-2, --UTF-1, --UCS-4; UNICODE-1-1 as well as a few subsets. These should be recommended against. * clause 3.7. For XML (and thus XHTML) one should recommend to use the hexadecimal rather than decimal "character escape". XHTML has inherited a number of named (rather than numbered) "character escapes"; are these counted as character escapes too, or are they not? (See also below.) XML 1.0 does not allow any "character escapes" in identifiers (they are allowed in comments, but I'm not sure if a "source viewer" is supposed to interpret them there). Maybe a note about that... Some editing tools are too eager to automatically use character escapes (or named character entities; like å) even though the target encoding perfectly well can represent the character directly without any problems. There should be a recommendation not to do so, but to insert the characters directly as typed on the keyboard (or pasted in as plain text), when representable and when they would not cause parsing problems (like e.g., '<' would in XML). * clause 4.1 "UTR #15" --> "UAX 15"; it's in UAX status, and the # is just ugly. * clause 4.2.2. For clarity, the parenthetical definition should be removed, along with its application. [this clause is a mess, as are the references to it] "does not contain any character escapes whose unescaping..."? This appears to be targeting such things as numeric escapes (like &#... in XML). It's not clear if standardised named character entities are to be considered, or even worse, non-standard externally parsed entities that someone might have defined in another file (or whatever). If not, expanding them may result in non-NFC, which is guarded against when it comes to numeric character escapes. If they are, then defined entities (in the XML-meaning) must be examined too. Are they then to be expanded during W3C-normalisation? E.g. is Ǟ W3C-normalised (for XHTML) or not? Note that expanding both the named and numeric character reference, and then creating an NFC version generates the single character called LATIN CAPITAL LETTER A WITH DIAERESIS AND MACRON. The situation gets even worse with non-W3C-standard entities that may be defined, in the same "file" or in another "file", which may contain any text, including markup, and even if the definition itself may be "W3C-normalised", at the point of use there may still be a concatenation of strings whose result is not in any normal form. Does a W3C-normaliser for XML need to consider externally parsed entities? 4.2.2. defines "W3C-normalised" w.r.t. the "character escape" syntax used, but is not clear about what that is. A further problem is that 4.2.2 does NOT actually define what "W3C-normalisATION", the algorithm, is supposed to do. Are input to be rejected if not already normalised? Probably not. Are some numeric character escapes to be expanded and combine that with creation of a result in NFC? Maybe. But what about XML's entities; are they to be examined? And if the data is then found not be W3C-normalised, what then? Expand the entity? That may contain markup, and be arbitrarily large; and the entity may have multiple occurrences. It may also destroy the document design ("I DID want that entity with a combining character first!"). It's not clear to me why W3C-normalisation at all has to be defined. Expanding character escapes (and other entities), if done while editing, should be accompanied by (local) establishment of NFC. But that is no different from, say, pasting in some text (that may contain or even begin with combining characters) during editing. Likewise for string identity matching, after expanding entities (and numeric character escapes), a local normalisation step may be needed. For such things as signature creation, entities and numeric character escapes would not be expanded, creating different signatures for the unexpanded and expanded versions. Maybe there is some assumed distinction between numeric character escapes and other entities (still in the XML-ish sense) that I've missed. Like that numeric character escapes are to be interpreted while string identity matching and signature creation, while the otherwise rather similar (character or larger) named entities are not to be so expanded for those operations. If so, please explain. Also please explain why such a difference in treatment would not be a problem. Writing Ä instead of Ä is not all that different. There is nothing about versioning. A new text that is in NFC for a version of Unicode that does not contain (unallocated) a "new" precomposed characters that is later allocated and used in the "new" text, will not be in NFC for the "new" version of Unicode, which will decompose it. The note about legacy (plain) text always being normalised might not be true for all (any?) legacy encodings for Vietnamese (and now maybe not for Hebrew either...). See in particular MS CP 1258. Side remark: turning marked-up W3C-normalised text into plain text may produce non-NFC results in another way too; e.g. <ex>A<emphasise>̈</emphasise></ex> (say that 'emphasise' uses red colour when displayed/printed). Just expanding the character escape while removing the markup tags results in a decomposed Ä as the plain text version. * clause 5 Expand "GI" to "generic identifier" (or avoid that term, which is not even properly defined in the XML spec.). * clause 8 (on URIs) [this is a general and ugly mess] "The conversion MUST (sic) take place as late as possible." Good. Similarly, the conversion back to a form that does not use the %-encoding should be done as early as possible (in case a URI protocol element is passed back as parameter, it should not then still be %-encoded). Nor should pre-%-encoded URIs occur in stored or generated documents. This should keep %xx's out of any UI. Note that the %-encoding is very similar to the TES Quoted-Printable. --typo: "conversion a legal" --> "conversion to a legal". * Example A.3 This example appears oversimplified; no keyboard state, nor intermediary displays (with quite different characters) are shown. That is hard to show in a simple table, but there should be some explanatory note about that. =========================================================
Received on Tuesday, 30 January 2001 14:22:42 UTC