- From: Ian Jacobs <ij@w3.org>
- Date: Mon, 29 Jan 2001 15:31:50 -0500
- To: www-i18n-comments@w3.org
- CC: ij@w3.org
Hello, I have a few comments and questions about the 26 January 2001 (last call) Working Draft of "Character Model for the World Wide Web 1.0" [1]. My comments are not sent on behalf of any W3C Working Group. My global comment is that I think the editors have done a good job explaining some very difficult concepts, but that more work may be necessary so that someone reading this specification without extensive experience with these topics will be able to grasp the issues. I have had some exposure to these topics and they are still challenging to me. Where sections begin with examples, I find those sections easier to absorb, and Appendix A is very instructive. My global advice is therefore to add more examples and explanation of those examples. Specific comments and questions follow. They are not prioritized. Some are editorial, but more than "there is a missing comma". Congratulations on your progress on this document! - Ian [1] http://www.w3.org/TR/2001/WD-charmod-20010126 ==================== COMMENTS ==================== | 1 Introduction | | 1.1 Goals and Scope [snip] | All W3C specifications have to conform to this document (see | section [57]2 Conformance). IJ: I don't think you can make this statement. This specification must define conformance to it and should encourage other specs to require conformance to it. However, I don't think it can impose this dependency by other specifications. This may be something that a future W3C architecture document can do. [snip] | 2 Conformance [snip] | In order to conform to this document, all applicable requirements | MUST be satisfied. IJ: I don't believe that you've defined "applicable". Is this a blanket exception clause? I suspect this means that I am not required to implement a "MUST" requirement if I don't consider it applicable to my specification. I recommend narrowing the scope of this significantly. For instance, you can define conformance profiles (e.g., for format specifications, for protocols, for graphical user interfaces, etc.) and (ideally) have no applicability exemptions within each class. | Requirements vary for content, software and | specifications. All new or revised W3C specification MUST conform | to the requirements applicable to specifications. IJ: See comments above: I don't think this spec should/can include such a statement. | 3 Characters | 3.1 Perceptions of Characters | 3.1.1 Introduction | 3.1.4 Units of Input | | In keyboard input, it is not the case that keystrokes and input | characters correspond one-to-one. IJ: I propose "it is not the case in general" | 3.1.5 Units of Collation [snip] | Although | Spanish sorting, and to some extent Spanish everyday use, treat | "ch" as a character, digital encodings treat it as two | characters, and keyboards do the same. Does "digital encodings treat it as two characters" mean that these encodings happen to do this today? The sentence as written suggests to me that they should be treated as two characters (even though people who use Spanish treat it as one character). I expected to read in this sentence that encodings "should" or "should not" do this. | 3.2 Digital Representation of Characters | | To be of any use in computers, in computer communications and in | particular on the World Wide Web, characters must be encoded. IJ: I think that this section should have the word "encoding" in the title. Also, the term "character encoding" is used later in the document, and it's not clear what that term applies to in this section. Does "character encoding" comprise bullets 1-4, or just 3 and 4? [snip] | 1. A set of characters to | be encoded is identified. The units of encoding, the | characters, I find "units of encoding" hard to understand, especially since it means "characters" here. Can "units of encoding" be dropped? The term "unit" is used through the document, and maybe it can be kept if there's a clear statement early on about how it's used. | 3.6 Choice and Identification of Character Encodings [snip] | Receiving software MUST determine the encoding from available | information. IJ: I propose instead: "Receiving software MUST determine the encoding according to specification (include default encodings)." This would include: * Specification-defined defaults * Information in headers * Information in markup. | It MAY recognize as many encodings (names and | aliases) as appropriate. IJ: What does "as appropriate" mean? | A field-upgradeable mechanism may be | appropriate for this purpose. When a IANA-registered charset name | is recognized, receiving software MUST interpret the received | data according to the encoding associated with the name in the | IANA registry. When no charset is provided the receiving software | MUST adhere to the default encoding(s) specified in the | specification. IJ: Last sentence can therefore be deleted. | 3.6.2 Private Use Code Points [snip] | Where specifications need to allow the transmission of symbols | not in Unicode or need to identify specific variants of Unicode | characters, they MAY define markup for this purpose. A couple of questions: - Can you give some examples of mechanisms other than markup? - Why is this a "MAY" rather than a "SHOULD"? What are the more recommended techniques than this? | 4.2 Definitions for W3C Text Normalization | | 4.2.1 Unicode-normalized Text | | Text data is, for the purposes of this specification, | Unicode-normalized if it is in a Unicode encoding form IJ: Is a "Unicode encoding form" a "Unicode character encoding form (CEF)"? It would be good to tie this back to the bulleted list of section 3.2. | and is in | Unicode Normalization Form C (according to revision 18 of | [115][UTR #15]). | | [115] http://www.w3.org/TR/charmod/#UTR15 | | 4.2.2 W3C-normalized Text IJ: Questions: - What is the definition of "legacy text"? "Legacy encoding"? - The definitions rely on the existence of normalizing transcoders. Does this mean actual software must exist. I would prefer that the definition relied on the possibility of writing a transcoder rather than the actual existence of one. On the other hand, it's not very satisfying to write "and if one could write a normalizing transcoder". But that's what it amounts to, I think. | 4.2.3 Examples IJ: I found these examples hard to understand because I don't know anything about Unicode Normalization Form C, and it's not clear to me, for example, wheter U+0327 is a character escape. I think taking more time to explain the examples would be useful. | The string "suçon", expressed as the sequence of five characters | U+0073 U+0075 U+00E7 U+006F U+006E and encoded in a Unicode | encoding form, is both Unicode-normalized and W3C-normalized. The | same string encoded in a legacy encoding for which there exists a | normalizing-transcoder would be W3C-normalized but not | Unicode-normalized. | | The string "suçon", expressed as the sequence of six characters | U+0073 U+0075 U+0063 U+0327 U+006F U+006E (U+0327 is the | COMBINING CEDILLA) and encoded in a Unicode encoding form, is | neither W3C-normalized nor Unicode-normalized. | | In an XML or HTML context, the string "suçon" is not | W3C-normalized, whatever the encoding form, because expanding | "̧" yields the sequence "suc¸on" which is not | Unicode-normalized. Note that, since Unicode-normalization | doesn't take escapes into account, the string "suçon" is | Unicode-normalized if encoded in a Unicode encoding form. I do not understand the last sentence. I suspect that more preceding explanation will help. | 5 Compatibility and Formatting Characters [snip] | Specifications SHOULD exclude compatibility characters in the | syntactic elements (markup, delimiters, identifiers) of the | formats they define (e.g. exclusion of compatibility characters | for GIs in XML). IJ: What are "GIs"? | 6 String Identity Matching [snip] | To avoid unnecessary conversions and, more importantly, to ensure | predictability, all components of the Web must use the same | identity testing mechanism. To meet this requirement and support | the above definition of identity, this specification mandates the | following steps for string identity matching: | 1. Early uniform | normalization to W3C-normalized form, as defined in [120]4.2.2 | W3C-normalized Text | 2. Conversion to a common encoding of UCS, if necessary Is this physical string encoding or byte string encoding? | 3. Expansion of all escapes | 4. Binary comparison | In accordance with section [121]4 Early Uniform Normalization, | the first step MUST be performed by the producers of the strings | to be compared. This ensures 1) that the identity matching | process can produce correct results using the next three steps | and 2) that a minimum of effort is spent on solving the problem. | | [121] http://www.w3.org/TR/charmod/#sec-Normalization | | 7 String Indexing I think that the beginning of this section would benefit from more examples. [snip] | It is noteworthy that there exist other, non-numeric ways of | identifying substrings which have favorable properties. For | instance, substrings based on string matching are quite robust | against small edits; substrings based on document structure (in | structured formats such as XML) are even more robust against | edits and even against translation of a document from one | language to another. IJ: one "markup" language or one human language? | Consequently, specifications that need a way | to identify substrings or point within a string SHOULD provide | ways other than string indexing to perform this operation. Users | of such specifications (software developers, content developers) | SHOULD prefer those other ways whenever possible. IJ: Is the DOM Range spec an example of this? Should this be used as an example? | Experience shows that more general, flexible and robust | specifications result when individual characters are understood | and processed as substrings, identified by a position before and | a position after the substring. Understanding indices as boundary | positions between the counting units also makes it easier to | relate the indices resulting from the different string | definitions. Specifications SHOULD use this form of indexing, | regardless of the choice of counting units. APIs in addition | SHOULD NOT specify single character or single encoding-unit | arguments. IJ: This last sentence is cryptic and some examples or further explanation should be given. | 8 Character Encoding in URI References IJ: I think a slightly different ordering of the discussion might be more effective if split along character set/character encoding lines. Character set: a) The character set of URIs, defined in RFC2396, is a subset of US-ASCII. b) Specs that use URI references must not impose this restriction and must allow URIs to include any Unicode character. This includes some characters disallowed by RFC2396 (e.g., number sign, etc.). These are called "IRIs". Character encoding: a) RFC2396 does allow encoding of arbitrary byte sequences. Here's why this is limited... b) Since some applications will not be able to deal with the characters outside of RFC2396, here's how you must convert to the encoding specified in RFC2396... | According to the definition in [130][RFC 2396], URI references | are restricted to a subset of US-ASCII. This RFC also specifies | an escaping mechanism to encode arbitrary byte values, using the | %HH convention. However, because the RFC does not define the | mapping from characters to bytes, the %HH convention by itself is | of limited use. IJ: Can you provide more explanation of why the convention is of limited use, perhaps referring to the bulleted items of section 3.2? [snip] | 9 Referencing the Unicode Standard and ISO/IEC 10646 [snip] | The fact that both ISO/IEC 10646 and Unicode are evolving (in | synchronism) I would prefer "in synchrony". -- Ian Jacobs (jacobs@w3.org) http://www.w3.org/People/Jacobs Tel: +1 831 457-2842 Cell: +1 917 450-8783
Received on Monday, 29 January 2001 15:31:53 UTC