- From: Martin J. Duerst <duerst@w3.org>
- Date: Mon, 27 Dec 1999 12:30:03 +0900
- To: www-xml-linking-comments@w3.org
Dear Linking WG, These are the last call comments regarding Internationalization (i18n) from the I18N WG/IG. I'm sending these comments directly to the public list indicated for comments, but in case there are any topics that need further discussion, please use crossposting between the relevant group lists, without coping this list. Character Sets and Escaping (2.2) --------------------------- This is very important for i18n, and as far as we understand from the text and from previous discussions, our groups in principle understand each other and agree on the way to go. However, the current wording comes with a number of problems and can and should be improved. This point in particular has been discussed during the I18N WGs last face-to-face meeting. Several rounds of discussion may be needed to get to the final shape. The main problems currently are: - Title: Change 'character sets' to 'character encodings'. - Substructure: There should be subsections to organize the topic further. - Lack of details: People familliar with the referenced specifications, and people checking those throughly, will figure things out. However, these are usually in the minority, and in this particular case, it's very easy for implementers to think that they are not really concerned. Giving details, e.g. on which characters in the ASCII range have to be escaped,..., is very important to try to assure stable implementations. In particular, there should be lists of the various relevant URI character categories (reserved, marks, delims, unwise,...) and how to treat them. - Lack of examples: The distinction/interaction between the various escaping schemes is complex. From your experience, we know that many implementers get this wrong in one way or another. Adding a number of examples seems crucial. (if the above two items should become too long, moving them to an appendix may be considered). - In 2.1, the sentence 'does not represent the general applicability of escaping' should be made more precise (e.g. 'the formal syntax in this specification defines the syntax of XPointers before the application of the various kinds of escaping described in Section 2.2'). - I am not yet completely clear as to whether the ^ escaping mechanism is needed. It would be great if we could avoid it. Let me try to show that we don't need it: - XPointer is the part of an URI behind the #. - The # is the 'most important' character in an URI, i.e. when an URI is processed, first look for a #. - The fragment identifier in URI syntax (RFC 2396) allows a wide range of things, in particular: fragment = *uric uric = reserved | unreserved | escaped reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | "," unreserved = alphanum | mark mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" escaped = "%" hex hex - This includes (), for which no escaping is needed. - This means that %28 (for '(') and %29 (for ')') can be used to hide unbalanced parentheses from the scheme end detector. - This means that the ^ escaping mechanism is not needed anymore. - One escaping mechanism less is a big plus. The potential problems for the above argument that I currently see are: - The %28/%29 escapings are used or planned to be used for something else. - It is not possible to rely on the URI infrastructure to keep %28/%29 and (/) separated for fragment parts. - 'The UTF-8 encoding is used for URI-references': There is (unfortunately!) no general character encoding for URI references. The spec should say that the UTF-8 encoding is used for XPointers, and the use of any other encoding with %HH in URIs is an error. - The Note on handling illegal characters in URIs should make clear the following: - Whether it is allowed to use characters not allowed in URIs in some place in an XML document (or outside) depends on that particular case, not on whether XPointer is used or not. - Where XPointer is used, it is advides to allow to use charaters not permitted in URIs, to make the use of XPointer easier. - The second Validity Constraint in 2.2 doesn't belong there, and should be moved somewhere else. - Both Unicode and UTF-8 are mentioned. References to them must be added (RFC 2379 for UTF-8, reference to Unicode 3.0 for Unicode). Points and Ranges ----------------- - These are very valuable additions on top of XPath. In particular, multiple (logical) ranges are essential to address single graphical ranges in Bidirectional contexts. Also, multiple ranges can be very helpful to indicate correspondences between versions of the same document in different languages; this wouldn't be possible with single ranges. - There should be some note explaining the potential for multiple selections appearing as a result of a single graphical selection in a bidi context (near the 'unique()' function and/or at the end of 3.1). - Points and ranges, in accordance with DOM2 and with our Character Model (see http://www.w3.org/TR/1999/WD-charmod-19991129/#Indexing), use 0-based indexing. However, as far as we understand, it seems impossible to access points or ranges based on these indices. This should preferably be changed by adding appropriate functions, or alternatively, be pointed out in a note. - Points and Ranges correspond, to DOM's "position" and "range". However, DOM's position and range are based on UTF-16 units. XPointer on the other hand works on UCS characters, as is clear from http://www.w3.org/TR/xpath#strings. It should be made clear that this also applies to points and ranges. Various other issues -------------------- - Please make sure the CR/PR has a pointer to a translations page. (see e.g. the XPath Recommendation. - Intro, Robustness requirements: 'must attempt to be internationalized' sounds strange. It seems to say 'the WG has to give it a try, but if they don't get it, no problem'. This is obviously wrong. Please say something like 'This specification must be appropriately internationalized.' - 2.1.3, Child Sequences: This is a highly unstable way of addressing into a document. The i18n WG/IG are in particular concerned about stability e.g. when translating a document. Giving such an unstable way of addressing such a distinguished and short syntax seems highly inappropriate, in particular given the fact that the *[n] notation, which does the same thing and can be mixed with other ways of addressing, does exactly the same and is not much longer. We request that Child Sequences be removed from the spec for the above reasons. If this should not be possible, we alternatively request that a very strong warning against this way of addressing is added to 2.1.3. - Whitespace handling in 'string-range': This is defined so that multiple spaces in the source match with multiple spaces in the XPointer. In as far as this is used to deal with 'pretty printing' in the source or in XPointers (as opposed to catching spurious double spaces,...), this is inappropriate because it does not cover languages that are written without spaces, such as Thai, Chinese, and Japanese. This has to be improved. - There should be some comment about matching and normalization in 3.5. The best thing to say is that only codepoint-by-codepoint matching is done, and that both source and XPointer are assumed to be normalized, and that for things such as case-folding, appropriate functions in XPath should be used. Regards, Martin. #-#-# Martin J. Du"rst, World Wide Web Consortium #-#-# mailto:duerst@w3.org http://www.w3.org
Received on Sunday, 26 December 1999 22:28:29 UTC