- From: Mike Brown <mbrown@corp.webb.net>
- Date: Mon, 29 Jan 2001 12:28:07 -0700
- To: "'www-i18n-comments@w3.org'" <www-i18n-comments@w3.org>
Section 8 is a mess, IMHO. I suggest making the following changes. 1. Divide it into subsections for better reference. One should be able to say "I am putting this in this W3C spec because it has to conform to section 8.1.1 of the 'Character Model for the WWW' Recommendation." 2. The opening statement implies that URI references cannot contain non-ASCII and disallowed ASCII characters at all, then it clarifies that such characters can be referenced using an escaping mechanism. It is also somewhat mistaken in the assertion that %HH escaping is provided for the purpose of escaping "arbitrary byte values". It is actually only defined as being a solution for escaping disallowed ASCII characters. However, it does imply that it is OK to use the escaping for non-ASCII characters as well. It just does not say exactly how to do it. I suggest making the text less ambiguous: According to the definition in [RFC 2396], URI references are comprised of a subset of characters from the US-ASCII repertoire. To represent non-ASCII and other disallowed characters, the RFC specifies an escaping mechanism whereby the disallowed characters are mapped to bytes and the bytes are represented using the %HH convention. Then I would rephrase the rest of the opening paragraph, noting that RFC 2396 defers to the scheme-specific specs the recommendation of what encoding to use as the basis for the escaping. This is a point you seem to have completely missed. URI scheme specs are supposed to dictate what encoding to use for %HH escaping, but few actually do. (URN does, HTTP doesn't). By mandating that protocol/format elements must use UTF-8 as the basis for %HH escaping, you are in conflict with widely deployed implementations of schemes (HTTP in particular) that do not enforce such interpretation. It is also impossible to know what scheme is being used in the case of relative URIs! I have written about this issue at http://skew.org/xml/misc/URI-i18n/ and have provided a specific example of how enforcing UTF-8 based escaping can cause conflict. See the section entitled "XML related considerations". Suggested text for end of opening paragraph (immediately follows text above): However, this mechanism is of little utility on its own, because it does not mandate which character-to-byte encoding to use. Other RFCs that cover different URI schemes (HTTP, URN, etc.) are expected to specify the encoding and semantics for each scheme. Unfortunately, few actually do. Consequently, widely deployed implementations of those schemes vary in how they interpret URI references. W3C specifications must address the ramifications of this situation according to the following directives. 3. Delete examples of URI reference subsets, or make the terminology less vague. Incorporate the "new syntax for URIs" section here rather than breaking it out later; it is redundant to speak of URI reference subsets and then to have fragment identifiers as being separate from this. Suggested text: W3C specifications that define protocol or format elements (e.g., HTTP headers, XML attributes) that are to be interpreted as URI references or subsets thereof (including new types of fragment identifiers), MUST allow such elements to contain non-ASCII and otherwise disallowed ASCII characters. 4. Delete examples of ASCII chars disallowed as literals in URIs. RFC 2396 covers them; why repeat here? This kind of repetition burdened the XML spec because it was not clear which statements were informative summaries of referenced specs, and which statements were normative requirements. Suggested text (immediately follows the text above): Such W3C specifications must specify that the disallowed characters must follow the %HH escaping convention defined in [RFC 2396]. Such W3C specifications must specify that the disallowed characters be handled for %HH escaping as follows: (followed by the usual 3-step explanation) 5. The section on 'conversion' of protocol/format elements to URI references does not make sense and the need for it is unclear. I suggest leaving it out altogether. If the elements are expected to be interpreted as URI references, then for all intents and purposes they *are* URI references. What confusion has there been in the past that led to the need for the proposed requirements? 6. Discuss how to handle conflicts between the mandate of UTF-8 in URI references and actual implementations, particularly w.r.t. HTTP. I really do not have a suggestion for how to solve this issue, except that maybe you could just require W3C specs to acknowledge that it is not always practical to force UTF-8 interpretation of such references. 7. Move all lengthy examples and notes to a subsection at the end, and clearly indicate that they are informative, not normative, examples and discussion. - Mike ____________________________________________________________________ Mike J. Brown, software engineer at My XML/XSL resources: webb.net in Denver, Colorado, USA http://skew.org/xml/
Received on Monday, 29 January 2001 14:27:33 UTC