CM4WWW 1.0 - Suggestion re: Discussion of URI i18n from Mike Brown on 2001-01-29 (www-i18n-comments@w3.org from January 2001)

From: Mike Brown <mbrown@corp.webb.net>
Date: Mon, 29 Jan 2001 12:28:07 -0700
To: "'www-i18n-comments@w3.org'" <www-i18n-comments@w3.org>
Message-ID: <8D96EDA0AC04D31197B400A0C96C1480F70990@ossex1.webb.net>
Section 8 is a mess, IMHO. I suggest making the following changes.

1. Divide it into subsections for better reference. One should be able to
say "I am putting this in this W3C spec because it has to conform to section
8.1.1 of the 'Character Model for the WWW' Recommendation."

2. The opening statement implies that URI references cannot contain
non-ASCII and disallowed ASCII characters at all, then it clarifies that
such characters can be referenced using an escaping mechanism. It is also
somewhat mistaken in the assertion that %HH escaping is provided for the
purpose of escaping "arbitrary byte values". It is actually only defined as
being a solution for escaping disallowed ASCII characters. However, it does
imply that it is OK to use the escaping for non-ASCII characters as well. It
just does not say exactly how to do it.

I suggest making the text less ambiguous:

   According to the definition in [RFC 2396], URI references
   are comprised of a subset of characters from the US-ASCII
   repertoire. To represent non-ASCII and other disallowed 
   characters, the RFC specifies an escaping mechanism
   whereby the disallowed characters are mapped to bytes and
   the bytes are represented using the %HH convention.

Then I would rephrase the rest of the opening paragraph, noting that RFC
2396 defers to the scheme-specific specs the recommendation of what encoding
to use as the basis for the escaping. This is a point you seem to have
completely missed. URI scheme specs are supposed to dictate what encoding to
use for %HH escaping, but few actually do. (URN does, HTTP doesn't). By
mandating that protocol/format elements must use UTF-8 as the basis for %HH
escaping, you are in conflict with widely deployed implementations of
schemes (HTTP in particular) that do not enforce such interpretation. It is
also impossible to know what scheme is being used in the case of relative
URIs!

I have written about this issue at http://skew.org/xml/misc/URI-i18n/ and
have provided a specific example of how enforcing UTF-8 based escaping can
cause conflict. See the section entitled "XML related considerations".

Suggested text for end of opening paragraph (immediately follows text
above):

   However, this mechanism is of little utility on its own,
   because it does not mandate which character-to-byte
   encoding to use. Other RFCs that cover different URI
   schemes (HTTP, URN, etc.) are expected to specify the
   encoding and semantics for each scheme. Unfortunately,
   few actually do. Consequently, widely deployed
   implementations of those schemes vary in how they
   interpret URI references. W3C specifications must address
   the ramifications of this situation according to the
   following directives.

3. Delete examples of URI reference subsets, or make the terminology less
vague. Incorporate the "new syntax for URIs" section here rather than
breaking it out later; it is redundant to speak of URI reference subsets and
then to have fragment identifiers as being separate from this.

Suggested text:

   W3C specifications that define protocol or format
   elements (e.g., HTTP headers, XML attributes) that are to
   be interpreted as URI references or subsets thereof
   (including new types of fragment identifiers), MUST allow
   such elements to contain non-ASCII and otherwise
   disallowed ASCII characters.

4. Delete examples of ASCII chars disallowed as literals in URIs. RFC 2396
covers them; why repeat here? This kind of repetition burdened the XML spec
because it was not clear which statements were informative summaries of
referenced specs, and which statements were normative requirements.

Suggested text (immediately follows the text above):

   Such W3C specifications must specify that the disallowed
   characters must follow the %HH escaping convention
   defined in [RFC 2396].

   Such W3C specifications must specify that the disallowed
   characters be handled for %HH escaping as follows:

     (followed by the usual 3-step explanation)

5. The section on 'conversion' of protocol/format elements to URI references
does not make sense and the need for it is unclear. I suggest leaving it out
altogether. If the elements are expected to be interpreted as URI
references, then for all intents and purposes they *are* URI references.
What confusion has there been in the past that led to the need for the
proposed requirements?

6. Discuss how to handle conflicts between the mandate of UTF-8 in URI
references and actual implementations, particularly w.r.t. HTTP. I really do
not have a suggestion for how to solve this issue, except that maybe you
could just require W3C specs to acknowledge that it is not always practical
to force UTF-8 interpretation of such references.

7. Move all lengthy examples and notes to a subsection at the end, and
clearly indicate that they are informative, not normative, examples and
discussion.


   - Mike
____________________________________________________________________
Mike J. Brown, software engineer at            My XML/XSL resources: 
webb.net in Denver, Colorado, USA              http://skew.org/xml/
Received on Monday, 29 January 2001 14:27:33 UTC