Issues: Section 3.1 references to non-ASCII characters from Chris Haynes on 2004-05-11 (public-iri@w3.org from May 2004)

From: Chris Haynes <chris@harvington.org.uk>
Date: Tue, 11 May 2004 10:37:52 +0100
To: <public-iri@w3.org>
Message-ID: <01b701c4373b$a33a1f70$0200000a@ringo>
Here are three related issues re. draft 7, Sect 3.1, Step 2.



I have a concern with the sentence "The disallowed characters consist of all
non-ASCII characters allowed in IRIs."

(Issue 1) Since this step is referring to (presumably legal) IRIs, then the
phrase "allowed in IRIs" is superfluous - there could be no others.

--------------

(Issue 2) Is the phrase "non-ASCII characters" sufficiently precice / normative?

I think here is a much cleaner definition available, providing you don't mind
dropping the allusion to the reasoning:

"The disallowed characters consist of all those matching 'ucschar' or 'iprivate'
of Section 2.2"

Altenatively, you could say something like "The disallowed characters consist of
all those whose UTF-8 encodings employ two or more octets" (which is more to the
point and all-embracing).

--------------


TYhe definition of disallowed characters  now leads us to an apparent conflict
with step 2.1, which currently says to "convert the character to one or more
octets using UTF-8".

Unless I've misunderstood some subtlety in the definition of 'disallowed
characters', all such characters will require at least two octets for their
encoding so we reach issue 3:


(Issue 3)  In Step 2.1 none of the characters to be so processed can have just
one octet in their UCS-8 encoding, so the instruction, strictly-speaking, cannot
be obeyed.

--------------


I also find the mixture of negatives and plurals in the introduction to step 2
somewhat confusing, so I've taken the liberty of suggesting some re-drafts which
addresses all three issues.


One possible re-draft of the start of Step 2, which consolidates all the above
points, is:

Version 1:
 vvvvvvvvvvvvvvvv
   Step 2)
      IRI characters matching 'ucschar' or 'iprivate' (section 2.2) are
disallowed in URI
      references. For each such character apply steps 2.1 through 2.3 below..

         2.1) Encode the disallowed character using UTF-8, which will generate a
sequence
         of two or more octets.

         2.2) Convert each octet to %HH ........
^^^^^^^^^^^^^^^^

An alternative re-draft is:

Version 2:
vvvvvvvvvvvvvvvv
   Step 2)
      IRI characters whose UCS-8 encodings emply two or more octets are
disallowed in
      URI references. For each such character apply steps 2.1 through 2.3
below..

         2.1) Encode the disallowed character using UTF-8, which will generate a
sequence
          of two or more octets.

         2.2) Convert each octet to %HH ........
^^^^^^^^^^^^^^^^


yet a third, which restores the reasoning, is:

Version 3:
vvvvvvvvvvvvvvvv
   Step 2)
      IRI characters whose UCS-8 encodings emply two or more octets are
disallowed in
      URI references because they are not US-ASCII characters. For each such
character
      apply steps 2.1 through 2.3 below..

         2.1) Encode the disallowed character using UTF-8, which will generate a
sequence
         of two or more octets.

        2.2) Convert each octet to %HH ........
^^^^^^^^^^^^^^^^

Take your pick!


---------------------------

One final general point:  throughout the document I can see both 'ASCII' and
'US-ASCII' in use. Should not a single designation be selected, and a normative
reference supplied (such as that in RFC 2396 [ASCII] )?


HTH

Chris Haynes
Received on Tuesday, 11 May 2004 05:41:11 UTC