Re: Issues: Section 3.1 references to non-ASCII characters from Martin Duerst on 2004-05-20 (public-iri@w3.org from May 2004)

From: Martin Duerst <duerst@w3.org>
Date: Thu, 20 May 2004 13:31:31 +0900
To: "Chris Haynes" <chris@harvington.org.uk>, <public-iri@w3.org>
Message-Id: <4.2.0.58.J.20040519172334.060bfd80@localhost>
Hello Chris,

I have noted this as issue non-ASCII-3.1-33.

At 10:37 04/05/11 +0100, Chris Haynes wrote:

>Here are three related issues re. draft 7, Sect 3.1, Step 2.
>
>
>
>I have a concern with the sentence "The disallowed characters consist of all
>non-ASCII characters allowed in IRIs."
>
>(Issue 1) Since this step is referring to (presumably legal) IRIs, then the
>phrase "allowed in IRIs" is superfluous - there could be no others.

see below.


>--------------
>
>(Issue 2) Is the phrase "non-ASCII characters" sufficiently precice / 
>normative?
>
>I think here is a much cleaner definition available, providing you don't mind
>dropping the allusion to the reasoning:
>
>"The disallowed characters consist of all those matching 'ucschar' or 
>'iprivate'
>of Section 2.2"
>
>Altenatively, you could say something like "The disallowed characters 
>consist of
>all those whose UTF-8 encodings employ two or more octets" (which is more 
>to the
>point and all-embracing).

see below.

>--------------
>
>
>The definition of disallowed characters  now leads us to an apparent conflict
>with step 2.1, which currently says to "convert the character to one or more
>octets using UTF-8".
>
>Unless I've misunderstood some subtlety in the definition of 'disallowed
>characters', all such characters will require at least two octets for their
>encoding so we reach issue 3:
>
>
>(Issue 3)  In Step 2.1 none of the characters to be so processed can have just
>one octet in their UCS-8 encoding, so the instruction, strictly-speaking, 
>cannot
>be obeyed.

Well, you are right, except that strictly speaking, there is also
the following possibility, mentioned later in the spec:

 >>>>
Infrastructure accepting IRIs MAY also deal with the printable characters
in US-ASCII that are not allowed in URIs, namely "&lt;", "&gt;", '"',
Space, "{", "}", "|", "\", "^", and "`", in Step 2.2 above.
 >>>>

Except that it should say "Step 2", which I have fixed.



>--------------
>
>
>I also find the mixture of negatives and plurals in the introduction to step 2
>somewhat confusing, so I've taken the liberty of suggesting some re-drafts 
>which
>addresses all three issues.
>
>
>One possible re-draft of the start of Step 2, which consolidates all the above
>points, is:
>
>Version 1:
>  vvvvvvvvvvvvvvvv
>    Step 2)
>       IRI characters matching 'ucschar' or 'iprivate' (section 2.2) are
>disallowed in URI
>       references. For each such character apply steps 2.1 through 2.3 below..
>
>          2.1) Encode the disallowed character using UTF-8, which will 
> generate a
>sequence
>          of two or more octets.
>
>          2.2) Convert each octet to %HH ........
>^^^^^^^^^^^^^^^^
>
>An alternative re-draft is:
>
>Version 2:
>vvvvvvvvvvvvvvvv
>    Step 2)
>       IRI characters whose UCS-8 encodings emply two or more octets are
>disallowed in
>       URI references. For each such character apply steps 2.1 through 2.3
>below..
>
>          2.1) Encode the disallowed character using UTF-8, which will 
> generate a
>sequence
>           of two or more octets.
>
>          2.2) Convert each octet to %HH ........
>^^^^^^^^^^^^^^^^
>
>
>yet a third, which restores the reasoning, is:
>
>Version 3:
>vvvvvvvvvvvvvvvv
>    Step 2)
>       IRI characters whose UCS-8 encodings emply two or more octets are
>disallowed in
>       URI references because they are not US-ASCII characters. For each such
>character
>       apply steps 2.1 through 2.3 below..
>
>          2.1) Encode the disallowed character using UTF-8, which will 
> generate a
>sequence
>          of two or more octets.
>
>         2.2) Convert each octet to %HH ........
>^^^^^^^^^^^^^^^^

I think we can do this even shorter and more clearly.

I changed the introduction of step 2) to:

 >>>>
    For each character in 'ucschar' or 'iprivate', apply
    Steps 2.1 through 2.3 below.
 >>>>

I think that addresses your issues 1 and 2.


>Take your pick!
>
>
>---------------------------
>
>One final general point:  throughout the document I can see both 'ASCII' and
>'US-ASCII' in use. Should not a single designation be selected, and a 
>normative
>reference supplied (such as that in RFC 2396 [ASCII] )?

Good catch. I have added the reference, and went through the document
to change everything to US-ASCII, except for 'non-ASCII' (also used in
RFC 2396bis) and things like ToASCII,... I also changed a couple
occurrences of "US-ASCII range" to "US-ASCII repertoire" to allign
with terminology. I also cought one occurrence where US-ASCII is
alluded to as a script, which I fixed.


I hope this addresses your issues. Please confirm.

Regards,    Martin.
Received on Thursday, 20 May 2004 00:36:05 UTC