- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 20 May 2004 13:31:31 +0900
- To: "Chris Haynes" <chris@harvington.org.uk>, <public-iri@w3.org>
Hello Chris,
I have noted this as issue non-ASCII-3.1-33.
At 10:37 04/05/11 +0100, Chris Haynes wrote:
>Here are three related issues re. draft 7, Sect 3.1, Step 2.
>
>
>
>I have a concern with the sentence "The disallowed characters consist of all
>non-ASCII characters allowed in IRIs."
>
>(Issue 1) Since this step is referring to (presumably legal) IRIs, then the
>phrase "allowed in IRIs" is superfluous - there could be no others.
see below.
>--------------
>
>(Issue 2) Is the phrase "non-ASCII characters" sufficiently precice /
>normative?
>
>I think here is a much cleaner definition available, providing you don't mind
>dropping the allusion to the reasoning:
>
>"The disallowed characters consist of all those matching 'ucschar' or
>'iprivate'
>of Section 2.2"
>
>Altenatively, you could say something like "The disallowed characters
>consist of
>all those whose UTF-8 encodings employ two or more octets" (which is more
>to the
>point and all-embracing).
see below.
>--------------
>
>
>The definition of disallowed characters now leads us to an apparent conflict
>with step 2.1, which currently says to "convert the character to one or more
>octets using UTF-8".
>
>Unless I've misunderstood some subtlety in the definition of 'disallowed
>characters', all such characters will require at least two octets for their
>encoding so we reach issue 3:
>
>
>(Issue 3) In Step 2.1 none of the characters to be so processed can have just
>one octet in their UCS-8 encoding, so the instruction, strictly-speaking,
>cannot
>be obeyed.
Well, you are right, except that strictly speaking, there is also
the following possibility, mentioned later in the spec:
>>>>
Infrastructure accepting IRIs MAY also deal with the printable characters
in US-ASCII that are not allowed in URIs, namely "<", ">", '"',
Space, "{", "}", "|", "\", "^", and "`", in Step 2.2 above.
>>>>
Except that it should say "Step 2", which I have fixed.
>--------------
>
>
>I also find the mixture of negatives and plurals in the introduction to step 2
>somewhat confusing, so I've taken the liberty of suggesting some re-drafts
>which
>addresses all three issues.
>
>
>One possible re-draft of the start of Step 2, which consolidates all the above
>points, is:
>
>Version 1:
> vvvvvvvvvvvvvvvv
> Step 2)
> IRI characters matching 'ucschar' or 'iprivate' (section 2.2) are
>disallowed in URI
> references. For each such character apply steps 2.1 through 2.3 below..
>
> 2.1) Encode the disallowed character using UTF-8, which will
> generate a
>sequence
> of two or more octets.
>
> 2.2) Convert each octet to %HH ........
>^^^^^^^^^^^^^^^^
>
>An alternative re-draft is:
>
>Version 2:
>vvvvvvvvvvvvvvvv
> Step 2)
> IRI characters whose UCS-8 encodings emply two or more octets are
>disallowed in
> URI references. For each such character apply steps 2.1 through 2.3
>below..
>
> 2.1) Encode the disallowed character using UTF-8, which will
> generate a
>sequence
> of two or more octets.
>
> 2.2) Convert each octet to %HH ........
>^^^^^^^^^^^^^^^^
>
>
>yet a third, which restores the reasoning, is:
>
>Version 3:
>vvvvvvvvvvvvvvvv
> Step 2)
> IRI characters whose UCS-8 encodings emply two or more octets are
>disallowed in
> URI references because they are not US-ASCII characters. For each such
>character
> apply steps 2.1 through 2.3 below..
>
> 2.1) Encode the disallowed character using UTF-8, which will
> generate a
>sequence
> of two or more octets.
>
> 2.2) Convert each octet to %HH ........
>^^^^^^^^^^^^^^^^
I think we can do this even shorter and more clearly.
I changed the introduction of step 2) to:
>>>>
For each character in 'ucschar' or 'iprivate', apply
Steps 2.1 through 2.3 below.
>>>>
I think that addresses your issues 1 and 2.
>Take your pick!
>
>
>---------------------------
>
>One final general point: throughout the document I can see both 'ASCII' and
>'US-ASCII' in use. Should not a single designation be selected, and a
>normative
>reference supplied (such as that in RFC 2396 [ASCII] )?
Good catch. I have added the reference, and went through the document
to change everything to US-ASCII, except for 'non-ASCII' (also used in
RFC 2396bis) and things like ToASCII,... I also changed a couple
occurrences of "US-ASCII range" to "US-ASCII repertoire" to allign
with terminology. I also cought one occurrence where US-ASCII is
alluded to as a script, which I fixed.
I hope this addresses your issues. Please confirm.
Regards, Martin.
Received on Thursday, 20 May 2004 00:36:05 UTC