Fwd: Re: Issues: Section 3.1 references to non-ASCII characters

[forwarded to the list, for documentation purposes]

>Date: Fri, 28 May 2004 15:02:52 +0900
>To: "Chris Haynes" <chris@harvington.org.uk>
>From: Martin Duerst <duerst@w3.org>
>Subject: Re: Issues: Section 3.1 references to non-ASCII characters
>
>At 08:07 04/05/20 +0100, Chris Haynes wrote:
>>Thanks Martin.
>>
>>Your amendments satisfy all my concerns.
>
>Okay, thanks, I have closed this issue.
>
>Regards,    Martin.
>
>
>>Chris Haynes
>>
>>
>>
>>  "Martin Duerst" responded:
>>
>> >
>> > Hello Chris,
>> >
>> > I have noted this as issue non-ASCII-3.1-33.
>> >
>> > At 10:37 04/05/11 +0100, Chris Haynes wrote:
>> >
>> > >Here are three related issues re. draft 7, Sect 3.1, Step 2.
>> > >
>> > >
>> > >
>> > >I have a concern with the sentence "The disallowed characters consist 
>> of all
>> > >non-ASCII characters allowed in IRIs."
>> > >
>> > >(Issue 1) Since this step is referring to (presumably legal) IRIs, 
>> then the
>> > >phrase "allowed in IRIs" is superfluous - there could be no others.
>> >
>> > see below.
>> >
>> >
>> > >--------------
>> > >
>> > >(Issue 2) Is the phrase "non-ASCII characters" sufficiently precice /
>> > >normative?
>> > >
>> > >I think here is a much cleaner definition available, providing you 
>> don't mind
>> > >dropping the allusion to the reasoning:
>> > >
>> > >"The disallowed characters consist of all those matching 'ucschar' or
>> > >'iprivate'
>> > >of Section 2.2"
>> > >
>> > >Altenatively, you could say something like "The disallowed characters
>> > >consist of
>> > >all those whose UTF-8 encodings employ two or more octets" (which is more
>> > >to the
>> > >point and all-embracing).
>> >
>> > see below.
>> >
>> > >--------------
>> > >
>> > >
>> > >The definition of disallowed characters  now leads us to an apparent 
>> conflict
>> > >with step 2.1, which currently says to "convert the character to one 
>> or more
>> > >octets using UTF-8".
>> > >
>> > >Unless I've misunderstood some subtlety in the definition of 'disallowed
>> > >characters', all such characters will require at least two octets for 
>> their
>> > >encoding so we reach issue 3:
>> > >
>> > >
>> > >(Issue 3)  In Step 2.1 none of the characters to be so processed can have
>>just
>> > >one octet in their UCS-8 encoding, so the instruction, strictly-speaking,
>> > >cannot
>> > >be obeyed.
>> >
>> > Well, you are right, except that strictly speaking, there is also
>> > the following possibility, mentioned later in the spec:
>> >
>> >  >>>>
>> > Infrastructure accepting IRIs MAY also deal with the printable characters
>> > in US-ASCII that are not allowed in URIs, namely "&lt;", "&gt;", '"',
>> > Space, "{", "}", "|", "\", "^", and "`", in Step 2.2 above.
>> >  >>>>
>> >
>> > Except that it should say "Step 2", which I have fixed.
>> >
>> >
>> >
>> > >--------------
>> > >
>> > >
>> > >I also find the mixture of negatives and plurals in the introduction 
>> to step
>>2
>> > >somewhat confusing, so I've taken the liberty of suggesting some 
>> re-drafts
>> > >which
>> > >addresses all three issues.
>> > >
>> > >
>> > >One possible re-draft of the start of Step 2, which consolidates all the
>>above
>> > >points, is:
>> > >
>> > >Version 1:
>> > >  vvvvvvvvvvvvvvvv
>> > >    Step 2)
>> > >       IRI characters matching 'ucschar' or 'iprivate' (section 2.2) are
>> > >disallowed in URI
>> > >       references. For each such character apply steps 2.1 through 2.3
>>below..
>> > >
>> > >          2.1) Encode the disallowed character using UTF-8, which will
>> > > generate a
>> > >sequence
>> > >          of two or more octets.
>> > >
>> > >          2.2) Convert each octet to %HH ........
>> > >^^^^^^^^^^^^^^^^
>> > >
>> > >An alternative re-draft is:
>> > >
>> > >Version 2:
>> > >vvvvvvvvvvvvvvvv
>> > >    Step 2)
>> > >       IRI characters whose UCS-8 encodings emply two or more octets are
>> > >disallowed in
>> > >       URI references. For each such character apply steps 2.1 
>> through 2.3
>> > >below..
>> > >
>> > >          2.1) Encode the disallowed character using UTF-8, which will
>> > > generate a
>> > >sequence
>> > >           of two or more octets.
>> > >
>> > >          2.2) Convert each octet to %HH ........
>> > >^^^^^^^^^^^^^^^^
>> > >
>> > >
>> > >yet a third, which restores the reasoning, is:
>> > >
>> > >Version 3:
>> > >vvvvvvvvvvvvvvvv
>> > >    Step 2)
>> > >       IRI characters whose UCS-8 encodings emply two or more octets are
>> > >disallowed in
>> > >       URI references because they are not US-ASCII characters. For 
>> each such
>> > >character
>> > >       apply steps 2.1 through 2.3 below..
>> > >
>> > >          2.1) Encode the disallowed character using UTF-8, which will
>> > > generate a
>> > >sequence
>> > >          of two or more octets.
>> > >
>> > >         2.2) Convert each octet to %HH ........
>> > >^^^^^^^^^^^^^^^^
>> >
>> > I think we can do this even shorter and more clearly.
>> >
>> > I changed the introduction of step 2) to:
>> >
>> >  >>>>
>> >     For each character in 'ucschar' or 'iprivate', apply
>> >     Steps 2.1 through 2.3 below.
>> >  >>>>
>> >
>> > I think that addresses your issues 1 and 2.
>> >
>> >
>> > >Take your pick!
>> > >
>> > >
>> > >---------------------------
>> > >
>> > >One final general point:  throughout the document I can see both 
>> 'ASCII' and
>> > >'US-ASCII' in use. Should not a single designation be selected, and a
>> > >normative
>> > >reference supplied (such as that in RFC 2396 [ASCII] )?
>> >
>> > Good catch. I have added the reference, and went through the document
>> > to change everything to US-ASCII, except for 'non-ASCII' (also used in
>> > RFC 2396bis) and things like ToASCII,... I also changed a couple
>> > occurrences of "US-ASCII range" to "US-ASCII repertoire" to allign
>> > with terminology. I also cought one occurrence where US-ASCII is
>> > alluded to as a script, which I fixed.
>> >
>> >
>> > I hope this addresses your issues. Please confirm.
>> >
>> > Regards,    Martin.
>> >
>> >
>> >
>> >
>> >

Received on Friday, 28 May 2004 02:04:58 UTC