- From: Martin Duerst <duerst@w3.org>
- Date: Thu, 20 May 2004 13:31:31 +0900
- To: "Chris Haynes" <chris@harvington.org.uk>, <public-iri@w3.org>
Hello Chris, I have noted this as issue non-ASCII-3.1-33. At 10:37 04/05/11 +0100, Chris Haynes wrote: >Here are three related issues re. draft 7, Sect 3.1, Step 2. > > > >I have a concern with the sentence "The disallowed characters consist of all >non-ASCII characters allowed in IRIs." > >(Issue 1) Since this step is referring to (presumably legal) IRIs, then the >phrase "allowed in IRIs" is superfluous - there could be no others. see below. >-------------- > >(Issue 2) Is the phrase "non-ASCII characters" sufficiently precice / >normative? > >I think here is a much cleaner definition available, providing you don't mind >dropping the allusion to the reasoning: > >"The disallowed characters consist of all those matching 'ucschar' or >'iprivate' >of Section 2.2" > >Altenatively, you could say something like "The disallowed characters >consist of >all those whose UTF-8 encodings employ two or more octets" (which is more >to the >point and all-embracing). see below. >-------------- > > >The definition of disallowed characters now leads us to an apparent conflict >with step 2.1, which currently says to "convert the character to one or more >octets using UTF-8". > >Unless I've misunderstood some subtlety in the definition of 'disallowed >characters', all such characters will require at least two octets for their >encoding so we reach issue 3: > > >(Issue 3) In Step 2.1 none of the characters to be so processed can have just >one octet in their UCS-8 encoding, so the instruction, strictly-speaking, >cannot >be obeyed. Well, you are right, except that strictly speaking, there is also the following possibility, mentioned later in the spec: >>>> Infrastructure accepting IRIs MAY also deal with the printable characters in US-ASCII that are not allowed in URIs, namely "<", ">", '"', Space, "{", "}", "|", "\", "^", and "`", in Step 2.2 above. >>>> Except that it should say "Step 2", which I have fixed. >-------------- > > >I also find the mixture of negatives and plurals in the introduction to step 2 >somewhat confusing, so I've taken the liberty of suggesting some re-drafts >which >addresses all three issues. > > >One possible re-draft of the start of Step 2, which consolidates all the above >points, is: > >Version 1: > vvvvvvvvvvvvvvvv > Step 2) > IRI characters matching 'ucschar' or 'iprivate' (section 2.2) are >disallowed in URI > references. For each such character apply steps 2.1 through 2.3 below.. > > 2.1) Encode the disallowed character using UTF-8, which will > generate a >sequence > of two or more octets. > > 2.2) Convert each octet to %HH ........ >^^^^^^^^^^^^^^^^ > >An alternative re-draft is: > >Version 2: >vvvvvvvvvvvvvvvv > Step 2) > IRI characters whose UCS-8 encodings emply two or more octets are >disallowed in > URI references. For each such character apply steps 2.1 through 2.3 >below.. > > 2.1) Encode the disallowed character using UTF-8, which will > generate a >sequence > of two or more octets. > > 2.2) Convert each octet to %HH ........ >^^^^^^^^^^^^^^^^ > > >yet a third, which restores the reasoning, is: > >Version 3: >vvvvvvvvvvvvvvvv > Step 2) > IRI characters whose UCS-8 encodings emply two or more octets are >disallowed in > URI references because they are not US-ASCII characters. For each such >character > apply steps 2.1 through 2.3 below.. > > 2.1) Encode the disallowed character using UTF-8, which will > generate a >sequence > of two or more octets. > > 2.2) Convert each octet to %HH ........ >^^^^^^^^^^^^^^^^ I think we can do this even shorter and more clearly. I changed the introduction of step 2) to: >>>> For each character in 'ucschar' or 'iprivate', apply Steps 2.1 through 2.3 below. >>>> I think that addresses your issues 1 and 2. >Take your pick! > > >--------------------------- > >One final general point: throughout the document I can see both 'ASCII' and >'US-ASCII' in use. Should not a single designation be selected, and a >normative >reference supplied (such as that in RFC 2396 [ASCII] )? Good catch. I have added the reference, and went through the document to change everything to US-ASCII, except for 'non-ASCII' (also used in RFC 2396bis) and things like ToASCII,... I also changed a couple occurrences of "US-ASCII range" to "US-ASCII repertoire" to allign with terminology. I also cought one occurrence where US-ASCII is alluded to as a script, which I fixed. I hope this addresses your issues. Please confirm. Regards, Martin.
Received on Thursday, 20 May 2004 00:36:05 UTC