- From: Martin Duerst <duerst@w3.org>
- Date: Fri, 28 May 2004 15:04:41 +0900
- To: public-iri@w3.org
[forwarded to the list, for documentation purposes] >Date: Fri, 28 May 2004 15:02:52 +0900 >To: "Chris Haynes" <chris@harvington.org.uk> >From: Martin Duerst <duerst@w3.org> >Subject: Re: Issues: Section 3.1 references to non-ASCII characters > >At 08:07 04/05/20 +0100, Chris Haynes wrote: >>Thanks Martin. >> >>Your amendments satisfy all my concerns. > >Okay, thanks, I have closed this issue. > >Regards, Martin. > > >>Chris Haynes >> >> >> >> "Martin Duerst" responded: >> >> > >> > Hello Chris, >> > >> > I have noted this as issue non-ASCII-3.1-33. >> > >> > At 10:37 04/05/11 +0100, Chris Haynes wrote: >> > >> > >Here are three related issues re. draft 7, Sect 3.1, Step 2. >> > > >> > > >> > > >> > >I have a concern with the sentence "The disallowed characters consist >> of all >> > >non-ASCII characters allowed in IRIs." >> > > >> > >(Issue 1) Since this step is referring to (presumably legal) IRIs, >> then the >> > >phrase "allowed in IRIs" is superfluous - there could be no others. >> > >> > see below. >> > >> > >> > >-------------- >> > > >> > >(Issue 2) Is the phrase "non-ASCII characters" sufficiently precice / >> > >normative? >> > > >> > >I think here is a much cleaner definition available, providing you >> don't mind >> > >dropping the allusion to the reasoning: >> > > >> > >"The disallowed characters consist of all those matching 'ucschar' or >> > >'iprivate' >> > >of Section 2.2" >> > > >> > >Altenatively, you could say something like "The disallowed characters >> > >consist of >> > >all those whose UTF-8 encodings employ two or more octets" (which is more >> > >to the >> > >point and all-embracing). >> > >> > see below. >> > >> > >-------------- >> > > >> > > >> > >The definition of disallowed characters now leads us to an apparent >> conflict >> > >with step 2.1, which currently says to "convert the character to one >> or more >> > >octets using UTF-8". >> > > >> > >Unless I've misunderstood some subtlety in the definition of 'disallowed >> > >characters', all such characters will require at least two octets for >> their >> > >encoding so we reach issue 3: >> > > >> > > >> > >(Issue 3) In Step 2.1 none of the characters to be so processed can have >>just >> > >one octet in their UCS-8 encoding, so the instruction, strictly-speaking, >> > >cannot >> > >be obeyed. >> > >> > Well, you are right, except that strictly speaking, there is also >> > the following possibility, mentioned later in the spec: >> > >> > >>>> >> > Infrastructure accepting IRIs MAY also deal with the printable characters >> > in US-ASCII that are not allowed in URIs, namely "<", ">", '"', >> > Space, "{", "}", "|", "\", "^", and "`", in Step 2.2 above. >> > >>>> >> > >> > Except that it should say "Step 2", which I have fixed. >> > >> > >> > >> > >-------------- >> > > >> > > >> > >I also find the mixture of negatives and plurals in the introduction >> to step >>2 >> > >somewhat confusing, so I've taken the liberty of suggesting some >> re-drafts >> > >which >> > >addresses all three issues. >> > > >> > > >> > >One possible re-draft of the start of Step 2, which consolidates all the >>above >> > >points, is: >> > > >> > >Version 1: >> > > vvvvvvvvvvvvvvvv >> > > Step 2) >> > > IRI characters matching 'ucschar' or 'iprivate' (section 2.2) are >> > >disallowed in URI >> > > references. For each such character apply steps 2.1 through 2.3 >>below.. >> > > >> > > 2.1) Encode the disallowed character using UTF-8, which will >> > > generate a >> > >sequence >> > > of two or more octets. >> > > >> > > 2.2) Convert each octet to %HH ........ >> > >^^^^^^^^^^^^^^^^ >> > > >> > >An alternative re-draft is: >> > > >> > >Version 2: >> > >vvvvvvvvvvvvvvvv >> > > Step 2) >> > > IRI characters whose UCS-8 encodings emply two or more octets are >> > >disallowed in >> > > URI references. For each such character apply steps 2.1 >> through 2.3 >> > >below.. >> > > >> > > 2.1) Encode the disallowed character using UTF-8, which will >> > > generate a >> > >sequence >> > > of two or more octets. >> > > >> > > 2.2) Convert each octet to %HH ........ >> > >^^^^^^^^^^^^^^^^ >> > > >> > > >> > >yet a third, which restores the reasoning, is: >> > > >> > >Version 3: >> > >vvvvvvvvvvvvvvvv >> > > Step 2) >> > > IRI characters whose UCS-8 encodings emply two or more octets are >> > >disallowed in >> > > URI references because they are not US-ASCII characters. For >> each such >> > >character >> > > apply steps 2.1 through 2.3 below.. >> > > >> > > 2.1) Encode the disallowed character using UTF-8, which will >> > > generate a >> > >sequence >> > > of two or more octets. >> > > >> > > 2.2) Convert each octet to %HH ........ >> > >^^^^^^^^^^^^^^^^ >> > >> > I think we can do this even shorter and more clearly. >> > >> > I changed the introduction of step 2) to: >> > >> > >>>> >> > For each character in 'ucschar' or 'iprivate', apply >> > Steps 2.1 through 2.3 below. >> > >>>> >> > >> > I think that addresses your issues 1 and 2. >> > >> > >> > >Take your pick! >> > > >> > > >> > >--------------------------- >> > > >> > >One final general point: throughout the document I can see both >> 'ASCII' and >> > >'US-ASCII' in use. Should not a single designation be selected, and a >> > >normative >> > >reference supplied (such as that in RFC 2396 [ASCII] )? >> > >> > Good catch. I have added the reference, and went through the document >> > to change everything to US-ASCII, except for 'non-ASCII' (also used in >> > RFC 2396bis) and things like ToASCII,... I also changed a couple >> > occurrences of "US-ASCII range" to "US-ASCII repertoire" to allign >> > with terminology. I also cought one occurrence where US-ASCII is >> > alluded to as a script, which I fixed. >> > >> > >> > I hope this addresses your issues. Please confirm. >> > >> > Regards, Martin. >> > >> > >> > >> > >> >
Received on Friday, 28 May 2004 02:04:58 UTC