- From: Chris Haynes <chris@harvington.org.uk>
- Date: Tue, 11 May 2004 10:37:52 +0100
- To: <public-iri@w3.org>
Here are three related issues re. draft 7, Sect 3.1, Step 2. I have a concern with the sentence "The disallowed characters consist of all non-ASCII characters allowed in IRIs." (Issue 1) Since this step is referring to (presumably legal) IRIs, then the phrase "allowed in IRIs" is superfluous - there could be no others. -------------- (Issue 2) Is the phrase "non-ASCII characters" sufficiently precice / normative? I think here is a much cleaner definition available, providing you don't mind dropping the allusion to the reasoning: "The disallowed characters consist of all those matching 'ucschar' or 'iprivate' of Section 2.2" Altenatively, you could say something like "The disallowed characters consist of all those whose UTF-8 encodings employ two or more octets" (which is more to the point and all-embracing). -------------- TYhe definition of disallowed characters now leads us to an apparent conflict with step 2.1, which currently says to "convert the character to one or more octets using UTF-8". Unless I've misunderstood some subtlety in the definition of 'disallowed characters', all such characters will require at least two octets for their encoding so we reach issue 3: (Issue 3) In Step 2.1 none of the characters to be so processed can have just one octet in their UCS-8 encoding, so the instruction, strictly-speaking, cannot be obeyed. -------------- I also find the mixture of negatives and plurals in the introduction to step 2 somewhat confusing, so I've taken the liberty of suggesting some re-drafts which addresses all three issues. One possible re-draft of the start of Step 2, which consolidates all the above points, is: Version 1: vvvvvvvvvvvvvvvv Step 2) IRI characters matching 'ucschar' or 'iprivate' (section 2.2) are disallowed in URI references. For each such character apply steps 2.1 through 2.3 below.. 2.1) Encode the disallowed character using UTF-8, which will generate a sequence of two or more octets. 2.2) Convert each octet to %HH ........ ^^^^^^^^^^^^^^^^ An alternative re-draft is: Version 2: vvvvvvvvvvvvvvvv Step 2) IRI characters whose UCS-8 encodings emply two or more octets are disallowed in URI references. For each such character apply steps 2.1 through 2.3 below.. 2.1) Encode the disallowed character using UTF-8, which will generate a sequence of two or more octets. 2.2) Convert each octet to %HH ........ ^^^^^^^^^^^^^^^^ yet a third, which restores the reasoning, is: Version 3: vvvvvvvvvvvvvvvv Step 2) IRI characters whose UCS-8 encodings emply two or more octets are disallowed in URI references because they are not US-ASCII characters. For each such character apply steps 2.1 through 2.3 below.. 2.1) Encode the disallowed character using UTF-8, which will generate a sequence of two or more octets. 2.2) Convert each octet to %HH ........ ^^^^^^^^^^^^^^^^ Take your pick! --------------------------- One final general point: throughout the document I can see both 'ASCII' and 'US-ASCII' in use. Should not a single designation be selected, and a normative reference supplied (such as that in RFC 2396 [ASCII] )? HTH Chris Haynes
Received on Tuesday, 11 May 2004 05:41:11 UTC