Re: reviewing draft-weber-iri-guidelines-00 from Chris Weber on 2011-07-06 (public-iri@w3.org from July 2011)

From: Chris Weber <chris@lookout.net>
Date: Wed, 06 Jul 2011 11:51:51 -0700
To: "Phillips, Addison" <addison@lab126.com>
CC: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <4E14AEC7.50109@lookout.net>
On 7/5/2011 1:33 PM, Phillips, Addison wrote:
> Hi Chris,
>
> Thank you for this document. I have a few comments, which follow:
>

Thank you for the feedback Addision.

> 1. Section 4, item 1. Unicode whitespace includes additional
> characters other than the ones listed here (or in draft-3987bis). I
> think the choice of characters here is deliberate, but it might be
> wise to say something about it. Perhaps a note that says: "Remove
> leading and trailing instances of ASCII whitespace..." and followed
> by "Note that other Unicode whitespace and control characters are not
> affected by this rule."

Okay.

> 2. Section 4, item 2. Replacing blocks of contiguous whitespace with
> a single %20 is imprecise (for the same reason as my first comment).
> Presumably multiple unquoted non-terminal whitespace characters in an
> IRI represent an error of some sort. But would this be a valid IRI:
> "http://example.com?value=%20%20foo%20%20bar"? (I have %20'd multiple
> whitespace items for visibility).

With a literal "SPACE" in place of each "%20" this does appear to be a 
valid URI in all browsers, all of which percent-encode each literal 
space in the HTTP request.  The DOM parsing mostly matches except for 
MSIE which does not percent-encode any spaces.

So it seems the guidance here would be to percent-encode each occurrence 
of a SPACE character would you agree?

Whereas all browsers seem to discard/remove the control characters:

http://www.example.com/foo/bar/&#x0009;:foo.com&#x000A;

Becomes:

http://www.example.com/foo/bar/:foo.com;


> 3. Section 4, item 3. Why UTF-8? Wouldn't a sequence of Unicode code
> points be preferable at this stage? UTF-8 is only necessary when
> converting to a URI.

Indeed, I agree, and this also agrees with Section 3.1 of 3987 
http://trac.tools.ietf.org/html/draft-ietf-iri-3987bis-05#section-3.1 
which allows for any Unicode encoding, such as UTF-8 or UTF-16 but isn't 
picky about which.  Are you suggesting that UTF-16 be applied at this stage?

>
> 4. Section 4, item 4. "entity references" ->  "entity reference".
>
> 5. Section 4, item 4. What does "entity reference" mean here? I can't
> find it as a formally defined term in any of the IRI documents. I
> know what it means in e.g. an HTML context. Should I assume that it
> means "local transfer encodings", such as HTML entities in an HTML
> document? Or should I assume it means IRI's own percent-encoding?
>
Yes I was thinking of HTML and XML entities.  HTML/XML numeric character
references such as "&#x0041;" as well as percent-encodings and other
high-level escapings.

> 6. Note that not every entity reference (assuming for a moment that
> we mean percent-encoding) can be so replaced? Perhaps: "Replace each
> entity that references a Unicode character with its corresponding
> character. Any remaining entities encode octets."
>

I saw from your other email that you were thinking the suggestion here 
would be to "unescape iunreserved characters".

> 7. Section 4, item 5. Is NFC desirable here? Do we need to consider
> path elements separately? Applying normalization blindly to the
> entire string risks altering information that may be desirable later.
> For example, it prevents including a denormalized query string, which
> may be generated by a user on purpose. The use of Unicode
> normalization might be better limited to:
>
> - IRI elements, such as authority, that require it inherently (but
> then we don't need to specify it here?) - comparison of path elements
> or IRIs for identity

Very true, applying NFC here could be detrimental.  And as my testing 
shows, some browsers seem to be applying NFC only in specific elements 
such as how Chrome treats the fragment.  Although Safari seems to apply 
NFC to the path, query, and fragment.  I'm not sure if it's handling 
those individually or treating everything after authority as an opaque 
string.  Probably safest to assume the former.  Test results are up here:

https://spreadsheets.google.com/spreadsheet/ccc?key=0AifoWoA0trUndEZSTlRRNnd5MzE3N3RYOVlIVFFMREE&hl=en_US#gid=5


> There is considerable discussion at W3C right now about Unicode
> Normalization in document formats. My sense is that NFC will *not* be
> a requirement elsewhere in the Web ecosystem. Perhaps requiring it
> for IRI pre-processing is inconsistent? The real question is whether
> any later processing is harmed by not performing the normalization.
> None of the remaining IRI processing steps appear to be affected by
> applying (or not) NFC---in fact I think that denormalized strings
> should parse in a manner identical to normalized ones if possible.
>
> NFC really only helps with identity/matching processing, as far as I
> can tell. I'm not saying it's not important. Only that it might be
> wise to limit its application.

So limit the application of NFC to the comparison of identifiers or 
their parts?  Are you saying that even during initial creation IRIs 
should not be normalized with NFC?

>
> Thanks,
>
> Addison

Best regards,
Chris
Received on Wednesday, 6 July 2011 18:52:27 UTC