RE: reviewing draft-weber-iri-guidelines-00 from Phillips, Addison on 2011-07-05 (public-iri@w3.org from July 2011)

From: Phillips, Addison <addison@lab126.com>
Date: Tue, 5 Jul 2011 13:33:55 -0700
To: Chris Weber <chris@lookout.net>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <131F80DEA635F044946897AFDA9AC3476A94126890@EX-SEA31-D.ant.amazon.com>
Hi Chris,

Thank you for this document. I have a few comments, which follow:

1. Section 4, item 1. Unicode whitespace includes additional characters other than the ones listed here (or in draft-3987bis). I think the choice of characters here is deliberate, but it might be wise to say something about it. Perhaps a note that says: "Remove leading and trailing instances of ASCII whitespace..." and followed by "Note that other Unicode whitespace and control characters are not affected by this rule."

2. Section 4, item 2. Replacing blocks of contiguous whitespace with a single %20 is imprecise (for the same reason as my first comment). Presumably multiple unquoted non-terminal whitespace characters in an IRI represent an error of some sort. But would this be a valid IRI: "http://example.com?value=%20%20foo%20%20bar"? (I have %20'd multiple whitespace items for visibility).

3. Section 4, item 3. Why UTF-8? Wouldn't a sequence of Unicode code points be preferable at this stage? UTF-8 is only necessary when converting to a URI. 

4. Section 4, item 4. "entity references" -> "entity reference". 

5. Section 4, item 4. What does "entity reference" mean here? I can't find it as a formally defined term in any of the IRI documents. I know what it means in e.g. an HTML context. Should I assume that it means "local transfer encodings", such as HTML entities in an HTML document? Or should I assume it means IRI's own percent-encoding?

6. Note that not every entity reference (assuming for a moment that we mean percent-encoding) can be so replaced? Perhaps: "Replace each entity that references a Unicode character with its corresponding character. Any remaining entities encode octets."

7. Section 4, item 5. Is NFC desirable here? Do we need to consider path elements separately? Applying normalization blindly to the entire string risks altering information that may be desirable later. For example, it prevents including a denormalized query string, which may be generated by a user on purpose. The use of Unicode normalization might be better limited to:

- IRI elements, such as authority, that require it inherently (but then we don't need to specify it here?)
- comparison of path elements or IRIs for identity

There is considerable discussion at W3C right now about Unicode Normalization in document formats. My sense is that NFC will *not* be a requirement elsewhere in the Web ecosystem. Perhaps requiring it for IRI pre-processing is inconsistent? The real question is whether any later processing is harmed by not performing the normalization. None of the remaining IRI processing steps appear to be affected by applying (or not) NFC---in fact I think that denormalized strings should parse in a manner identical to normalized ones if possible. 

NFC really only helps with identity/matching processing, as far as I can tell. I'm not saying it's not important. Only that it might be wise to limit its application.

Thanks,

Addison

Addison Phillips
Globalization Architect (Lab126)
Chair (W3C I18N WG)

Internationalization is not a feature.
It is an architecture.


> -----Original Message-----
> From: public-iri-request@w3.org [mailto:public-iri-request@w3.org] On Behalf
> Of Chris Weber
> Sent: Tuesday, July 05, 2011 12:37 PM
> To: PUBLIC-IRI@W3.ORG
> Subject: reviewing draft-weber-iri-guidelines-00
> 
> Hello all, I put out an early draft as an effort to address some of the topics
> mentioned in my message from <http://lists.w3.org/Archives/Public/public-

> iri/2011May/0036.html>.
> 
> The draft is available at
> <http://datatracker.ietf.org/doc/draft-weber-iri-guidelines/>
> 
> It's missing a lot and any feedback would be welcome.
> 
> Best regards,
> Chris
Received on Tuesday, 5 July 2011 20:34:22 UTC