RE: reviewing draft-weber-iri-guidelines-00 from Phillips, Addison on 2011-07-06 (public-iri@w3.org from July 2011)

From: Phillips, Addison <addison@lab126.com>
Date: Wed, 6 Jul 2011 14:53:01 -0700
To: Chris Weber <chris@lookout.net>
CC: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <131F80DEA635F044946897AFDA9AC3476A941CA7AF@EX-SEA31-D.ant.amazon.com>

Hi Chris,

> > 2. Section 4, item 2. Replacing blocks of contiguous whitespace with a
> > single %20 is imprecise (for the same reason as my first comment).
> > Presumably multiple unquoted non-terminal whitespace characters in an
> > IRI represent an error of some sort. But would this be a valid IRI:
> > "http://example.com?value=%20%20foo%20%20bar"? (I have %20'd multiple
> > whitespace items for visibility).
> 
> With a literal "SPACE" in place of each "%20" this does appear to be a valid URI
> in all browsers, all of which percent-encode each literal space in the HTTP
> request.  The DOM parsing mostly matches except for MSIE which does not
> percent-encode any spaces.

Right, but the instruction in your document is to replace multiple whitespace with a single whitespace.

> 
> Indeed, I agree, and this also agrees with Section 3.1 of 3987
> http://trac.tools.ietf.org/html/draft-ietf-iri-3987bis-05#section-3.1

> which allows for any Unicode encoding, such as UTF-8 or UTF-16 but isn't picky
> about which.  Are you suggesting that UTF-16 be applied at this stage?

No, I'm suggesting *not to specify* a particular Unicode character encoding because it is unnecessary. This is, for example, the approach that XML takes. If the IRI is treated as a sequence of Unicode code points (logical characters), it can be actually be encoded as and represented using the most appropriate Unicode character encoding (UTF-8, UTF-16, UTF-32, etc.) for the given environment. A particular implementation can then choose which encoding to use to achieve this. For example, UTF-16 makes the most sense for ECMAScript or possibly in the DOM whereas an given browser might be processing in terms of UTF-8 internally.

> 
> Very true, applying NFC here could be detrimental.  And as my testing shows,
> some browsers seem to be applying NFC only in specific elements such as how
> Chrome treats the fragment.  Although Safari seems to apply NFC to the path,
> query, and fragment.  I'm not sure if it's handling those individually or treating
> everything after authority as an opaque string.  Probably safest to assume the
> former.  Test results are up here:
> 
> https://spreadsheets.google.com/spreadsheet/ccc?key=0AifoWoA0trUndEZSTlR

> RNnd5MzE3N3RYOVlIVFFMREE&hl=en_US#gid=5

Right. This is great stuff. You might want to post what your tests are too, just for completeness.

> 
> 
> So limit the application of NFC to the comparison of identifiers or their parts?
> Are you saying that even during initial creation IRIs should not be normalized
> with NFC?

No. I'm saying that NFC is a "good idea" when a content creator is making an IRI... but that the spec may not *enforce* it and implementations should not assume it. It may be the case that two canonically equivalent (under Unicode normalization) IRIs may not be equivalent (under IRI rules). But that doesn't mean that it isn't a good idea to use normalized values to form the path in e.g. a RESTful request.

I'm not sure that NFC is not part of the IRI picture. Only that, if Web content is not itself normalized, it becomes more important to allow it to be transmitted in a de-normalized form as fragids, path components, query parameters, et al.

Best Regards,

Addison

Received on Wednesday, 6 July 2011 21:53:37 UTC