Re: prohibited code points and error handling in Chrome and MSIE from Bjoern Hoehrmann on 2011-07-10 (public-iri@w3.org from July 2011)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sun, 10 Jul 2011 19:18:41 +0200
To: Chris Weber <chris@lookout.net>
Cc: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <apkj179s1njvoe03jbr5nse1nh9veuq8g5@hive.bjoern.hoehrmann.de>
* Chris Weber wrote:
>I'm curious about a test case that caught my attention:
>
>(<a href='http://example.com/&#xfdd0;foo' id='302'>302</a><img 
>src='http://example.com/&#xfdd0;foo' />)
>
>For Chrome - do you know if this result is the way an IRI parsing should 
>get represented in the DOM? This seems to be the same result in other 
>test cases such as <http://&#xD87E;&#xDC68;.com> as well. But it also 
>happens with URI cases as well <http://[::eeee:192.168.0.1]/>

I am not sure how you arrive at your question. Above you have some mark-
up; assuming this is HTML code then the HTML specification would have to
define how you get from the markup to strings that are resource identi-
fiers. There might be some API that somehow represents them, which would
have to go from the resource identifier strings to whatever feature the
API may have, and how that works depends on the API specification.

If you go from some HTML snippet to "the DOM" there may be no resource
identifier parsing going on at all. You don't say which doctype you are
using or which character encoding, IDNs are very different from paths,
and there is a whole sea of sadness with browser technology and Unicode
surrogates. If we want to talk about IRI parsing, we'd have to agree on
what the input to "IRI parsing" is, and what the output can be. I'd say
the input is a sequence of Unicode scalar values at the least.

Output is more complicated, but it would probably revolve around things
that are defined in RFC 3986 or RFC 3987, or in a group draft, e.g., we
can talk about whether some such sequence is a relative or absolute re-
ference (or should be handled as such when the input is malformed).

That is, we would be talking about things that we control or are direct-
ly affected by. We do not control and are not affected by how HTML is
parsed or how HTML APIs work. It would be nice to minimize differences
and surprises, but ultimately we know that the results with some piece
of HTML markup may not be the results for XMLHttpRequest interactions or
any number of other operations in browsers and elsewhere.

>U+FDD0 is prohibited under IDNA2003's nameprep step, and disallowed by 
>IDNA2008. The results below are from the DOM parsing.
>
>Scheme Hostname Path Query Browser
>: Chrome/12.0
>http: example.com /%EF%B7%90foo Opera/9.80
>http: example.com ?zyx MSIE 7.0
>http: example.com ?zyx MSIE 8.0
>http: example.com /%EF%B7%90foo Firefox/4.0.1
>http: example.com /?foo Safari/5.0.5

I do not know how you derived these results, but it does not make much
sense to have scheme names with colons in them under the definition of
the term in RFC 3986 (percent-encoding is not allowed, so you cannot
serialize this). Your intent is clear, but we need to keep the layering
and the terminology straight and intact.

>The raw HTTP request results for the <img> are as follows. The only 
>exception was that Chrome did not make the request for the <img>.
>
>Path Browser
>/%EF%B7%90foo Opera/9.80
>/?foo MSIE 7.0
>/?foo MSIE 8.0
>/%EF%B7%90foo Firefox/4.0.1
>/%EF%B7%90foo Safari/5.0.5
>
>Although Chrome did not make a request for the <img>, the <a> link is 
>still clickable and resolves to the percent-encoded Unicode replacement 
>character U+FFFD in the path "/%EF%BF%BDfoo".

So browsers are internally inconsistent and disagree with each other.
As far as the existing IRI specification goes, I am not aware it says
to replace certain characters with replacement characters, so if there
is an issue with the specific character, they should either all keep it
intact, or all refuse to resolve it if they want to behave consistently
and sensibly. Neither would require changes to the specification if it
does indeed not suggest to replace characters (refusing to dereference
IRIs is always an option for security and other reasons).
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Sunday, 10 July 2011 17:19:06 UTC