Re: prohibited code points and error handling in Chrome and MSIE from Chris Weber on 2011-07-12 (public-iri@w3.org from July 2011)

From: Chris Weber <chris@lookout.net>
Date: Mon, 11 Jul 2011 17:08:22 -0700
To: public-iri@w3.org
Message-ID: <4E1B9076.10900@lookout.net>
On 7/10/2011 10:18 AM, Bjoern Hoehrmann wrote:
> If you go from some HTML snippet to "the DOM" there may be no resource
> identifier parsing going on at all. You don't say which doctype you are
> using or which character encoding, IDNs are very different from paths,
> and there is a whole sea of sadness with browser technology and Unicode
> surrogates. If we want to talk about IRI parsing, we'd have to agree on
> what the input to "IRI parsing" is, and what the output can be. I'd say
> the input is a sequence of Unicode scalar values at the least.

Thanks for showing me how my message could be ambiguous with regard to 
the test case setup and how I interpreted results.  It has been my hope 
that we might discuss a general approach to testing IRI parsing that 
would be acceptable and understood by the WG so that we could discuss 
testing and results without the ambiguity.

To answer your question, I was testing Web browsers in Quirks mode with 
a complete HTML page where the fragment was in the <body> and the 
charset was specified as UTF-8 by the HTTP header Content-Type: 
text/html; charset=utf-8. I have re-run the tests using an HTML 4.01 
strict DOCTYPE and found the same results.

Maybe it would be more helpful to release my test harness but it's not 
very portable at the moment - I use a combination of Web, database, and 
DNS server to generate test HTML pages and collect results.  It's 
focused on generating HTML for Web browser testing but could be 
repurposed for testing servers, and general API testing as well - e.g. 
.NET's System.Uri.  Although wrappers would be required for each case.

>
> Output is more complicated, but it would probably revolve around things
> that are defined in RFC 3986 or RFC 3987, or in a group draft, e.g., we
> can talk about whether some such sequence is a relative or absolute re-
> ference (or should be handled as such when the input is malformed).
>
> That is, we would be talking about things that we control or are direct-
> ly affected by. We do not control and are not affected by how HTML is
> parsed or how HTML APIs work. It would be nice to minimize differences
> and surprises, but ultimately we know that the results with some piece
> of HTML markup may not be the results for XMLHttpRequest interactions or
> any number of other operations in browsers and elsewhere.

I agree and would like to document the minimum set of interfaces that 
would be interesting or valuable to test as part of a test plan.

>
>> U+FDD0 is prohibited under IDNA2003's nameprep step, and disallowed by
>> IDNA2008. The results below are from the DOM parsing.
>>
>> Scheme Hostname Path Query Browser
>> : Chrome/12.0
>> http: example.com /%EF%B7%90foo Opera/9.80
>> http: example.com ?zyx MSIE 7.0
>> http: example.com ?zyx MSIE 8.0
>> http: example.com /%EF%B7%90foo Firefox/4.0.1
>> http: example.com /?foo Safari/5.0.5
>
> I do not know how you derived these results, but it does not make much
> sense to have scheme names with colons in them under the definition of
> the term in RFC 3986 (percent-encoding is not allowed, so you cannot
> serialize this). Your intent is clear, but we need to keep the layering
> and the terminology straight and intact.

I'm collecting DOM information the same way Webkit does at 
http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/ and the same 
way Julian does at http://greenbytes.de/tech/webdav/urldecomp.xml.

My test harness collects the DOM properties for the HTML anchor element 
using javascript.  As such, the "http:" result is the value of the 
.protocol property in the DOM.  I should probably name the table columns 
the same as the DOM property names to be consistent.

>
>> The raw HTTP request results for the<img>  are as follows. The only
>> exception was that Chrome did not make the request for the<img>.
>>
>> Path Browser
>> /%EF%B7%90foo Opera/9.80
>> /?foo MSIE 7.0
>> /?foo MSIE 8.0
>> /%EF%B7%90foo Firefox/4.0.1
>> /%EF%B7%90foo Safari/5.0.5
>>
>> Although Chrome did not make a request for the<img>, the<a>  link is
>> still clickable and resolves to the percent-encoded Unicode replacement
>> character U+FFFD in the path "/%EF%BF%BDfoo".
>
> So browsers are internally inconsistent and disagree with each other.
> As far as the existing IRI specification goes, I am not aware it says
> to replace certain characters with replacement characters, so if there
> is an issue with the specific character, they should either all keep it
> intact, or all refuse to resolve it if they want to behave consistently
> and sensibly. Neither would require changes to the specification if it
> does indeed not suggest to replace characters (refusing to dereference
> IRIs is always an option for security and other reasons).

Does this WG think it's worth while to build a test plan?  Examining a 
set of URI/IRI schemes is in the WG charter, which I take to mean as 
observation through testing.  A rough document I started is up at: 
https://spreadsheets.google.com/spreadsheet/ccc?key=0AifoWoA0trUndEZSTlRRNnd5MzE3N3RYOVlIVFFMREE&hl=en_US#gid=7

I'm happy to take on the effort here but only if the results are deemed 
worthwhile by the group.  In that case I would also require feedback and 
guidance in terms of goals - what questions do we want to answer?; 
targets - what applications and interfaces should be tested?; and how 
should we go about testing to collect results?

Best regards,
Chris
Received on Tuesday, 12 July 2011 00:09:00 UTC