- From: Chris Weber <chris@lookout.net>
- Date: Mon, 11 Jul 2011 17:08:22 -0700
- To: public-iri@w3.org
On 7/10/2011 10:18 AM, Bjoern Hoehrmann wrote: > If you go from some HTML snippet to "the DOM" there may be no resource > identifier parsing going on at all. You don't say which doctype you are > using or which character encoding, IDNs are very different from paths, > and there is a whole sea of sadness with browser technology and Unicode > surrogates. If we want to talk about IRI parsing, we'd have to agree on > what the input to "IRI parsing" is, and what the output can be. I'd say > the input is a sequence of Unicode scalar values at the least. Thanks for showing me how my message could be ambiguous with regard to the test case setup and how I interpreted results. It has been my hope that we might discuss a general approach to testing IRI parsing that would be acceptable and understood by the WG so that we could discuss testing and results without the ambiguity. To answer your question, I was testing Web browsers in Quirks mode with a complete HTML page where the fragment was in the <body> and the charset was specified as UTF-8 by the HTTP header Content-Type: text/html; charset=utf-8. I have re-run the tests using an HTML 4.01 strict DOCTYPE and found the same results. Maybe it would be more helpful to release my test harness but it's not very portable at the moment - I use a combination of Web, database, and DNS server to generate test HTML pages and collect results. It's focused on generating HTML for Web browser testing but could be repurposed for testing servers, and general API testing as well - e.g. .NET's System.Uri. Although wrappers would be required for each case. > > Output is more complicated, but it would probably revolve around things > that are defined in RFC 3986 or RFC 3987, or in a group draft, e.g., we > can talk about whether some such sequence is a relative or absolute re- > ference (or should be handled as such when the input is malformed). > > That is, we would be talking about things that we control or are direct- > ly affected by. We do not control and are not affected by how HTML is > parsed or how HTML APIs work. It would be nice to minimize differences > and surprises, but ultimately we know that the results with some piece > of HTML markup may not be the results for XMLHttpRequest interactions or > any number of other operations in browsers and elsewhere. I agree and would like to document the minimum set of interfaces that would be interesting or valuable to test as part of a test plan. > >> U+FDD0 is prohibited under IDNA2003's nameprep step, and disallowed by >> IDNA2008. The results below are from the DOM parsing. >> >> Scheme Hostname Path Query Browser >> : Chrome/12.0 >> http: example.com /%EF%B7%90foo Opera/9.80 >> http: example.com ?zyx MSIE 7.0 >> http: example.com ?zyx MSIE 8.0 >> http: example.com /%EF%B7%90foo Firefox/4.0.1 >> http: example.com /?foo Safari/5.0.5 > > I do not know how you derived these results, but it does not make much > sense to have scheme names with colons in them under the definition of > the term in RFC 3986 (percent-encoding is not allowed, so you cannot > serialize this). Your intent is clear, but we need to keep the layering > and the terminology straight and intact. I'm collecting DOM information the same way Webkit does at http://trac.webkit.org/browser/trunk/LayoutTests/fast/url/ and the same way Julian does at http://greenbytes.de/tech/webdav/urldecomp.xml. My test harness collects the DOM properties for the HTML anchor element using javascript. As such, the "http:" result is the value of the .protocol property in the DOM. I should probably name the table columns the same as the DOM property names to be consistent. > >> The raw HTTP request results for the<img> are as follows. The only >> exception was that Chrome did not make the request for the<img>. >> >> Path Browser >> /%EF%B7%90foo Opera/9.80 >> /?foo MSIE 7.0 >> /?foo MSIE 8.0 >> /%EF%B7%90foo Firefox/4.0.1 >> /%EF%B7%90foo Safari/5.0.5 >> >> Although Chrome did not make a request for the<img>, the<a> link is >> still clickable and resolves to the percent-encoded Unicode replacement >> character U+FFFD in the path "/%EF%BF%BDfoo". > > So browsers are internally inconsistent and disagree with each other. > As far as the existing IRI specification goes, I am not aware it says > to replace certain characters with replacement characters, so if there > is an issue with the specific character, they should either all keep it > intact, or all refuse to resolve it if they want to behave consistently > and sensibly. Neither would require changes to the specification if it > does indeed not suggest to replace characters (refusing to dereference > IRIs is always an option for security and other reasons). Does this WG think it's worth while to build a test plan? Examining a set of URI/IRI schemes is in the WG charter, which I take to mean as observation through testing. A rough document I started is up at: https://spreadsheets.google.com/spreadsheet/ccc?key=0AifoWoA0trUndEZSTlRRNnd5MzE3N3RYOVlIVFFMREE&hl=en_US#gid=7 I'm happy to take on the effort here but only if the results are deemed worthwhile by the group. In that case I would also require feedback and guidance in terms of goals - what questions do we want to answer?; targets - what applications and interfaces should be tested?; and how should we go about testing to collect results? Best regards, Chris
Received on Tuesday, 12 July 2011 00:09:00 UTC