- From: Sam Ruby <rubys@intertwingly.net>
- Date: Sat, 06 Dec 2014 13:40:21 -0500
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- CC: "public-ietf-w3c@w3.org" <public-ietf-w3c@w3.org>
On 12/06/2014 08:21 AM, Bjoern Hoehrmann wrote: > * Sam Ruby wrote: >> I will say that if the IETF-W3C liaison group feels that submitting this >> content as an Internet-Draft makes sense, I will follow through on that. >> After all, publishing this content on WebPlatform.org was a result of >> me following up on a suggestion[1]. If there are other serious >> suggestions, I WILL follow up on them. > > You could also consider submitting a problem statement or other kind of > higher level document with pointers to your proposals. Something, in any > case, is better than nothing, if you want to raise awareness within and > get feedback from the IETF community. Is this something you would be willing to co-author with me? As a starter set, I see three problem areas: Nomenclature --- URL/URI/IRI is just the beginning. Over time different terms have been used by different organizations. One survey can be found here: http://tantek.com/2011/238/b1/many-ways-slice-url-name-pieces At a minimum, this information belongs as an appendix in some RFC/Recommendation/Standard. It even could stand alone. Applicable standards --- While the problem space seems like it would be reasonably self-contained, in practice concepts like IDNA and UNICODE make profound differences. Even those standards have versions and even those versions have options like Normalization forms and UseSTD3ASCIIRules. Two examples: https://url.spec.whatwg.org/interop/urltest-results/61a4a14209 https://url.spec.whatwg.org/interop/urltest-results/683ac9869d RFC3986, for example, mentions IDNA and UTF-8, but doesn't nail down these options. Interop --- Lets face it, every programming language these days has some form of standard library, and in that library is some form of URL or URI parse function. Many are horribly broken, even after we take into account the nomenclature and applicable standard differences. Here is a concrete example: http://intertwingly.net/blog/2004/07/31/URI-Equivalence That was a decade ago. At the time, C# was the winner; Perl a close second, and Java was a far distant third. A decade later, I've rerun the tests for Perl and Java, and sadly, they haven't changed. If you take a survey of implementations, you will find that in addition to the outliers, there are two families of implementations. One that collect around RFC 3986 are precise (in that they tend to produce the same results) but not necessary accurate in the face of IDNA and Unicode considerations. And another that collect around browser results. The latter is less precise (in that there are variations), but tend overall to be more accurate with respect to other applicable standards. By the way, another and more insidious problem lurking in places like file:// URIs. For now, I'll just leave it at that. >> An example where help would be very much appreciated: would it be >> possible for somebody who not only is familiar with RFC 3986 but also >> has a sense for what parts might be changeable and what parts can't >> change to review the following: >> >> https://url.spec.whatwg.org/interop/urltest-results/ > > This page is rather difficult to digest. One problem is that there is no > indication of expected results, and the colour coding does not indicate, > for instance, where test results diverge from the relevant RFCs. My > > http://shadowregistry.org/js/misc/ > > presents tests and results in a form that makes such information more > readily available. Actually, it is color coded (skip to the bottom of the page), but it doesn't start from the assumption that there is one right answer and that there are a number of errant implementations that don't conform to that right answer. Such an assumption, if it could be made, would indeed simplify the presentation. Items colored in a redish color are examples where there doesn't seem to be agreement on what the right answer is. The next two colors cover the cases where the IETF or WHATWG specifications are not in line with the consensus. The final color is where IETF and WHATWG agree. Even in those cases, there often are a few outliers. If you have other ideas on how to present this information, here's the raw data captured for a number of user agents: https://github.com/webspecs/url/tree/develop/evaluate/useragent-results I welcome people to take this data and present it other ways. I welcome but don't require people to contribute back: possible things I would be interested in are other ways to present this data, more tests, or other result sets that should be included. For example, adding Perl to this evaluation results would make perfect sense. >> And while that is a broad request, here is a much more focused request, >> define some test cases which will define how relative references should >> be evaluated against a base with an unknown URLs/URIs scheme: >> >> https://www.w3.org/Bugs/Public/show_bug.cgi?id=27233 > > You already seem to have plenty of tests if you replace the scheme in > them, and if you have a setup that can automatically evaluate tests, I > would simply automatically generate test cases. For an example, see > > http://lists.w3.org/Archives/Public/www-archive/2011Aug/0001.html > > I also note that RFC 3986 already fully defines this, and I am not aware > of differences in deployed code that cannot be changed in this regard. > If there are, they ought to be brought up on the `public-iri` or `uri` > list. This is a case where my precision vs accuracy comment applies. My fear is that people by comparing test results have come to standardize on things outside of the spec when it comes to matters like IDNA and Unicode. And in the places where they have done so, they may not be in compliance with those others standards. Some of these choices may be defensible. Perhaps Perl can't make assumptions about character encoding when faced with % encoded bytes. But then perhaps URI::eq shouldn't be providing boolean answers on questions of equivalence. It isn't that difficult to come up with scenarios where such differences have security implications. If you believe that this should be discussed on one of those lists, please do so. Feel free to copy me, point to this email (it is publicly archived), or to even forward some or all of this email to those lists. - Sam Ruby
Received on Saturday, 6 December 2014 18:40:52 UTC