Re: respecting IETF customs? from Sam Ruby on 2014-12-06 (public-ietf-w3c@w3.org from December 2014)

From: Sam Ruby <rubys@intertwingly.net>
Date: Sat, 06 Dec 2014 13:40:21 -0500
To: Bjoern Hoehrmann <derhoermi@gmx.net>
CC: "public-ietf-w3c@w3.org" <public-ietf-w3c@w3.org>
Message-ID: <54834D95.9090805@intertwingly.net>
On 12/06/2014 08:21 AM, Bjoern Hoehrmann wrote:
> * Sam Ruby wrote:
>> I will say that if the IETF-W3C liaison group feels that submitting this
>> content as an Internet-Draft makes sense, I will follow through on that.
>>   After all, publishing this content on WebPlatform.org was a result of
>> me following up on a suggestion[1].  If there are other serious
>> suggestions, I WILL follow up on them.
>
> You could also consider submitting a problem statement or other kind of
> higher level document with pointers to your proposals. Something, in any
> case, is better than nothing, if you want to raise awareness within and
> get feedback from the IETF community.

Is this something you would be willing to co-author with me?

As a starter set, I see three problem areas:

Nomenclature
---

URL/URI/IRI is just the beginning.  Over time different terms have been 
used by different organizations.  One survey can be found here:
http://tantek.com/2011/238/b1/many-ways-slice-url-name-pieces

At a minimum, this information belongs as an appendix in some 
RFC/Recommendation/Standard.  It even could stand alone.

Applicable standards
---

While the problem space seems like it would be reasonably 
self-contained, in practice concepts like IDNA and UNICODE make profound 
differences.  Even those standards have versions and even those versions 
have options like Normalization forms and UseSTD3ASCIIRules.

Two examples:
https://url.spec.whatwg.org/interop/urltest-results/61a4a14209
https://url.spec.whatwg.org/interop/urltest-results/683ac9869d

RFC3986, for example, mentions IDNA and UTF-8, but doesn't nail down 
these options.

Interop
---

Lets face it, every programming language these days has some form of 
standard library, and in that library is some form of URL or URI parse 
function.  Many are horribly broken, even after we take into account the 
nomenclature and applicable standard differences.  Here is a concrete 
example:

http://intertwingly.net/blog/2004/07/31/URI-Equivalence

That was a decade ago.  At the time, C# was the winner; Perl a close 
second, and Java was a far distant third.  A decade later, I've rerun 
the tests for Perl and Java, and sadly, they haven't changed.

If you take a survey of implementations, you will find that in addition 
to the outliers, there are two families of implementations.  One that 
collect around RFC 3986 are precise (in that they tend to produce the 
same results) but not necessary accurate in the face of IDNA and Unicode 
considerations.  And another that collect around browser results.  The 
latter is less precise (in that there are variations), but tend overall 
to be more accurate with respect to other applicable standards.

By the way, another and more insidious problem lurking in places like 
file:// URIs.  For now, I'll just leave it at that.

>> An example where help would be very much appreciated: would it be
>> possible for somebody who not only is familiar with RFC 3986 but also
>> has a sense for what parts might be changeable and what parts can't
>> change to review the following:
>>
>> https://url.spec.whatwg.org/interop/urltest-results/
>
> This page is rather difficult to digest. One problem is that there is no
> indication of expected results, and the colour coding does not indicate,
> for instance, where test results diverge from the relevant RFCs. My
>
>    http://shadowregistry.org/js/misc/
>
> presents tests and results in a form that makes such information more
> readily available.

Actually, it is color coded (skip to the bottom of the page), but it 
doesn't start from the assumption that there is one right answer and 
that there are a number of errant implementations that don't conform to 
that right answer.  Such an assumption, if it could be made, would 
indeed simplify the presentation.

Items colored in a redish color are examples where there doesn't seem to 
be agreement on what the right answer is.

The next two colors cover the cases where the IETF or WHATWG 
specifications are not in line with the consensus.

The final color is where IETF and WHATWG agree.  Even in those cases, 
there often are a few outliers.

If you have other ideas on how to present this information, here's the 
raw data captured for a number of user agents:

https://github.com/webspecs/url/tree/develop/evaluate/useragent-results

I welcome people to take this data and present it other ways.  I welcome 
but don't require people to contribute back: possible things I would be 
interested in are other ways to present this data, more tests, or other 
result sets that should be included.  For example, adding Perl to this 
evaluation results would make perfect sense.

>> And while that is a broad request, here is a much more focused request,
>> define some test cases which will define how relative references should
>> be evaluated against a base with an unknown URLs/URIs scheme:
>>
>> https://www.w3.org/Bugs/Public/show_bug.cgi?id=27233
>
> You already seem to have plenty of tests if you replace the scheme in
> them, and if you have a setup that can automatically evaluate tests, I
> would simply automatically generate test cases. For an example, see
>
>    http://lists.w3.org/Archives/Public/www-archive/2011Aug/0001.html
>
> I also note that RFC 3986 already fully defines this, and I am not aware
> of differences in deployed code that cannot be changed in this regard.
> If there are, they ought to be brought up on the `public-iri` or `uri`
> list.

This is a case where my precision vs accuracy comment applies.  My fear 
is that people by comparing test results have come to standardize on 
things outside of the spec when it comes to matters like IDNA and 
Unicode.  And in the places where they have done so, they may not be in 
compliance with those others standards.

Some of these choices may be defensible.  Perhaps Perl can't make 
assumptions about character encoding when faced with % encoded bytes. 
But then perhaps URI::eq shouldn't be providing boolean answers on 
questions of equivalence.  It isn't that difficult to come up with 
scenarios where such differences have security implications.

If you believe that this should be discussed on one of those lists, 
please do so.  Feel free to copy me, point to this email (it is publicly 
archived), or to even forward some or all of this email to those lists.

- Sam Ruby
Received on Saturday, 6 December 2014 18:40:52 UTC