Re: [url] Requests for Feedback (was Feedback from TPAC) from Sam Ruby on 2014-12-24 (public-ietf-w3c@w3.org from December 2014)

From: Sam Ruby <rubys@intertwingly.net>
Date: Wed, 24 Dec 2014 16:03:48 -0500
To: "Roy T. Fielding" <roy.fielding@gmail.com>
CC: "public-ietf-w3c@w3.org" <public-ietf-w3c@w3.org>
Message-ID: <549B2A34.8070901@intertwingly.net>
On 12/24/2014 03:08 PM, Roy T. Fielding wrote:
> On Dec 24, 2014, at 9:06 AM, Sam Ruby wrote:
>> On 12/24/2014 11:47 AM, Roy T. Fielding wrote:
>>> On Dec 23, 2014, at 11:47 AM, Sam Ruby <rubys@intertwingly.net>
>>> wrote:
>>>>
>>>>> On 12/23/2014 02:07 PM, Mark Nottingham wrote:
>>>>>
>>>>> At first glance, it appears like a lot of the valid
>>>>> URI/invalid URL outcomes are because url LS is doing
>>>>> scheme-specific processing; is that the case? (Currently
>>>>> working with limited net access + heavy jet lag)
>>>>
>>>> That certainly explains a number of differences.
>>>> Additionally:
>>>>
>>>> 1) There are cases that ABNF can't capture.  I tend to agree
>>>> with Julian[1] that the ABNF should be treated as rough syntax
>>>> only, and that additional constraints should be specified in
>>>> prose.  That's effectively how the webplatform URL draft is
>>>> structured[2].
>>>>
>>>> 2) The URL LS is IDNA and Unicode more aware than RFC 3986 is.
>>>> Clearly, this is by design, but I will suggest that there is
>>>> an important lesson to be learned by the effort to split out
>>>> RFC 3987 into a separate RFC: I think that unintentionally had
>>>> the effect of "ghettoizing" IRIs.  I might be misreading
>>>> Martin, but perhaps that's why he suggested RFC 3986 errata as
>>>> the way to handle bidi?[3]
>>>
>>> No, it has to be understood that RFC3986 defines the set of
>>> addresses that are universally interoperable. IDNA is not
>>> INTEROPERABLE except in its punycode form.
>>
>> It is not clear to me what you are saying 'No' to.  Even ASCII only
>> URIs have never been universally INTEROPERABLE (just curious: why
>> are we shouting here?).
>
> because my iPhone sometime sticks to all-caps ... it yells for me, in
> bed.
>
>> I have data that demonstrates considerable IDNA interoperability,
>> though clearly not universally.
>
> That's because you are using the HTML5 definition of interoperable,
> whereas I am talking about two independent implementations being able
> to communicate.
>
>> I'm personally willing to settle for "rough consensus and running
>> code".
>
> You have only looked at a handful of implementations, almost all of
> which are browsers.  For 3986, we had over 50 independent
> implementations participate in the process, with well over a thousand
> known implementations in the wild. We spent four years on that
> process, in addition to the years spent on RFC1808 and RFC2396.  You
> don't replace that by posting a few messages to a private liaison
> mailing list.

You must have bigger hands than me, and less than half are browsers.

The WHATWG has been working on URLs for several years.  That process
isn't as inclusive as I would like, and I'm working to fix that.  I've
also reached out to the apps area in the IETF and the webapps WG in the W3C.

I welcome the addition of more implementations to test.  To date, I've 
chosen not to focus on implementations that are neither actively being 
developed nor particularly compliant, even if those implementations are 
widely used.  The story will only get worse if such are included (an 
example would be java.net.URI).

>>> This is an entirely different problem than parsing arbitrary
>>> references so that they can be transformed into a URL, just as it
>>> is an entirely different problem to define the URL DOM API.
>>> Neither of those would have made IETF Standard because there was
>>> no single agreement on what to do. The best we could do was an
>>> appendix.
>>
>> I'm inclined to believe that the amount of consensus and running
>> code may be different in 2015 than it was in 2005.
>
> Yes, which is why I suggested that it can be specified.  It still is
> not what RFC3986 defines.  RFC3986 is like a uniform postal code. The
> fact that you can also send letters using non-uniform addressing like
> "third house west of the red barn" does not change what is the
> uniform addressing standard.

None of this changes the fact that existing, popular, allegedly RFC 3986
compliant disagree.  In areas where there is agreement -- or even near 
agreement -- that differs from what RFC 3986 may state, then I feel that 
there should be a standard that documents that agreement.

>>> The problem with RFC3987 was that it tried to define a new
>>> addressing format instead of simply defining an arbitrary
>>> reference and how to get from there to an interoperable URI. It
>>> did not work because it wasn't written to handle arbitrary input
>>> and could not keep up with changes in IDNA.
>>>
>>> As I said when this ruckus started years ago, all that HTML needs
>>> is a specification for how to parse references and another for
>>> how to fill the URL DOM. Those are HTML concerns. The notion that
>>> 3986 had to be replaced is nothing more than ignorance combined
>>> with the arrogant way that HTML5 has been allowed to piss all
>>> over the rest of Web standards.
>>
>> It looks to me that you are allowing your emotions to cloud your
>> judgment here.
>>
>> I have data[1] that shows that ASCII only RFC 3986 valid URIs are
>> not fully interoperable today.  I am working on conformance rules
>> and new parsing rules that better match implementations.  I am
>> looking not just at browsers, but at a variety of libraries.  I
>> welcome contributions of scripts and programs that explore even
>> more libraries.
>>
>> It might be possible for us to split that effort into two parts.
>> One part would either be an errata for RFC3986 or an RFC3986bis.
>> The other would be layered on top of that.
>>
>> If turns out that this doesn't happen for whatever reason
>> (technical or political, it matters not), then the URL standard
>> will simply be a more up to date and better description as to how
>> things actually work.
>
> Most of the data you have is a bunch of abnormal references that are
> parsed differently by different browser engines.  That does not make
> them any more or less interoperable, since they don't operate as
> identifiers in isolation.

To repeat: what I have is a set of ASCII only RFC 3986 valid URIs which 
I have tested across a number of implementations, less than half of 
which are browsers, and found differences.

> It appears that your test oracle is broken for ie, since many of the
> references for which it has the correct result are flagged as errors
> because the DOM properties are empty.

Perhaps you are only looking at the values for href?  I'm looking at all 
of the values.

Meanwhile, there is an open bug for allowing Microsoft's behavior:

https://www.w3.org/Bugs/Public/show_bug.cgi?id=27516

> In many other cases, the test results don't take into account the
> scheme-specific processing that might differ based on where the URL
> was obtained (e.g., userinfo is often allowed on the command-line but
> not when received from an external source) and optional
> canonicalization (e.g., eliding the ? on an http URI when the query
> is empty). Canonicalization is often done for the sake of improving
> cache hits, but is not required of implementations, and there is
> nothing to suggest that it should happen within the DOM (as opposed
> to when the URL is used within a fetch).

Some of these are indeed scheme specific.  Some are clear violations of 
RFC 3986.  In any case, there are numerous interoperability issues, and 
may even have security implications.

> Those tests might tell you whether the URL DOM is consistent, but
> not whether the RFC3986 identifier is interoperable with servers.
> What matters to 3986 is that the reference provided in, say, an HTML
> href, will go out on the wire as valid ASCII URI components via the
> HTTP request-target and Host.  That is why the lack of interop for
> file schemes has been largely ignored by both the IETF and browsers,
> since they are not intended for the Internet.  Likewise, javascript
> and blob are not really URL schemes even if they might be reserved
> as such to avoid name conflicts; they are special-case exceptions
> for the browser href parsing engine that are not applicable to any
> other type of implementation.

In the case of browsers, there is a relationship between what goes over 
the wire and what is in the DOM.  I'm also testing non-browser libraries 
based on those library's public APIs.

> This does not mean there isn't value in coming up with a consistent
> reference parsing model for HTML that reflects the browser
> environment and results in a consistent URL DOM and address display.
> It simply doesn't change what is in RFC3986, nor does it escape the
> fact that a browser is still dependent on RFC3986 to produce a valid
> URI out of whatever it happens to parse when it eventually chooses to
> use that URI in an IETF protocol like HTTP.

I keep saying this: it is not just for HTML, and it is not just browsers.

I will, however, say that great effort has been undertaken to ensure that:

1) the output produced by the URL parsing algorithm matches the ABNF 
specified by RFC 3986.

2) the set of conformant inputs to the URL parsing algorithm is a proper 
subset of the set of strings that match the ABNF specified by RFC 3986.

Violations of either of these are valid bugs.  At the moment, there is 
one such bug open that I will address shortly:

https://www.w3.org/Bugs/Public/show_bug.cgi?id=27687

As an aside, I agree that it is confusing for the current URL standard 
uses the term URL while simultaneously using the same term for the 
output of the parsing algorithm -- again there is an existing bug:

https://www.w3.org/Bugs/Public/show_bug.cgi?id=26405

> So, if you want to talk about replacing 3987 with a spec of how to
> get from an arbitrary string to an RFC3986 URI (and an i18n
> consistent display form for that URI), then I think you can make some
> progress.
>
> Changing RFC3986 is not going to happen without a full IETF WG. The
> result would have to be based on all implementations, not just
> browsers, and would have to be validated with all Internet protocols
> that depend on STD66.

I believe that both would require a full IETF WG.  So, lets do that.

> ....Roy

- Sam Ruby
Received on Wednesday, 24 December 2014 21:04:16 UTC