Re: URL parsing in HTML5

Hi Martin, thank you for your comments. I have time for only a quick
reply at the moment.

On 11/3/11 10:23 PM, "Martin J. Dürst" wrote:
> Hello Peter, others,
> 
> On 2011/11/04 13:21, Peter Saint-Andre wrote:
>> After chatting during TPAC 2011 with Addison, Larry, Richard, Ian, Mike,
>> Ted, Julian (etc.), I'd like to share some thoughts about a possible
>> compromise / resolution regarding Issue 56 in the HTML WG:
>>
>> http://www.w3.org/html/wg/tracker/issues/56
>>
>> Some observations and opinions:
>>
>> 1. It is unlikely that existing browsers will change their current URL
>> parsing behavior. (I am not judging whether that behavior is good or
>> bad.)
> 
> I agree that it's very unlikely that they change it in areas where they
> all agree on a particular behavior. Discussions in the IRI WG have often
> very quickly come up with examples where major browsers differ, and (at
> least) in these areas, some change seems desirable.

Yes, that's true.

>> 2. Documentation of that behavior is out of scope for the revisions to
>> RFC 3987, and outside the charter of the IRI WG, because it's a matter
>> of URI [pre-]processing (RFC 3986) and not IRI processing (RFC 3987).
> 
> I have to say that I'm very surprised to see such an "out of scope"
> statement. Of course, I haven't been part to the discussions you
> mention, and I admit that coming from you as the responsible Area
> Director, such a statement carries a lot of weight.

You are right. I was voicing my impression from discussions this week.
My impression might be wrong.

> However, as far as I can remember, the issue of how browsers deal with
> IRIs was always an important part in the deliberations that lead up to
> the formation of the WG, and also during the WG.
> 
> Also, saying that browsers do URI (pre-)processing but not IRI
> (pre-)processing surprises me quite a lot, because the single most
> important difference between URIs and IRIs is that the later allow
> non-ASCII characters, and browsers definitely do that. This is despite
> the fact that the HTML5 spec likes to call these "URL"s (which is
> neither URI nor IRI).

I meant among other things that behaviors like "remove whitespace from
the front and back of a proto-URL" are not specific to URIs or IRIs
because they are a matter of preprocessing. You are right that HTML
files can include UTF-8 encoded Unicode characters, so in theory we are
talking about IRI processing. However, many of the heuristics appear to
be related to things other than percent-encoding and such, so the topics
cross many boundaries. I think this has led to much of the confusion
about roles and responsibilities.

>> 3. It is unlikely that RFC 3986 will ever be modified to recommend the
>> current behavior, and simply impossible before HTML5 is advanced at the
>> W3C (even if such modifications were desirable).
> 
> Fully agreed.
> 
> 
>> 4. As far as I can see, the current behavior is in fact out of scope for
>> RFC 3986 and any future possible revisions to RFC 3986 because:
>>
>>     (a) it is mostly or completely a matter of pre-processing of strings
>>     that look like URIs/URLs/"web-addresses" -- we could call these
>>     "candidate strings" or "proto-URLs" or somesuch to disambiguate them
>>     from URIs
>>
>>     (b) this pre-processing behavior is applied only in the web context
>>     by browsers and software applications that want to be consistent
>>     with browsers
>>
>>     (c) because of (b), there is no great danger that this behavior will
>>     "leak" into processing of URIs in general (mailto:, sip:, tel:,
>>     URNs, and so on)
> 
> Mostly agree, except for (c). URI/IRI/URL processing isn't a matter of
> schemes; browsers handle mailto: schemes, and some deal with tel:
> schemes and others.

What I meant to say is that, given how many browsers are implemented,
such specialized "web processing" would not necessarily leak into
generic URI parsing code. This hunch would need to be validated, but my
sense is that some folks have been worried that documenting browser
behavior would cause all URIs/IRIs in all applications to be processed
in ways that are not fully consistent with RFC 3986 / RFC 3987. Right
now I think such a fear might be misplaced.

>> 5. There's no necessity for work on documentation of the current URL
>> parsing behavior to happen at the IETF, given that it's out of scope for
>> the IRI WG.
> 
> As said above, I disagree with the later part of the sentence, and
> therefore have to disagree with the overall conclusion.
> 
> It may very well be that for various and potentially even very good
> reasons, it is better to do this work somewhere else than at the IETF,
> but "it's out of scope for the IRI WG" doesn't really make a good
> reason, because the IRI WG was formed and until now has worked under the
> assumption that it's in scope.
> 
> [That the IRI WG hasn't made much progress on this issue may be a good
> reason to decide it shouldn't be part of the IRI WGs work, and should be
> done somewhere else, but that would be a different reason.]

I see your point and I look forward to discussing the matter further in
an open fashion so that we can figure out a way forward.

Peter

-- 
Peter Saint-Andre
https://stpeter.im/

Received on Friday, 4 November 2011 14:38:37 UTC