RE: URL work in HTML 5 (semifork)

((It was requested we move the conversations to the public-iri mailing list, so I'm willing to take that advice, bcc www-archive@w3.org to make sure anyone looking there has a place to follow up.))

I agree that a "processing spec" which reflects what browsers actually do, or should do, when confronted with a string which claims to be a "URL", seems like a good idea.

In 2009, the transition from  http://tools.ietf.org/html/draft-duerst-iri-bis-06  to http://tools.ietf.org/html/draft-duerst-iri-bis-07#section-13.1   was a major restructuring of the IRI spec to move from a BNF-derived syntax specification to a more directive "processing model". This was followed by the formation of an IRI working group to manage this, with the documents starting the series http://tools.ietf.org/html/draft-ietf-iri-3987bis including, until version 06, containing:

http://tools.ietf.org/html/draft-ietf-iri-3987bis-06#section-6.2 "Web Address Processing". 

   Many popular web browsers have taken the approach of being quite
   liberal in what is accepted as a "URL" or its relative forms.  This
   section describes their behavior in terms of a preprocessor which
   maps strings into the IRI space for subsequent parsing andd
   interpretation as an IRI.

   In some situations, it might be appropriate to describe the syntax
   that a liberal consumer implementation might accept as a "Web
   Address" or "Hypertext Reference" or "HREF".  However, technical
   specifications SHOULD restrict the syntactic form allowed by
   compliant producers to the IRI or IRI reference syntax defined in
   this document even if they want to mandate this processing.

The specification may not have matched exact browser behavior, or expected or wanted browser behavior, but it at least attempted to do what was claimed was wanted -- describe a processing model which is forgiving and accepts arbitrary input -- but also provide a stricter interpretation of IRI "legal" syntax for IRI producers.  

It's hard to find any evidence of issues, discussion, comments in the tracker or on the mailing list; it's great to now finally get some participation, testing, and resolution of open issues.  

My understanding is that around August 2011, to address the issue of 'venue selection', there was some kind of agreement that the "Web Address Processing" would be handled in W3C (originally in HTML WG and now in WebApps) while the stricter interpretation handled in IETF, and this section on "Web Address Processing" removed from the IETF document. http://tools.ietf.org/html/draft-ietf-iri-3987bis-07 .   

If there is agreement now that the entire IRI / URL processing model will be described in a W3C specification (from string-of-characters in some document charset into "strings sent to IRI component processor or to HTTP client interface")  I think that's workable.

I think, though, that it is risky to have the same processing described in two different parallel specifications.

One path would be for 3987bis to instead normatively reference the W3C specification for URL processing (most of 3.1-3.5  of http://tools.ietf.org/html/draft-ietf-iri-3987bis-12 ), leaving the remaining components: how to compare IRIs (in the comparison document), considerations for dealing with BIDI IRIs (in the bidi document), and how to be compatible with legacy systems which accept not only ASCII-only URIs, but also Unicode-based IRIs which require some compatibility with RFC 3987 (such as XML processors which only accept LEIRIs.)

That would also converge the specifications. The only remaining concern is whether the W3C specification will follow the same backward compatibility guidelines -- to only make specification changes if there is a large majority of commonly deployed implementations that implement the change. If, for example, 30% of installed browsers treat "\" as if it is "/" and 70% of installed browsers do not, then the 30% should not lead to a change in the processing model.   Of course, percentages and market share are fluid, but let's look for some stability...  

Liberal handling of previously illegal strings as IRIs has some security implications that should be examined carefully; this is not like content parsing and style sheet application.

Larry
--
http://larry.masinter.net


> -----Original Message-----
> From: Jan Algermissen [mailto:jan.algermissen@nordsc.com]
> Sent: Tuesday, October 16, 2012 4:45 AM
> To: Anne van Kesteren
> Cc: Martin J. Dürst; Robin Berjon; Ted Hardie; Larry Masinter; plh@w3.org; Peter
> Saint-Andre (stpeter@stpeter.im); Pete Resnick (presnick@qualcomm.com);
> www-archive@w3.org; Michael(tm) Smith
> Subject: Re: URL work in HTML 5 (semifork)
> 
> 
> On Oct 16, 2012, at 1:29 PM, Anne van Kesteren wrote:
> 
> > I'm not arguing URLs should be allowed to contain SP, just that they
> > can (and do) in certain contexts and that we need to deal with that
> > (either by terminating processing or converting it to %20 or ignoring
> > it in case of domain names, if I remember correctly).
> 
> I am not understanding your perceived problem with two specs.
> 
> There is the RFC and that is telling us what a valid URI looks like.
> 
> In addition to that you can standardize 'recovery' algorithms for turning
> broken URIs to valid ones. Maybe with different 'heuristics levels' before
> giving up and reporting an error.
> 
> Any piece of software that wishes to be nice on 'URI providers' and process
> broken URIs to some extend can apply that standardized algorith in a fixup
> phase before handing it on to the component that expects a valid URI.
> 
> The emphasis is then on fixing to get a valid URI as early in the stack
> as possible and avoid the fork on software components that deal with URIs.
> 
> I just don't see any need to mangle any specs. Syntax definition and fixing
> algorithm are orthogonal aspects, really. The belong in different specs.
> 
> Jan

Received on Tuesday, 16 October 2012 23:57:10 UTC