Re: heads-up about "new" URLs section in HTML5 editor's draft from Robert J Burns on 2008-06-28 (public-html@w3.org from June 2008)

From: Robert J Burns <rob@robburns.com>
Date: Sat, 28 Jun 2008 22:10:26 +0300
To: Michael (tm) Smith <mike@w3.org>
Cc: public-html@w3.org
Message-Id: <42F797C0-61C5-44D9-B731-DBC6C444636F@robburns.com>
Hi Mike,

We used to have a general guideline in this WG where changes needed to  
be accompanied by use cases. However, from reading this message you  
sent and from reading the new changes to the draft, I have no idea  
either what change in definitions have been made to URL and IRI nor  
what real problems such changes would need to solve.

You say[1]:
> The rationale for redefining the term "URL" -- and for including
> the sections that specify URL parsing rules for user agents and
> how user agents must resolve URLs -- is provided in the "URLs"
> section introduction:


But obviously it does not follow that because user agents parse,  
resolve and otherwise handle URLs in ways specific to HTML5 UAs, tat  
the URLs and IRIs themselves need to be redefined. It should be  
sufficient to use the URL and IRI definitions as they are and add  
document conformance language surrounding "valid URL" along with  
defining how HTML5 UAs handle, resolve, parse and prepare URLs for  
HTTP (and otherwise) requests.

So why not accept and adopt the existing IRI and URL specifications in  
HTML5 and then specify document conformance norms for a "valid  
URL" (and a valid IRI if necessary). Then we can clearly define what  
characters require percent escaping within a document and how the URL  
will be delivered in schema requests.

Also the problematic edge cases that Philip, Ian, Julian and others  
have uncovered are both extremely problematic and very much edge  
cases. This is an area where we should be very careful (more so than  
previously) about codifying poor implementations. For example,  
Philip's discovery that percent escaped UTF-8 and escaped UTF-16 do  
not always get sent as they should is not something we should  
encourage[2] (FireFox appears to be the only one doing the right thing  
here). Fixing this will make sites work in a more interoperable way.

Finally having UAs (and server agents) properly handle UTF-8 (%ww%xx%yy 
%zz) and UTF-16 (\uxxxx) will help transition to transmission of pure  
unicode URLs since authors can switch soon to escaping (as has been  
recommended for years) and HTML5 UAs could later escape all URLs with  
non-Latin characters for transmission in schema requests. In this way  
eventually authors could author URLs (and IRIs in this case) in the  
document’s encoding and the UA would handle the encoding for requests  
(either UTF-8 or UTF-16 as the UA or HTML5 sees fit for the particular  
characters involved in the IRI). Perhaps this is where Ian’s heading  
already it is impossible to tell from what’s in your email or in the  
spec so far.

So to summarize:

  1) adopt URLs and IRIs as specified elsewhere
  2) define HTML5 document conformance for author use of URLs and IRIs
  3) define HTML5 UA conformance for handling of URLs and IRIs  
(including for non-conforming URLs and IRIs)
  4) try to push HTML toward the transmission of pure UTF-8 and UTF-16  
(when necessary through percent-encoding) for URL and IRI requests

If there's some reason not to follow such an approach (especially for  
items one, two, and three together) then what are the use cases and  
problem statements that necessitate our divergence from the existing  
specifications?

Take care,
Rob

[1]: <http://lists.w3.org/Archives/Public/public-html/2008Jun/0348.html>
[2]: <http://lists.w3.org/Archives/Public/public-html/2008Jun/0358.html>
Received on Saturday, 28 June 2008 19:11:13 UTC