Re: Change proposal for ISSUE-56 from Roy T. Fielding on 2010-07-15 (public-html@w3.org from July 2010)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Wed, 14 Jul 2010 19:46:20 -0700
To: Adam Barth <w3c@adambarth.com>
Cc: HTML WG <public-html@w3.org>
Message-Id: <1667A550-A0B3-422A-ABFE-C7827BD253AE@gbiv.com>

On Jul 14, 2010, at 6:12 PM, Adam Barth wrote:

> == Proposal Details ==
> 
> The proposal details herein takes the form of a set of edit
> instructions, specific enough that they can be applied without
> ambiguity:
> 
> 1) Revert http://svn.whatwg.org/webapps@3245.  (Note: the editor and
> the working group should feel free to continue to improve this text
> after adopting this change proposal.)

Er, the link doesn't work, but the original text that you intend
to restore is not consistent with your change proposal.  The text
that I originally objected to does not recognize the distinction
between input strings and URIs, and in fact deliberately misuses
the term URL in a misguided attempt to "fix" a problem that never
existed in the first place.  Restoring bad text will not address the
issues in your rationale.

Why not just propose text that addresses the issue?  Forget about
the algorithm that used to be in the spec -- it was not accurate
anyway and certainly does not reflect interoperable implementation.

Parsing is not that hard to describe:

   Get string value (in context and document encoding)
   Remove leading and trailing whitespace

   If more than one reference allowed,
     split the string into separate strings on \s+
   else
     look for embedded whitespace and either remove it
     or replace it with a single %20, depending on context

   For each such reference,
     transcode the reference string to UTF-8;
     replace entity references with corresponding character;
     parse string into components according to an algorithm
       equivalent to the regex in RFC3986, Appendix B;
     further parse the authority component into its generic
       subcomponents: [ userinfo "@" ] host [ ":" port ]

To obtain the URI of a parsed reference:

   Encode any disallowed characters in each component by
      a) applying punycode to anything that is supposed to be a
         host *and* is presented in Internet dot-notation
         (i.e., don't punycode local names like WINS), or 
      b) pct-encode any disallowed or component-delimiting
         characters within each component;

   Perform the (non-strict) relative resolution algorithm of
      RFC3986, sec 5.2;

   Combine the components as defined in RFC3986, sec. 5.3.

To obtain the IRI (display form) of a parsed reference

   Obtain the URI;
   Search for punycoded domain names and decode them to UTF-8;
   Replace each sequence of pct-encoded octets that correspond
      to a valid UTF-8 character outside the ASCII subset with
      that UTF-8 character [DO NOT decode encoded ASCII];
   Transcode the string to the document (or display) encoding.
   [Apply appropriate spoof-highlighting filters.]

Most implementations store most (if not all) of these components
or intermediate forms as a byproduct of parsing and display,
usually in the equivalent of a DOM.

....Roy

Received on Thursday, 15 July 2010 02:46:50 UTC