Re: Change proposal for ISSUE-56 from Adam Barth on 2010-07-15 (public-html@w3.org from July 2010)

From: Adam Barth <w3c@adambarth.com>
Date: Wed, 14 Jul 2010 19:58:39 -0700
To: "Roy T. Fielding" <fielding@gbiv.com>
Cc: HTML WG <public-html@w3.org>
Message-ID: <AANLkTin9FC5z2qpPDcldsUIx4TCu1hbmQWEeOAD8lL5f@mail.gmail.com>
On Wed, Jul 14, 2010 at 7:46 PM, Roy T. Fielding <fielding@gbiv.com> wrote:
> On Jul 14, 2010, at 6:12 PM, Adam Barth wrote:
>> == Proposal Details ==
>>
>> The proposal details herein takes the form of a set of edit
>> instructions, specific enough that they can be applied without
>> ambiguity:
>>
>> 1) Revert http://svn.whatwg.org/webapps@3245.  (Note: the editor and
>> the working group should feel free to continue to improve this text
>> after adopting this change proposal.)
>
> Er, the link doesn't work, but the original text that you intend
> to restore is not consistent with your change proposal.  The text
> that I originally objected to does not recognize the distinction
> between input strings and URIs, and in fact deliberately misuses
> the term URL in a misguided attempt to "fix" a problem that never
> existed in the first place.  Restoring bad text will not address the
> issues in your rationale.
>
> Why not just propose text that addresses the issue?  Forget about
> the algorithm that used to be in the spec -- it was not accurate
> anyway and certainly does not reflect interoperable implementation.
>
> Parsing is not that hard to describe:
>
>   Get string value (in context and document encoding)
>   Remove leading and trailing whitespace
>
>   If more than one reference allowed,
>     split the string into separate strings on \s+
>   else
>     look for embedded whitespace and either remove it
>     or replace it with a single %20, depending on context
>
>   For each such reference,
>     transcode the reference string to UTF-8;
>     replace entity references with corresponding character;
>     parse string into components according to an algorithm
>       equivalent to the regex in RFC3986, Appendix B;
>     further parse the authority component into its generic
>       subcomponents: [ userinfo "@" ] host [ ":" port ]
>
> To obtain the URI of a parsed reference:
>
>   Encode any disallowed characters in each component by
>      a) applying punycode to anything that is supposed to be a
>         host *and* is presented in Internet dot-notation
>         (i.e., don't punycode local names like WINS), or
>      b) pct-encode any disallowed or component-delimiting
>         characters within each component;
>
>   Perform the (non-strict) relative resolution algorithm of
>      RFC3986, sec 5.2;
>
>   Combine the components as defined in RFC3986, sec. 5.3.
>
> To obtain the IRI (display form) of a parsed reference
>
>   Obtain the URI;
>   Search for punycoded domain names and decode them to UTF-8;
>   Replace each sequence of pct-encoded octets that correspond
>      to a valid UTF-8 character outside the ASCII subset with
>      that UTF-8 character [DO NOT decode encoded ASCII];
>   Transcode the string to the document (or display) encoding.
>   [Apply appropriate spoof-highlighting filters.]
>
> Most implementations store most (if not all) of these components
> or intermediate forms as a byproduct of parsing and display,
> usually in the equivalent of a DOM.

That's fine with me.  I don't know what the specific text should be.
I was mostly suggesting reverting http://svn.whatwg.org/webapps@3245
as a starting point, but the text you have above seems like a
reasonable starting point as well.  It's going to take some study to
figure out exactly what the right text is, but the exact text isn't
essential to the proposal.

Adam
Received on Thursday, 15 July 2010 02:59:33 UTC