- From: Roy T. Fielding <fielding@gbiv.com>
- Date: Wed, 14 Jul 2010 19:46:20 -0700
- To: Adam Barth <w3c@adambarth.com>
- Cc: HTML WG <public-html@w3.org>
On Jul 14, 2010, at 6:12 PM, Adam Barth wrote: > == Proposal Details == > > The proposal details herein takes the form of a set of edit > instructions, specific enough that they can be applied without > ambiguity: > > 1) Revert http://svn.whatwg.org/webapps@3245. (Note: the editor and > the working group should feel free to continue to improve this text > after adopting this change proposal.) Er, the link doesn't work, but the original text that you intend to restore is not consistent with your change proposal. The text that I originally objected to does not recognize the distinction between input strings and URIs, and in fact deliberately misuses the term URL in a misguided attempt to "fix" a problem that never existed in the first place. Restoring bad text will not address the issues in your rationale. Why not just propose text that addresses the issue? Forget about the algorithm that used to be in the spec -- it was not accurate anyway and certainly does not reflect interoperable implementation. Parsing is not that hard to describe: Get string value (in context and document encoding) Remove leading and trailing whitespace If more than one reference allowed, split the string into separate strings on \s+ else look for embedded whitespace and either remove it or replace it with a single %20, depending on context For each such reference, transcode the reference string to UTF-8; replace entity references with corresponding character; parse string into components according to an algorithm equivalent to the regex in RFC3986, Appendix B; further parse the authority component into its generic subcomponents: [ userinfo "@" ] host [ ":" port ] To obtain the URI of a parsed reference: Encode any disallowed characters in each component by a) applying punycode to anything that is supposed to be a host *and* is presented in Internet dot-notation (i.e., don't punycode local names like WINS), or b) pct-encode any disallowed or component-delimiting characters within each component; Perform the (non-strict) relative resolution algorithm of RFC3986, sec 5.2; Combine the components as defined in RFC3986, sec. 5.3. To obtain the IRI (display form) of a parsed reference Obtain the URI; Search for punycoded domain names and decode them to UTF-8; Replace each sequence of pct-encoded octets that correspond to a valid UTF-8 character outside the ASCII subset with that UTF-8 character [DO NOT decode encoded ASCII]; Transcode the string to the document (or display) encoding. [Apply appropriate spoof-highlighting filters.] Most implementations store most (if not all) of these components or intermediate forms as a byproduct of parsing and display, usually in the equivalent of a DOM. ....Roy
Received on Thursday, 15 July 2010 02:46:50 UTC