- From: Roy T. Fielding <fielding@gbiv.com>
- Date: Wed, 5 May 2010 18:56:37 -0700
- To: Adam Barth <ietf@adambarth.com>
- Cc: "Phillips, Addison" <addison@lab126.com>, "public-iri@w3.org" <public-iri@w3.org>
On May 5, 2010, at 5:31 PM, Adam Barth wrote: > On Wed, May 5, 2010 at 5:09 PM, Roy T. Fielding <fielding@gbiv.com> wrote: >> Please understand that browsers almost never parse URI or IRI or >> anything in between. Browsers have input strings that contain one >> or more references, usually in the document encoding, and so there >> is a sequence of context-specific and charset-specific and >> media-type-specific processing that occurs before you even get to >> the individual URI-reference or IRI-reference that are defined by >> 3986/3987. > > Where are those rules defined (e.g., for HTML documents)? I suspect > that's the layer that interests me at the moment. The pre-processing is defined in HTML4, for things like href and src attributes, and nowhere for things like the location bar. There is no single standard way of doing it. What is standard and defined by 3986 is how to encode non-URI characters and then interpret the extracted reference relative to the base URI in order to obtain the target URI. >> Some people have proposed that most of that pre-processing be added >> to the IRIbis spec, but I have seen no evidence to suggest that >> such pre-processing is even remotely standardizable (it seems to >> be different for every input context). If you can demonstrate or >> get agreement on a single way to preprocess an input string, or at >> least a few named processes (like single-ref and multi-ref), then >> that would be useful. > > It seems likely that this would be possible and valuable for at least > some widely used contexts (e.g., UTF8-encoded HTML documents). Yes, but keep in mind there are at least three different contexts within just UTF8-encoded HTML. It would be great if we could reduce that to at most 2 (one for singleton references and one for space-separated references). >> It would have no effect on RFC 3986. The only things that would >> impact 3986 is if the allowed characters or major components >> changed in the wire syntax of the URI standard, which is simply >> not going to happen because that would break a majority of >> implementations (of which browsers make up less than 1%). >> As far as 3986 is concerned, your algorithm is in Appendix B. >> Note that the algorithm will work with any superset of ASCII. > > I don't have an algorithm yet, but, according to my understanding of > your email, the algorithm in Appendix B appears to a constraint on the > *output* of the media/context-specific transformation that interests > me. Right, it is one algorithm that provides a consistent answer no matter what is in the input string, assuming that the input has no leading or trailing whitespace and consists of only one reference. Something like that algorithm was implemented (at least in terms of output) by most implementations and is known to be interoperable for valid URI. However, it does not include steps for pct-encoding non-URI characters or case-normalizing the case-insensitive ones, since that isn't the role of parsing (i.e., you wouldn't want to do that in an original-preserving editor). >> IRI (3987) is more flexible because there are no wire implementations >> that depend on its constraints -- it could just as easily have >> been defined as an "any string" conversion/presentation process, >> which would have satisfied the scope you are looking for if there >> is sufficient agreement among implementations. > > I didn't understand this paragraph, but I'm not sure it's essential to > our discussion. There is an old debate about whether IRI should be an identifier syntax of its own, for the sake of writing addresses on the side of a bus or for use unencoded within some future wire protocol, or if it should be the colloquial term for any i18n string that can be converted to a URI. The difference is in how invalid input is "handled" by the spec. If IRIbis decided to define IRI as "any string", then your algorithm would be in scope as one way to translate any string into an address that can then be converted for use as a URI. I still don't know if it would be the one true algorithm, since that would depend on many more implementations than it sounds like you are going to test, but it certainly couldn't hurt to know what the browsers do today. If IRIbis decided to define IRI as a valid identifier, then your investigation would still be in scope. How the result fit in with the rest of the specification is unknown -- perhaps as defining some other term, like Larry was using HRef (yuck), or as an appendix like the one in URI. ....Roy
Received on Thursday, 6 May 2010 01:57:07 UTC