Re: Scope question from Roy T. Fielding on 2010-05-06 (public-iri@w3.org from May 2010)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Wed, 5 May 2010 18:56:37 -0700
To: Adam Barth <ietf@adambarth.com>
Cc: "Phillips, Addison" <addison@lab126.com>, "public-iri@w3.org" <public-iri@w3.org>
Message-Id: <735C381F-0D23-47DC-A5B9-A22F8BEAEE85@gbiv.com>
On May 5, 2010, at 5:31 PM, Adam Barth wrote:
> On Wed, May 5, 2010 at 5:09 PM, Roy T. Fielding <fielding@gbiv.com> wrote:
>> Please understand that browsers almost never parse URI or IRI or
>> anything in between.  Browsers have input strings that contain one
>> or more references, usually in the document encoding, and so there
>> is a sequence of context-specific and charset-specific and
>> media-type-specific processing that occurs before you even get to
>> the individual URI-reference or IRI-reference that are defined by
>> 3986/3987.
> 
> Where are those rules defined (e.g., for HTML documents)?  I suspect
> that's the layer that interests me at the moment.

The pre-processing is defined in HTML4, for things like href and src
attributes, and nowhere for things like the location bar.  There is
no single standard way of doing it.  What is standard and defined by
3986 is how to encode non-URI characters and then interpret the
extracted reference relative to the base URI in order to obtain
the target URI.

>> Some people have proposed that most of that pre-processing be added
>> to the IRIbis spec, but I have seen no evidence to suggest that
>> such pre-processing is even remotely standardizable (it seems to
>> be different for every input context).  If you can demonstrate or
>> get agreement on a single way to preprocess an input string, or at
>> least a few named processes (like single-ref and multi-ref), then
>> that would be useful.
> 
> It seems likely that this would be possible and valuable for at least
> some widely used contexts (e.g., UTF8-encoded HTML documents).

Yes, but keep in mind there are at least three different contexts
within just UTF8-encoded HTML.  It would be great if we could reduce
that to at most 2 (one for singleton references and one for
space-separated references).

>> It would have no effect on RFC 3986.  The only things that would
>> impact 3986 is if the allowed characters or major components
>> changed in the wire syntax of the URI standard, which is simply
>> not going to happen because that would break a majority of
>> implementations (of which browsers make up less than 1%).
>> As far as 3986 is concerned, your algorithm is in Appendix B.
>> Note that the algorithm will work with any superset of ASCII.
> 
> I don't have an algorithm yet, but, according to my understanding of
> your email, the algorithm in Appendix B appears to a constraint on the
> *output* of the media/context-specific transformation that interests
> me.

Right, it is one algorithm that provides a consistent answer no
matter what is in the input string, assuming that the input has
no leading or trailing whitespace and consists of only one reference.
Something like that algorithm was implemented (at least in terms of
output) by most implementations and is known to be interoperable
for valid URI.  However, it does not include steps for pct-encoding
non-URI characters or case-normalizing the case-insensitive ones,
since that isn't the role of parsing (i.e., you wouldn't want to
do that in an original-preserving editor).

>> IRI (3987) is more flexible because there are no wire implementations
>> that depend on its constraints -- it could just as easily have
>> been defined as an "any string" conversion/presentation process,
>> which would have satisfied the scope you are looking for if there
>> is sufficient agreement among implementations.
> 
> I didn't understand this paragraph, but I'm not sure it's essential to
> our discussion.

There is an old debate about whether IRI should be an identifier syntax
of its own, for the sake of writing addresses on the side of a bus or
for use unencoded within some future wire protocol, or if it should be
the colloquial term for any i18n string that can be converted to a URI.
The difference is in how invalid input is "handled" by the spec.

If IRIbis decided to define IRI as "any string", then your algorithm
would be in scope as one way to translate any string into an address
that can then be converted for use as a URI.  I still don't know if
it would be the one true algorithm, since that would depend on many
more implementations than it sounds like you are going to test, but
it certainly couldn't hurt to know what the browsers do today.

If IRIbis decided to define IRI as a valid identifier, then your
investigation would still be in scope.  How the result fit in with
the rest of the specification is unknown -- perhaps as defining
some other term, like Larry was using HRef (yuck), or as an appendix
like the one in URI.

....Roy
Received on Thursday, 6 May 2010 01:57:07 UTC