Re: Non-hierarchical base URLs (was Re: draft-abarth-url-01 uploaded)

On Mon, May 2, 2011 at 5:42 PM, Adam Barth <ietf@adambarth.com> wrote:

> >>> However, for the subset of possible references that do happen
> >>> to match what are called valid URI references by RFC3986, then
> >>> we have already tested consensus and deployed many implementations
> >>> that conform exactly to the results given in RFC3986.
> >>
> >> If these references are something other than URIs, and must be
> transcoded, why is it important that the subset that happens to look
> syntactically like a valid URI must be processed without that transcoding
> step? This implies that the transcoding must be the identity encoding in
> some cases. Where does that assumption come from?
> >
> > Authors have been using plain old ASCII references to URIs for
> > longer than the Web has been documented.  We expect them to
> > still work.  Likewise for references that are in the document
> > encoding but only use the subset of characters that are found
> > in ASCII.  URIs are defined in terms of characters, not octets,
> > so the transcoding I am referring to is the removal of whitespace,
> > pct-encoding of non-unreserved characters, etc.  A reference that
> > is already in URI form does not need to be transcoded.
>
> You're missing the constraint that browser vendors aren't going to
> change their implementations to align with this dream.  Our choice is
> between having the specification reflect that reality or having the
> spec tell a lie.
>
> Adam
>
>
Pulling back in something Boris said:

In Gecko's case, I believe there are 4 different categories.  We have one
parsing setup for "non-hierarchical" schemes (view-source, data, javascript,
about, etc), and 3 different parsing setups for "hierarchical" ones (http,
ftp, file, chromesee the URLTYPE_* constants at
http://hg.mozilla.org/mozilla-central/file/c062731105cf/netwerk/base/public/nsIStandardURL.idl#l53
which
happen to document how the parsing differs based on the different type).


I think the specification should note what the current behavior is, even if
what it is specifying is different from that behavior.  There seem to be a
couple of key questions here, but from the outside of this exchange, it's
getting hard to puzzle out the answers.  Form my perspective, those
questions are:

1) For the classes of input strings important to HTML5, is there a baseline
behavior which is present for all classes?  Can we document that baseline
behavior?

2) It's obvious that even if there is a baseline behavior there is at least
additional behavior required for specific classes.  The categorization
"hierarchical" and "non-hierarchical" doesn't seem to capture the full
parsing setup (at least for Gecko, according to Boris).  Is it still a
valuable classification?  Is anything other than input string type specific
processing likely to be valuable?  (Note that scheme-specific is one type of
input string type specific processing).

3) Things that are not web browsers use URIs, and the amount of
cross-context embedding  is large (web pages containing mailto URIs, SNMP
MIBs containing http URIs, SIP exchanges containing Tel URIs and so on).
 Defining behavior to be context dependent and survive embedding seems to
require some mechanism that is not currently present (especially given the
desire others have expressed to have things like BiDi work correctly with
"bare" URIs like "example.com").  If a context marker is created, will it
actually be used?

My personal view is that a fork here is a bad idea, as it will be very hard
to determine on what branch of the fork many strings should be evaluated.  I
hope others share my concern with that and are willing to compromise to
avoid it.

regards,

Ted Hardie

Received on Tuesday, 3 May 2011 01:14:38 UTC