Re: Advice on making IRI document suitable for reference by HTML (and other specs) from Erik van der Poel on 2010-01-03 (public-iri@w3.org from January 2010)

From: Erik van der Poel <erikv@google.com>
Date: Sat, 2 Jan 2010 17:16:27 -0800
To: Larry Masinter <masinter@adobe.com>
Cc: "Phillips, Addison" <addison@amazon.com>, "Roy T. Fielding" <fielding@gbiv.com>, "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <c07a32651001021716x2f56c64ew7a8a3f61ff16781f@mail.gmail.com>

>> The DOM
>> interfaces can be used to access the output of the parsing phase,
>> including, in the case of the DOM href interface, the absolute URL
>> that was produced by resolving a relative URL against a base URL.
>
> It's not clear to me whether relative-resolution is against
> a base URL (IRI) or against a set of parsed components. In
> particular, if the input "base" is mal-formed or invalid in
> some way, is that remembered?

I guess we ought to test this in a few browsers, and then decide what
to do about the spec. I haven't tested it very extensively, but I have
seen some differences between the browsers, e.g. MSIE appears to
accept <base href="www.example.com/foo"> (without the leading
"http://"). Of course, in some cases, it might be better to fix the
implementation(s) than to modify the spec.

>>  It
>> appears that many of the major browsers return Unicode in the DOM
>> interfaces, even when the host was originally in Punycode (in the
>> HTML).
>
> For all input? Only for HTTP?

I have run a few manual tests with HTTP, but I am still working on
automating them.

> What about "widget" URIs where
> the "authority" was going to be used for intra-package references?

I have no idea. If somebody would like to take this part on, that
would be great.

>> How much of this should be in the HTML spec, and how much in
>> the DOM spec? This is also a "split", as Ian calls it.
>
> I'm confused. There is no separate "DOM spec" currently, and
> no one is proposing one. Perhaps you mean "the URL spec" or
> "the IRI spec?"

Here are a couple of relevant DOM specs:

http://www.w3.org/TR/2003/REC-DOM-Level-2-HTML-20030109/html.html#ID-48250443
http://dev.w3.org/html5/spec/Overview.html#htmlanchorelement

>> The output of the resolution phase includes such things as the HTTP
>> Request-URI. The major HTML implementations all convert the ?query
>> part back to the original character encoding of the HTML before
>> placing it in the HTTP request. How much of this should be in the HTML
>> spec, and how much in the IRIbis spec? This is part of Ian's "split"
>> question.
>
> I'm concerned, and would like to minimize the impact of this,
> by encouraging HTML implementations to change to convert
> query parameters to percent-hex-encoded *before* constructing
> the IRI-with-query in the case where the query is being
> constructed for a non-Unicode document. This would reduce
> the number of URLs floating around whose interpretation
> depends on the context in which the URL/IRI is embedded,
> and which required scheme specific processing, which is not
> always available. Currently, a general "convert HTML document
> to UTF8" processor, can't work properly, without examining
> all of the URLs and scripts that produce URLs and rewrite
> all of the query parameters on http URLs and (impossibly)
> analyze the scripts and rewrite them.

I agree that one of the advantages of a "self-contained" IRI is that
you do not need a separate piece of info (the original character
encoding) to convert the ?query part back to the encoding expected by
the server.

However, at this point, it is not clear to me how many of the current
implementations leave the ?query part in non-ASCII format (rather than
converting back to the original encoding and percent-encoding) when
accessing the DOM interfaces. I also do not know whether any of the
implementers would be willing to change something like this.

By the way, other than the DOM interfaces, where do you expect
browsers to output IRIs? The URL field (address bar) and status bar
perhaps? Certainly not in the HTTP request headers, right?

Erik

Received on Sunday, 3 January 2010 01:17:04 UTC