RE: Advice on making IRI document suitable for reference by HTML (and other specs)

(Reply to Erik van der Poel's message)

It would help quite a bit if everyone could focus on
the documents we already have. Either starting with


or, if you really think a better starting point 

> URL processing can be divided into parsing and resolution.

I think that's what iri-bis-07 tries to do, to define IRI
processing as "parse" then "resolve" rather than
"convert to URI" then URI-based "parse and resolve".

> The DOM
> interfaces can be used to access the output of the parsing phase,
> including, in the case of the DOM href interface, the absolute URL
> that was produced by resolving a relative URL against a base URL.

It's not clear to me whether relative-resolution is against
a base URL (IRI) or against a set of parsed components. In
particular, if the input "base" is mal-formed or invalid in
some way, is that remembered? 

>  It
> appears that many of the major browsers return Unicode in the DOM
> interfaces, even when the host was originally in Punycode (in the
> HTML).

For all input? Only for HTTP? What about "widget" URIs where
the "authority" was going to be used for intra-package references?

> How much of this should be in the HTML spec, and how much in
> the DOM spec? This is also a "split", as Ian calls it.

I'm confused. There is no separate "DOM spec" currently, and
no one is proposing one. Perhaps you mean "the URL spec" or
"the IRI spec?"

> I think the IRI definition can and should reasonably contain advice
> for combining relative IRI and absolute IRI in a way that would make 
> be useful other content-types than HTML.

Uh, of course. Right now, the definition of "relative resolution"
is in the URI standard (RFC 3986) and not explicitly in the
iri-bis draft (not in -07 either).  What iri-rewrite says is

# The term resolve (in the context of resolve a URL relative 
# to another URL) is used to describe the process of combining
#  two strings: an original URL and a base URL (usually an 
# absolute URL) to obtain parsed components; these parsed 
# components may then be recombined to construct a new URL. 
# This is accomplished by parsing the original and base URLs 
# (preprocessing by section 7.2 of [draft-duerst-iri-bis]
#  first, then matching against the productions of section
#  3.2 of [draft-duerst-iri-bis]) but then combining the
#  original and base components following the algorithms 
# in section 5.2 of [RFC 3986], but applied to the Unicode
#  characters which constitute the original and base.

Adding that to the iri-bis documents (using IRI for URL)
seems like it might be useful.

> The output of the resolution phase includes such things as the HTTP
> Request-URI. The major HTML implementations all convert the ?query
> part back to the original character encoding of the HTML before
> placing it in the HTTP request. How much of this should be in the HTML
> spec, and how much in the IRIbis spec? This is part of Ian's "split"
> question.

I'm concerned, and would like to minimize the impact of this,
by encouraging HTML implementations to change to convert
query parameters to percent-hex-encoded *before* constructing
the IRI-with-query in the case where the query is being
constructed for a non-Unicode document. This would reduce
the number of URLs floating around whose interpretation
depends on the context in which the URL/IRI is embedded,
and which required scheme specific processing, which is not
always available. Currently, a general "convert HTML document
to UTF8" processor, can't work properly, without examining
all of the URLs and scripts that produce URLs and rewrite
all of the query parameters on http URLs and (impossibly)
analyze the scripts and rewrite them. 

Received on Saturday, 2 January 2010 17:45:02 UTC