Re: BiDi IRI deployment? from Frank Ellermann on 2008-04-29 (www-international@w3.org from April to June 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Tue, 29 Apr 2008 04:54:05 +0200
To: www-international@w3.org
Message-ID: <fv62fa$un5$1@ger.gmane.org>

Erik van der Poel wrote:

> I notice that you did not address the query part in your response.

IIRC RFC 3987 has no special rules for anything excl. <ihost>,
it's always "transform legacy charset to UTF-8 and then percent-
encode" to get the equivalent URI.

Any magic with say iri= parameters in an <iquery> happens on the
server, servers like IRI producers are supposed to know how they
can handle any IRI in its URI-equivalent form.

The critical part, who supports IDNA, is handled by the producer
and the server, the clients and consumers can be obsolete. 

> Since URIs and IRIs do not have the "accept-charset" that HTML
> forms have, the "best practice" would be to use a charset that
> can encode all of Unicode (e.g. UTF-8).

Yes.  But legacy charsets do not need to be a problem.  Missing
characters can be given as NCRs, they are Unicode by definition
in any (X)(HT)ML document.  On the KOI8-R test page I have NCRs
for Greek characters.  "Only" all non-ASCII octets are KOI8-R.
In theory user agents can get this right when they know KOI8-R.

All is lost if they send the octets to the clipboard "as is"
without saying what it is, they better transform KOI8-R and 
NCRs to UTF-8 before talking with a clipboard.  

But problems with forms, legacy charsets, and clipboards are no
IRI problem, or rather I don't see where IRIs make this worse.

> The &#NNNNN; syntax has the advantage that it is consistent
> with de facto HTML form handling. (The server does not know
> whether the client started with an HTML form or an href.)

ACK, I normally prefer US-ASCII with NCRs for very limited uses
of non-ASCII, but that is only because I rarely need non-ASCII,
no option in most languages.

> The IRIbis author(s) may wish to make this part optional (e.g.
> a profile), so that applications other than HTML can still opt
> for the "clean" solution (query part in escaped UTF-8).

AFAIK there is no standard for query parts, name=value pairs 
separated by "&" are only a popular convention, not mandated in
RFC 3986.  The required syntax is "begins after first ?", some
characters like space, "[", "<", ">", and "]" cannot occur in
a query part (percent-encoded is okay, a raw "?" is also okay),
and "#" or the end of the URI, e.g., indicated by ">" , is the
end of the query.

Anything else *within* the query is free style - for some time
folks tried to establish ";" instead of "&" as separator.  IMO
it would be a bad idea if Martin starts to talk about issues
not specified for URIs in 3987bis.  The magic of RFC 3987 is
that it's straight forward.  Admittedly I ignore "legacy IRIs"
(a few MAYs) and "IRI comparison" in RFC 3987.

All query-part problems are not IRI-problems, they have to be
addressed elsewhere, not 3987bis, they already existed before.

 Frank

Received on Tuesday, 29 April 2008 02:52:15 UTC