Re: BiDi IRI deployment? from Frank Ellermann on 2008-04-29 (www-international@w3.org from April to June 2008)

From: Frank Ellermann <nobody@xyzzy.claranet.de>
Date: Wed, 30 Apr 2008 01:30:55 +0200
To: www-international@w3.org
Message-ID: <fv8cmk$m8u$1@ger.gmane.org>
Erik van der Poel wrote:
 
> RFC 3987 does mention related issues.

The complete section 7 is informative, I fear I never read
it, because syntax and and prose in chapters 1..3 satisfied
all my "now that is really KISS" desires.  

For chapter 4 (BiDi) I anyway couldn't judge it.
[Digression... I'm not completely convinced that numbers are
really written LTR in RTL languages, or if they just have a
"little endian" concept where RTL languages use "big endian"]

>| when a new Web form is set up using UTF-8 as the character
>| encoding of the form page, the returned query URIs will
>| use UTF-8 as the character encoding (unless the user, for
>| whatever reason, changes the character encoding) and will
>| therefore be compatible with IRIs."

A reason to change it is a browser not supporting UTF-8, but
I'm confident that the number of netscape 2.02 users sharply
declined by 100% from one to zero worldwide last year.  

Of course using UTF-8 is the most robust solution, queries
in an URI can't say which other percent-encoded charset they
might use, FWIW it can be no charset at all, percent-encoded
raw octets of an ICO or similar.

For that general issue IRIs are only an example, it affects
all queries as soon as a part of it is about non-ASCII.  

>| Second, it may include URIs constructed based on character
>| encodings other than UTF-8.  These URIs may be produced by
>| user agents that do not conform to this specification and
>| that use legacy character encodings to convert non-ASCII
>| characters to URIs.

It could be an ftp URI talking about file names on a server
using a legacy charset, or similar cases for other schemes.

RFC 3987 merely repeats what RFC 2277 before and RFC 5198
later say, use UTF-8 over the wire, anything else requires
a way to indicate the charset.   And for HTTP GET forms the
resulting URI can't say what it is, clients trying to state
that an URI is not UTF-8 are doomed:  

* A percent-encoded ICO is not UTF-8, nor any other charset.
* An URI by definition is US-ASCII following STD 66 syntax,
  otherwise it is broken and potentially dangerous.
* And RFC 3987 quietly adds the concept "as far as URIs use
  percent-encoded octets it is either some binary gibberish,
  or percent-encoded UTF-8".

The last point is the real magic in RFC 3987, it deprecates
the whole zoo of legacy charsets (again) without mentioning
the fact.  For definitions of "legacy" starting with UTF-16,
UTF-32, UTF-7, UTF-1, and then covering anything that is not
UTF-8 or its proper subset US-ASCII.

> I'm not sure whether we are communicating here. I'm talking
> about URIs that are sent from the client to a server

We are on the same track, if a user clicks on a "raw" IRI in
the href on the KOI8-R test page it cannot work with almost
all clients (minus popular browsers), because HTTP supports
only URIs, not "raw" IRIs.

Besides old clients have no way to figure out the server in
these two IRIs (one KOI8-R IRI for the Cyril test TLD Wiki,
one Unicode IRI given with NCRs for the Greek test TLD Wiki).

> Currently, HTML browsers convert from Unicode to the document
> encoding when an HTML form is submitted or an href with a
> non-ASCII query part is clicked.

That sounds strange.  For the <ihost> part I found that FF2
converts it from legacy (KOI8-R) or Unicode (NCRs) to the
corresponding IDNA A-labels, otherwise the links don't work.

For <ipath> Martin's test suite showed that FF2 didn't get
this right for legacy non-UTF-8 charsets (JFTR also not for
iso-8859-1).  Fixing that should be straight forward:

Treat any (X)HTML document internally as Unicode (RFC 2070
and later), if in doubt use UTF-8 for Unicode (RFC 2277 and
5198), and finally percent-encode UTF-8 (RFC 3986 and 3987).

Where does the bit of doing something *else* for a <query>
or <iquery> enter the picture ?  What is the point of doing
something else, i.e. different from a <path> or <ipath> ?

> I'm saying that it would be more consistent if the browsers
> used NCRs for both forms *and* hrefs, since the server
> doesn't know which one the user was dealing with.

Do you want to send NCRs in URI query parts over the wire ?
Including sending "&" as "&amp;" etc. ?  I don't see how
that can be a good idea, servers would then be faced with
questions of how often they need to decode NCRs in the URI
*plus* the known issues of decoding %25hh, %25%25hh, etc.

> Maybe HTML forms and hrefs with query parts can be 
> specified in HTML 5 instead of IRIbis.

HTML5 already tries to reinvent the complete Internet, but
as far as HTML5 is a manual for browser implementors, yes,
HTML5 might need to talk about these issues.  And maybe it
does already, I have only read the "diff" draft carefully,
now waiting for the next round of "official" HTML5 drafts.

 Frank
Received on Tuesday, 29 April 2008 23:59:07 UTC