W3C home > Mailing lists > Public > www-international@w3.org > April to June 2008

Re: BiDi IRI deployment?

From: Erik van der Poel <erikv@google.com>
Date: Fri, 25 Apr 2008 13:58:05 -0700
Message-ID: <c07a32650804251358p5055a42ax4eed4c5544cb3f7c@mail.gmail.com>
To: "Frank Ellermann" <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
Cc: www-international@w3.org

On Fri, Apr 25, 2008 at 11:54 AM, Frank Ellermann
<nobody@xyzzy.claranet.de> wrote:
>
>  Erik van der Poel wrote:
>
>  > Here are some URIs with Arabic and Hebrew in the host, path
>  > and query parts of the URI.
>  [...]
>
> > Arabic host: http://xn--wgbe9chb01aytce.com/
>
>  Apparently valid URIs in all href attributes, no "raw" IRI.
>
>
>  > Hebrew host: http://www.xn--4dbbmod3aio.net/
>
>  That has a "raw" IRI in a href-link, but the page claims to
>  be XHTML 1 permitting only URIs, therefore it is invalid.

Do you know of any user agents that process the IRI differently,
depending on the XHTML 1 claim?

>  > Arabic path: http://ar.wikipedia.org/wiki/%D8%A3%D9%88%D8%AF%D9%85%D9%88%D8%B1%D8%AA%D9%8A%D8%A7
>
>  Apparently valid URIs, no "raw" IRI.  Ditto he.wikipedia.
>
>  Maybe I misunderstood the question.  I was about to post a
>  link to <http://idn.icann.org>, but that Wiki now also uses
>  "URI-fied" IRIs, not "raw" IRIs.  It is tricky to find any
>  document format permitting "raw" IRIs in links.
>
>  And "raw" UTF-8 IRIs are boring, popular browsers get this
>  right - "raw" IRIs in legacy charsets are more interesting.

I agree that those are more interesting. The major browsers are slowly
converging on a set of conventions in this area.

Host name: Content developers still use Punycode because MSIE 6 does
not support IDNA. Opera supports escaped UTF-8, but the other browsers
don't (yet).

Path: Firefox has agreed to convert raw paths to escaped UTF-8,
starting with Firefox 3.

Query: The browser developers appear to have agreed to convert raw
queries back to the original HTML encoding, with certain exceptions.
(E.g. UTF-16 documents.) Also, MSIE leaves the query unescaped, while
Firefox escapes it. Finally, characters outside the destination set
are converted differently: MSIE converts to question marks, Firefox
converts to UTF-8 instead, and Safari converts to the same format as
submitted HTML forms (with HTTP GET), i.e. decimal NCRs, e.g. &#12345;

When the browsers have converged enough, and the older browsers are
not being used so much any more, we may begin to see more "raw" IRIs
in HTML.

Erik
Received on Friday, 25 April 2008 20:58:46 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:17 GMT