- From: Erik van der Poel <erikv@google.com>
- Date: Fri, 25 Apr 2008 13:58:05 -0700
- To: "Frank Ellermann" <hmdmhdfmhdjmzdtjmzdtzktdkztdjz@gmail.com>
- Cc: www-international@w3.org
On Fri, Apr 25, 2008 at 11:54 AM, Frank Ellermann <nobody@xyzzy.claranet.de> wrote: > > Erik van der Poel wrote: > > > Here are some URIs with Arabic and Hebrew in the host, path > > and query parts of the URI. > [...] > > > Arabic host: http://xn--wgbe9chb01aytce.com/ > > Apparently valid URIs in all href attributes, no "raw" IRI. > > > > Hebrew host: http://www.xn--4dbbmod3aio.net/ > > That has a "raw" IRI in a href-link, but the page claims to > be XHTML 1 permitting only URIs, therefore it is invalid. Do you know of any user agents that process the IRI differently, depending on the XHTML 1 claim? > > Arabic path: http://ar.wikipedia.org/wiki/%D8%A3%D9%88%D8%AF%D9%85%D9%88%D8%B1%D8%AA%D9%8A%D8%A7 > > Apparently valid URIs, no "raw" IRI. Ditto he.wikipedia. > > Maybe I misunderstood the question. I was about to post a > link to <http://idn.icann.org>, but that Wiki now also uses > "URI-fied" IRIs, not "raw" IRIs. It is tricky to find any > document format permitting "raw" IRIs in links. > > And "raw" UTF-8 IRIs are boring, popular browsers get this > right - "raw" IRIs in legacy charsets are more interesting. I agree that those are more interesting. The major browsers are slowly converging on a set of conventions in this area. Host name: Content developers still use Punycode because MSIE 6 does not support IDNA. Opera supports escaped UTF-8, but the other browsers don't (yet). Path: Firefox has agreed to convert raw paths to escaped UTF-8, starting with Firefox 3. Query: The browser developers appear to have agreed to convert raw queries back to the original HTML encoding, with certain exceptions. (E.g. UTF-16 documents.) Also, MSIE leaves the query unescaped, while Firefox escapes it. Finally, characters outside the destination set are converted differently: MSIE converts to question marks, Firefox converts to UTF-8 instead, and Safari converts to the same format as submitted HTML forms (with HTTP GET), i.e. decimal NCRs, e.g. 〹 When the browsers have converged enough, and the older browsers are not being used so much any more, we may begin to see more "raw" IRIs in HTML. Erik
Received on Friday, 25 April 2008 20:58:46 UTC