Re: IURIs with bidi text from Martin J. Duerst on 2000-10-16 (uri@w3.org from October 2000)

From: Martin J. Duerst <duerst@w3.org>
Date: Mon, 16 Oct 2000 14:36:59 +0900
To: "Yves Savourel" <ysavourel@translate.com>, <uri@w3.org>
Message-Id: <4.2.0.58.J.20001016140614.030eaac0@sh.w3.mag.keio.ac.jp>
At 00/10/15 17:09 -0600, Yves Savourel wrote:
>Hello,
>
>I would like to ask for confirmation on how to represent URIs that include 
>bidi text.
>
>For example, the following URL: 
>http窶��窶�窶�www窶�リァル�」ル�ァル���com窶ャ contain the Arabic word for 
>"security". Looking at the various draft documents about IURI I assume it 
>should go through the following steps:
>
>1-- Represent the URI in UCS with the LRE and PDF characters at each end 
>and the LRM prefixing each reserved character. That would give the 
>following (with the non-ASCII character between brackets):
>
><202A>http<200E>:<200E>/<200E>/www<200E>.<0627><0644><0623><0645><0627><064 
>6><200E>.com<202C>

No, the assumption is different. You would not use the LRE/PDF/whatever
in the URI itself. The main reason is that there are places where escaped
characters can't go, e.g. before the 'http:' (see example below).
The various special characters may be used when displaying the URI
in a reasonable way, i.e. making sure that the syntactically relevant
characters stay at the right place rather than being thrown around.
But they are just one potential way of implementing the display of
these URIs.



>2-- Then convert it into UTF-8.
>
>3-- Then escape any escapable octet. This would give the following:
>
>%E2%80%AAhttp%E2%80%8E:E2%80%8E/%E2%80%8E/www%E2%80%8E.%D8%A7%D9%84%D8%A3%D 
>9%85%D8%A7%D9%86%E2%80%8E.com%E2%80%AC
>
>Is this a correct example of implementing an Internationalized URI with 
>bidi text? Or I've got something wrong in the process?

Here you see easily that it won't work, because no existing software
can identify this as a http URI.

The correct URI would be

http://www.%D8%A7%D9%84%D8%A3%D9%85%D8%A7%D9%86.com
(assuming the Arabic word was in logical order). URIs as such
are in logical order, and don't allow any special bidi characters
because that would lead to ambiguities (there are several ways
to get the same reordering behaviour).


>In addition, in an XML document, as per section 4.2.2 of the specs, is it 
>correct that I should use this last form directly in the document, and not 
>rely on the XML processor to do the transformation?

The original XML spec says
(http://www.w3.org/TR/1998/REC-xml-19980210#sec-external-ent):

 >>>>
An XML processor should handle a non-ASCII character in a URI by
representing the character in UTF-8 as one or more bytes, and then
escaping these bytes with the URI escaping mechanism (i.e., by
converting each byte to %HH, where HH is the hexadecimal notation
of the byte value).
<<<<

This says that the XML processor should do this for you, and therefore
it should be okay for you to put in the original characters. But there
are three problems here:

- It says 'should', not must.
- It's not clear whether it applies to all URIs, or just to the URIs
   used in System Identifiers, and in the former case, it's not
   clear how an XML processor would find all URIs in a document
   (without e.g. Schema information).
- The text in the second edition of XML
   (http://www.w3.org/TR/REC-xml#sec-external-ent) is much clearer about
   how the conversion has to take place; unfortunately, it doesn't make
   clear who should do this conversion (the original document producer
   or the XML processor). The idea was not to change this for the second
   edition, but somehow it got lost. I'm following up on this.

So the current answer is, unfortunately:

You should be able to rely on the XML processor, but for the moment,
it might be safer to do the transformation yourself.



Regards,   Martin.
Received on Monday, 16 October 2000 02:43:30 UTC