- From: Martin J. Duerst <duerst@w3.org>
- Date: Mon, 16 Oct 2000 14:36:59 +0900
- To: "Yves Savourel" <ysavourel@translate.com>, <uri@w3.org>
At 00/10/15 17:09 -0600, Yves Savourel wrote: >Hello, > >I would like to ask for confirmation on how to represent URIs that include >bidi text. > >For example, the following URL: >http$Bc`;c`;c`;(Bwww$Bc`;%j%!%k(Z!W%k*Z%!%k,d(B€$B;(Bcom$Bc`%c(B contain the Arabic word for >"security". Looking at the various draft documents about IURI I assume it >should go through the following steps: > >1-- Represent the URI in UCS with the LRE and PDF characters at each end >and the LRM prefixing each reserved character. That would give the >following (with the non-ASCII character between brackets): > ><202A>http<200E>:<200E>/<200E>/www<200E>.<0627><0644><0623><0645><0627><064 >6><200E>.com<202C> No, the assumption is different. You would not use the LRE/PDF/whatever in the URI itself. The main reason is that there are places where escaped characters can't go, e.g. before the 'http:' (see example below). The various special characters may be used when displaying the URI in a reasonable way, i.e. making sure that the syntactically relevant characters stay at the right place rather than being thrown around. But they are just one potential way of implementing the display of these URIs. >2-- Then convert it into UTF-8. > >3-- Then escape any escapable octet. This would give the following: > >%E2%80%AAhttp%E2%80%8E:E2%80%8E/%E2%80%8E/www%E2%80%8E.%D8%A7%D9%84%D8%A3%D >9%85%D8%A7%D9%86%E2%80%8E.com%E2%80%AC > >Is this a correct example of implementing an Internationalized URI with >bidi text? Or I've got something wrong in the process? Here you see easily that it won't work, because no existing software can identify this as a http URI. The correct URI would be http://www.%D8%A7%D9%84%D8%A3%D9%85%D8%A7%D9%86.com (assuming the Arabic word was in logical order). URIs as such are in logical order, and don't allow any special bidi characters because that would lead to ambiguities (there are several ways to get the same reordering behaviour). >In addition, in an XML document, as per section 4.2.2 of the specs, is it >correct that I should use this last form directly in the document, and not >rely on the XML processor to do the transformation? The original XML spec says (http://www.w3.org/TR/1998/REC-xml-19980210#sec-external-ent): >>>> An XML processor should handle a non-ASCII character in a URI by representing the character in UTF-8 as one or more bytes, and then escaping these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value). <<<< This says that the XML processor should do this for you, and therefore it should be okay for you to put in the original characters. But there are three problems here: - It says 'should', not must. - It's not clear whether it applies to all URIs, or just to the URIs used in System Identifiers, and in the former case, it's not clear how an XML processor would find all URIs in a document (without e.g. Schema information). - The text in the second edition of XML (http://www.w3.org/TR/REC-xml#sec-external-ent) is much clearer about how the conversion has to take place; unfortunately, it doesn't make clear who should do this conversion (the original document producer or the XML processor). The idea was not to change this for the second edition, but somehow it got lost. I'm following up on this. So the current answer is, unfortunately: You should be able to rely on the XML processor, but for the moment, it might be safer to do the transformation yourself. Regards, Martin.
Received on Monday, 16 October 2000 02:43:30 UTC