RE: Unicode in a URL

Hello Mike,

At 19:09 01/04/26 -0600, Mike Brown wrote:
> > W3C specifies to use %-encoded UTF-8 for URLs.
>
>I think that's an overstatement.
>Neither the W3C nor the IETF make such a specification.

True. Neither W3C nor IETF make such a general statement,
because we can't just remove the about 10 years of history
of URIs.


>http://www.w3.org/TR/charmod/#sec-URIs
>contains many ambiguities, conflicts with XML and HTTP,
>and is not yet a recommendation.

Thanks for having had a look at that. It may of course
contain some ambiguities, and may need improvement in
presentation and wording (that's why we have put it
out for last call, and we are now working on the comments
we got). It's also true that it's not yet a recommendation.
But XML is a Recommendation, and XLink and XML Schema are
both Proposed Recommendations (i.e. close to Recommendations),
and they all say the same, in the places where they have to
say something about URIs.

And I'm aware of no conflicts with XML or HTTP at all.


>I wrote a little about this topic at
>http://skew.org/xml/misc/URI-i18n/

Overall, that's an extremely well written and well presented
document. But it contains a crucial misunderstanding.

It gives the following example (sorry for the "e'"; my Japanese
mailer doesn't handle Latin-1):

 >>>>
Here is a scenario that illustrates how the assumption of UTF-8
based escaping could conflict with the URI spec's deference to the
scheme specs:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE mydoc [
    <!ELEMENT mydoc (#PCDATA)>
    <!ENTITY greeting SYSTEM 
"http://somewhere/getgreeting?lang=es&name=C%C3%A9sar">
]>
<mydoc>&greeting;</mydoc>

The name Ce'sar is represented here as C%C3%A9sar in the UTF-8 based escaping,
as per the XML requirement.
 >>>>

This is wrong. There is no requirement to use UTF-8 for all the %hh escapes
in system literals. A system literal is an URI, and you can use whatever
%hh sequence you want. In particular, if you have to send the corresponding
Latin-1 bytes to "http://somewhere/getgreeting", then you can use
http://somewhere/getgreeting?lang=es&name=C%E9sar. URIs like this have
worked for a long time, and there is no reason nor intention from W3C
to stop you (if you really need that).

What the XML spec (and all the others mentioned above) say is something
different. Assume the following example:

 >>>>
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE mydoc [
    <!ELEMENT mydoc (#PCDATA)>
    <!ENTITY greeting SYSTEM 
"http://somewhere/getgreeting?lang=es&name=C&#xe9;sar">
]>
<mydoc>&greeting;</mydoc>
  >>>>

Here there is an actual e-acute character in the file (I just used a numeric
character reference to make sure it gets through email). This can't be sent
off directly, and it's better if we clearly say what an XML processor
(which is the thing that interprets the XML, resolving the entities
and doing other parsing-related things) has to do.

It is this case to which the XML spec, the character model, and so on, apply.
In this case, the e-acute character is converted to %C3%A9, and the XML
processor tries to get the entity from
     http://somewhere/getgreeting?lang=es&name=C%C3%A9sar

So is short, what the XML spec is saying is:

- If you use non-ASCII characters directly in a system id, they're converted
   using UTF-8.
- If you want anything else, use exactly the %-escapes you want. You won't
   get the benefit of using the actual character in the source document.

I hope this clears things up a bit. If there is anything in the XML spec
or any other spec that you think should be changed to make this clearer,
we are always open to suggestions. But I think the fact that it says
"The XML processor must escape disallowed characters as follows:" makes
it quite clear that this happens on parsing, not when creating the XML
document.


Regards,   Martin.

Received on Thursday, 26 April 2001 23:06:37 UTC