Re: Special characters in URIs

Dan Connolly (connolly@w3.org)
Thu, 27 May 1999 17:41:59 -0500


Message-ID: <374DCA37.1EDF77BE@w3.org>
Date: Thu, 27 May 1999 17:41:59 -0500
From: Dan Connolly <connolly@w3.org>
To: Larry Masinter <masinter@parc.xerox.com>
CC: "Martin J. Duerst" <duerst@w3.org>, ietf-url@imc.org, uri@Bunyip.Com
Subject: Re: Special characters in URIs

Larry Masinter wrote:
> 
> URL character escaping normally should only be done at the
> time the URL is constructed from its component pieces, and
> normally should only be undone (unescaped) when the URL
> is decomposed into its internal pieces.

True.

>  Your description
> of the process of either applying or removing %XX escaping
> seems to be based on having the escapes applied or removed
> when the URL is removed from or embedded in some context
> such as XML.

only when it's removed

> In general, you cannot change an arbitrary
> %XX into the character the XX byte sequence represents in
> ASCII without some risk of changing the meaning of the URL,

true.

> and so you should not recommend this process at all.

The excerpt below doesn't mention unescaping. Only how
to take an XML attribute value and turn it into a URL
in the case that it's not already a URL (because it
has non-URL characters).

It's probably worth warning folks that the inverse operation
is not licensed, but that doesn't mean the operation
itself is a problem.

> 
> Larry
> --
> http://www.parc.xerox.com/masinter
> 
> > The second case basically works by saying that if in these formats
> > (e.g. HTML), an URI contains a non-ASCII character, this character
> > is converted to a byte sequence using UTF-8 and then %-encoded to
> > produce a legal URI.
> 
> I think "works" is ambitious. It "works" because most
> HTTP servers are forgiving about this kind of transliteration
> and most URLs are HTTP.

It "works" in the case that, for example, a user copies
a filename from a desktop filebrowser into an XML document
	href="xyz__"
where __ is some non-URL character.

Meanwhile, the HTTP server, when it exports the xyz__ file,
uses the same convention: UTF-8 encoding, %XX escaped.

That doesn't mean the HTTP server should grab xyz%XX%XX off
the tcp socket and unescape it; it means the HTTP server
should (do something equivalent to) enumerate each file
in the directory and escape it, and compare the resultin URI path
to xyz%XX%XX.

It's a bit of a kludge; the cleaner thing to do would
be to say "don't put things other than URIs in those
XML attribute values." But we haven't had any luck doing that.
And this "kludge" just so happens to be consistent with
the existing specs (though subtly) and consistent with
a fair amount of acutal practice (or at least so I
gather from Martin; I haven't seen the evidence 1st hand).

And it provides a global convention for interoperability
between HTTP servers exporting filesystems that use
iso-latin-1 to encode filenames and those that
export filesystems that use shift-jis or UCS-2.

-- 
Dan Connolly, W3C
http://www.w3.org/People/Connolly/