Re: [URN] Re: URN/URL spec issues... from Martin J. Dürst on 1997-11-04 (uri@w3.org from November 1997)

From: Martin J. Dürst <mduerst@ifi.unizh.ch>
Date: Tue, 4 Nov 1997 12:46:11 +0100 (MET)
To: "Sam X. Sun" <ssun@CNRI.Reston.VA.US>
cc: urn-ietf <urn-ietf@bunyip.com>, URI mailing list <uri@bunyip.com>
Message-ID: <Pine.SUN.3.96.971104121524.1769K-100000@enoshima.ifi.unizh.ch>
On Tue, 4 Nov 1997, Sam X. Sun wrote:

> > > One more question though. About the excluded characters. I can see the
> > > reason why ASCII 00-1F and 7F are excluded. But do characters like "<",
> > > ">", and "#" definitely have to be excluded also?
> > 
> > "<" and ">" are used to delimit URIs. If they are not excluded, it's
> > very difficult to know where an URI starts or ends. Also, "<" and ">"
> > are very frequent in HTML. 
> 
> Isn't charactar  "  enough to serve the delimiter purpose? In HTML, the
> real
> delimiter to separate the URL is character ", but not  "<"  or ">" .
> Characters  "<"
> and  ">"  are used to separate the HTML tags. For example, in HTML
> document,
> when a hyperlink is defined as <A HREF="http:my-link" options...>My
> Link</A>, only the http:my-link is the URL, which is delimited by a pair of
> " characters. Characters "<" and ">" are not in the context of URL itself.

Correct. But at one time, there were browsers that accepted
<A HREF="http:my-link>. "<" and ">" are mainly used to delimit
URLs in free text, e.g. as <http:my-link>. And it's very nice if
a system allows you to click links in e.g. a plain text email message.


> >"#" is the delimiter between what gets sent to the server and what 
> >remains at the client for further processing. If this is scheme-specific, 
> >this creates lots of problems.
> > 
> 
> I'm having a hard time to figure the kind of problems it creates. Could you
> be more specific of the problem? 

It's not that a big point, but assume a HTML parser is extracting an URL
reference and then wants to send the URL to a resolving machinery. If
it knows "keep the part after the # for yourself, give the part before
the # to the resolver", it's very easy. If it didn't know that, it
would have to ask the resolver, which would have to decide based on
scheme/protocol. Some browsers may just send all unknown schemes to
a proxy, and so we would need protocol additions so that the proxy
could send back the part after the # (or whatever). Having the #
and only the # for this purpose, and not allowing anything else to
use the # leads to a much nicer architecture.



> I might miss some big point here, and correct me if I'm wrong. But I do
> feel 
> that there is an intention of making the URL/URI specification fitting into
> the 
> "http URL" model. But "http URL" is just A particular scheme under the
> URL family. New schemes should be allowed to come up with their own 
> syntax definition to serve their own purpose, but not have to carry over
> the 
> constraints of other schemes. Even for implementation simplicity, every 
> scheme will have to do its own parsing anyway. Why not allow them to 
> define its own set of reserved/excluded characters?

Well, it's not really that the URL/URI spec is fitted to http.
HTTP and HTML were the first to use URLs, as far as I know,
and made URLs popular. If it were not for HTTP and HTML, nobody
would use URLs. So there is some legacy, and some kind of
right-of-ownership and first-come-first-served.

URLs allow quite a wide range of scheme-specific syntax, but also
have some common concepts that allow generic parsing. But not only
the computer processing aspect is important, also the human user
has to be considered. It would not be too difficult, if it would
be needed, to build an infrastructure that e.g. considered "/"
to have scheme-specific semantics. But now that people are used
to relative URLs, the chances are large that they would make
many mistakes. So having some kind of common syntax has many
advantages.


> >Also, I would like to use this occasion to reiterate my (and many
> >other's) request to put a note into draft-fielding-url-syntax-09.txt
> >to alert readers of the fact that internationalization of URIs
> >is converging towards UTF-8. The IMAP URL and the URN syntax
> >draft are clear evidence of this and can be cited easily.
> >Not putting in such a note would consist a serious negligence
> >to include relevant information. I will be glad to provide the
> >detailled wording.
> >
> 
> I also think using UTF8 as the underlying character set encoding for 
> global naming scheme, like URN, is a good choice. In fact, we specified 
> UTF8 as the character set encoding for the handle system. 

Great!


> On the other hand, "http URL" can and is surviving without a globally
> agreed character set encoding. And the link generally won't break even
> if changed from one character set encoding to another.

"surviving" is the right word here. It works as long as the encoding
stays the same. It definitely doesn't work when the encoding gets
changed. By proposing to use UTF-8, we don't want to force every
http server to change to UTF-8 immediately (or at all). Backwards
compatibility measures have been discussed that will allow an
amazingly smooth transition.


> Currently there're
> tons of non-ASCII URL out there already, and this could make moving "http
> URL" into UTF8 very difficult.

No, it turns out that it's not very difficult. The key is that UTF-8
has a very particular structure, and therefore is easy to detect,
and that the namespace on a server is extremely sparsely populated.
I can point you to some papers of mine that discuss this.


> Besides, UTF8 is not readable for most other
> languages other than ASCII, and this may not make it acceptable for people,
> say, using CJK or Greek. It's might be more appropriate to let "http URL"
> will have their character set encoding information carried with them,
> either embedded in the HTML context, or by switching the encoding setup
> from the browser.

It is true that if you take a non-ASCII URL, encode it as UTF-8, and then
insert the resulting octets into e.g. a Greek document (iso-8859-7), that
will look ugly or even worse, will get an editor or browser confused.
But that's not what we are proposing. Whatever characters the URL
contains, these characters are encoded in the same way as the rest of
the characters in the document. For Greek characters in an iso-8859-7
document, these can be encoded as single octets, and then will be
nicely readable. If it is a HTML document, other characters can be
included by using numeric character references (the &#dddd; things),
for all characters from Unicode/ISO 10646. Where UTF-8 comes into
play is where %HH escaping is needed, or when the URL or part of it
is sent to the server.


Regards,	Martin.
Received on Tuesday, 4 November 1997 06:46:43 UTC