Re: draft-fielding-url-syntax-05.txt from Martin J. Duerst on 1997-05-12 (uri@w3.org from May 1997)

From: Martin J. Duerst <mduerst@ifi.unizh.ch>
Date: Mon, 12 May 1997 16:47:36 +0200 (MET DST)
To: Larry Masinter <masinter@parc.xerox.com>
cc: Chris Newman <Chris.Newman@innosoft.com>, IETF URI list <uri@bunyip.com>
Message-ID: <Pine.SUN.3.96.970512150549.245S-100000@enoshima>
In an earlier mail, Larry Masenter wrote:

> > - Shortly state what, in your oppinion, has been resolved
> >         (and what may not yet have been resolved).
> 
> I don't think the latest draft will please everyone, but I think
> that by toning down the explanation that "URLs are limited to ASCII",
> pointing out that the definition of non-ASCII URLs is handled
> outside of this document, and starting on the process of actually
> resolving how non-ASCII URLs can be made to work, we've dealt with
> the requirements of starting the work on Internationalization and
> leaving enough space to come to a resolution on the issues.

That sounds like a reasonable way to go. In particular,
"pointing out that the definition of non-ASCII URLs is handled
outside of this document" is very helpful.


In a different mail, Chris Newman wrote:

> On Fri, 2 May 1997, Larry Masinter wrote:
> 
> > 2.1 URLs and non-ASCII characters   
> >    
> >    While URLs are sequences of characters and those characters are
> >    used (within delimited sections) to represent sequences of octets,
> >    in some cases those sequences of octets are used (via a 'charset'
> >    or character encoding scheme) to represent sequences of characters:

Saying "in some cases" is not completely wrong, but it is not a very
accurate representation of reality. Changing this to "in most cases"
would be much more appropriate.


> >    URL char. sequence <-> octet sequence <-> original char. sequence

I think it would be more intuitive to turn things around, i.e.

original char. sequence -> octet sequence <-> URL char. sequence

Note that the first arrow is one way only, because the reverse
mapping is not guaranteed. Probably going back to the figure that
we had in the last (non-submitted) draft would be even better,
to make this stick out. Also, a paragraph later, the words
"second mapping" have to be changed to "first mapping".


> >    In cases where the original character sequence contains characters
> >    that are strictly within the set of characters defined in the
> >    US-ASCII character set,

This is a rather convoluted expression. How about:

In case the original character sequence only contains characters
from the character repertoire of US-ASCII,...


> >                            the mapping is simple: each original
> >    character is translated into the US-ASCII code for it, and
> >    subsequently represented either as the same character, or as an
> >    escape sequence.
> > 
> >    In general practice, many different character encoding schemes are
> >    used in the second mapping (between sequences of represented
> >    characters and sequences of octets) and there is generally no
> >    representation in the URL itself of which mapping was used. While
> >    there is a strong desire to provide for a general and uniform
> >    mapping between more general scripts and URLs, the standard for
> >    such use is outside of the scope of this document.
> 
> I find this much too wishy-washy.

Well, wishy-washy may not be the right term, but I agree that things
can be improved. In particular, it is very doubtful to me whether
the last sentence indeed is "pointing out that the definition of
non-ASCII URLs is handled outside of this document". There is
mention of a "standard", but that either means that there is
indeed a standard, or that the word "standard" is just a filler,
in which case the last sentence above only means that this
document does not deal with these issues, and not that they
are handled somewhere else. This makes a big difference.


> I think we should explicitly forbid the
> use of 8-bit characters and hex-encoded 8-bit characters, except as
> defined by the future I18N URL standard.  We need to make it very clear
> that programs sending 8-bit URLs over the wire are broken (unless they use
> UTF8 according to the future standard).

I think Chris was a bit too short and general here, but basically,
I agree. See below for details.


On Fri, 2 May 1997, Larry Masinter wrote:

> #I find this much too wishy-washy. 
> Not every section of a document can explicitly forbid
> everything that is forbidden. In general, standards
> documents work best when they say "how do I use this"
> rather than listing lots of rules.

I agree in general. But there are big differences between
the case where something is forbidden just because it is
forbidden and cases where it is discouraged because it
interferes with future (and now clearly envisioned) uses.

I think we have made enormous progress in argeeing on many
components of URL internationalization. Yet when I read the
draft, although I have to admit that there are a lot of
changes in the right direction, there is nothing much that
does reflect in any way the progress we have made and the
consequences this should have on URL implementations.


> The future standard will set the standard for the future.
> All this document says is that it doesn't set that standard.

Which is definitely not the same as your intention of
"pointing out that the definition of non-ASCII URLs is handled
outside of this document". There is a significant difference
between saying "it's not done here" and "it's done elsewhere".


> # We need to make it very clear that programs sending 8-bit URLs over
> # the wire are broken (unless they use UTF8 according to the # future
> standard).
> 
> The purpose of this document is to define the standard for
> how URLs work, and not to 'send a message' about a future
> standard.

It is very clear that the main purpose of this document is
to define the standard for how URLs currently work. Close to
30 pages are devoted to this purpose. The purpose, structure,
content, and integrity of this document are not affected
by a note stating what we already assume or know about
i18n URLs, in a way that will make the possible consequences
clear to implementers while not precluding anything in the
(unprobable) case that we decide to go a completely different
way.

What we are currently doing is pretending that
our right hand doesn't know what our left hand is doing.
This in no way does help implementers (the main addressees
of the draft) make the right decisions or look out in the
right places for what's going on, or even get the idea to
look out in the first place. 


> I and Martin have actually started work on the
> 'message', and if you want to help 'send a message' about UTF8,
> I invite you to actually help craft the 'message'.

Helping crafting this 'message' (which I very strongly support!)
and advocating an appropriate formulation of the syntax
draft are not mutually exclusive activities :-).


> When we have a standard for UTF8 URLs, we'll have a standard
> for UTF8 URLs. But that's the only message that you can
> send that will have any meaning.

Then why don't we send it, in a form that doesn't need reading
between the lines? I propose the following wording,
to be added at the end of Section 2.1:

Note: There is currently work going on to define a standard
	for URLs including characters outside the US-ASCII
	repertoire based on UTF-8 as a character encoding.


Regards,	Martin.
Received on Monday, 12 May 1997 10:51:09 UTC