Re: revised "generic syntax" internet draft from Edward Cherlin on 1997-04-15 (uri@w3.org from April 1997)

From: Edward Cherlin <cherlin@newbie.net>
Date: Tue, 15 Apr 1997 12:55:28 -0700
To: uri@bunyip.com
Message-Id: <v03007830af796f9a9a4a@[206.245.192.43]>
"Roy T. Fielding" <fielding@kiwi.ICS.UCI.EDU> wrote:

>I am going to try this once more, and then end this discussion.

Good luck. :-) But seriously, I think we can make some progress here. These
concerns can be dealt with.

>I have already repeated these arguments several times, on several
>lists, over the past two years, and their substance has been repeatedly
>ignored by Martin and Francois.

I'll let them speak for themselves, but for me, it is not that I ignore
these points. I don't consider them as real obstacles, for reasons which I
have stated, and will now state again in more detail.

>That is why I get pissed-off when
>Martin sends out statements of consensus which are not true, whether he
>realizes them to be false or not.

I think we have a clear consensus now that there is no consensus on the use
of %HH-encoded UTF-8, but that there is some agreement that *something* has
to be done to support characters beyond ASCII sometime.

>The html-i18n RFC is a travesty
>because actual solutions to the problems were IGNORED in favor of
>promoting a single charset standard (Unicode).

You prefer not promoting a single standard?

>I personally would
>approve of systems using Unicode, but I will not standardize solutions
>which are fundamentally incompatible with existing practice.  In my world,
>it is more important to have a deployable standard than a perfect standard.

Total agreement with what you say here. We want a deployable standard,
rather than no standard, and we have agreed to put off the quest for a
perfect standard until next time.

>
>PROBLEM 1:  Users in network environments where non-ASCII characters
>            are the norm would prefer to use language-specific characters
>            in their URLs, rather than ASCII translations.

And already do, creating illegal URLs.

>Proposal 1a: Do not allow such characters, since the URL is an address
>             and not a user-friendly string.  Obviously, this solution
>             causes non-Latin character users to suffer more than people
>             who normally use Latin characters, but is known to interoperate
>             on all Internet systems.

Impractical. We do agree an this, don't we?

>Proposal 1b: Allow such characters, provided that they are encoded using
>             a charset which is a superset of ASCII.  Clients may display
>             such URLs in the same charset of their retrieval context,
>             in the data-entry charset of a user's dialog, as %xx encoded
>             bytes, or in the specific charset defined for a particular
>             URL scheme (if that is the case).  Authors must be aware that
>             their URL will not be widely accessible, and may not be safely
>             transportable via 7-bit protocols, but that is a reasonable
>             trade-off that only the author can decide.

In other words, make a major extension to URL syntax to allow 8-bit
characters. This will break much existing software, we are told.

>Proposal 1c: Allow such characters, but only when encoded as UTF-8.

In compliance with current 7-bit URL syntax.

>             Clients may only display such characters if they have a
>             UTF-8 font or a translation table.

That's "can", not "may". The tables are available on the Internet from the
Unicode, Inc. site, and there are free Unicode fonts for a number of
scripts. Software vendors have several commercial sources for the full
range of Unicode font requirements. There is no such thing as a UTF-8 font.

>             Servers are required to
>             filter all generated URLs through a translation table, even
>             when none of their URLs use non-Latin characters.

Is this a burden?

1) URLS in ASCII are already in correct UTF-8, so they need no translation.

2) Servers already have to test for the validity of URLs in all sorts of
ways, and perform numerous transformations. One of the tests they have to
do now in order to cope with common practice and common errors is whether
the characters are legal 7-bit ASCII.

>             Browsers
>             are required to translate all FORM-based GET request data
                  ^^^^^^^^
Or perhaps just encouraged, as in Martin's proposal.


>             to UTF-8, even when the browser is incapable of using UTF-8
>             for data entry.

No translation. ASCII is correct UTF-8. No testing needed, either, in this
case.

>             Authors must be aware that their
>             URL will not be widely accessible,

I.e., in the usage we expect, more accessible to the target audience, and
less so to those who don't matter.

>             and may not be safely
>             transportable via 7-bit protocols,

Eh? Weren't we discussing an all-ASCII encoding with %HH?

>             but that is a reasonable
>             trade-off that only the author can decide.  Implementers
>             must also be aware that no current browsers and servers
>             work in this manner (for obvious reasons of efficiency),

Not reasons of efficiency, lack of a standard. Some of us are eager to do
these things, and are only awaiting sufficient guidance, such as a
suggestion of the direction the standards will take.

>             and thus recipients of a message would need to maintain two
>             possible translations for every non-ASCII URL accessed.

Presumably we can arrange for ASCII users to receive the ASCII-encoded
version, and others can use the version in their script. The software
requirements to do this are not onerous. We have been through similar
problems in implementing MIME mail, and it is happening again with HTML
mail. Not everybody can read what some people can send. Does this mean we
shouldn't implement the more powerful method?

>             In addition, all GET-based CGI scripts would need to be
>             rewritten to perform charset translation on data entry, since
>             the server is incapable of knowing what charset (if any)
>             is expected by the CGI.  Likewise for all other forms of
>             server-side API.

I don't understand this. CGI scripts for ASCII data entry do not have to be
rewritten. CGI scripts for national character sets mostly work and
sometimes break today, as they generate a multitude of illegal URLs. UTF-8
will be an *option*, but everyone will know that it is the recommended
option where appropriate, so implementors will have guidance and users will
eventually be spared the problem.

I assume that no Webmaster will deliberately set up a CGI script for UTF-8
without knowing whether the server can handle it. He can provide a Java
applet to any browser that doesn't know what to do. So who has the problem?

(Note to François and Martin: Can either of you create, or get someone to
create, such an applet? It might settle another point of controversy.)

>
>Proposal 1a is represented in the current RFCs and the draft,
>since it was the only one that had broad agreement among the
>implementations of URLs.  I proposed Proposal 1b as a means
>to satisfy Martin's original requests without breaking all
>existing systems, but he rejected it in favor of Proposal 1c.

I'm with Martin, on the grounds that 1b breaks existing systems, and his
proposal, which is *NOT* 1c, doesn't.

>I still claim that Proposal 1c cannot be deployed and will
>not be implemented, for the reasons given above.  The only
>advantage of Proposal 1c is that it represents the
>Unicode-uber-alles method of software standardization.

No straw men, please. And spare us the ad hominem rubbish, too. We had
enough of that last week. Unicode will prevail for sound technical and
economic reasons. We just want to help clear away the obstacles sooner
rather than later.

Martin and I, at least, claim that we have shown how to implement our
actual proposal, which is *NOT* 1c, without breaking anything. 1c would
require people to go out and get Unicode software. We *recommend* it to
anyone who needs non-ASCII characters. (BTW, please don't say non-Latin
when you mean non-ASCII. There are lots of standard English-language
Latin-script characters not in ASCII, and many hundreds more in other
Latin-script languages.)

>Proposal 1b achieves the same result, but without requiring
>changes to systems that have never used Unicode in the past.

I'm getting tired of repeating myself, but it is our proposal that requires
no changes, and yours that does.

>If Unicode becomes the accepted charset on all systems, then
>Unicode will be the most likely choice of all systems, for the
>same reason that systems currently use whatever charset is present
>on their own system.
>
>Martin is mistaken when he claims that Proposal 1c can be implemented

Martin is correct, since he does not now propose Proposal 1c.

>on the server-side by a few simple changes, or even a module in Apache.
>It would require a Unicode translation table for all cases where the
>server generates URLs,

Except for ordinary all-ASCII URLs, which I believe is the only case
troubling you.

>including those parts of the server which are
>not controlled by the Apache Group (CGIs, optional modules, etc.).

Where we assume that page designers and Webmasters will learn what to do,
as they must when dealing with any feature of HTML, URLs, etc. that is not
fully implemented--frames, for example.

>We cannot simply translate URLs upon receipt, since the server has no
>way of knowing whether the characters correspond to "language" or
>raw bits.  The server would be required to interpret all URL characters
>as characters, rather than the current situation in which the server's
>namespace is distributed amongst its interpreting components, each of which
>may have its own charset (or no charset).

Sorry, I don't follow. The server will be asked to deal with pure ASCII
URLs, unlike the present situation, where servers are handed URLs
containing 80-FF bytes. It can either treat them as pure ASCII, with no
processing, or it can look to see whether they may be %HH-encoded UTF-8,
which doesn't take AI.

Again, we may assume that URLs sent to a server bear some relation to the
URLs published from that server, or to CGI scripts and other methods of
generating URLs from clients. So we may assume that Webmasters will make
sure that they can interpret the URLs they plan to receive.

What you are describing is not the result of our proposal, but the current
situation in which the server may receive illegal URLs in any of dozens of
charsets. Our proposal would only reduce the problem.

>Even if we were to make such
>a change, it would be a disaster since we would have to find a way to
>distinguish between clients that send UTF-8 encoded URLs and all of those
>currently in existence that send the same charset as is used by the HTML
>(or other media type) page in which the FORM was obtained and entered
>by the user.

Try again? Clients could send legal all-ASCII URLs under our proposal. If a
server doesn't want to be bothered with UTF-8 processing, on the grounds
that it has only ASCII URLs pointing to its pages, and doesn't accept
anything but ASCII from CGI, it can ignore the issue totally. If a server
wants to accept UTF-8, it won't need AI to do so from a conforming client,
and will be no worse off than it is now with non-conforming clients.

>The compromise that Martin proposed was not to require UTF-8, but
>merely recommend it on such systems.  But that doesn't solve the problem,
>so why bother?

You think people won't rush to use a solution once they know which one it
is? You think they won't hammer on their vendors for it? (It's OK,
François, we know you're pedalling as fast as you can. :-)  ) You think
this won't be a standard feature of multilingual browsers and servers by
next year if we give the go-ahead? (Not all browsers and servers, but
that's the marketplace. You always have a tradeoff of features.)

>PROBLEM 2:  When a browser uses the HTTP GET method to submit HTML
>            form data, it url-encodes the data within the query part
>            of the requested URL as ASCII and/or %xx encoded bytes.
>            However, it does not include any indication of the charset
>            in which the data was entered by the user, leading to the
>            potential ambiguity as to what "characters" are represented
>            by the bytes in any text-entry fields of that FORM.
>
>Proposal 1a:  Let the form interpreter decide what the charset is, based
>              on what it knows about its users.  Obviously, this leads to
>              problems when non-Latin charset users encounter a form script
>              developed by an internationally-challenged programmer.
>
>Proposal 1b:  Assume that the form includes fields for selecting the data
>              entry charset, which is passed to the interpreter, and thus
>              removing any possible ambiguity.  The only problem is that
>              users don't want to manually select charsets.
>
>Proposal 1c:  Require that the browser submit the form data in the same
>              charset as that used by the HTML form.  Since the form
>              includes the interpreter resource's URL, this removes all
>              ambiguity without changing current practice.  In fact,
>              this should already be current practice.  Forms cannot allow
>              data entry in multiple charsets, but that isn't needed if the
>              form uses a reasonably complete charset like UTF-8.

Does this mean I can create a Latin-script page, using Unicode as the
charset, that is readable in an ASCII-only browser, so that I don't break
existing software trying to permit multiscript data entry? Does it mean I
can't use ASCII as the charset on a Latin-script page if I want to allow
multilingual data entry? That would be a strange requirement.

>Proposal 1d:  Require that the browser include a <name>-charset entry
>              along with any field that uses a charset other than the one
>              used by the HTML form.  This is a mix of 1b and 1c, but
>              isn't necessary given the comprehensive solution of 1c
>              unless there is some need for multi-charset forms.
>
>Proposal 1e:  Require all form data to be UTF-8.  That removes the
>              ambiguity for new systems, but does nothing for existing
>              systems since there are no browsers that do this.  Of course
>              they don't, because the CGI and module scripts that interpret
>              existing form data DO NOT USE UTF-8, and therefore would
>              break if browsers were required to use UTF-8 in URLs.

>The real question here is whether or not the problem is real.  Proposal 1c
>solves the ambiguity in a way which satisfies both current and any future
>practice, and is in fact the way browsers are supposed to be designed.
>Yes, early browsers were broken in this regard, but we can't fix early
>browsers by standardizing a proposal that breaks ALL browsers.
                                           ^^^^^^^^^^
No, it doesn't. We recommend, not require, UTF-8. We would encourage the
use of method 1c, where appropriate, but also encourage a transition to
UTF-8 *where appropriate*. In a year or two, after we have a full set of
implementations, we can discuss whether we need to *require* UTF-8 or any
other form of Unicode.

>The only
>reasonable solution is provided by Proposal 1c and matches what non-broken
>browsers do in current practice.  Furthermore, IT HAS NOTHING TO DO WITH
>THE GENERIC URL SYNTAX, which is what we are supposed to be discussing.
>
>The above are the only two problems that I've heard expressed by Martin
>and Francois -- if you have others to mention, do so now.  Neither of the
>above problems are solved by requiring the use of UTF-8, which is what
                              ^^^^^^^^^
You must be upset. You can't argue both ways, that we are requiring
something that will break software, and that we are only recommending it,
as you stated yourself above, or that requiring UTF-8 doesn't solve any
problems, and recommending UTF-8 doesn't solve any problems. I guarantee
that UTF-8 will solve problems that I have now.

>Larry was saying over-and-over-and-over again, apparently to deaf ears.

We're not deaf, we think you're wrong, as we have been saying
over-and-over-and-over again, apparently to deaf ears.

>I think both of us are tired of the accusations of being ASCII-bigots

And we're tired of the accusations of being Unicode-bigots.

>simply because we don't agree with your non-solutions.  Either agree to
>a solution that works in practice, and thus is supported by actual
>implementations, or we will stick with the status quo, which at least
>prevents us from breaking things that are not already broken.

Nah, you're not an ASCII bigot. You just don't get it. :-) :-) :-) Yes, you
want a sound solution. So do we. Let us return to the issues.

>BTW, I am not the editor -- I am one of the authors.  Larry is the only
>editor of the document right now.
>
> ...Roy T. Fielding
>    Department of Information & Computer Science    (fielding@ics.uci.edu)
>    University of California, Irvine, CA 92697-3425    fax:+1(714)824-1715
>    http://www.ics.uci.edu/~fielding/

It is clear that we now need to see implementations of%HH/UTF-8 form
encoding and interpretation in browsers and servers before we can get
agreement that implementation is possible. Two browsers and two servers, I
believe. Any takers?

Is there a consensus *among those who favor going ahead* that this is the
right way to do it? If so, let's pass the word and get on with it, whether
on not we get this particular standard fixed in this round.

Shall we start a separate implementation mailing list?

--
Edward Cherlin     cherlin@newbie.net     Everything should be made
Vice President     Ask. Someone knows.       as simple as possible,
NewbieNet, Inc.                                 __but no simpler__.
http://www.newbie.net/                Attributed to Albert Einstein
Received on Wednesday, 16 April 1997 01:53:56 UTC