Re: "Difficult Characters" draft

Martin J. Duerst (mduerst@ifi.unizh.ch)
Tue, 6 May 1997 20:50:15 +0200 (MET DST)


Date: Tue, 6 May 1997 20:50:15 +0200 (MET DST)
From: "Martin J. Duerst" <mduerst@ifi.unizh.ch>
To: Larry Masinter <masinter@parc.xerox.com>
cc: "Alain LaBont/e'/" <alb@riq.qc.ca>, URI mailing list <uri@bunyip.com>
Subject: Re: "Difficult Characters" draft
In-Reply-To: <336F5302.64F7@parc.xerox.com>
Message-ID: <Pine.SUN.3.96.970506203138.245T-100000@enoshima>

Larry,

Many thanks for your interesting toughts.

On Tue, 6 May 1997, you wrote:

> Perhaps you could mention in your draft about the use of
> identifiers with characters outside of ASCII that such
> use is actually problematic,

Well, it is of course a little less easy for some people
than using ASCII.


> and that some applications
> which use canonical identifiers and exact match as a way
> of doing symbol lookup when restricted to ASCII-only symbols
> might find that users of languages other than English
> will be ill-served by such a design; in some applications
> using a careful language-sensitive equivalence lookup
> (instead of exact-match) would make the software actually
> accomodate the needs and practices of such users.

Well, there have been some interesting examlpes. But
frankly speaking, I don't think that the average
French speaker should have more difficulties to
transcribe the correct accents than the average
US user should have difficulties to get case correct.
Quite to the contrary, case in URLs is often rather
random, whereas accents in a known word can easily
be reconstructed. Also, having to test all possible
accented and non-accented versions of a word is much
easier than having to test all different capitalizations
of a word. What's remaining is most probably the basic
surprise when a French user is first confronted with an
accented URL. She might not believe it true and just
type it in without accents, and get an error :-).


> The mail in the recent week has been full of good examples
> of places where canonicalization is either ill-specified
> or context-sensitive, and "equivalence matching"
> would be far more practical.

Equivalence matching could save a lot of US typos also.
But nobody ever cared to do equivalence matching for them.
It's assumed that the user type things in correctly.


> Fortunately, it's possible that equivalence-based matching
> could be deployed for URLs;

That's interesting. But it would be a lot more work than the
conversions from and to UTF-8 that I have suggested for backwards
compatibility and that have raised great concerns from Roy.


> other kinds of exact-match
> names will require a separate analysis. Both DNS and HTTP-servers
> (if not FTP servers) could be coaxed into doing equivalence-matching
> instead of exact matching for reference lookup; if they also respond
> with the server's view of the "canonical" name, then we
> won't be asking clients to do what it seems like is nearly
> impossible.

What do you mean by "nearly impossible"? Is it the implementation
of the normalization algorithms that I propose? (one could send
an URL to some "normalization server" if that's really considered
a problem.) Is it a consistent collection of normalization rules
and warnings about codepoints not to use? Is it that you think we
are asking too much of the user?

We don't want to ask the French user more than the US user,
when compared to his/her language abilities. And up to now,
we don't.

Regards,	Martin.