Re: resolving the URL mess from John C Klensin on 2014-10-03 (public-urispec@w3.org from October 2014)

From: John C Klensin <klensin@jck.com>
Date: Fri, 03 Oct 2014 12:54:58 -0400
To: David Sheets <kosmo.zb@gmail.com>
cc: Sam Ruby <rubys@intertwingly.net>, Larry Masinter <masinter@adobe.com>, public-urispec@w3.org, Anne van Kesteren <annevk@annevk.nl>
Message-ID: <EF42F0813B9BB27575099CA4@JcK-HP8200.jck.com>
--On Friday, October 03, 2014 14:39 +0100 David Sheets
<kosmo.zb@gmail.com> wrote:

>> I recommend that you take a look at the discussions that have
>> occurred in recent months on the IETF's "urnbis" WG mailing
>> list, whether as an example of some issues or as a set of
>> views about the issues.  I think I like your proposed
>> approach (like others, ask me again when there is a draft we
>> can look at), but it seems to me that 3986 has failed us by
>> being insufficiently clear about the differences between
>> syntax and semantics, between recommendations and
>> requirements, and between examples or suggestions and
>> normative material.   From the perspective of someone who
>> spends a lot of time worrying about things that can go wrong
>> at the localization-internationalization boundary, especially
>> when precise language context is not identified and
>> unambiguous,  I am equally or more concerned when ideas taken
>> from 3987 are folded into the basic URL concept and mixed with
>> various practical alternatives for Unicode character
>> representation and escapes. I want to stress the
>> "insufficiently" part of that: it is possible that 3986 itself
>> is fine and the difficult lies with those who try to extract
>> specific information from a 60+ page document.  But those
>> kinds of readers and readings are part of our reality.
> 
> Could you be a little more specific about what concerns you
> have regarding inclusion of IRI concepts into URLs? What do
> you see as the most effective specification approach if not
> unification? A unified specification of two objects, one a
> superset of the other? Layered specifications as we have today?

Barring a note from one of Sam's comments, one of our other
problems is that we have allowed ourselves to develop two
different types of UR* specs along a different dimension.  I'm
not sure I have this exactly right, but one of them seems to be
about what can be used in or near a user interface or by a web
page author in an href (or equivalent) and the other about what
is appropriate for over-the-wire transmission, e.g., between
client and  server machines.    The very idea that they can be
defined differently, even in layers, depends somewhat on the
assumption that leaks will not occur, but we've got 30+ years of
experience with the Internet and elsewhere that tells us the
leaks will almost always occur sooner or later.  If we don't
take clear advantage of a layering architecture, things get
worse (i.e., you can remove "almost" from the previous sentence).

At least in the first case, we have many ways to represent a
Unicode character: three different Unicode Encoding Forms and
variations on them; %-style encoding that worked well for ISO
8859 but that is somewhat problematic for the combination of
those encoding forms (especially UTF-8) and people; assorted
native C, Java, etc., escaping forms (I don't know whether what
I and others wrote into RFCs 5198 and 5137 made things better or
just added to the confusion); and maybe others.  Attempts to use
IDNA-like Punycode encoding for things other than domain names
don't help either, especially given some of the restrictions on
that encoding.  For those who have become obsessed about
confusable characters in UIs, especially where language
identification or clues cannot be depended upon (as with domain
names), things get worse yet.  Pieces of the PRECIS work, to say
nothing about Unicode compatibility normalization, that
sometimes make some characters equivalent to others but that
maintain distinctions among them in other cases don't help
either. For the second, RFC 3986 and at least the current HTTP
spec effectively say "whatever you do, only %-escapes go down
the wire".  But I believe that rule is often violated --whether
accidentally, by sloppiness, or otherwise-- and that there are
servers who believe that accepting and processing the deviations
is far better than listening to the complaints and blame-casting.

Given that, we have at least two separate issues, the former of
which has been a W3C i18n topic on and off in the last year and
which has to do with how much we can confuse users by permitting
multiple variations, especially when some of them don't work
universally.  And, again, I'm concerned about a race to the
bottom in which almost anything will work sometimes but, from a
user perspective, it is very hard to know what will work (or how
it will be interpreted) in any given practical case.  As I have
told PRECIS a few times, in all of the common cases, things will
just work but that makes the edge cases --cases that we may
recognize as special but that ordinary users and page authors
may not-- all the more surprising when encountered.

 best,
    john
Received on Friday, 3 October 2014 16:55:28 UTC