RE: resolving the URL mess from John C Klensin on 2014-10-07 (public-urispec@w3.org from October 2014)

From: John C Klensin <klensin@jck.com>
Date: Tue, 07 Oct 2014 10:42:08 -0400
To: Larry Masinter <masinter@adobe.com>, David Sheets <kosmo.zb@gmail.com>
cc: Sam Ruby <rubys@intertwingly.net>, public-urispec@w3.org, Anne van Kesteren <annevk@annevk.nl>
Message-ID: <D14FCC5281213E7DBAFADC2D@JcK-HP8200.jck.com>
(this response more or less assumes some of my comments in the
note in the same thread to Austin about three hours ago -- if
needed, please read that first)

--On Tuesday, October 07, 2014 01:47 +0000 Larry Masinter
<masinter@adobe.com> wrote:

>> >> I recommend that you take a look at the discussions that
>> >> have occurred in recent months on the IETF's "urnbis" WG
>> >> mailing list, whether as an example of some issues or as a
>> >> set of views about the issues.
> 
> There's too much to review there, John, perhaps we could try
> to summarize what the URN requirements were. I thought it was
> mainly fragment identifiers, which have many other problems
> 
> http://www.w3.org/TR/fragid-best-practices/

Actually, it isn't mainly fragment identifiers and never has
been.  A large fraction of the problem is that different people
(or groups) keep grabbing different pieces of the elephant and
then insisting that that parts they are describing is the whole
beast -- different communities talking past each other,
sometimes by talking about their own picture of their own needs
and assuming that, if their perceived problems are solved,
everything else will fall into place.  There has also been a lot
of confusion between description of requirements and description
of perceived solutions.

I've tried summarizing several times and none of my summaries
have gotten any traction.   So either I've been wrong or people
are doing more talking than listening.

>> >>  I think I like your proposed
>> >> approach (like others, ask me again when there is a draft
>> >> we can look at), but it seems to me that 3986 has failed
>> >> us by being insufficiently clear about the differences
>> >> between syntax and semantics, between recommendations and
>> >> requirements, and between examples or suggestions and
>> >> normative material. 
> 
> I think it's necessary to say there is some disagreement over
> whether 3986 is insufficiently clear, or whether people are
> just not reading the text that is there.

First, "it seems to me" was intended to cover some of that.
Beyond that, some people have read the text (or think and claim
they have) _very_ carefully and have reached different
conclusions from others who also claim to have read it very
carefully.   Regardless of what the authors intended and the
existence of others who haven't "read the text that is there",
the existence of groups of careful readers with different
interpretations is, to me, the very essence of insufficiently
clear about those differences.

> And I fear there
> might be a tinge of some sense of wanting control and
> authority, and using the perceived ambiguity as an excuse for
> forking URN from URI.

Let's back up a bit.  I think there are a few people in the IETF
who detest 3986 and would like to see it done away with or
replaced by a very different definition or style of definition.
I can assure you that I was not a member of that group when the
URNBIS effort started; you may choose to believe me about that
or not.  I believe there are also a few people in the IETF and
the broader community who strongly dislike URNs, do not consider
them useful (either at all or except in the narrowest of cases),
and who would be delighted to see them either eliminated or
sufficiently crippled that it becomes obvious to others that
they are not really useful.  I assume that both groups have
their reasons and are completely sincere.  Discussions between
them tend to not be very illuminating and often sound more like
very firm declarations of positions than a dialogue in which two
groups are listening to each other.

During the URNBIS work, there were a series of discussions that
amounted to some communities saying "we need to do X", to which
the response was not "no you don't" nor "here is this other way
to accomplish that" but "you cannot do it because 3986 won't
allow it".   

The latter type of response is, IMO, not useful if only because
the approval of 3986 as a full standard didn't turn it into holy
writ that cannot be reconsidered or amended.  So the question
then turned to whether 3986 actually prohibited the desired
behavior or not and, if it did, what modifications to it
--either globally or for URNs-- were needed.  It was impossible
for the URNBIS WG to resolve the first question, impossible to
resolve the second because people kept intruding with the first,
and impossible to make any progress at all while that discussion
raged.

The proposals to separate URNs from 3986 (and, later, from 3986
semantics) were made with great reluctance on my part and that
of at least most of the others who put them together (I was more
a recorder and editor of the ideas than creator/author) in the
hope of allowing URNBIS to move forward and concentrate on real
questions rather than arguments about what 3986 did or did not
say or did or did not restrict.   Because we could not reach
consensus on whether there were actual restrictions, both
approaches were ways of saying "if there are restrictions (and,
later, non-syntax restrictions), they don't apply to URNs, so
now we can move on".  Turning that into a theory about
conspiracy or power struggle just distracts from the issues,
issues that include the question of usability of URNs in
different communities and, ultimately, how far the IETF is
willing to go to accommodate the needs of communities who have
at least some basis to believe they know what they are doing or
whether it is better to adopt a narrower view with the almost
certain result of a separate and conflicting standard for URNs
from another body or bodies.
 
> >>  From the perspective of someone who
>> >> spends a lot of time worrying about things that can go
>> >> wrong at the localization-internationalization boundary,
>> >> especially when precise language context is not identified
>> >> and unambiguous,  I am equally or more concerned when
>> >> ideas taken from 3987 are folded into the basic URL
>> >> concept and mixed with various practical alternatives for
>> >> Unicode character representation and escapes.
> 
> I think almost all of the implementations fold these things
> together, and trying to separate them into layers might be
> good in theory but difficult to follow.

You may be right.   I believe that the result will be some truly
interesting interoperability problems and user confusion, but
perhaps that is just a cost we need to accept.

>> >> I want to stress the
>> >> "insufficiently" part of that: it is possible that 3986
>> >> itself is fine and the difficult lies with those who try
>> >> to extract specific information from a 60+ page document.
>> >> But those kinds of readers and readings are part of our
>> >> reality.
> 
> Do you personally find this a problem, or is it really all
> "other people" who have trouble finding assurance.

My reading of 3986 is different from that expressed at IETF 90
by, e.g., Julian and Joe.  
 
>...
> A URL in a web page href, in unicode, is transmitted "over the
> wire". I think the distinction is artificial and just
> confusing. 

The current version of HTTP says something different.  Beyond
that, see the comment about interoperability and user confusion
above.

>> At least in the first case, we have many ways to represent a
>> Unicode character: three different Unicode Encoding Forms and
>> variations on them; %-style encoding that worked well for ISO
>> 8859 but that is somewhat problematic for the combination of
>> those encoding forms (especially UTF-8) and people; assorted
>> native C, Java, etc., escaping forms (I don't know whether
>> what I and others wrote into RFCs 5198 and 5137 made things
>> better or just added to the confusion); and maybe others. 
> 
> 3987 settled on %xx percent-hex encoding of UTF-8. 
> The update I was working on in IRI before it closed took
> the tack of using that EXCEPT for the hostname/authority
> in well-known schemes which would be encoded in
> punycode.

I know about both of those.  I am only suggesting that people
and implementations are unlikely to get that right and be
consistent about it.  In addition, as you certainly know,
whether or not a particular scheme is "well-known" enough to
rate exceptional treatment is an invitation to both endless
arguments and to leakage from implementations that believe their
favorite scheme is well-known enough (regardless of what others
think) or that a library will be used only for well-known
schemes).   In addition, remember that the Punycode encoding has
some built-in restrictions that make it inappropriate (or worse)
for some authority information that is not part of a domain
name.  Some known possible modifications of the Punycode
encoding can be used to get around a subset of those problems
but the encoding with those modifications is no longer standard
Punycode (another opportunity for interoperability problems).

>> Attempts to use
>> IDNA-like Punycode encoding for things other than domain names
>> don't help either, especially given some of the restrictions
>> on that encoding.  
> 
> Could you please expand what you mean by this?
> I'm not aware of anyone using Punicode other than
> in hostname for http, ftp, and a few other schemes.

There have been proposals (and implementations) to use it for
email local parts, in various non-domain parts of URI paths, as
an alternative in Encoded-words in email headers, etc.   More
broadly, several groups of people have concluded that it is a
universal ACE rather than something very specific to IDNA.

>> For those who have become obsessed about
>> confusable characters in UIs, especially where language
>> identification or clues cannot be depended upon (as with
>> domain names), things get worse yet. 
> 
> I think avoiding confusable characters is hopeless, and
> some other means of safety check is necessary.

I probably agree about the hopelessness, but I know a lot of
folks who would welcome a serious proposal for appropriate
safety checks.  As you are probably aware, the issue has
promoted the sale of a great deal of snake oil.

>>  Pieces of the PRECIS work, to say
>> nothing about Unicode compatibility normalization, that
>> sometimes make some characters equivalent to others but that
>> maintain distinctions among them in other cases don't help
>> either.
> 
> I'm also not sure how this hurts URLs.... could you say more?

As the most obvious example, assume that there is a URI scheme
(I still don't know where URLs stop and URI schemes (other than
"urn:") become something else, and 3986 as usually interpreted
doesn't help much there either) that incorporates user identity
and/or "password" information in the authority.  Now assume that
there is an underlying protocol that binds those fields to a
PRECIS interpretation.  Suddenly, it makes a difference in which
order URI processing (or comparison) and PRECIS processing and
comparison are applied.  To the extent to which "sameness"
criteria for URIs are used in caching of information (not just
web page caching) or other optimizations, that can hurt the
utility and unambiguousness (wrt "sameness") of those URIs.

>>  For the second, RFC 3986 and at least the current HTTP
>> spec effectively say "whatever you do, only %-escapes go down
>> the wire".  
> 
> I don't think 3986 makes any such restrictions. "Whatever
> you do, if you require a URI and not an IRI then you can
> only use  URIs and not use IRIs" would be more like it

At least as I read it, the ABNF of 3986 does not allow non-ASCII
characters in URIs.  If you read it differently, it exposes
another one of the disagreements about clarity I mention above.
Specs that normatively reference 3986 for their definitions of
what URIs are above and that mention URIs and not IRIs (via 3987
or some other definition) are therefore bound to that
restriction.

>...
>> Given that, we have at least two separate issues, the former
>> of which has been a W3C i18n topic on and off in the last
>> year and which has to do with how much we can confuse users
>> by permitting multiple variations, especially when some of
>> them don't work universally.  
> 
> I thought this work was being done by Unicode consortium,
> what is the W3C I18N document, could you provide a pointer?

Try the WHATWG-> W3C Encoding spec for one example.  And note
that, as soon as one says "Unicode" or even "UTF-8", one has
considerably restricted the domain of discourage (especially
since there are still web pages and sites out there that assume
anything unlabeled but containing characters with the high bit
on are encoded in ISO 8859-1.

>> And, again, I'm concerned about a race to the
>> bottom in which almost anything will work sometimes but, from
>> a user perspective, it is very hard to know what will work
>> (or how it will be interpreted) in any given practical case.
> 
> Some people might need a recap of the "race to the  bottom"
> argument, because that seems at the heart of the
> W3C/WHATWG struggle.

If that is true, I'll try to find time to do it.  At the moment,
I've not convinced that anyone who does not already understand
that argument is interested.

>...
best,
   john
Received on Tuesday, 7 October 2014 14:42:41 UTC