RE: resolving the URL mess

> >> I recommend that you take a look at the discussions that have
> >> occurred in recent months on the IETF's "urnbis" WG mailing
> >> list, whether as an example of some issues or as a set of
> >> views about the issues.

There's too much to review there, John, perhaps we could try to summarize what the URN requirements were.
I thought it was mainly fragment identifiers, which have many other problems

http://www.w3.org/TR/fragid-best-practices/



> >>  I think I like your proposed
> >> approach (like others, ask me again when there is a draft we
> >> can look at), but it seems to me that 3986 has failed us by
> >> being insufficiently clear about the differences between
> >> syntax and semantics, between recommendations and
> >> requirements, and between examples or suggestions and
> >> normative material. 

I think it's necessary to say there is some disagreement over whether
3986 is insufficiently clear, or whether people are just not reading
the text that is there. And I fear there might be a tinge of some
sense of wanting control and authority, and using the perceived
ambiguity as an excuse for forking URN from URI.

> >>  From the perspective of someone who
> >> spends a lot of time worrying about things that can go wrong
> >> at the localization-internationalization boundary, especially
> >> when precise language context is not identified and
> >> unambiguous,  I am equally or more concerned when ideas taken
> >> from 3987 are folded into the basic URL concept and mixed with
> >> various practical alternatives for Unicode character
> >> representation and escapes.

I think almost all of the implementations fold these things together,
and trying to separate them into layers might be good in theory
but difficult to follow.

> >> I want to stress the
> >> "insufficiently" part of that: it is possible that 3986 itself
> >> is fine and the difficult lies with those who try to extract
> >> specific information from a 60+ page document.  But those
> >> kinds of readers and readings are part of our reality.

Do you personally find this a problem, or is it really all
"other people" who have trouble finding assurance.

> > Could you be a little more specific about what concerns you
> > have regarding inclusion of IRI concepts into URLs? What do
> > you see as the most effective specification approach if not
> > unification? A unified specification of two objects, one a
> > superset of the other? Layered specifications as we have today?
> 
> Barring a note from one of Sam's comments, one of our other
> problems is that we have allowed ourselves to develop two
> different types of UR* specs along a different dimension.  I'm
> not sure I have this exactly right, but one of them seems to be
> about what can be used in or near a user interface or by a web
> page author in an href (or equivalent) and the other about what
> is appropriate for over-the-wire transmission, e.g., between
> client and  server machines.    The very idea that they can be
> defined differently, even in layers, depends somewhat on the
> assumption that leaks will not occur, but we've got 30+ years of
> experience with the Internet and elsewhere that tells us the
> leaks will almost always occur sooner or later.  If we don't
> take clear advantage of a layering architecture, things get
> worse (i.e., you can remove "almost" from the previous sentence).

A URL in a web page href, in unicode, is transmitted "over the wire".
I think the distinction is artificial and just confusing. 


> At least in the first case, we have many ways to represent a
> Unicode character: three different Unicode Encoding Forms and
> variations on them; %-style encoding that worked well for ISO
> 8859 but that is somewhat problematic for the combination of
> those encoding forms (especially UTF-8) and people; assorted
> native C, Java, etc., escaping forms (I don't know whether what
> I and others wrote into RFCs 5198 and 5137 made things better or
> just added to the confusion); and maybe others. 


3987 settled on %xx percent-hex encoding of UTF-8. 
The update I was working on in IRI before it closed took
the tack of using that EXCEPT for the hostname/authority
in well-known schemes which would be encoded in
punycode.

> Attempts to use
> IDNA-like Punycode encoding for things other than domain names
> don't help either, especially given some of the restrictions on
> that encoding.  

Could you please expand what you mean by this?
I'm not aware of anyone using Punicode other than
in hostname for http, ftp, and a few other schemes.

> For those who have become obsessed about
> confusable characters in UIs, especially where language
> identification or clues cannot be depended upon (as with domain
> names), things get worse yet. 

I think avoiding confusable characters is hopeless, and
some other means of safety check is necessary.

>  Pieces of the PRECIS work, to say
> nothing about Unicode compatibility normalization, that
> sometimes make some characters equivalent to others but that
> maintain distinctions among them in other cases don't help
> either.

I’m also not sure how this hurts URLs.... could you say more?

>  For the second, RFC 3986 and at least the current HTTP
> spec effectively say "whatever you do, only %-escapes go down
> the wire".  

I don't think 3986 makes any such restrictions. "Whatever
you do, if you require a URI and not an IRI then you can
only use  URIs and not use IRIs" would be more like it

> But I believe that rule is often violated --whether
> accidentally, by sloppiness, or otherwise-- and that there are
> servers who believe that accepting and processing the deviations
> is far better than listening to the complaints and blame-casting.

At one time there was an HTTP/2 issue, whether the path
could be UTF-8. I thought that might have been put off,
But I'm not sure.


> Given that, we have at least two separate issues, the former of
> which has been a W3C i18n topic on and off in the last year and
> which has to do with how much we can confuse users by permitting
> multiple variations, especially when some of them don't work
> universally.  

I thought this work was being done by Unicode consortium,
what is the W3C I18N document, could you provide a pointer?

> And, again, I'm concerned about a race to the
> bottom in which almost anything will work sometimes but, from a
> user perspective, it is very hard to know what will work (or how
> it will be interpreted) in any given practical case.

Some people might need a recap of the "race to the  bottom"
argument, because that seems at the heart of the
W3C/WHATWG struggle.

>  As I have
> told PRECIS a few times, in all of the common cases, things will
> just work but that makes the edge cases --cases that we may
> recognize as special but that ordinary users and page authors
> may not-- all the more surprising when encountered.
> 
Your misfortune is needing to repeat and draw clearer
conclusions.

Received on Tuesday, 7 October 2014 01:47:56 UTC