- From: Larry Masinter <masinter@adobe.com>
- Date: Tue, 7 Oct 2014 01:47:25 +0000
- To: John C Klensin <klensin@jck.com>, David Sheets <kosmo.zb@gmail.com>
- CC: Sam Ruby <rubys@intertwingly.net>, "public-urispec@w3.org" <public-urispec@w3.org>, Anne van Kesteren <annevk@annevk.nl>
> >> I recommend that you take a look at the discussions that have > >> occurred in recent months on the IETF's "urnbis" WG mailing > >> list, whether as an example of some issues or as a set of > >> views about the issues. There's too much to review there, John, perhaps we could try to summarize what the URN requirements were. I thought it was mainly fragment identifiers, which have many other problems http://www.w3.org/TR/fragid-best-practices/ > >> I think I like your proposed > >> approach (like others, ask me again when there is a draft we > >> can look at), but it seems to me that 3986 has failed us by > >> being insufficiently clear about the differences between > >> syntax and semantics, between recommendations and > >> requirements, and between examples or suggestions and > >> normative material. I think it's necessary to say there is some disagreement over whether 3986 is insufficiently clear, or whether people are just not reading the text that is there. And I fear there might be a tinge of some sense of wanting control and authority, and using the perceived ambiguity as an excuse for forking URN from URI. > >> From the perspective of someone who > >> spends a lot of time worrying about things that can go wrong > >> at the localization-internationalization boundary, especially > >> when precise language context is not identified and > >> unambiguous, I am equally or more concerned when ideas taken > >> from 3987 are folded into the basic URL concept and mixed with > >> various practical alternatives for Unicode character > >> representation and escapes. I think almost all of the implementations fold these things together, and trying to separate them into layers might be good in theory but difficult to follow. > >> I want to stress the > >> "insufficiently" part of that: it is possible that 3986 itself > >> is fine and the difficult lies with those who try to extract > >> specific information from a 60+ page document. But those > >> kinds of readers and readings are part of our reality. Do you personally find this a problem, or is it really all "other people" who have trouble finding assurance. > > Could you be a little more specific about what concerns you > > have regarding inclusion of IRI concepts into URLs? What do > > you see as the most effective specification approach if not > > unification? A unified specification of two objects, one a > > superset of the other? Layered specifications as we have today? > > Barring a note from one of Sam's comments, one of our other > problems is that we have allowed ourselves to develop two > different types of UR* specs along a different dimension. I'm > not sure I have this exactly right, but one of them seems to be > about what can be used in or near a user interface or by a web > page author in an href (or equivalent) and the other about what > is appropriate for over-the-wire transmission, e.g., between > client and server machines. The very idea that they can be > defined differently, even in layers, depends somewhat on the > assumption that leaks will not occur, but we've got 30+ years of > experience with the Internet and elsewhere that tells us the > leaks will almost always occur sooner or later. If we don't > take clear advantage of a layering architecture, things get > worse (i.e., you can remove "almost" from the previous sentence). A URL in a web page href, in unicode, is transmitted "over the wire". I think the distinction is artificial and just confusing. > At least in the first case, we have many ways to represent a > Unicode character: three different Unicode Encoding Forms and > variations on them; %-style encoding that worked well for ISO > 8859 but that is somewhat problematic for the combination of > those encoding forms (especially UTF-8) and people; assorted > native C, Java, etc., escaping forms (I don't know whether what > I and others wrote into RFCs 5198 and 5137 made things better or > just added to the confusion); and maybe others. 3987 settled on %xx percent-hex encoding of UTF-8. The update I was working on in IRI before it closed took the tack of using that EXCEPT for the hostname/authority in well-known schemes which would be encoded in punycode. > Attempts to use > IDNA-like Punycode encoding for things other than domain names > don't help either, especially given some of the restrictions on > that encoding. Could you please expand what you mean by this? I'm not aware of anyone using Punicode other than in hostname for http, ftp, and a few other schemes. > For those who have become obsessed about > confusable characters in UIs, especially where language > identification or clues cannot be depended upon (as with domain > names), things get worse yet. I think avoiding confusable characters is hopeless, and some other means of safety check is necessary. > Pieces of the PRECIS work, to say > nothing about Unicode compatibility normalization, that > sometimes make some characters equivalent to others but that > maintain distinctions among them in other cases don't help > either. I’m also not sure how this hurts URLs.... could you say more? > For the second, RFC 3986 and at least the current HTTP > spec effectively say "whatever you do, only %-escapes go down > the wire". I don't think 3986 makes any such restrictions. "Whatever you do, if you require a URI and not an IRI then you can only use URIs and not use IRIs" would be more like it > But I believe that rule is often violated --whether > accidentally, by sloppiness, or otherwise-- and that there are > servers who believe that accepting and processing the deviations > is far better than listening to the complaints and blame-casting. At one time there was an HTTP/2 issue, whether the path could be UTF-8. I thought that might have been put off, But I'm not sure. > Given that, we have at least two separate issues, the former of > which has been a W3C i18n topic on and off in the last year and > which has to do with how much we can confuse users by permitting > multiple variations, especially when some of them don't work > universally. I thought this work was being done by Unicode consortium, what is the W3C I18N document, could you provide a pointer? > And, again, I'm concerned about a race to the > bottom in which almost anything will work sometimes but, from a > user perspective, it is very hard to know what will work (or how > it will be interpreted) in any given practical case. Some people might need a recap of the "race to the bottom" argument, because that seems at the heart of the W3C/WHATWG struggle. > As I have > told PRECIS a few times, in all of the common cases, things will > just work but that makes the edge cases --cases that we may > recognize as special but that ordinary users and page authors > may not-- all the more surprising when encountered. > Your misfortune is needing to repeat and draw clearer conclusions.
Received on Tuesday, 7 October 2014 01:47:56 UTC