- From: John C Klensin <klensin@jck.com>
- Date: Fri, 03 Oct 2014 12:54:58 -0400
- To: David Sheets <kosmo.zb@gmail.com>
- cc: Sam Ruby <rubys@intertwingly.net>, Larry Masinter <masinter@adobe.com>, public-urispec@w3.org, Anne van Kesteren <annevk@annevk.nl>
--On Friday, October 03, 2014 14:39 +0100 David Sheets <kosmo.zb@gmail.com> wrote: >> I recommend that you take a look at the discussions that have >> occurred in recent months on the IETF's "urnbis" WG mailing >> list, whether as an example of some issues or as a set of >> views about the issues. I think I like your proposed >> approach (like others, ask me again when there is a draft we >> can look at), but it seems to me that 3986 has failed us by >> being insufficiently clear about the differences between >> syntax and semantics, between recommendations and >> requirements, and between examples or suggestions and >> normative material. From the perspective of someone who >> spends a lot of time worrying about things that can go wrong >> at the localization-internationalization boundary, especially >> when precise language context is not identified and >> unambiguous, I am equally or more concerned when ideas taken >> from 3987 are folded into the basic URL concept and mixed with >> various practical alternatives for Unicode character >> representation and escapes. I want to stress the >> "insufficiently" part of that: it is possible that 3986 itself >> is fine and the difficult lies with those who try to extract >> specific information from a 60+ page document. But those >> kinds of readers and readings are part of our reality. > > Could you be a little more specific about what concerns you > have regarding inclusion of IRI concepts into URLs? What do > you see as the most effective specification approach if not > unification? A unified specification of two objects, one a > superset of the other? Layered specifications as we have today? Barring a note from one of Sam's comments, one of our other problems is that we have allowed ourselves to develop two different types of UR* specs along a different dimension. I'm not sure I have this exactly right, but one of them seems to be about what can be used in or near a user interface or by a web page author in an href (or equivalent) and the other about what is appropriate for over-the-wire transmission, e.g., between client and server machines. The very idea that they can be defined differently, even in layers, depends somewhat on the assumption that leaks will not occur, but we've got 30+ years of experience with the Internet and elsewhere that tells us the leaks will almost always occur sooner or later. If we don't take clear advantage of a layering architecture, things get worse (i.e., you can remove "almost" from the previous sentence). At least in the first case, we have many ways to represent a Unicode character: three different Unicode Encoding Forms and variations on them; %-style encoding that worked well for ISO 8859 but that is somewhat problematic for the combination of those encoding forms (especially UTF-8) and people; assorted native C, Java, etc., escaping forms (I don't know whether what I and others wrote into RFCs 5198 and 5137 made things better or just added to the confusion); and maybe others. Attempts to use IDNA-like Punycode encoding for things other than domain names don't help either, especially given some of the restrictions on that encoding. For those who have become obsessed about confusable characters in UIs, especially where language identification or clues cannot be depended upon (as with domain names), things get worse yet. Pieces of the PRECIS work, to say nothing about Unicode compatibility normalization, that sometimes make some characters equivalent to others but that maintain distinctions among them in other cases don't help either. For the second, RFC 3986 and at least the current HTTP spec effectively say "whatever you do, only %-escapes go down the wire". But I believe that rule is often violated --whether accidentally, by sloppiness, or otherwise-- and that there are servers who believe that accepting and processing the deviations is far better than listening to the complaints and blame-casting. Given that, we have at least two separate issues, the former of which has been a W3C i18n topic on and off in the last year and which has to do with how much we can confuse users by permitting multiple variations, especially when some of them don't work universally. And, again, I'm concerned about a race to the bottom in which almost anything will work sometimes but, from a user perspective, it is very hard to know what will work (or how it will be interpreted) in any given practical case. As I have told PRECIS a few times, in all of the common cases, things will just work but that makes the edge cases --cases that we may recognize as special but that ordinary users and page authors may not-- all the more surprising when encountered. best, john
Received on Friday, 3 October 2014 16:55:28 UTC