- From: John C Klensin <klensin@jck.com>
- Date: Tue, 07 Oct 2014 10:42:08 -0400
- To: Larry Masinter <masinter@adobe.com>, David Sheets <kosmo.zb@gmail.com>
- cc: Sam Ruby <rubys@intertwingly.net>, public-urispec@w3.org, Anne van Kesteren <annevk@annevk.nl>
(this response more or less assumes some of my comments in the note in the same thread to Austin about three hours ago -- if needed, please read that first) --On Tuesday, October 07, 2014 01:47 +0000 Larry Masinter <masinter@adobe.com> wrote: >> >> I recommend that you take a look at the discussions that >> >> have occurred in recent months on the IETF's "urnbis" WG >> >> mailing list, whether as an example of some issues or as a >> >> set of views about the issues. > > There's too much to review there, John, perhaps we could try > to summarize what the URN requirements were. I thought it was > mainly fragment identifiers, which have many other problems > > http://www.w3.org/TR/fragid-best-practices/ Actually, it isn't mainly fragment identifiers and never has been. A large fraction of the problem is that different people (or groups) keep grabbing different pieces of the elephant and then insisting that that parts they are describing is the whole beast -- different communities talking past each other, sometimes by talking about their own picture of their own needs and assuming that, if their perceived problems are solved, everything else will fall into place. There has also been a lot of confusion between description of requirements and description of perceived solutions. I've tried summarizing several times and none of my summaries have gotten any traction. So either I've been wrong or people are doing more talking than listening. >> >> I think I like your proposed >> >> approach (like others, ask me again when there is a draft >> >> we can look at), but it seems to me that 3986 has failed >> >> us by being insufficiently clear about the differences >> >> between syntax and semantics, between recommendations and >> >> requirements, and between examples or suggestions and >> >> normative material. > > I think it's necessary to say there is some disagreement over > whether 3986 is insufficiently clear, or whether people are > just not reading the text that is there. First, "it seems to me" was intended to cover some of that. Beyond that, some people have read the text (or think and claim they have) _very_ carefully and have reached different conclusions from others who also claim to have read it very carefully. Regardless of what the authors intended and the existence of others who haven't "read the text that is there", the existence of groups of careful readers with different interpretations is, to me, the very essence of insufficiently clear about those differences. > And I fear there > might be a tinge of some sense of wanting control and > authority, and using the perceived ambiguity as an excuse for > forking URN from URI. Let's back up a bit. I think there are a few people in the IETF who detest 3986 and would like to see it done away with or replaced by a very different definition or style of definition. I can assure you that I was not a member of that group when the URNBIS effort started; you may choose to believe me about that or not. I believe there are also a few people in the IETF and the broader community who strongly dislike URNs, do not consider them useful (either at all or except in the narrowest of cases), and who would be delighted to see them either eliminated or sufficiently crippled that it becomes obvious to others that they are not really useful. I assume that both groups have their reasons and are completely sincere. Discussions between them tend to not be very illuminating and often sound more like very firm declarations of positions than a dialogue in which two groups are listening to each other. During the URNBIS work, there were a series of discussions that amounted to some communities saying "we need to do X", to which the response was not "no you don't" nor "here is this other way to accomplish that" but "you cannot do it because 3986 won't allow it". The latter type of response is, IMO, not useful if only because the approval of 3986 as a full standard didn't turn it into holy writ that cannot be reconsidered or amended. So the question then turned to whether 3986 actually prohibited the desired behavior or not and, if it did, what modifications to it --either globally or for URNs-- were needed. It was impossible for the URNBIS WG to resolve the first question, impossible to resolve the second because people kept intruding with the first, and impossible to make any progress at all while that discussion raged. The proposals to separate URNs from 3986 (and, later, from 3986 semantics) were made with great reluctance on my part and that of at least most of the others who put them together (I was more a recorder and editor of the ideas than creator/author) in the hope of allowing URNBIS to move forward and concentrate on real questions rather than arguments about what 3986 did or did not say or did or did not restrict. Because we could not reach consensus on whether there were actual restrictions, both approaches were ways of saying "if there are restrictions (and, later, non-syntax restrictions), they don't apply to URNs, so now we can move on". Turning that into a theory about conspiracy or power struggle just distracts from the issues, issues that include the question of usability of URNs in different communities and, ultimately, how far the IETF is willing to go to accommodate the needs of communities who have at least some basis to believe they know what they are doing or whether it is better to adopt a narrower view with the almost certain result of a separate and conflicting standard for URNs from another body or bodies. > >> From the perspective of someone who >> >> spends a lot of time worrying about things that can go >> >> wrong at the localization-internationalization boundary, >> >> especially when precise language context is not identified >> >> and unambiguous, I am equally or more concerned when >> >> ideas taken from 3987 are folded into the basic URL >> >> concept and mixed with various practical alternatives for >> >> Unicode character representation and escapes. > > I think almost all of the implementations fold these things > together, and trying to separate them into layers might be > good in theory but difficult to follow. You may be right. I believe that the result will be some truly interesting interoperability problems and user confusion, but perhaps that is just a cost we need to accept. >> >> I want to stress the >> >> "insufficiently" part of that: it is possible that 3986 >> >> itself is fine and the difficult lies with those who try >> >> to extract specific information from a 60+ page document. >> >> But those kinds of readers and readings are part of our >> >> reality. > > Do you personally find this a problem, or is it really all > "other people" who have trouble finding assurance. My reading of 3986 is different from that expressed at IETF 90 by, e.g., Julian and Joe. >... > A URL in a web page href, in unicode, is transmitted "over the > wire". I think the distinction is artificial and just > confusing. The current version of HTTP says something different. Beyond that, see the comment about interoperability and user confusion above. >> At least in the first case, we have many ways to represent a >> Unicode character: three different Unicode Encoding Forms and >> variations on them; %-style encoding that worked well for ISO >> 8859 but that is somewhat problematic for the combination of >> those encoding forms (especially UTF-8) and people; assorted >> native C, Java, etc., escaping forms (I don't know whether >> what I and others wrote into RFCs 5198 and 5137 made things >> better or just added to the confusion); and maybe others. > > 3987 settled on %xx percent-hex encoding of UTF-8. > The update I was working on in IRI before it closed took > the tack of using that EXCEPT for the hostname/authority > in well-known schemes which would be encoded in > punycode. I know about both of those. I am only suggesting that people and implementations are unlikely to get that right and be consistent about it. In addition, as you certainly know, whether or not a particular scheme is "well-known" enough to rate exceptional treatment is an invitation to both endless arguments and to leakage from implementations that believe their favorite scheme is well-known enough (regardless of what others think) or that a library will be used only for well-known schemes). In addition, remember that the Punycode encoding has some built-in restrictions that make it inappropriate (or worse) for some authority information that is not part of a domain name. Some known possible modifications of the Punycode encoding can be used to get around a subset of those problems but the encoding with those modifications is no longer standard Punycode (another opportunity for interoperability problems). >> Attempts to use >> IDNA-like Punycode encoding for things other than domain names >> don't help either, especially given some of the restrictions >> on that encoding. > > Could you please expand what you mean by this? > I'm not aware of anyone using Punicode other than > in hostname for http, ftp, and a few other schemes. There have been proposals (and implementations) to use it for email local parts, in various non-domain parts of URI paths, as an alternative in Encoded-words in email headers, etc. More broadly, several groups of people have concluded that it is a universal ACE rather than something very specific to IDNA. >> For those who have become obsessed about >> confusable characters in UIs, especially where language >> identification or clues cannot be depended upon (as with >> domain names), things get worse yet. > > I think avoiding confusable characters is hopeless, and > some other means of safety check is necessary. I probably agree about the hopelessness, but I know a lot of folks who would welcome a serious proposal for appropriate safety checks. As you are probably aware, the issue has promoted the sale of a great deal of snake oil. >> Pieces of the PRECIS work, to say >> nothing about Unicode compatibility normalization, that >> sometimes make some characters equivalent to others but that >> maintain distinctions among them in other cases don't help >> either. > > I'm also not sure how this hurts URLs.... could you say more? As the most obvious example, assume that there is a URI scheme (I still don't know where URLs stop and URI schemes (other than "urn:") become something else, and 3986 as usually interpreted doesn't help much there either) that incorporates user identity and/or "password" information in the authority. Now assume that there is an underlying protocol that binds those fields to a PRECIS interpretation. Suddenly, it makes a difference in which order URI processing (or comparison) and PRECIS processing and comparison are applied. To the extent to which "sameness" criteria for URIs are used in caching of information (not just web page caching) or other optimizations, that can hurt the utility and unambiguousness (wrt "sameness") of those URIs. >> For the second, RFC 3986 and at least the current HTTP >> spec effectively say "whatever you do, only %-escapes go down >> the wire". > > I don't think 3986 makes any such restrictions. "Whatever > you do, if you require a URI and not an IRI then you can > only use URIs and not use IRIs" would be more like it At least as I read it, the ABNF of 3986 does not allow non-ASCII characters in URIs. If you read it differently, it exposes another one of the disagreements about clarity I mention above. Specs that normatively reference 3986 for their definitions of what URIs are above and that mention URIs and not IRIs (via 3987 or some other definition) are therefore bound to that restriction. >... >> Given that, we have at least two separate issues, the former >> of which has been a W3C i18n topic on and off in the last >> year and which has to do with how much we can confuse users >> by permitting multiple variations, especially when some of >> them don't work universally. > > I thought this work was being done by Unicode consortium, > what is the W3C I18N document, could you provide a pointer? Try the WHATWG-> W3C Encoding spec for one example. And note that, as soon as one says "Unicode" or even "UTF-8", one has considerably restricted the domain of discourage (especially since there are still web pages and sites out there that assume anything unlabeled but containing characters with the high bit on are encoded in ISO 8859-1. >> And, again, I'm concerned about a race to the >> bottom in which almost anything will work sometimes but, from >> a user perspective, it is very hard to know what will work >> (or how it will be interpreted) in any given practical case. > > Some people might need a recap of the "race to the bottom" > argument, because that seems at the heart of the > W3C/WHATWG struggle. If that is true, I'll try to find time to do it. At the moment, I've not convinced that anyone who does not already understand that argument is interested. >... best, john
Received on Tuesday, 7 October 2014 14:42:41 UTC