- From: Graham Klyne <gk@ninebynine.org>
- Date: Thu, 09 Oct 2014 11:44:20 +0100
- To: public-urispec@w3.org, David Sheets <sheets@alum.mit.edu>
Hi David, You responded: > On Wed, Oct 8, 2014 at 7:32 AM, Graham Klyne <gk@ninebynine.org> wrote: >> Hi, >> >> I've just read through the URI-spec list discussion to date, and find myself >> rather confused about what it actually hopes to achieve. > > Hi Graham, > > I don't think you're alone in that confusion. From my perspective, the > broad goal of any new specification effort should be harmonization of > existing standards and formalization of their components. > Specifically, Web browser implementors have found that 3986 > insufficiently describes some aspects of URL parsing and manipulation. Fair enough... > Others in our broader community of implementors feel similarly. The > resulting WHATWG URL spec which aims to correct this deficit in 3986 > is now making normative statements about URLs and is being touted as a > replacement for 3986. ... but where you lose me is in treating this as a deficiency in RFC3986. I fully accept there are things that RFC3986 doesn't cover, but as I said previously I see that as a feature, not a bug. I don't see any need to go back and tear up RFC3986 because of the things it does not say. To take your example of "some aspects of URL parsing and manipulation", I think it would quite appropriate to write a spec that described these functions for browsers in a way that builds upon rather than replaces RFC3986. I think it would be wrong to assume that all uses of URIs have the same requirements for URI parsing and manipulation, and to bake a particular set of mechanisms into a core URI spec would be to make the spec less useful for other applications. Would it be so hard, or insufficient for the example you mention, to write a spec called, say "URL parsing and manipulation for browsers" that describes how to take a string from a browser address bar and turn it into an RFC3986-compliant URI string? ... TL;DR: see above. The rest of this response delves deeper into some of the points you raise, but all my comments ultimately derive from the position indicated above. ... > > This state of affairs is confusing and, if left unattended, liable to > make implementation of correct and interoperable (according to any > specification) URI handling even more difficult than it already is. For whom? This isn't a problem I've noticed. I work with language libraries and they pretty much do what I need. > ... We > already know of many areas of confusion in 3986 (percent-encoding > alphabets for different components, equivalence, parser error > recovery...) and implementations will continue to diverge without > significant effort to understand all of the present issues and unify > the browser vendors', library authors', Web authors', and users' URI > standards. I recognize that the are difficulties in the internationalization. But URI strings as defined avoid that by sticking to US-ASCII. IRIs are an attempt to address these issues, and I accept that's an area that might usefully be clarified and regularized. > >> I've been writing software and specifications that work with URIs for over a >> decade, and throughout that time I've found RFC3986 has been a perfectly >> good specification for what it covers, viz: >> - defining the syntax of a string used as a URI >> - identifying parts that can be extracted from a valid URI (*) >> - a specification for resolving a relative reference to a full (absolute) >> URI > > RFC3986 does an admirable job at defining some of these structures and > functions. Notably, RFC3986 is silent on real-world normalization, > parsing input with errors, incompatible implementations, > internationalization, and scheme-specific properties. Sure it's silent on those things, and I'll repeat: I think that's a feature not a bug, because I don't think there's a single solution for these that's best for all purposes: - real-world normalization: for what purpose? I submit that different purposes will require different normal forms. The main issue I come across is URI equality testing, but in practice I find that most of the time it's sufficient to treat the URI as an opaque string and compare that (per RFC3986). It may be that there are different URIs that dereference or identify the same resource, but no amount of normalization will make that problem go away - ultimately it's an issue that applications (of which broswers are one class) must deal with. - dealing with input errors: error recovery is surely an application issue? I'd suggest if there's a standardized "recovery" for an error then it's not an error so much as an alternative form. - incompatible implementations: again, I think this only makes sense with some particular purpose in mind, and not all URI-using applications have the same purposes. - internationalization: agree - see above - for those applications that need to deal with mapping between human-readable IRIs and US-ASCII-based URIs as protocol elements. But not all applications do (or not in the full generality where many of the I18N demons seem to lurk). - scheme-specific properties: surely, these are for scheme definitions to describe (within the framework of what is described for generic URIs)? So, while I agree that there are things that can usefully be done, I'm not seeing anything here that requires replacement of RFC3986. > >> There are many things that one might do with URIs, or ways in which they >> might be constructed, that are not covered by RFC3986. In my view, that's a >> feature, not a bug. > > I certainly think we should be very careful with the scope of our work > for upstream acceptance, prompt delivery, confusion avoidance, and > effort dilution purposes. With that said, it is clear that there are a > number of related functions that most implementations use or expose > that are simply not covered by 3986. We should strive to provide a > solid, unified, well-structured core specification to alleviate the > pain I mentioned above. I have my doubts that this is possible, because I don't believe there exist one-size-fits-all solutions to the issues you mention. If and where such solutions do exist, then I think they can be written as separate specs that build upon RFC3986, and can prove their worth in that form. Proven solutions might then be merged into future successors of RFC3986. (This, BTW, is my notion of how a "living standard" might work: not as a dynamic document, but as a dynamic constellation of individual pieces, those of which that have proven their worth used to provide stable points of reference. For the time being, I see RFC3986 as one of those stable points, and we risk great damage by trying to tinker with its scope.) > >> So, in my view, I think a URI spec activity would usefully use RFC3986 (or >> successor) as a base specification, and create additional specs that >> describe additional usage-oriented aspects; e.g. a URI parsing API, a >> procedure for converting a manually entered string into a URI string, >> handling of URIs as identifiers vs URIs as locators, internationalization >> issues, etc. > > I agree that RFC 3986 makes a useful guide (and WHATWG URL an > interesting counterpoint). I would be wary of over-modularization of > some of these URI specifications, however. Besides introducing very > procedurally-formal boundaries between closely related functionality, > development of these specs would almost certainly push-back new > requirements on the core specification. I think RFC3986 is really much more than a "useful guide". We have 25 years of developed software based on the key ideas of which RFC3986 is the current evolved specification. I think it needs to stand at the heart of any URI clarification efforts (not protected from evolution where needed, but used as an anchor point to which other developments can be referred). (FWIW, as a developer I've never consulted the WHATWG URL spec, as I generally find that RFC3986 is generally adequate for my needs and has the great advantage of being stable. So this developer has no need of the WHATWG URL spec.) I really think that monolithic specs covering all uses are a bad idea, as they come to look like application specifications, and end up prescribing things that should properly be left as application implementation concerns rather than focusing on the essentials needed for interoperability. I see URIs in information architectures are somewhat like the hourglass neck represented by IP protocol in the family of Internet protocol standards. By sticking to a minimum core concern, it is able to support a greater variety of applications that a more comprehensive specification might do. Of course, additional specifications may still be needed for those particular applications > > I would be absolutely thrilled to see a constellation of > specifications incubated together and modularized internally. If that > effort is successful, I think it would make sense to start looking at > spinning out dependent specs. I think there's a danger here of engaging in a monumental act of hubris, by assuming that you can bring all of the required breadth of expertise into a single forum. Far safer, and more productive IMO, would be to stick with a core functionality of known value, and then develop specifications that build on those core capabilities in well defined ways. I think you're much more likely to end up identifying a constellation of universally useful features that way than trying to incubate them together. As IETF URI scheme reviewer, I see a lot of scheme proposals that have very little, if anything, to do with the Web. Given that the URI spec is one of the foundation pieces of the Web, I sometimes find this a bit disconcerting. But it is also testament to the widespread utility of URIs as an engineering artifact beyond the Web for which they were designed. IMO, this kind of utility is most unlikely to be achieved by an atempt to incubate a constellation of core specifications. In this, I strongly believe less is more - i.e. my doing less we can in the long run achieve more. > > Finally, as there appears to be interest in very accurate > specification of URI functions, I think any new effort for URI > specification will necessarily involve a significant investment in > tools for spec construction. If a specification strives to completely > describe the inputs and outputs of functions (e.g. "string -> uri"), > then, to my mind, it should exist as a formal description of such > first and include annotations for human consumption secondarily. This > is not to say that a human-readable spec is a second-class citizen in > this world; simply that a machine-analyzable spec should also be first > class! I think that's an orthogonal concern. We already have some such tools (ABNF comes to mind), though clearly there are others that might be considered. I'd be very wary about making the development of such tools a part of a URI specification group's charter. > > I believe that URI functions (parsing, printing, normalizing, > equating, resolving...) are self-contained enough, small enough, and > widely used enough to make this new specification approach extremely > valuable to everyone involved. Art the risk of sounding like a broken record, I think for the most part that they'd be equally useful as satellite specifications around the core of RFC3986. If it returns out that these specs expose requirements that cannot be achieved within what is mandated by RFC3986, then there is a case for updating RFC3986 with respect to just those identified requirements - but I think that case needs to be established before considering changes to RFC3986. > >> As such, I think a list of perceived problems might be more useful than a >> single problem statement. Then it might be reasonable to discuss which of >> those problems are realistically addressable. > > I agree! I often think in terms of questions rather than problems, though. > > I'll start: > > ***** > > - What are the common functions of type "string -> uri"? > > 3986 says regex parser and only talks about string when it matches the > included ABNF. IIRC, the regex is in a not normative appendix. RFC3986 says nothing normatively about *how* to parse a URI, just what constitutes a syntactically well-formed URI. The closest to a normative processing spec is the relative reference resolution, which in turn depends opn isolation of key elements within the URI (scheme, authority, etc.). But even that, as I recall, is not a normative procedure: other implementations are OK if they achieve the same result. So, yes, a parsing spec could be useful, but I don't see that it needs to be part of the core URI spec. Similarly, I think an API spec might be useful to promote consistency between URI library implementations, but again not as part of the core. > > WHATWG URL says procedural, mutable state machine in English prose > parser (one total) and aspires to cover any input string. But not all applications have a need to "cover any input string" - sometimes the right thing to do is say "that's not a URI". Most of the time, that's all I need in my work. As you say... > > There are, actually, multiple related functions of string -> uri and > some applications want to use a strict parser and some want to use a > sloppy parser. Some implementations will always compose the parser > with a normalization function or resolution and others will want to > keep those functions separate. How can we be certain that desirable > properties hold across these variations and guide implementors, > developers, authors, and users to the safest and most desirable > behavior? The problem with creating a catalogue of functions is that it's not clear where the cut-off should be. The focus on a specification here should IMO be to address interoperability problems; so I think it might be more useful to draw up a list of known interoperability problems, and then consider which of those might be addressed by a clearer specification. > > - What are the common functions of type "uri -> uri"? > > 3986 says there are a few components to normalize (percent hex casing, > DNS casing, scheme casing, percent unnecessity, IPv6 hex casing, empty > paths). It misses some like query encoding and DNS root label and > explicitly doesn't cover internationalization. Again, I think it would be more helpful to identify actual interop problems. I've often had to face the question of whether or not to %-encode, but it's rarely turned out to cause an interoperability problem. On the few occasions it has, I've found the guidance in RFC3986 has been enough. But YMMV. > > WHATWG URL doesn't address this directly but includes a few > normalizations directly in its parser state machine. > > - What are the common functions of type "uri -> uri -> uri"? > > 3986 says resolution against an absolute URI and stays silent on > relative-relative resolution. I use relative reference resolution quite a lot in my work, and I've never found this to be a problem. I'm not offhand sure why, but can think of two possible reasons: (a) the absolute -> relative -> uri function as described also works for relative -> relative -> uri (b) if the end goal is an absolute URI, then the sequence can be always performed as a series of absolute -> relative -> uri But I'll accept that a clear specification of valid outcomes of relative -> relative -> uri could be useful. > > WHATWG URL doesn't address this directly but includes resolution as > part of its parser state machine. > > - What are the common functions of type "uri -> string"? > > One would hope that these are only ever effectively normalization > functions (uri -> uri) composed with a single serialization function > but there may be reasons that this definition isn't possible. > > 3986 and WHATWG URL treat this as mostly self-evident and dependent on > the internal representation of a URI. Round-trip composition (compose > "string -> uri" with "uri -> string" and "uri -> string" with "string > -> uri") is absent from 3986 as it only covers valid grammatical forms > and entirely missing from WHATWG URL. I'd say that RFC3986 just doesn't address this, but leaves this as an API issue. For example, in my Haskell URI parser, I created functions to extract components which some have argued is not correct. I made some choices that meant it was easier to re-assemble an original URI from its components (e.g. including ":" in an extracted scheme) - I don't think any or my choices violated any edict of RFC3986, but different implementers could reasonably make different choices SO I'd say this is an area where an API spec could bring some useful clarity and consistency, but it doesn't need to change any fundamentals of RFC3986. > > - Where are the test cases for a given spec assertion? > > No URI spec, as far as I know, covers this or delivers a comprehensive > test suite. Assembling a comprehensive test suite could be a useful outcome. There are plenty of partial test suites out there (RFC3986 has many useful test cases, Dan Connolly created one several years ago for his W3C work, I created one for my Haskell URI parser, Sam Ruby has recently been assembling test cases, and I'm sure there are more that can be plundered. A note of caution: some test cases may be applicable in certain usage, and not universally for all URIs. (I think some of Sam Ruby's recent tests may fall into this category.) > > ***** > > There are certainly other questions one could ask or problems one > could raise and I'd be very interested in reading any you might have. > > The general issue of standards fragmentation and lack of precise, > accurate functional specification leads me to pursue a single, unified > specification about which things can be proven and from which > documents, test oracles, and test suites can be produced. You claim "lack of precise, accurate functional specification". I disagree (mostly). A specification stands (or falls) with respect to some stated purpose, and I think RFC3986 does pretty well with respect to its stated purpose. There may be other valid concerns not covered by RFC3986, and I think it's fine to address those concerns. What I don't see if any good cause to tear up one of the Web's well-established foundational elements in the process. I think that an attempt by a small group to produce a "single, unified specification" will be of little value beyond a relatively small coterie of developers who happen to have a shared set of concerns. "There are more things in heaven and earth, Horatio, Than are dreamt of in your philosophy." On the other hand, I think attempting to formalize those things that RFC3986 does say could be worthwhile, and doing likewise for additional proposals such as those you suggest could be a useful check on whether any additional proposals are or are not consistent with the core spec. (My own Haskell implementation of URI parsing [1] was conducted, in part, as a way to (semi)formalize and validate the ABNF as it was being written for RFC3986, and I believe it may have resulted in some minor updates to the draft spec.) [1] http://hackage.haskell.org/package/network-2.1.0.0/docs/Network-URI.html #g
Received on Thursday, 9 October 2014 10:44:21 UTC