- From: David Sheets <sheets@alum.mit.edu>
- Date: Wed, 8 Oct 2014 18:17:59 +0100
- To: Graham Klyne <gk@ninebynine.org>
- Cc: "public-urispec@w3.org" <public-urispec@w3.org>
On Wed, Oct 8, 2014 at 7:32 AM, Graham Klyne <gk@ninebynine.org> wrote: > Hi, > > I've just read through the URI-spec list discussion to date, and find myself > rather confused about what it actually hopes to achieve. Hi Graham, I don't think you're alone in that confusion. From my perspective, the broad goal of any new specification effort should be harmonization of existing standards and formalization of their components. Specifically, Web browser implementors have found that 3986 insufficiently describes some aspects of URL parsing and manipulation. Others in our broader community of implementors feel similarly. The resulting WHATWG URL spec which aims to correct this deficit in 3986 is now making normative statements about URLs and is being touted as a replacement for 3986. This state of affairs is confusing and, if left unattended, liable to make implementation of correct and interoperable (according to any specification) URI handling even more difficult than it already is. We already know of many areas of confusion in 3986 (percent-encoding alphabets for different components, equivalence, parser error recovery...) and implementations will continue to diverge without significant effort to understand all of the present issues and unify the browser vendors', library authors', Web authors', and users' URI standards. > I've been writing software and specifications that work with URIs for over a > decade, and throughout that time I've found RFC3986 has been a perfectly > good specification for what it covers, viz: > - defining the syntax of a string used as a URI > - identifying parts that can be extracted from a valid URI (*) > - a specification for resolving a relative reference to a full (absolute) > URI RFC3986 does an admirable job at defining some of these structures and functions. Notably, RFC3986 is silent on real-world normalization, parsing input with errors, incompatible implementations, internationalization, and scheme-specific properties. > There are many things that one might do with URIs, or ways in which they > might be constructed, that are not covered by RFC3986. In my view, that's a > feature, not a bug. I certainly think we should be very careful with the scope of our work for upstream acceptance, prompt delivery, confusion avoidance, and effort dilution purposes. With that said, it is clear that there are a number of related functions that most implementations use or expose that are simply not covered by 3986. We should strive to provide a solid, unified, well-structured core specification to alleviate the pain I mentioned above. > So, in my view, I think a URI spec activity would usefully use RFC3986 (or > successor) as a base specification, and create additional specs that > describe additional usage-oriented aspects; e.g. a URI parsing API, a > procedure for converting a manually entered string into a URI string, > handling of URIs as identifiers vs URIs as locators, internationalization > issues, etc. I agree that RFC 3986 makes a useful guide (and WHATWG URL an interesting counterpoint). I would be wary of over-modularization of some of these URI specifications, however. Besides introducing very procedurally-formal boundaries between closely related functionality, development of these specs would almost certainly push-back new requirements on the core specification. I would be absolutely thrilled to see a constellation of specifications incubated together and modularized internally. If that effort is successful, I think it would make sense to start looking at spinning out dependent specs. Finally, as there appears to be interest in very accurate specification of URI functions, I think any new effort for URI specification will necessarily involve a significant investment in tools for spec construction. If a specification strives to completely describe the inputs and outputs of functions (e.g. "string -> uri"), then, to my mind, it should exist as a formal description of such first and include annotations for human consumption secondarily. This is not to say that a human-readable spec is a second-class citizen in this world; simply that a machine-analyzable spec should also be first class! I believe that URI functions (parsing, printing, normalizing, equating, resolving...) are self-contained enough, small enough, and widely used enough to make this new specification approach extremely valuable to everyone involved. > As such, I think a list of perceived problems might be more useful than a > single problem statement. Then it might be reasonable to discuss which of > those problems are realistically addressable. I agree! I often think in terms of questions rather than problems, though. I'll start: ***** - What are the common functions of type "string -> uri"? 3986 says regex parser and only talks about string when it matches the included ABNF. WHATWG URL says procedural, mutable state machine in English prose parser (one total) and aspires to cover any input string. There are, actually, multiple related functions of string -> uri and some applications want to use a strict parser and some want to use a sloppy parser. Some implementations will always compose the parser with a normalization function or resolution and others will want to keep those functions separate. How can we be certain that desirable properties hold across these variations and guide implementors, developers, authors, and users to the safest and most desirable behavior? - What are the common functions of type "uri -> uri"? 3986 says there are a few components to normalize (percent hex casing, DNS casing, scheme casing, percent unnecessity, IPv6 hex casing, empty paths). It misses some like query encoding and DNS root label and explicitly doesn't cover internationalization. WHATWG URL doesn't address this directly but includes a few normalizations directly in its parser state machine. - What are the common functions of type "uri -> uri -> uri"? 3986 says resolution against an absolute URI and stays silent on relative-relative resolution. WHATWG URL doesn't address this directly but includes resolution as part of its parser state machine. - What are the common functions of type "uri -> string"? One would hope that these are only ever effectively normalization functions (uri -> uri) composed with a single serialization function but there may be reasons that this definition isn't possible. 3986 and WHATWG URL treat this as mostly self-evident and dependent on the internal representation of a URI. Round-trip composition (compose "string -> uri" with "uri -> string" and "uri -> string" with "string -> uri") is absent from 3986 as it only covers valid grammatical forms and entirely missing from WHATWG URL. - Where are the test cases for a given spec assertion? No URI spec, as far as I know, covers this or delivers a comprehensive test suite. ***** There are certainly other questions one could ask or problems one could raise and I'd be very interested in reading any you might have. The general issue of standards fragmentation and lack of precise, accurate functional specification leads me to pursue a single, unified specification about which things can be proven and from which documents, test oracles, and test suites can be produced. I hope this helps clear up some of your confusion. Please let me know if there is anything else I can help you with. Thanks, David
Received on Wednesday, 8 October 2014 17:18:28 UTC