- From: Graham Klyne <gk@ninebynine.org>
- Date: Thu, 09 Oct 2014 11:44:20 +0100
- To: public-urispec@w3.org, David Sheets <sheets@alum.mit.edu>
Hi David,
You responded:
> On Wed, Oct 8, 2014 at 7:32 AM, Graham Klyne <gk@ninebynine.org> wrote:
>> Hi,
>>
>> I've just read through the URI-spec list discussion to date, and find myself
>> rather confused about what it actually hopes to achieve.
>
> Hi Graham,
>
> I don't think you're alone in that confusion. From my perspective, the
> broad goal of any new specification effort should be harmonization of
> existing standards and formalization of their components.
> Specifically, Web browser implementors have found that 3986
> insufficiently describes some aspects of URL parsing and manipulation.
Fair enough...
> Others in our broader community of implementors feel similarly. The
> resulting WHATWG URL spec which aims to correct this deficit in 3986
> is now making normative statements about URLs and is being touted as a
> replacement for 3986.
... but where you lose me is in treating this as a deficiency in RFC3986.
I fully accept there are things that RFC3986 doesn't cover, but as I said
previously I see that as a feature, not a bug. I don't see any need to go back
and tear up RFC3986 because of the things it does not say.
To take your example of "some aspects of URL parsing and manipulation", I think
it would quite appropriate to write a spec that described these functions for
browsers in a way that builds upon rather than replaces RFC3986. I think it
would be wrong to assume that all uses of URIs have the same requirements for
URI parsing and manipulation, and to bake a particular set of mechanisms into a
core URI spec would be to make the spec less useful for other applications.
Would it be so hard, or insufficient for the example you mention, to write a
spec called, say "URL parsing and manipulation for browsers" that describes how
to take a string from a browser address bar and turn it into an
RFC3986-compliant URI string?
...
TL;DR: see above. The rest of this response delves deeper into some of the
points you raise, but all my comments ultimately derive from the position
indicated above.
...
>
> This state of affairs is confusing and, if left unattended, liable to
> make implementation of correct and interoperable (according to any
> specification) URI handling even more difficult than it already is.
For whom? This isn't a problem I've noticed. I work with language libraries
and they pretty much do what I need.
> ... We
> already know of many areas of confusion in 3986 (percent-encoding
> alphabets for different components, equivalence, parser error
> recovery...) and implementations will continue to diverge without
> significant effort to understand all of the present issues and unify
> the browser vendors', library authors', Web authors', and users' URI
> standards.
I recognize that the are difficulties in the internationalization. But URI
strings as defined avoid that by sticking to US-ASCII. IRIs are an attempt to
address these issues, and I accept that's an area that might usefully be
clarified and regularized.
>
>> I've been writing software and specifications that work with URIs for over a
>> decade, and throughout that time I've found RFC3986 has been a perfectly
>> good specification for what it covers, viz:
>> - defining the syntax of a string used as a URI
>> - identifying parts that can be extracted from a valid URI (*)
>> - a specification for resolving a relative reference to a full (absolute)
>> URI
>
> RFC3986 does an admirable job at defining some of these structures and
> functions. Notably, RFC3986 is silent on real-world normalization,
> parsing input with errors, incompatible implementations,
> internationalization, and scheme-specific properties.
Sure it's silent on those things, and I'll repeat: I think that's a feature not
a bug, because I don't think there's a single solution for these that's best for
all purposes:
- real-world normalization: for what purpose? I submit that different purposes
will require different normal forms. The main issue I come across is URI
equality testing, but in practice I find that most of the time it's sufficient
to treat the URI as an opaque string and compare that (per RFC3986). It may be
that there are different URIs that dereference or identify the same resource,
but no amount of normalization will make that problem go away - ultimately it's
an issue that applications (of which broswers are one class) must deal with.
- dealing with input errors: error recovery is surely an application issue?
I'd suggest if there's a standardized "recovery" for an error then it's not an
error so much as an alternative form.
- incompatible implementations: again, I think this only makes sense with some
particular purpose in mind, and not all URI-using applications have the same
purposes.
- internationalization: agree - see above - for those applications that need to
deal with mapping between human-readable IRIs and US-ASCII-based URIs as
protocol elements. But not all applications do (or not in the full generality
where many of the I18N demons seem to lurk).
- scheme-specific properties: surely, these are for scheme definitions to
describe (within the framework of what is described for generic URIs)?
So, while I agree that there are things that can usefully be done, I'm not
seeing anything here that requires replacement of RFC3986.
>
>> There are many things that one might do with URIs, or ways in which they
>> might be constructed, that are not covered by RFC3986. In my view, that's a
>> feature, not a bug.
>
> I certainly think we should be very careful with the scope of our work
> for upstream acceptance, prompt delivery, confusion avoidance, and
> effort dilution purposes. With that said, it is clear that there are a
> number of related functions that most implementations use or expose
> that are simply not covered by 3986. We should strive to provide a
> solid, unified, well-structured core specification to alleviate the
> pain I mentioned above.
I have my doubts that this is possible, because I don't believe there exist
one-size-fits-all solutions to the issues you mention. If and where such
solutions do exist, then I think they can be written as separate specs that
build upon RFC3986, and can prove their worth in that form.
Proven solutions might then be merged into future successors of RFC3986.
(This, BTW, is my notion of how a "living standard" might work: not as a dynamic
document, but as a dynamic constellation of individual pieces, those of which
that have proven their worth used to provide stable points of reference. For
the time being, I see RFC3986 as one of those stable points, and we risk great
damage by trying to tinker with its scope.)
>
>> So, in my view, I think a URI spec activity would usefully use RFC3986 (or
>> successor) as a base specification, and create additional specs that
>> describe additional usage-oriented aspects; e.g. a URI parsing API, a
>> procedure for converting a manually entered string into a URI string,
>> handling of URIs as identifiers vs URIs as locators, internationalization
>> issues, etc.
>
> I agree that RFC 3986 makes a useful guide (and WHATWG URL an
> interesting counterpoint). I would be wary of over-modularization of
> some of these URI specifications, however. Besides introducing very
> procedurally-formal boundaries between closely related functionality,
> development of these specs would almost certainly push-back new
> requirements on the core specification.
I think RFC3986 is really much more than a "useful guide". We have 25 years of
developed software based on the key ideas of which RFC3986 is the current
evolved specification. I think it needs to stand at the heart of any URI
clarification efforts (not protected from evolution where needed, but used as an
anchor point to which other developments can be referred). (FWIW, as a
developer I've never consulted the WHATWG URL spec, as I generally find that
RFC3986 is generally adequate for my needs and has the great advantage of being
stable. So this developer has no need of the WHATWG URL spec.)
I really think that monolithic specs covering all uses are a bad idea, as they
come to look like application specifications, and end up prescribing things that
should properly be left as application implementation concerns rather than
focusing on the essentials needed for interoperability.
I see URIs in information architectures are somewhat like the hourglass neck
represented by IP protocol in the family of Internet protocol standards. By
sticking to a minimum core concern, it is able to support a greater variety of
applications that a more comprehensive specification might do. Of course,
additional specifications may still be needed for those particular applications
>
> I would be absolutely thrilled to see a constellation of
> specifications incubated together and modularized internally. If that
> effort is successful, I think it would make sense to start looking at
> spinning out dependent specs.
I think there's a danger here of engaging in a monumental act of hubris, by
assuming that you can bring all of the required breadth of expertise into a
single forum. Far safer, and more productive IMO, would be to stick with a core
functionality of known value, and then develop specifications that build on
those core capabilities in well defined ways. I think you're much more likely
to end up identifying a constellation of universally useful features that way
than trying to incubate them together.
As IETF URI scheme reviewer, I see a lot of scheme proposals that have very
little, if anything, to do with the Web. Given that the URI spec is one of the
foundation pieces of the Web, I sometimes find this a bit disconcerting. But it
is also testament to the widespread utility of URIs as an engineering artifact
beyond the Web for which they were designed. IMO, this kind of utility is most
unlikely to be achieved by an atempt to incubate a constellation of core
specifications. In this, I strongly believe less is more - i.e. my doing less
we can in the long run achieve more.
>
> Finally, as there appears to be interest in very accurate
> specification of URI functions, I think any new effort for URI
> specification will necessarily involve a significant investment in
> tools for spec construction. If a specification strives to completely
> describe the inputs and outputs of functions (e.g. "string -> uri"),
> then, to my mind, it should exist as a formal description of such
> first and include annotations for human consumption secondarily. This
> is not to say that a human-readable spec is a second-class citizen in
> this world; simply that a machine-analyzable spec should also be first
> class!
I think that's an orthogonal concern. We already have some such tools (ABNF
comes to mind), though clearly there are others that might be considered. I'd
be very wary about making the development of such tools a part of a URI
specification group's charter.
>
> I believe that URI functions (parsing, printing, normalizing,
> equating, resolving...) are self-contained enough, small enough, and
> widely used enough to make this new specification approach extremely
> valuable to everyone involved.
Art the risk of sounding like a broken record, I think for the most part that
they'd be equally useful as satellite specifications around the core of RFC3986.
If it returns out that these specs expose requirements that cannot be achieved
within what is mandated by RFC3986, then there is a case for updating RFC3986
with respect to just those identified requirements - but I think that case needs
to be established before considering changes to RFC3986.
>
>> As such, I think a list of perceived problems might be more useful than a
>> single problem statement. Then it might be reasonable to discuss which of
>> those problems are realistically addressable.
>
> I agree! I often think in terms of questions rather than problems, though.
>
> I'll start:
>
> *****
>
> - What are the common functions of type "string -> uri"?
>
> 3986 says regex parser and only talks about string when it matches the
> included ABNF.
IIRC, the regex is in a not normative appendix. RFC3986 says nothing
normatively about *how* to parse a URI, just what constitutes a syntactically
well-formed URI.
The closest to a normative processing spec is the relative reference resolution,
which in turn depends opn isolation of key elements within the URI (scheme,
authority, etc.). But even that, as I recall, is not a normative procedure:
other implementations are OK if they achieve the same result.
So, yes, a parsing spec could be useful, but I don't see that it needs to be
part of the core URI spec. Similarly, I think an API spec might be useful to
promote consistency between URI library implementations, but again not as part
of the core.
>
> WHATWG URL says procedural, mutable state machine in English prose
> parser (one total) and aspires to cover any input string.
But not all applications have a need to "cover any input string" - sometimes the
right thing to do is say "that's not a URI". Most of the time, that's all I
need in my work. As you say...
>
> There are, actually, multiple related functions of string -> uri and
> some applications want to use a strict parser and some want to use a
> sloppy parser. Some implementations will always compose the parser
> with a normalization function or resolution and others will want to
> keep those functions separate. How can we be certain that desirable
> properties hold across these variations and guide implementors,
> developers, authors, and users to the safest and most desirable
> behavior?
The problem with creating a catalogue of functions is that it's not clear where
the cut-off should be. The focus on a specification here should IMO be to
address interoperability problems; so I think it might be more useful to draw
up a list of known interoperability problems, and then consider which of those
might be addressed by a clearer specification.
>
> - What are the common functions of type "uri -> uri"?
>
> 3986 says there are a few components to normalize (percent hex casing,
> DNS casing, scheme casing, percent unnecessity, IPv6 hex casing, empty
> paths). It misses some like query encoding and DNS root label and
> explicitly doesn't cover internationalization.
Again, I think it would be more helpful to identify actual interop problems.
I've often had to face the question of whether or not to %-encode, but it's
rarely turned out to cause an interoperability problem. On the few occasions it
has, I've found the guidance in RFC3986 has been enough. But YMMV.
>
> WHATWG URL doesn't address this directly but includes a few
> normalizations directly in its parser state machine.
>
> - What are the common functions of type "uri -> uri -> uri"?
>
> 3986 says resolution against an absolute URI and stays silent on
> relative-relative resolution.
I use relative reference resolution quite a lot in my work, and I've never found
this to be a problem. I'm not offhand sure why, but can think of two possible
reasons:
(a) the absolute -> relative -> uri function as described also works for
relative -> relative -> uri
(b) if the end goal is an absolute URI, then the sequence can be always
performed as a series of absolute -> relative -> uri
But I'll accept that a clear specification of valid outcomes of relative ->
relative -> uri could be useful.
>
> WHATWG URL doesn't address this directly but includes resolution as
> part of its parser state machine.
>
> - What are the common functions of type "uri -> string"?
>
> One would hope that these are only ever effectively normalization
> functions (uri -> uri) composed with a single serialization function
> but there may be reasons that this definition isn't possible.
>
> 3986 and WHATWG URL treat this as mostly self-evident and dependent on
> the internal representation of a URI. Round-trip composition (compose
> "string -> uri" with "uri -> string" and "uri -> string" with "string
> -> uri") is absent from 3986 as it only covers valid grammatical forms
> and entirely missing from WHATWG URL.
I'd say that RFC3986 just doesn't address this, but leaves this as an API issue.
For example, in my Haskell URI parser, I created functions to extract
components which some have argued is not correct. I made some choices that
meant it was easier to re-assemble an original URI from its components (e.g.
including ":" in an extracted scheme) - I don't think any or my choices violated
any edict of RFC3986, but different implementers could reasonably make different
choices
SO I'd say this is an area where an API spec could bring some useful clarity and
consistency, but it doesn't need to change any fundamentals of RFC3986.
>
> - Where are the test cases for a given spec assertion?
>
> No URI spec, as far as I know, covers this or delivers a comprehensive
> test suite.
Assembling a comprehensive test suite could be a useful outcome. There are
plenty of partial test suites out there (RFC3986 has many useful test cases, Dan
Connolly created one several years ago for his W3C work, I created one for my
Haskell URI parser, Sam Ruby has recently been assembling test cases, and I'm
sure there are more that can be plundered.
A note of caution: some test cases may be applicable in certain usage, and not
universally for all URIs. (I think some of Sam Ruby's recent tests may fall
into this category.)
>
> *****
>
> There are certainly other questions one could ask or problems one
> could raise and I'd be very interested in reading any you might have.
>
> The general issue of standards fragmentation and lack of precise,
> accurate functional specification leads me to pursue a single, unified
> specification about which things can be proven and from which
> documents, test oracles, and test suites can be produced.
You claim "lack of precise, accurate functional specification". I disagree
(mostly). A specification stands (or falls) with respect to some stated
purpose, and I think RFC3986 does pretty well with respect to its stated purpose.
There may be other valid concerns not covered by RFC3986, and I think it's fine
to address those concerns. What I don't see if any good cause to tear up one of
the Web's well-established foundational elements in the process.
I think that an attempt by a small group to produce a "single, unified
specification" will be of little value beyond a relatively small coterie of
developers who happen to have a shared set of concerns.
"There are more things in heaven and earth, Horatio,
Than are dreamt of in your philosophy."
On the other hand, I think attempting to formalize those things that RFC3986
does say could be worthwhile, and doing likewise for additional proposals such
as those you suggest could be a useful check on whether any additional proposals
are or are not consistent with the core spec.
(My own Haskell implementation of URI parsing [1] was conducted, in part, as a
way to (semi)formalize and validate the ABNF as it was being written for
RFC3986, and I believe it may have resulted in some minor updates to the draft
spec.)
[1] http://hackage.haskell.org/package/network-2.1.0.0/docs/Network-URI.html
#g
Received on Thursday, 9 October 2014 10:44:21 UTC