Re: So what is the problem with URIs?

Hi David,

You responded:
> On Wed, Oct 8, 2014 at 7:32 AM, Graham Klyne <gk@ninebynine.org> wrote:
>> Hi,
>>
>> I've just read through the URI-spec list discussion to date, and find myself
>> rather confused about what it actually hopes to achieve.
>
> Hi Graham,
>
> I don't think you're alone in that confusion. From my perspective, the
> broad goal of any new specification effort should be harmonization of
> existing standards and formalization of their components.
> Specifically, Web browser implementors have found that 3986
> insufficiently describes some aspects of URL parsing and manipulation.

Fair enough...

> Others in our broader community of implementors feel similarly. The
> resulting WHATWG URL spec which aims to correct this deficit in 3986
> is now making normative statements about URLs and is being touted as a
> replacement for 3986.

... but where you lose me is in treating this as a deficiency in RFC3986.

I fully accept there are things that RFC3986 doesn't cover, but as I said 
previously I see that as a feature, not a bug.  I don't see any need to go back 
and tear up RFC3986 because of the things it does not say.

To take your example of "some aspects of URL parsing and manipulation", I think 
it would quite appropriate to write a spec that described these functions for 
browsers in a way that builds upon rather than replaces RFC3986.  I think it 
would be wrong to assume that all uses of URIs have the same requirements for 
URI parsing and manipulation, and to bake a particular set of mechanisms into a 
core URI spec would be to make the spec less useful for other applications.

Would it be so hard, or insufficient for the example you mention, to write a 
spec called, say "URL parsing and manipulation for browsers" that describes how 
to take a string from a browser address bar and turn it into an 
RFC3986-compliant URI string?

...

TL;DR: see above.  The rest of this response delves deeper into some of the 
points you raise, but all my comments ultimately derive from the position 
indicated above.

...

>
> This state of affairs is confusing and, if left unattended, liable to
> make implementation of correct and interoperable (according to any
> specification) URI handling even more difficult than it already is.

For whom?  This isn't a problem I've noticed.  I work with language libraries 
and they pretty much do what I need.

> ... We
> already know of many areas of confusion in 3986 (percent-encoding
> alphabets for different components, equivalence, parser error
> recovery...) and implementations will continue to diverge without
> significant effort to understand all of the present issues and unify
> the browser vendors', library authors', Web authors', and users' URI
> standards.

I recognize that the are difficulties in the internationalization.  But URI 
strings as defined avoid that by sticking to US-ASCII.  IRIs are an attempt to 
address these issues, and I accept that's an area that might usefully be 
clarified and regularized.

>
>> I've been writing software and specifications that work with URIs for over a
>> decade, and throughout that time I've found RFC3986 has been a perfectly
>> good specification for what it covers, viz:
>> - defining the syntax of a string used as a URI
>> - identifying parts that can be extracted from a valid URI (*)
>> - a specification for resolving a relative reference to a full (absolute)
>> URI
>
> RFC3986 does an admirable job at defining some of these structures and
> functions. Notably, RFC3986 is silent on real-world normalization,
> parsing input with errors, incompatible implementations,
> internationalization, and scheme-specific properties.

Sure it's silent on those things, and I'll repeat:  I think that's a feature not 
a bug, because I don't think there's a single solution for these that's best for 
all purposes:

- real-world normalization:  for what purpose?  I submit that different purposes 
will require different normal forms.  The main issue I come across is URI 
equality testing, but in practice I find that most of the time it's sufficient 
to treat the URI as an opaque string and compare that (per RFC3986).  It may be 
that there are different URIs that dereference or identify the same resource, 
but no amount of normalization will make that problem go away - ultimately it's 
an issue that applications (of which broswers are one class) must deal with.

- dealing with input errors:  error recovery is surely an application issue? 
I'd suggest if there's a standardized "recovery" for an error then it's not an 
error so much as an alternative form.

- incompatible implementations: again, I think this only makes sense with some 
particular purpose in mind, and not all URI-using applications have the same 
purposes.

- internationalization: agree - see above - for those applications that need to 
deal with mapping between human-readable IRIs and US-ASCII-based URIs as 
protocol elements.  But not all applications do (or not in the full generality 
where many of the I18N demons seem to lurk).

- scheme-specific properties: surely, these are for scheme definitions to 
describe (within the framework of what is described for generic URIs)?

So, while I agree that there are things that can usefully be done, I'm not 
seeing anything here that requires replacement of RFC3986.

>
>> There are many things that one might do with URIs, or ways in which they
>> might be constructed, that are not covered by RFC3986.  In my view, that's a
>> feature, not a bug.
>
> I certainly think we should be very careful with the scope of our work
> for upstream acceptance, prompt delivery, confusion avoidance, and
> effort dilution purposes. With that said, it is clear that there are a
> number of related functions that most implementations use or expose
> that are simply not covered by 3986. We should strive to provide a
> solid, unified, well-structured core specification to alleviate the
> pain I mentioned above.

I have my doubts that this is possible, because I don't believe there exist 
one-size-fits-all solutions to the issues you mention.  If and where such 
solutions do exist, then I think they can be written as separate specs that 
build upon RFC3986, and can prove their worth in that form.

Proven solutions might then be merged into future successors of RFC3986.

(This, BTW, is my notion of how a "living standard" might work: not as a dynamic 
document, but as a dynamic constellation of individual pieces, those of which 
that have proven their worth used to provide stable points of reference.  For 
the time being, I see RFC3986 as one of those stable points, and we risk great 
damage by trying to tinker with its scope.)

>
>> So, in my view, I think a URI spec activity would usefully use RFC3986 (or
>> successor) as a base specification, and create additional specs that
>> describe additional usage-oriented aspects; e.g. a URI parsing API, a
>> procedure for converting a manually entered string into a URI string,
>> handling of URIs as identifiers vs URIs as locators, internationalization
>> issues, etc.
>
> I agree that RFC 3986 makes a useful guide (and WHATWG URL an
> interesting counterpoint). I would be wary of over-modularization of
> some of these URI specifications, however. Besides introducing very
> procedurally-formal boundaries between closely related functionality,
> development of these specs would almost certainly push-back new
> requirements on the core specification.

I think RFC3986 is really much more than a "useful guide".  We have 25 years of 
developed software based on the key ideas of which RFC3986 is the current 
evolved specification.  I think it needs to stand at the heart of any URI 
clarification efforts (not protected from evolution where needed, but used as an 
anchor point to which other developments can be referred).  (FWIW, as a 
developer I've never consulted the WHATWG URL spec, as I generally find that 
RFC3986 is generally adequate for my needs and has the great advantage of being 
stable.  So this developer has no need of the WHATWG URL spec.)

I really think that monolithic specs covering all uses are a bad idea, as they 
come to look like application specifications, and end up prescribing things that 
should properly be left as application implementation concerns rather than 
focusing on the essentials needed for interoperability.

I see URIs in information architectures are somewhat like the hourglass neck 
represented by IP protocol in the family of Internet protocol standards.  By 
sticking to a minimum core concern, it is able to support a greater variety of 
applications that a more comprehensive specification might do.  Of course, 
additional specifications may still be needed for those particular applications

>
> I would be absolutely thrilled to see a constellation of
> specifications incubated together and modularized internally. If that
> effort is successful, I think it would make sense to start looking at
> spinning out dependent specs.

I think there's a danger here of engaging in a monumental act of hubris, by 
assuming that you can bring all of the required breadth of expertise into a 
single forum.  Far safer, and more productive IMO, would be to stick with a core 
functionality of known value, and then develop specifications that build on 
those core capabilities in well defined ways.  I think you're much more likely 
to end up identifying a constellation of universally useful features that way 
than trying to incubate them together.

As IETF URI scheme reviewer, I see a lot of scheme proposals that have very 
little, if anything, to do with the Web.  Given that the URI spec is one of the 
foundation pieces of the Web, I sometimes find this a bit disconcerting.  But it 
is also testament to the widespread utility of URIs as an engineering artifact 
beyond the Web for which they were designed.  IMO, this kind of utility is most 
unlikely to be achieved by an atempt to incubate a constellation of core 
specifications.  In this, I strongly believe less is more - i.e. my doing less 
we can in the long run achieve more.

>
> Finally, as there appears to be interest in very accurate
> specification of URI functions, I think any new effort for URI
> specification will necessarily involve a significant investment in
> tools for spec construction. If a specification strives to completely
> describe the inputs and outputs of functions (e.g. "string -> uri"),
> then, to my mind, it should exist as a formal description of such
> first and include annotations for human consumption secondarily. This
> is not to say that a human-readable spec is a second-class citizen in
> this world; simply that a machine-analyzable spec should also be first
> class!

I think that's an orthogonal concern.  We already have some such tools (ABNF 
comes to mind), though clearly there are others that might be considered.  I'd 
be very wary about making the development of such tools a part of a URI 
specification group's charter.

>
> I believe that URI functions (parsing, printing, normalizing,
> equating, resolving...) are self-contained enough, small enough, and
> widely used enough to make this new specification approach extremely
> valuable to everyone involved.

Art the risk of sounding like a broken record, I think for the most part that 
they'd be equally useful as satellite specifications around the core of RFC3986.

If it returns out that these specs expose requirements that cannot be achieved 
within what is mandated by RFC3986, then there is a case for updating RFC3986 
with respect to just those identified requirements - but I think that case needs 
to be established before considering changes to RFC3986.

>
>> As such, I think a list of perceived problems might be more useful than a
>> single problem statement.  Then it might be reasonable to discuss which of
>> those problems are realistically addressable.
>
> I agree! I often think in terms of questions rather than problems, though.
>
> I'll start:
>
> *****
>
> - What are the common functions of type "string -> uri"?
>
> 3986 says regex parser and only talks about string when it matches the
> included ABNF.

IIRC, the regex is in a not normative appendix.  RFC3986 says nothing 
normatively about *how* to parse a URI, just what constitutes a syntactically 
well-formed URI.

The closest to a normative processing spec is the relative reference resolution, 
which in turn depends opn isolation of key elements within the URI (scheme, 
authority, etc.).  But even that, as I recall, is not a normative procedure: 
other implementations are OK if they achieve the same result.

So, yes, a parsing spec could be useful, but I don't see that it needs to be 
part of the core URI spec.  Similarly, I think an API spec might be useful to 
promote consistency between URI library implementations, but again not as part 
of the core.

>
> WHATWG URL says procedural, mutable state machine in English prose
> parser (one total) and aspires to cover any input string.

But not all applications have a need to "cover any input string" - sometimes the 
right thing to do is say "that's not a URI".  Most of the time, that's all I 
need in my work.  As you say...

>
> There are, actually, multiple related functions of string -> uri and
> some applications want to use a strict parser and some want to use a
> sloppy parser. Some implementations will always compose the parser
> with a normalization function or resolution and others will want to
> keep those functions separate. How can we be certain that desirable
> properties hold across these variations and guide implementors,
> developers, authors, and users to the safest and most desirable
> behavior?

The problem with creating a catalogue of functions is that it's not clear where 
the cut-off should be.  The focus on a specification here should IMO be to 
address interoperability problems;  so I think it might be more useful to draw 
up a list of known interoperability problems, and then consider which of those 
might be addressed by a clearer specification.

>
> - What are the common functions of type "uri -> uri"?
>
> 3986 says there are a few components to normalize (percent hex casing,
> DNS casing, scheme casing, percent unnecessity, IPv6 hex casing, empty
> paths). It misses some like query encoding and DNS root label and
> explicitly doesn't cover internationalization.

Again, I think it would be more helpful to identify actual interop problems. 
I've often had to face the question of whether or not to %-encode, but it's 
rarely turned out to cause an interoperability problem.  On the few occasions it 
has, I've found the guidance in RFC3986 has been enough.  But YMMV.

>
> WHATWG URL doesn't address this directly but includes a few
> normalizations directly in its parser state machine.
>
> - What are the common functions of type "uri -> uri -> uri"?
>
> 3986 says resolution against an absolute URI and stays silent on
> relative-relative resolution.

I use relative reference resolution quite a lot in my work, and I've never found 
this to be a problem.  I'm not offhand sure why, but can think of two possible 
reasons:
(a) the absolute -> relative -> uri function as described also works for 
relative -> relative -> uri
(b) if the end goal is an absolute URI, then the sequence can be always 
performed as a series of absolute -> relative -> uri

But I'll accept that a clear specification of valid outcomes of relative -> 
relative -> uri could be useful.

>
> WHATWG URL doesn't address this directly but includes resolution as
> part of its parser state machine.
>
> - What are the common functions of type "uri -> string"?
>
> One would hope that these are only ever effectively normalization
> functions (uri -> uri) composed with a single serialization function
> but there may be reasons that this definition isn't possible.
>
> 3986 and WHATWG URL treat this as mostly self-evident and dependent on
> the internal representation of a URI. Round-trip composition (compose
> "string -> uri" with "uri -> string" and "uri -> string" with "string
> -> uri") is absent from 3986 as it only covers valid grammatical forms
> and entirely missing from WHATWG URL.

I'd say that RFC3986 just doesn't address this, but leaves this as an API issue. 
  For example, in my Haskell URI parser, I created functions to extract 
components which some have argued is not correct.  I made some choices that 
meant it was easier to re-assemble an original URI from its components (e.g. 
including ":" in an extracted scheme) - I don't think any or my choices violated 
any edict of RFC3986, but different implementers could reasonably make different 
choices

SO I'd say this is an area where an API spec could bring some useful clarity and 
consistency, but it doesn't need to change any fundamentals of RFC3986.

>
> - Where are the test cases for a given spec assertion?
>
> No URI spec, as far as I know, covers this or delivers a comprehensive
> test suite.

Assembling a comprehensive test suite could be a useful outcome.  There are 
plenty of partial test suites out there (RFC3986 has many useful test cases, Dan 
Connolly created one several years ago for his W3C work, I created one for my 
Haskell URI parser, Sam Ruby has recently been assembling test cases, and I'm 
sure there are more that can be plundered.

A note of caution: some test cases may be applicable in certain usage, and not 
universally for all URIs.  (I think some of Sam Ruby's recent tests may fall 
into this category.)

>
> *****
>
> There are certainly other questions one could ask or problems one
> could raise and I'd be very interested in reading any you might have.
>
> The general issue of standards fragmentation and lack of precise,
> accurate functional specification leads me to pursue a single, unified
> specification about which things can be proven and from which
> documents, test oracles, and test suites can be produced.

You claim "lack of precise, accurate functional specification".  I disagree 
(mostly).  A specification stands (or falls) with respect to some stated 
purpose, and I think RFC3986 does pretty well with respect to its stated purpose.

There may be other valid concerns not covered by RFC3986, and I think it's fine 
to address those concerns.  What I don't see if any good cause to tear up one of 
the Web's well-established foundational elements in the process.

I think that an attempt by a small group to produce a "single, unified 
specification" will be of little value beyond a relatively small coterie of 
developers who happen to have a shared set of concerns.

    "There are more things in heaven and earth, Horatio,
     Than are dreamt of in your philosophy."

On the other hand, I think attempting to formalize those things that RFC3986 
does say could be worthwhile, and doing likewise for additional proposals such 
as those you suggest could be a useful check on whether any additional proposals 
are or are not consistent with the core spec.

(My own Haskell implementation of URI parsing [1] was conducted, in part, as a 
way to (semi)formalize and validate the ABNF as it was being written for 
RFC3986, and I believe it may have resulted in some minor updates to the draft 
spec.)

[1] http://hackage.haskell.org/package/network-2.1.0.0/docs/Network-URI.html

#g

Received on Thursday, 9 October 2014 10:44:21 UTC