Re: So what is the problem with URIs?

On Thu, Oct 9, 2014 at 11:44 AM, Graham Klyne <gk@ninebynine.org> wrote:
> Hi David,
>
> You responded:
>>
>> On Wed, Oct 8, 2014 at 7:32 AM, Graham Klyne <gk@ninebynine.org> wrote:
>>>
>>> Hi,
>>>
>>> I've just read through the URI-spec list discussion to date, and find
>>> myself
>>> rather confused about what it actually hopes to achieve.
>>
>>
>> Hi Graham,
>>
>> I don't think you're alone in that confusion. From my perspective, the
>> broad goal of any new specification effort should be harmonization of
>> existing standards and formalization of their components.
>> Specifically, Web browser implementors have found that 3986
>> insufficiently describes some aspects of URL parsing and manipulation.
>
>
> Fair enough...
>
>> Others in our broader community of implementors feel similarly. The
>> resulting WHATWG URL spec which aims to correct this deficit in 3986
>> is now making normative statements about URLs and is being touted as a
>> replacement for 3986.
>
> ... but where you lose me is in treating this as a deficiency in RFC3986.
>
> I fully accept there are things that RFC3986 doesn't cover, but as I said
> previously I see that as a feature, not a bug.  I don't see any need to go
> back and tear up RFC3986 because of the things it does not say.

I'm not sure what you mean by "tear up RFC3986" here. It is a very
valuable document and serves its purpose well. As we discuss in the
thread below, it misses a small number of items and is silent on a few
others. I don't see the scope of any replacement increasing much
beyond what it does now.

> To take your example of "some aspects of URL parsing and manipulation", I
> think it would quite appropriate to write a spec that described these
> functions for browsers in a way that builds upon rather than replaces
> RFC3986.  I think it would be wrong to assume that all uses of URIs have the
> same requirements for URI parsing and manipulation, and to bake a particular
> set of mechanisms into a core URI spec would be to make the spec less useful
> for other applications.
>
> Would it be so hard, or insufficient for the example you mention, to write a
> spec called, say "URL parsing and manipulation for browsers" that describes
> how to take a string from a browser address bar and turn it into an
> RFC3986-compliant URI string?

I believe it would be both difficult to write well and difficult to
consume that specification -- especially if that specification was
precise in its error recovery and normalization procedures. Browsers
also put URI components into HTTP requests, read them out of
documents, and store them to disk. Each of these applications needs to
be understood and related to the core specification in such a way that
ensures that both accurately reflect deployed and future software.

> ...
>
> TL;DR: see above.  The rest of this response delves deeper into some of the
> points you raise, but all my comments ultimately derive from the position
> indicated above.
>
> ...
>
>>
>> This state of affairs is confusing and, if left unattended, liable to
>> make implementation of correct and interoperable (according to any
>> specification) URI handling even more difficult than it already is.
>
>
> For whom?  This isn't a problem I've noticed.  I work with language
> libraries and they pretty much do what I need.

I also work with (and author) libraries and they also mostly do what I
need. The problem is that it's only *mostly* and every time I look at
the WHATWG spec, I am disappointed that I don't understand how my
software's behavior relates to browser behavior. Do I accept a subset
of what browsers accept? Do I resolve the same way? What about weird
schemes? What are the reasonable rules that I should follow? How can I
be certain that I can consume and understand nearly all URIs minted?

>> ... We
>> already know of many areas of confusion in 3986 (percent-encoding
>> alphabets for different components, equivalence, parser error
>> recovery...) and implementations will continue to diverge without
>> significant effort to understand all of the present issues and unify
>> the browser vendors', library authors', Web authors', and users' URI
>> standards.
>
>
> I recognize that the are difficulties in the internationalization.  But URI
> strings as defined avoid that by sticking to US-ASCII.  IRIs are an attempt
> to address these issues, and I accept that's an area that might usefully be
> clarified and regularized.

Indeed. The relationship between URI and IRI is something that I
believe would benefit from coordination between a post-3986 spec and a
post-3987 spec. Specifically, the IRI spec should be able to
mechanically build on the concepts described in the URI spec.

>>> I've been writing software and specifications that work with URIs for
>>> over a
>>> decade, and throughout that time I've found RFC3986 has been a perfectly
>>> good specification for what it covers, viz:
>>> - defining the syntax of a string used as a URI
>>> - identifying parts that can be extracted from a valid URI (*)
>>> - a specification for resolving a relative reference to a full (absolute)
>>> URI
>>
>>
>> RFC3986 does an admirable job at defining some of these structures and
>> functions. Notably, RFC3986 is silent on real-world normalization,
>> parsing input with errors, incompatible implementations,
>> internationalization, and scheme-specific properties.
>
>
> Sure it's silent on those things, and I'll repeat:  I think that's a feature
> not a bug, because I don't think there's a single solution for these that's
> best for all purposes:
>
> - real-world normalization:  for what purpose?  I submit that different
> purposes will require different normal forms.  The main issue I come across
> is URI equality testing, but in practice I find that most of the time it's
> sufficient to treat the URI as an opaque string and compare that (per
> RFC3986).  It may be that there are different URIs that dereference or
> identify the same resource, but no amount of normalization will make that
> problem go away - ultimately it's an issue that applications (of which
> broswers are one class) must deal with.

I agree. Applications which wish to communicate accurately need to be
able to agree on a normalization form for the types of the protocol
elements they interchange. I desire a standard set of normal forms
from which to choose and statements about the properties and
interoperability of those forms.

> - dealing with input errors:  error recovery is surely an application issue?
> I'd suggest if there's a standardized "recovery" for an error then it's not
> an error so much as an alternative form.

Unfortunately, we already live in that world. We can't deny that
classes of applications want to use sloppy URI-like forms for user
input. We can absolutely specify what those forms mean and caution
against their use in interchange. If the present course is followed,
the sloppy alternative form will become the only form or we will have
two de facto forms and lots of pain.

> - incompatible implementations: again, I think this only makes sense with
> some particular purpose in mind, and not all URI-using applications have the
> same purposes.

By defining sets of functions that applications implement, we can
directly compare the behavior of different applications for the same
purpose. If I expect your dereferencer to perform sloppy parsing and
basic normalization, I should be able to transmit to you a correct URI
with basic normalization and you should be able to parse it and
understand it unchanged from my transmission. We don't even have the
means to talk about this behavior at the specification level,
currently.

> - internationalization: agree - see above - for those applications that need
> to deal with mapping between human-readable IRIs and US-ASCII-based URIs as
> protocol elements.  But not all applications do (or not in the full
> generality where many of the I18N demons seem to lurk).

I agree. This is one clear module boundary in any specification of this kind.

> - scheme-specific properties: surely, these are for scheme definitions to
> describe (within the framework of what is described for generic URIs)?

By "scheme-specific properties", I meant concepts like relative
schemes which are only mentioned in passing in RFC 3986. As it stands,
these properties of classes of schemes are not covered by
<https://tools.ietf.org/html/draft-ietf-appsawg-uri-scheme-reg-02>.

> So, while I agree that there are things that can usefully be done, I'm not
> seeing anything here that requires replacement of RFC3986.

Jumping between documents that are not formalized and not necessarily
in sync in order to get clarification on many minor issues is painful.
A consolidated specification (not with the kitchen sink, though) is
much easier to use and consult.

>>> There are many things that one might do with URIs, or ways in which they
>>> might be constructed, that are not covered by RFC3986.  In my view,
>>> that's a
>>> feature, not a bug.
>>
>>
>> I certainly think we should be very careful with the scope of our work
>> for upstream acceptance, prompt delivery, confusion avoidance, and
>> effort dilution purposes. With that said, it is clear that there are a
>> number of related functions that most implementations use or expose
>> that are simply not covered by 3986. We should strive to provide a
>> solid, unified, well-structured core specification to alleviate the
>> pain I mentioned above.
>
>
> I have my doubts that this is possible, because I don't believe there exist
> one-size-fits-all solutions to the issues you mention.  If and where such
> solutions do exist, then I think they can be written as separate specs that
> build upon RFC3986, and can prove their worth in that form.

You don't believe that there are a few common, related functions
involving URIs that 95% of applications use? RFC3986 indicates that
some of these functions exist and describes them both normatively and
informatively. Unfortunately, I don't see how formal specifications
can effectively use RFC3986 without essentially formalizing it. At
that point, RFC3986 should be updated using the results of the
formalization.

> Proven solutions might then be merged into future successors of RFC3986.

Proven how? As I see it, our work here is to help define what we would
like to see in a successor to RFC3986, RFC3987, WHATWG URL, and
related, dependent standards. A lot of the functions we're talking
about are already widely deployed and do not agree in their behavior.

> (This, BTW, is my notion of how a "living standard" might work: not as a
> dynamic document, but as a dynamic constellation of individual pieces, those
> of which that have proven their worth used to provide stable points of
> reference.  For the time being, I see RFC3986 as one of those stable points,
> and we risk great damage by trying to tinker with its scope.)

We are a long way off from submitting an RFC. My hope is that our work
on the concepts already specified in RFC 3986 can be separated easily
from any other work we've done when the time comes to submit to the
IETF. Until then, I don't see value in continuing to fragment the
effort around harmonizing URI functional specifications.

>>> So, in my view, I think a URI spec activity would usefully use RFC3986
>>> (or
>>> successor) as a base specification, and create additional specs that
>>> describe additional usage-oriented aspects; e.g. a URI parsing API, a
>>> procedure for converting a manually entered string into a URI string,
>>> handling of URIs as identifiers vs URIs as locators, internationalization
>>> issues, etc.
>>
>>
>> I agree that RFC 3986 makes a useful guide (and WHATWG URL an
>> interesting counterpoint). I would be wary of over-modularization of
>> some of these URI specifications, however. Besides introducing very
>> procedurally-formal boundaries between closely related functionality,
>> development of these specs would almost certainly push-back new
>> requirements on the core specification.
>
>
> I think RFC3986 is really much more than a "useful guide".  We have 25 years
> of developed software based on the key ideas of which RFC3986 is the current
> evolved specification.  I think it needs to stand at the heart of any URI
> clarification efforts (not protected from evolution where needed, but used
> as an anchor point to which other developments can be referred).  (FWIW, as
> a developer I've never consulted the WHATWG URL spec, as I generally find
> that RFC3986 is generally adequate for my needs and has the great advantage
> of being stable.  So this developer has no need of the WHATWG URL spec.)

I agree that it is the current foundational document and that it does
a good job at describing these URI objects. Much of RFC3986 is very
clear and helpful to implementors, developers, authors, and users.
3986 does not answer many questions regarding edge cases and specific
behaviors, however. It is those edge cases and behaviors that I would
like to see clarified.

> I really think that monolithic specs covering all uses are a bad idea, as
> they come to look like application specifications, and end up prescribing
> things that should properly be left as application implementation concerns
> rather than focusing on the essentials needed for interoperability.

I don't want to cover all uses or produce a monolithic spec. I'd
rather not work on any API specifics except to provide a suitable
vocabulary and function specifications for later work. I don't see how
leaving things like normalization unspecified is helpful, however, as
it harms interoperability.

With regard to my assertion that we should develop related components
together in a single spec, I only favor this approach at an early
stage so that module interfaces can be fluid in development. Official
publication of these components is a separate matter.

> I see URIs in information architectures are somewhat like the hourglass neck
> represented by IP protocol in the family of Internet protocol standards.  By
> sticking to a minimum core concern, it is able to support a greater variety
> of applications that a more comprehensive specification might do.  Of
> course, additional specifications may still be needed for those particular
> applications
>
>>
>> I would be absolutely thrilled to see a constellation of
>> specifications incubated together and modularized internally. If that
>> effort is successful, I think it would make sense to start looking at
>> spinning out dependent specs.
>
>
> I think there's a danger here of engaging in a monumental act of hubris, by
> assuming that you can bring all of the required breadth of expertise into a
> single forum.  Far safer, and more productive IMO, would be to stick with a
> core functionality of known value, and then develop specifications that
> build on those core capabilities in well defined ways.  I think you're much
> more likely to end up identifying a constellation of universally useful
> features that way than trying to incubate them together.

We may be misunderstanding each other because what you describe is
very much what I envision. Potentially, we are talking past each other
because I see a few of the dependent specifications as helping to
inform the properties of the core specification. If the core
specification offers a predicate for an absolute URI, a dependent spec
may want to reference it or push back with a new scheme property that
clarifies a resolution behavior.

> As IETF URI scheme reviewer, I see a lot of scheme proposals that have very
> little, if anything, to do with the Web.  Given that the URI spec is one of
> the foundation pieces of the Web, I sometimes find this a bit disconcerting.
> But it is also testament to the widespread utility of URIs as an engineering
> artifact beyond the Web for which they were designed.  IMO, this kind of
> utility is most unlikely to be achieved by an atempt to incubate a
> constellation of core specifications.  In this, I strongly believe less is
> more - i.e. my doing less we can in the long run achieve more.

I think URI is already quite successful and much of that success is
attributable to its specifications. It's not clear to me that the
present method of specification is sufficient to achieve global
interoperability in all of the facets that URIs expose. With the
continued deployment of URI both on the Web and off, we see new
implementation concerns and new interpretations of the same prose
specification.

As you say, the URI spec is one of the foundation pieces of the Web. I
believe it should have the absolute best specification we can muster
and this includes machine-assisted checking, testing, and proving.

>> Finally, as there appears to be interest in very accurate
>> specification of URI functions, I think any new effort for URI
>> specification will necessarily involve a significant investment in
>> tools for spec construction. If a specification strives to completely
>> describe the inputs and outputs of functions (e.g. "string -> uri"),
>> then, to my mind, it should exist as a formal description of such
>> first and include annotations for human consumption secondarily. This
>> is not to say that a human-readable spec is a second-class citizen in
>> this world; simply that a machine-analyzable spec should also be first
>> class!
>
>
> I think that's an orthogonal concern.  We already have some such tools (ABNF
> comes to mind), though clearly there are others that might be considered.
> I'd be very wary about making the development of such tools a part of a URI
> specification group's charter.

ABNF is a fine tool for some uses but misses some crucial features. I
believe there are other, related tools that we could put to good use.
I don't expect this group or any specification group to embark on the
*development* of those tools but I do expect to utilize the best tools
available and I will, personally, contribute to their development.

>> I believe that URI functions (parsing, printing, normalizing,
>> equating, resolving...) are self-contained enough, small enough, and
>> widely used enough to make this new specification approach extremely
>> valuable to everyone involved.
>
>
> Art the risk of sounding like a broken record, I think for the most part
> that they'd be equally useful as satellite specifications around the core of
> RFC3986.

If the core of RFC3986 were sufficiently formalized to support those
efforts, I would probably agree with you.

> If it returns out that these specs expose requirements that cannot be
> achieved within what is mandated by RFC3986, then there is a case for
> updating RFC3986 with respect to just those identified requirements - but I
> think that case needs to be established before considering changes to
> RFC3986.

I propose we attempt to formalize RFC3986 and a constellation of
related functions around it in order to establish what official
specification changes should be made.

>>> As such, I think a list of perceived problems might be more useful than a
>>> single problem statement.  Then it might be reasonable to discuss which
>>> of
>>> those problems are realistically addressable.
>>
>>
>> I agree! I often think in terms of questions rather than problems, though.
>>
>> I'll start:
>>
>> *****
>>
>> - What are the common functions of type "string -> uri"?
>>
>> 3986 says regex parser and only talks about string when it matches the
>> included ABNF.
>
> IIRC, the regex is in a not normative appendix.  RFC3986 says nothing
> normatively about *how* to parse a URI, just what constitutes a
> syntactically well-formed URI.

I'm not sure how the normativity of the regex parser that 3986
mentions is relevant. 3986 doesn't prescribe a regex parser but does
inform you that such a parser is possible for the grammar. This seems
to be a treatment of parsing. That is, 3986 does not mandate a parsing
method or means but does assert that valid URI references are parsable
with a regular expression. If this is not the case, the informative
regular expression is incorrect regarding the normative grammar and
should be removed from the document.

> The closest to a normative processing spec is the relative reference
> resolution, which in turn depends opn isolation of key elements within the
> URI (scheme, authority, etc.).  But even that, as I recall, is not a
> normative procedure: other implementations are OK if they achieve the same
> result.

To achieve the same result, those other implementations must be
functionally equivalent. I'm not interested in specifying mandatory
implementations; I am interested in specifying precise functional
equivalence.

> So, yes, a parsing spec could be useful, but I don't see that it needs to be
> part of the core URI spec.  Similarly, I think an API spec might be useful
> to promote consistency between URI library implementations, but again not as
> part of the core.
>
>>
>> WHATWG URL says procedural, mutable state machine in English prose
>> parser (one total) and aspires to cover any input string.
>
>
> But not all applications have a need to "cover any input string" - sometimes
> the right thing to do is say "that's not a URI".  Most of the time, that's
> all I need in my work.  As you say...
>
>>
>> There are, actually, multiple related functions of string -> uri and
>> some applications want to use a strict parser and some want to use a
>> sloppy parser. Some implementations will always compose the parser
>> with a normalization function or resolution and others will want to
>> keep those functions separate. How can we be certain that desirable
>> properties hold across these variations and guide implementors,
>> developers, authors, and users to the safest and most desirable
>> behavior?
>
>
> The problem with creating a catalogue of functions is that it's not clear
> where the cut-off should be.  The focus on a specification here should IMO
> be to address interoperability problems;  so I think it might be more useful
> to draw up a list of known interoperability problems, and then consider
> which of those might be addressed by a clearer specification.

What are the exact rules for percent encoding? Is %2F a slash in a
path component? Should /%2E%2E/ in a path component cause traversal?
What is the fragment when I receive a URI reference of "/?##"? Can I
use percent encoding in a scheme? In a host? When are those
equivalent? What about internationalized domain name encoding with
equivalent glyphs in an otherwise ASCII URI?

Perhaps you have an interpretation of RFC3986 that gives you these
answers. I don't think the answers are very clear in the document as
it stands. Furthermore, I'd be very surprised if implementations were
consistent in how these questions are handled.

>> - What are the common functions of type "uri -> uri"?
>>
>> 3986 says there are a few components to normalize (percent hex casing,
>> DNS casing, scheme casing, percent unnecessity, IPv6 hex casing, empty
>> paths). It misses some like query encoding and DNS root label and
>> explicitly doesn't cover internationalization.
>
>
> Again, I think it would be more helpful to identify actual interop problems.
> I've often had to face the question of whether or not to %-encode, but it's
> rarely turned out to cause an interoperability problem.  On the few
> occasions it has, I've found the guidance in RFC3986 has been enough.  But
> YMMV.

I'd like my URI library to be correct.

>> WHATWG URL doesn't address this directly but includes a few
>> normalizations directly in its parser state machine.
>>
>> - What are the common functions of type "uri -> uri -> uri"?
>>
>> 3986 says resolution against an absolute URI and stays silent on
>> relative-relative resolution.
>
>
> I use relative reference resolution quite a lot in my work, and I've never
> found this to be a problem.  I'm not offhand sure why, but can think of two
> possible reasons:
> (a) the absolute -> relative -> uri function as described also works for
> relative -> relative -> uri
> (b) if the end goal is an absolute URI, then the sequence can be always
> performed as a series of absolute -> relative -> uri
>
> But I'll accept that a clear specification of valid outcomes of relative ->
> relative -> uri could be useful.

As far as I can tell, (a) does not hold (think of collecting parent
traversals on the left-hand side). (b) is not possible in many cases
when a base is unknown to intermediate functions. I'd really like to
see relative-relative resolution specified.

Another possible function to consider specifying is unresolution or
difference. That is, two absolute URIs form the sides of a triangle
from the scheme-registry origin. What is the URI from one to the
other? (It may also be absolute.) This function may be suitable for a
dependent spec but it is very common in URI manipulation libraries
(including those in JavaScript for in-page use).

>> WHATWG URL doesn't address this directly but includes resolution as
>> part of its parser state machine.
>>
>> - What are the common functions of type "uri -> string"?
>>
>> One would hope that these are only ever effectively normalization
>> functions (uri -> uri) composed with a single serialization function
>> but there may be reasons that this definition isn't possible.
>>
>> 3986 and WHATWG URL treat this as mostly self-evident and dependent on
>> the internal representation of a URI. Round-trip composition (compose
>> "string -> uri" with "uri -> string" and "uri -> string" with "string
>> -> uri") is absent from 3986 as it only covers valid grammatical forms
>> and entirely missing from WHATWG URL.
>
>
> I'd say that RFC3986 just doesn't address this, but leaves this as an API
> issue.  For example, in my Haskell URI parser, I created functions to
> extract components which some have argued is not correct.  I made some
> choices that meant it was easier to re-assemble an original URI from its
> components (e.g. including ":" in an extracted scheme) - I don't think any
> or my choices violated any edict of RFC3986, but different implementers
> could reasonably make different choices
>
> SO I'd say this is an area where an API spec could bring some useful clarity
> and consistency, but it doesn't need to change any fundamentals of RFC3986.

I think RFC3986 should assert that conforming implementations should
reach a fixpoint in a single cycle. Additionally, this fixpoint should
be independent of the object's domain. That is, the same URI should
result from (parse (serialize (parse string))) and (parse (serialize
uri)) when "string" and "uri" represent the same URI.

Finally, if there were a standard set of normalization forms,
implementations could describe their round-trip behavior in terms of
these normalization classes.

>> - Where are the test cases for a given spec assertion?
>>
>> No URI spec, as far as I know, covers this or delivers a comprehensive
>> test suite.
>
>
> Assembling a comprehensive test suite could be a useful outcome.  There are
> plenty of partial test suites out there (RFC3986 has many useful test cases,
> Dan Connolly created one several years ago for his W3C work, I created one
> for my Haskell URI parser, Sam Ruby has recently been assembling test cases,
> and I'm sure there are more that can be plundered.

Indeed. I have a couple of my own and I believe Martin Dürst has
extensive suites for URI and IRI.

I would like to have the test cases derived from the specification
itself, exhaustively. I believe there are many benefits to this
approach including test case metadata, spec/test synchronization, and
test-annotated spec documents. Additionally, with a comprehensive
derived test suite, we can effectively interrogate implementations on
which fragments of the specification they obey and which they violate.
We can also compare the coverage of a collected set of tests (which
can have their domain and range now precisely defined) against the
derived tests and the specification document itself.

> A note of caution: some test cases may be applicable in certain usage, and
> not universally for all URIs.  (I think some of Sam Ruby's recent tests may
> fall into this category.)

This is one of the reasons that I would like to specify as many of
these auxiliary functions as possible so that we may understand
precisely what we are testing, where, and why.

>> *****
>>
>> There are certainly other questions one could ask or problems one
>> could raise and I'd be very interested in reading any you might have.
>>
>> The general issue of standards fragmentation and lack of precise,
>> accurate functional specification leads me to pursue a single, unified
>> specification about which things can be proven and from which
>> documents, test oracles, and test suites can be produced.
>
>
> You claim "lack of precise, accurate functional specification".  I disagree
> (mostly).  A specification stands (or falls) with respect to some stated
> purpose, and I think RFC3986 does pretty well with respect to its stated
> purpose.

I don't disagree. RFC3986's stated purpose was not to provide a
mathematically functional specification.

> There may be other valid concerns not covered by RFC3986, and I think it's
> fine to address those concerns.  What I don't see if any good cause to tear
> up one of the Web's well-established foundational elements in the process.

Again, I'm not sure what you mean by "tear up" in this context. I want
to define RFC3986 unequivocally.

> I think that an attempt by a small group to produce a "single, unified
> specification" will be of little value beyond a relatively small coterie of
> developers who happen to have a shared set of concerns.

We shall see how the work progresses. My intent is to strive for
adoption through overwhelming utility and I would appreciate your
support and guidance.

>    "There are more things in heaven and earth, Horatio,
>     Than are dreamt of in your philosophy."
>
> On the other hand, I think attempting to formalize those things that RFC3986
> does say could be worthwhile, and doing likewise for additional proposals
> such as those you suggest could be a useful check on whether any additional
> proposals are or are not consistent with the core spec.

This is my goal.

> (My own Haskell implementation of URI parsing [1] was conducted, in part, as
> a way to (semi)formalize and validate the ABNF as it was being written for
> RFC3986, and I believe it may have resulted in some minor updates to the
> draft spec.)

Neat! It's already on the Implementations page
<https://github.com/urispec/urispec/wiki/Implementations-and-Use-Cases>
of the wiki.

Thanks for your interest, Graham. I look forward to more discussions
in the future.

Best regards,

David

Received on Thursday, 9 October 2014 12:58:18 UTC