Re: So what is the problem with URIs?

On Wed, Oct 8, 2014 at 7:32 AM, Graham Klyne <gk@ninebynine.org> wrote:
> Hi,
>
> I've just read through the URI-spec list discussion to date, and find myself
> rather confused about what it actually hopes to achieve.

Hi Graham,

I don't think you're alone in that confusion. From my perspective, the
broad goal of any new specification effort should be harmonization of
existing standards and formalization of their components.
Specifically, Web browser implementors have found that 3986
insufficiently describes some aspects of URL parsing and manipulation.
Others in our broader community of implementors feel similarly. The
resulting WHATWG URL spec which aims to correct this deficit in 3986
is now making normative statements about URLs and is being touted as a
replacement for 3986.

This state of affairs is confusing and, if left unattended, liable to
make implementation of correct and interoperable (according to any
specification) URI handling even more difficult than it already is. We
already know of many areas of confusion in 3986 (percent-encoding
alphabets for different components, equivalence, parser error
recovery...) and implementations will continue to diverge without
significant effort to understand all of the present issues and unify
the browser vendors', library authors', Web authors', and users' URI
standards.

> I've been writing software and specifications that work with URIs for over a
> decade, and throughout that time I've found RFC3986 has been a perfectly
> good specification for what it covers, viz:
> - defining the syntax of a string used as a URI
> - identifying parts that can be extracted from a valid URI (*)
> - a specification for resolving a relative reference to a full (absolute)
> URI

RFC3986 does an admirable job at defining some of these structures and
functions. Notably, RFC3986 is silent on real-world normalization,
parsing input with errors, incompatible implementations,
internationalization, and scheme-specific properties.

> There are many things that one might do with URIs, or ways in which they
> might be constructed, that are not covered by RFC3986.  In my view, that's a
> feature, not a bug.

I certainly think we should be very careful with the scope of our work
for upstream acceptance, prompt delivery, confusion avoidance, and
effort dilution purposes. With that said, it is clear that there are a
number of related functions that most implementations use or expose
that are simply not covered by 3986. We should strive to provide a
solid, unified, well-structured core specification to alleviate the
pain I mentioned above.

> So, in my view, I think a URI spec activity would usefully use RFC3986 (or
> successor) as a base specification, and create additional specs that
> describe additional usage-oriented aspects; e.g. a URI parsing API, a
> procedure for converting a manually entered string into a URI string,
> handling of URIs as identifiers vs URIs as locators, internationalization
> issues, etc.

I agree that RFC 3986 makes a useful guide (and WHATWG URL an
interesting counterpoint). I would be wary of over-modularization of
some of these URI specifications, however. Besides introducing very
procedurally-formal boundaries between closely related functionality,
development of these specs would almost certainly push-back new
requirements on the core specification.

I would be absolutely thrilled to see a constellation of
specifications incubated together and modularized internally. If that
effort is successful, I think it would make sense to start looking at
spinning out dependent specs.

Finally, as there appears to be interest in very accurate
specification of URI functions, I think any new effort for URI
specification will necessarily involve a significant investment in
tools for spec construction. If a specification strives to completely
describe the inputs and outputs of functions (e.g. "string -> uri"),
then, to my mind, it should exist as a formal description of such
first and include annotations for human consumption secondarily. This
is not to say that a human-readable spec is a second-class citizen in
this world; simply that a machine-analyzable spec should also be first
class!

I believe that URI functions (parsing, printing, normalizing,
equating, resolving...) are self-contained enough, small enough, and
widely used enough to make this new specification approach extremely
valuable to everyone involved.

> As such, I think a list of perceived problems might be more useful than a
> single problem statement.  Then it might be reasonable to discuss which of
> those problems are realistically addressable.

I agree! I often think in terms of questions rather than problems, though.

I'll start:

*****

- What are the common functions of type "string -> uri"?

3986 says regex parser and only talks about string when it matches the
included ABNF.

WHATWG URL says procedural, mutable state machine in English prose
parser (one total) and aspires to cover any input string.

There are, actually, multiple related functions of string -> uri and
some applications want to use a strict parser and some want to use a
sloppy parser. Some implementations will always compose the parser
with a normalization function or resolution and others will want to
keep those functions separate. How can we be certain that desirable
properties hold across these variations and guide implementors,
developers, authors, and users to the safest and most desirable
behavior?

- What are the common functions of type "uri -> uri"?

3986 says there are a few components to normalize (percent hex casing,
DNS casing, scheme casing, percent unnecessity, IPv6 hex casing, empty
paths). It misses some like query encoding and DNS root label and
explicitly doesn't cover internationalization.

WHATWG URL doesn't address this directly but includes a few
normalizations directly in its parser state machine.

- What are the common functions of type "uri -> uri -> uri"?

3986 says resolution against an absolute URI and stays silent on
relative-relative resolution.

WHATWG URL doesn't address this directly but includes resolution as
part of its parser state machine.

- What are the common functions of type "uri -> string"?

One would hope that these are only ever effectively normalization
functions (uri -> uri) composed with a single serialization function
but there may be reasons that this definition isn't possible.

3986 and WHATWG URL treat this as mostly self-evident and dependent on
the internal representation of a URI. Round-trip composition (compose
"string -> uri" with "uri -> string" and "uri -> string" with "string
-> uri") is absent from 3986 as it only covers valid grammatical forms
and entirely missing from WHATWG URL.

- Where are the test cases for a given spec assertion?

No URI spec, as far as I know, covers this or delivers a comprehensive
test suite.

*****

There are certainly other questions one could ask or problems one
could raise and I'd be very interested in reading any you might have.

The general issue of standards fragmentation and lack of precise,
accurate functional specification leads me to pursue a single, unified
specification about which things can be proven and from which
documents, test oracles, and test suites can be produced.

I hope this helps clear up some of your confusion. Please let me know
if there is anything else I can help you with.

Thanks,

David

Received on Wednesday, 8 October 2014 17:18:28 UTC