Re: resolving the URL mess from Austin William Wright on 2014-10-07 (public-urispec@w3.org from October 2014)

From: Austin William Wright <aaa@bzfx.net>
Date: Mon, 6 Oct 2014 20:54:09 -0700
To: Larry Masinter <masinter@adobe.com>
Cc: John C Klensin <klensin@jck.com>, David Sheets <kosmo.zb@gmail.com>, Sam Ruby <rubys@intertwingly.net>, "public-urispec@w3.org" <public-urispec@w3.org>, Anne van Kesteren <annevk@annevk.nl>
Message-ID: <CANkuk-WyKvDqeOUgJwXUE5VJtHpjYHYweUMOs9NW9n-g7eMArw@mail.gmail.com>
On Mon, Oct 6, 2014 at 3:48 PM, Larry Masinter <masinter@adobe.com> wrote:

> > Many software applications utilize universal identifiers so that systems
> can refer to resources residing in other systems entirely.
>
> This kind of introduction is just confusing. We're not talking about
> identifiers in general, just URI/URL.


Mostly, it's laying the foundation for why URIs/URLs exist at all. If this
foundation is being eroded by incompatible URI/URL implementations, then
that is defeating the point of the URI, and that is a problem.

I can't think of a better way to phrase this to make this clear, would you
mind suggesting an improvement?

Also, technically I don't even see a difference, as any identifier can be
translated to a URI or IRI. And this is necessarily true of universal
identifiers in general. A notable explanation is found in the "URIs and the
Test of Independent Invention" section in <
http://www.w3.org/DesignIssues/Axioms.html>.


> And frankly I think we should include the political/organizational power
> struggle which seems to fuel much of the angst that gets in the way of a
> technical solution.
>

Any power struggles I've seen so far first came about due to
interoperability problems, by way of vendors choosing mutually incompatible
workarounds.

Do you have specific examples?

Power struggles in general are only going away when the relevant parties
can check their hubris, and we can't fix that. In many cases though, we
have the ability to fix other causes.

> The URI, while seeing near-universal adoption, has many subtle
> inconsistencies
>
> This phrase almost contradicts itself and begs the question.   Is it that
> 3986 and 3987 are unclear, imprecise, incomplete, or is it that
> implementations didn't pay attention
>
> > in implementations that threaten the ability for different systems to
> refer to each other's resources, creating fragmentation and development of
> workarounds.
>
> Are we trying to solve an implementation compatibility problem, or just a
> specification compatibility problem? Or a situation where implementations
> don't agree, but that for the most part the differences are inconsequential?
>

I don't believe there's a problem with the RFCs themselves causing the
problems in question. They provide ABNF and pesudocode, I'm not sure how
you can get less confusing than that. Though I know David Sheets has ideas,
do you too?


> > A mission statement (charter?) would follow:
>
> > To promote the convergence of the behavior of universal identifiers
> across all applications,
>
> I think this is way too ambitiously scoped. We're not interested in "all
> applications" in the universe, just ones that use URL, as you start to
> enumerate, but of course not exhaustively. And not "the behavior of
> universal identifiers ..."  for three reasons.
>
> * it is scoped not to _all_ universal identifiers but just these
>

As mentioned, I don't see a difference, but in any event, I'm not proposing
expanding our scope beyond URI-like identifiers.

* software has behavior, a URL doesn't 'behave'. The hope is to specify some
>    kinds of behavior using URLs: parsing, translating, comparing, relative
> resolution,
>
>    Other behavior is specified elsewhere (like 'Fetch').
>

Correct, but RFC3986 defines behavior like what it means to dereference a
resource, and how to resolve a Reference to absolute form, complete with
pesudocode. "This behavior" refers to the behavior of a compliant
implementation.

Of course, some things are (appropriately) undefined, like how to handle an
malformed URI or IRI.


> * It's too ambitious: getting implementations to converge isn't something
> a spec can do.
>    I think there are two tasks that are feasible
>    a) document current widely deployed behavior as it is, in sufficient
> detail
>      that liberal implementations can know how other software will operate
> to the
>      extent that differences matter
>    b) recommend future best practices for URL creators to improve
> interoperability
>
>
> >  by identifying inconsistencies and proposing resolutions, including in:
> > Databases, Web browsers, other Web user-agents, XML parsers,
> > a plethora of JSON Hypermedia formats like JSON Schema and Hydra,
> > Semantic Web applications and file formats like Turtle, protocols like
> > HTTP and CoAP, databases, compact notations like CURIE, and more.
>
> This kind of list can't be exhaustive. And putting it into the charter, I
> think should be done more carefully than "and more", make it clear this
> list just shows how wide deployment of non-web URLs is.
>

How about s/and more/and many, many more/?

In seriousness, I'm not trying to be exhaustive, but make the point there's
a *lot* of uses, yet without turning the charter into an encyclopedia.
Would you care to propose a specific improvement?


> > Part of the problem is we need to be absolutely crystal-clear and
> consistent about what terms we use.
>
> I’m afraid we have no control over how terms are used in the world, where
> everyone knows what a URL is. So "absolutely crystal-clear" is way beyond
> us. I'm just hoping for improved clarity in WHATWG and W3C documents.
>

Not in the world, just for our use. I'm not concerned how the layperson
uses the terms, in most cases URL and URI can be used interchangeably, and
context frequently makes it clear that they really mean something else
altogether (e.g. "Fragmentless HTTP URL").


> > There's many specs _about_ identifiers, but they all do different things:
>
> You give your list, I think mine in my blog post is more complete. (I left
> out the work on fragment identifiers).
>
> > * URI: Authoritatively defined in RFC3986, ...
> > * IRI: Defined in RFC3987 as a....
> > * URL: The URI was created as a generalization of the URL....
> > * URN: Likewise defined in terms of the URI, ...
>
>
> Getting consensus on the history and characterizations of these protocol
> elements is very hard. I'm not sure it's possible, or necessary. I *am*
> sure trying to put one history and overview in the charter is a non-starter.
>

I'm presenting definitions that seem to be held in common. I can't find
specific numbers, but I'd be willing to bet, and Google Scholar seems to
back this hypothesis up, that RFC3986 is the most cited standards-track RFC
ever.

Unless someone has a specific objection, I'm not sure how this could be a
non-starter. This is technical literature, we're entitled to a bunch of
different terms that the layperson need not care about.


> > Because an IRI, URI, URL, and URN all contain a scheme, they are called
> "absolute".
>
> Wha? Total non-sequitur and not really accurate anyway.
>

I'm making the point that URI (unqualified) is absolute, URI Reference is
either absolute or relative. Perhaps in too many words, but I want to be
technically accurate.

Why non-sequitur? The point here was to define terms.


> > Some standards defined their own set of strings largely compatible with
> URIs, mostly for technical reasons. For example, RDF 1.0 for example
> defined "RDF URI References" due to predating RFC3986 (so named despite
> being absolute). RDF 1.1 now formally uses IRIs.
>
> I don't think this is true, actually. Doesn't it use LEIRIs? (The XML
> "Legacy Extended IRI" ?).
>

Maybe in the past? I see both RDF 1.1 Concepts and RDF 1.1 Semantics
require IRIs, and RDF3987 is a normative reference.

Though an LEIRI is another good example of a custom-defined URI-compatible
identifier. <http://www.w3.org/TR/leiri/>


> > There's also the class of strings called URI References...
>
> This story is confusing.
>

URI Reference is defined in <http://tools.ietf.org/html/rfc3986#section-4.1>,
being a string that can be either an absolute or relative URI; and can be
resolved into a URI. (When unqualified, "URI" always means the absolute
kind.)

How about that?

Personally, I think of it as "a reference to a URI", URI itself being "a
reference/identifier to a resource." Recursion!


> > If we need to talk about how Web browsers implement URIs (or implement
> it _differently_), I propose the term Web Browser Address. I might adopt
> the acronym WBA.
>
> Oh please, not another term! How does this help?
>

There's a good precedent that people may create their own syntaxes of
"URI", but you must give them a different name, you cannot use "URI",
"URI", etc. unqualified. E.g. "RDF URI Reference" and "Legacy Extended IRI".

It's just downright impossible to refer to three or four different
definitions of a concept by the same name.

For "Web Browser Address" or just "Web Address", I actually took the term
from your blog post, <http://masinter.blogspot.com/2014/09/the-url-mess.html>.
That seems like an accurate label, but I'm open to other suggestions.


> > For the concept of the URI, URL, IRI, etc, where the meaning of the
> string
> > is uniform across time and space (as opposed to document-local ids),
> The "meaning" of "http://example.com/blah" is as uniform as it can be
> across all time and space, in that it's just a little bit of syntax to
> join together
> "http" and "example.com" and "/blah".
>
> > I will simply use the term "identifier" or "universal identifier".
>
> For what?
>

There was a whole sentence before that... How about:

Universal identifier: Any string that names/identifies a resource across
time and space. Frequently, this will be by an interpretation (i.e. <
http://en.wikipedia.org/wiki/Interpretation_(logic)>). In our work, this
almost exclusively refers to IRIs, URIs, URLs, URNs, and other forms of
strings largely compatible with them.

Does this work better?


> > I would propose the following deliverables:
>
> > (1) Can we formally adopt this terminology? ...
>
> No. Not formally or informally.
>

Why? Every specification that uses these terms must define them, in the
form of a normative reference. We are no exception.

When subtleties matter, we need well-defined terms. Otherwise we're talking
over each other and not really getting anywhere. (It's already happened.)


> > (2) What, exactly, are the incompatibilities between implementations?
>
> "exactly" is impossible. And specifying exactly may not be worthwhile. And
> which implementations count?  What incompatibilities result in
> interoperability problems.
>

Understood. Let's start with the low hanging fruit and let diminishing
marginal returns determine when to stop.


> > Why do Web browsers have a different spec or implementation *at all*?
>
> "Why" is a risky question, and also not really worth pursuing. Likely
> suspects are NIH, laziness, artifact of "browser wars" or the current "web
> standards king-of-the-mountain", overly helpful or shortcut engineering
> "race to the bottom" ....
>
> But while that might be fun to talk about, it's mainly irrelevant. Just
> focus on current state and providing a path forward.
>

Fair enough.


Austin.
Received on Tuesday, 7 October 2014 03:54:37 UTC