Re: HTML and URI References compatability conserns from David Sheets on 2014-09-12 (semantic-web@w3.org from September 2014)

From: David Sheets <sheets@alum.mit.edu>
Date: Fri, 12 Sep 2014 01:05:17 +0100
To: Austin William Wright <aaa@bzfx.net>
Cc: Damian Steer <pldms@mac.com>, Semantic Web <semantic-web@w3.org>
Message-ID: <CAAWM5TwRPT5j7TuZt0q1s3WZFrNdvDfNrFk-jLTiMFdP13hn7g@mail.gmail.com>
On Fri, Sep 12, 2014 at 12:27 AM, Austin William Wright <aaa@bzfx.net> wrote:
> Since I maintain URI and IRI libraries, and numerous programs that use URIs
> for stating relationships (JSON Schema, RDF Interfaces, Turtle parser, and
> more), I'm interested in getting involved, pending some questions about the
> purpose of the proposed Community Group. Certainly there's been a lot of
> drama, since I sent this message, on public-webapps, www-tag, and
> public-w3process about the fork of the "URL" document. Will a Community
> Group be able to positively impact the issue?

I believe that a Community Group which communicates regularly and
openly about its progress on a formal specification will be able to
positively affect the present issues. In particular, I think that a
Community Group offers a place to work on a well-engineered
specification using modern tools without requiring immediate buy-in
from existing groups. Once our methods have been demonstrated, I
expect work to move to other, more traditional specification venues.

> Will we be able to shed light on the Semantic Web uses of the URI, IRI, and
> URI Reference? (The current documents seem to think that only Web browsers
> consume URIs.)

The Semantic Web consumers of URI references (which, to my view,
encompasses URL, URI, and IRI) are an important constituency of any
URI specification document. However, I do not currently see a place
for Semantic Web or Linked Data specific content in such a document.
That is not to say that SemWeb concerns shouldn't be considered --
just that SemWeb uses of URI references should be clearly possible but
not called out specifically.

> Most importantly, I don't think it's necessary -- or even normatively
> possible -- to re-define how to parse URIs in HTML or any other spec. This
> is normatively done _only_ by RFC 3986 or a published successor that
> obsoletes it.

I intend to incubate a successor with this Community Group. It is my
sincere hope that we will, before the end of 2015, have begun the IETF
RFC process for a new URI reference standard.

> I would like to see a "URI/IRI API" that correctly uses the RFC3986/3987
> terminology. Would publishing an ECMAScript API be in scope?

Yes, publishing an ECMAScript API would eventually be in scope as such
an API would expose functions which the specification describes. I am
personally the maintainer of the ocaml-uri library
<https://github.com/mirage/ocaml-uri> and I would very much like to
see a test suite and test oracle for use against ECMAScript and other
languages' libraries.

Initially, the definition of the ECMAScript API could be sketched but
defining it more elaborately should probably wait until the functions
being specified are more clear. At that time, it may be the case that
the ECMAScript API we propose actually exposes only composites of the
specified functions (e.g.: compose parse normalize resolve).

> And as mentioned earlier, I'm interested in research into current
> implementation bugs of user agents and non-Web applications that consume
> IRIs, and if there's a way to fix them that's not (net) harmful. This is
> also one of the intended purposes, correct? For instance, there could
> possibly be a document describing how to fix invalid URI References, if that
> is acceptable (i.e. no "URI Strict Mode").

It's not clear to me if you are referring to fixing implementations or
fixing URIs. In general, there doesn't seem to be a valid way to fix
URIs that may have been used in a SW context as the only general
equality is byte-for-byte. With that said, I am very much interested
in specifying functions that consume potentially invalid URIs and
normalize them to be valid. If one understands the risks, such a
function could be used to "fix" invalid URIs.

There are a number of different normalizations:

1. valid -> normal
2. invalid -> valid
3. invalid -> normal

Ideally, 3 is 2 compose 1. 1 should be a fixpoint over normal. These
functions would be most useful at the publication side and could be
used to great effect in careful consumers.

> Generally, the goal is to work all the current issues of interoperability
> between Working Groups out? Wouldn't e.g. appsawg at the IETF, or another WG
> that deals with the URI, also be suited for this purpose?

A goal is to work out the issues of interoperability between the
Working Groups and the Real World. In addition, another goal is to
produce a single specification document that describes as fully as
possible the structure and interpretation of URI references, URLs,
IRIs, URNs, etc. This single source can then be used to generate a
text document, an executable test oracle, theorems about URIs, and
potentially an exhaustive test suite.

The venues you mention would be the ideal place for this work if the
use of formal methods, specifically specification using the Lem
<http://www.cl.cam.ac.uk/~pes20/lem/> tool, would be accepted. I do
not have high hope that these venues are yet ready for such a
proposal. Therefore, I am starting a Community Group in which to
incubate this human-readable and machine-executable specification. I
believe we should have demonstrable proof that our methods work well
and provide value before we approach traditional standardization
bodies.

I hope that you'll join me in supporting a single, readable source of
URI specification which is guaranteed to stay in sync with an
executable model and is robust enough to be used to enumerate its own
test suite. I will begin with IPv4 and IPv6 address parsing including
interface identifiers. I am the primary author of
<https://github.com/mirage/ocaml-ipaddr> which does precisely this but
does not yet handle interface identifiers. I believe this subcomponent
of the specification can easily be written in fewer than 20 hours.

Perhaps one of the hardest parts of this specification process will be
writing the proofs to demonstrate that high-level properties (e.g.
grammars) are satisfied by low-level specifications. Another difficult
point will be error recovery and handling. This issue in particular
will likely require nearly every syntactic component to allow a error
variants which describe the issues with parsing but allow processing
to continue. Higher level functions can then specify precisely which,
if any, errors are allowed.

I understand this is a large amount of work but I believe, together,
we can put in place a system of specification that will capture the
behavior of URI objects and serve us powerfully for decades to come.

Thanks for your interest,

David

> Thanks,
>
> Austin.
>
> On Thu, Sep 11, 2014 at 12:58 PM, David Sheets <sheets@alum.mit.edu> wrote:
>>
>> On Mon, Aug 18, 2014 at 3:22 PM, Damian Steer <pldms@mac.com> wrote:
>> > On 18/08/14 12:54, Austin William Wright wrote:
>> >> As the maintainer of a library that converts and parses URIs and IRIs,
>> >> as well as many Semantic Web-related libraries that use it, I was
>> >> reading through the HTML draft, and it appears that the core ingredient
>> >> of RDF and Semantic Web--the URI [1] and IRI [2]--is not, in current
>> >> draft, normatively referenced from its key hypertext technology, HTML
>> >> [3].
>> >
>> > For the lazy, what is being referenced is:
>> >
>> > <http://url.spec.whatwg.org/>
>> >
>> > Hmm.
>>
>> I have just proposed a community group to do this properly. Please
>> consider supporting it and beginning the discussion of formal
>> specification of URI:
>> <http://www.w3.org/community/groups/proposed/#urispec>.
>>
>> Thanks,
>>
>> David Sheets
>>
>> > Damian
>> >
>>
>
Received on Friday, 12 September 2014 00:05:46 UTC