Semantic web architectural requirement [was Re: Squaring the HTTP-range-14 circle] from David Booth on 2011-06-21 (public-awwsw@w3.org from June 2011)

From: David Booth <david@dbooth.org>
Date: Tue, 21 Jun 2011 17:06:06 -0400
To: AWWSW TF <public-awwsw@w3.org>
Cc: Tim Berners-Lee <timbl@w3.org>, Pat Hayes <phayes@ihmc.us>
Message-ID: <1308690366.2165.66417.camel@dbooth-laptop>
[Moving this comment to the AWWSW list, as I think it will be more
appropriate there.]
Following up on:
http://lists.w3.org/Archives/Public/public-lod/2011Jun/0362.html

On Sat, 2011-06-18 at 23:05 -0500, Pat Hayes wrote:
> Really (sorry to keep raining on the parade, but) it is not as simple
> as this. Look, it is indeed easy to not bother distinguishing male
> from female dogs. One simply talks of dogs without mentioning gender,
> and there is a lot that can be said about dogs without getting into
> that second topic. But confusing web pages, or documents more
> generally, with the things the documents are about, now that does
> matter a lot more, simply because it is virtually impossible to say
> *anything* about documents-or-things without immediately being clear
> which of them - documents or things - one is talking about. And there
> is a good reason why this particular confusion is so destructive.
> Unlike the dogs-vs-bitches case, the difference between the document
> and its topic, the thing, is that one is ABOUT the other. This is not
> simply a matter of ignoring some potentially relevant information (the
> gender of the dog) because one is temporarily not concerned with it:
> it is two different ways of using the very names that are the fabric
> of the descriptive representations themselves. It confuses language
> with language use, confuses language with meta-language. It is like
> saying giraffe has seven letters rather than "giraffe" has seven
> letters. Maybe this does not break Web architecture, but it certainly
> breaks **semantic** architecture. 

I don't think that's correct.  AFAICT what's important for the semantic
web from an architectural perspective is the following:

  The client must be able to use a simple, architecturally 
  authoritative algorithm to determine, with full fidelity, 
  the URI owner's formally expressed identity for the resource.  

To pick this apart and explain what I mean:

Why "simple"?  To facilitate widespread uptake. 

Why "architecturally authoritative"?  So that everyone knows how the
architecture is supposed to work.  This is like having an authoritative
specification for HTTP: you don't want different people having different
ideas about how HTTP is supposed to work.

Why "algorithm"?  So that it can be done by a machine.

What do I mean by "full fidelity"?  If both the publisher and the client
following the architecture and applicable standards then the client will
interpret the publisher's statements with the *same* formal semantics
that the publisher intended.  However, this does not -- and cannot --
extend beyond what is expressed in the machine-processable portion of
the statements.  It includes only what is expressed *formally* -- in
machine processable statements such as RDF or protocol codes.  It does
*not* include the human-oriented semantics of some natural language
prose embedded in an rdf:comment.  Note also that "full fidelity" does
*not* mean that the referent of a URI can be uniquely determined.
Rather, it means that its identity is constrained with the same
constraints -- neither more nor fewer.

Why the "URI owner"?  Because this provides a deterministic chain of
authority.  From AWWW:
http://www.w3.org/TR/webarch/#uri-ownership
[[
URI ownership is a relation between a URI and a social entity, such as a
person, organization, or specification. URI ownership gives the relevant
social entity certain rights, including:
   1. to pass on ownership of some or all owned URIs to another owner—
delegation; and
   2. to associate a resource with an owned URI—URI allocation.
]]

Why "expressed"?  Because we cannot access the intent that the publisher
has in his/her head.  We can only use what the publisher actually
expressed.

Why "*formally* expressed"?  Two reasons: (a) the point is to enable
automated machine processing, and machines are not so good at things
like natural language processing; and (b) to enable lossless
communication.


The reason this is important architecturally is that it enables global,
lossless communication by machine.  However, this does not *obligate*
the publisher to be unambiguous if the publisher chooses to be
ambiguous.  (And as we both know, it is *impossible* for the publisher
to remove all possible ambiguity anyway: ambiguity is in the eyes of the
consuming application.)  Furthermore, it does not obligate the client to
compute the publisher's expressed resource identity.  OTOH, the client
must not claim to use that it has if it hasn't.

Notice that this architectural requirement does *not* imply that
publishers must distinguish documents from dogs.  This is why the class
of "information resource" does need to be disjoint with the class of
dogs or people.  But it *does* imply that publishers be *able* to
distinguish documents from dogs (or male dogs from female dogs, etc.) if
they *choose* to do so in communicating to their clients.  I.e., if the
publisher chooses to make this distinction, it is important that the
client be able to determine that the distinction was made.  This is why
the httpRange-14 rule about 303 is important.


-- 
David Booth, Ph.D.
http://dbooth.org/

Opinions expressed herein are those of the author and do not necessarily
reflect those of his employer.
Received on Tuesday, 21 June 2011 21:06:30 UTC