Valid representations, canonical representations, and what the SW needs from the Web... from Patrick.Stickler@nokia.com on 2003-01-31 (www-tag@w3.org from January 2003)

From: <Patrick.Stickler@nokia.com>
Date: Fri, 31 Jan 2003 10:20:05 +0200
To: <sandro@w3.org>
Cc: <www-tag@w3.org>
Message-ID: <A03E60B17132A84F9B4BB5EEDE57957B5FBAEF@trebe006.europe.nokia.com>
Apologies in advance for the lengthy post, and appreciation for
those having the patience to read it... 


> -----Original Message-----
> From: ext Sandro Hawke [mailto:sandro@w3.org]
> Sent: 30 January, 2003 21:46
> To: Stickler Patrick (NMP/Tampere)
> Cc: www-tag@w3.org
> Subject: Re: RDDL and XML Schema instances are not valid 
> representations
> of namespaces 
> 
> 
> 
> > If an HTTP GET returns a representation of a resource, and RDDL or
> > XML Schema instances are considered valid representations of an
> > XML Namespace, then I see no useful value to the concept of 
> > representation, since there apparently are no bounds as to what it
> > might be, and very well might be random.
> 
> Indeed....
> 
> Each URI string can be used to point to several different things. 

If you mean indirectly, fine, but not directly. I am very much
opposed to the view that a URI can contextually denote different
resources.

The only mechanism the even remotely resembles contexts in the
present Web architecture are XML Namespaces, which of course are
opaque to RDF and in fact to most Web applications.

And though I agree that URIs can be used to indirectly refer to
multiple resources, I consider that out of scope for this discussion.

What I am focusing on here is (a) a URI denotes one single thing and
(b) if that URI is meaningful to HTTP, there are no well defined
boundaries on how far a "representation" returned by HTTP can diverge
from the inherent characteristics of that single denoted resource.

From what I can see, a representation need not embody *any* characteristics
of the resource itself, but can be any arbitrary content. I consider
that to constitute a complete breakdown in any real interface
between the Semantic Web and the Web since what is denoted by the
former has no reliable representation by the latter.

The lack of an authoritative and well defined concept of the nature
of and constraints on valid representations, as well as canonical
(bit-equal) representations for digital resources is a significant
omission in the interface between Web and SW.

> In
> thinking about what a URI string points to, while working with RDF or
> namespaces, I find it useful to ask:
> 
>    1.  What knowledge base might it be pointing to (if any)?
>        For every successful GET, over time, will I get a
>        serialization of the information in that knowledge base at that
>        time?  If GET doesn't get me anything, or what it gets can't be
>        thoughts of as the contents of a knowledge base, then the URI
>        is not identifying a knowledge base.  

Great. But an XML Namespace URI does not denote a knowledge base. It
denotes a simple set of strings (names). It does not include any
semantics that might *elsewhere* be associated with such names, and
in fact, different resources may assign different semantics to the
same name (again, my usual example of xhtml:html for Strict versus
Frameset, see below).

Thus, any content that could be construed as a knowledge base (e.g. RDDL)
returned as a representation of an XML Namespace is highly suspect and
IMO is not a valid and reasonable representation of a namespace.

Now, I'm not arguing that such a representation would not be useful to
certain applications, but rather that if we are to have a consistent
Web and SW architecture, then we should refrain from associating such
things as representations of XML Namespaces, as that far exceeds IMO
what a valid representation of a resource is.

There's a good bit of nudge-nudge-wink-wink going on here. The W3C
should play by its own rules and promote exemplary solutions
reflecting sound use of the Web architecture. Not hacks that further
confuse the foundational concepts and principles of the Web and SW.

If there is to be *any* meaningful interface between the Web and SW,
the concept of "representation" and the bounds of what is a valid and
acceptable representation and the concept of a canonical representation
must be given clear and formal treatment. At present, it is woefully
obscure and thus we get folks suggesting that RDDL or XML Schema instances 
are valid representations of XML Namespaces (which they are not).

>    2.  What subject (as in topic maps) might the URI be pointing to
>        (if any)? Does it seem like the text or pictures returned via
>        GET are conveying information about some one thing, every time,
>        all the time?  Alternatively, has the URI's owning authority
>        made it clear in some other way what the URI identifies?

I don't see this question as relevant to the issue at hand. There is
a W3C Reccomendation which explicitly states what an XML Namespace
URI denotes -- a particular set of names. That's it. There is no
ambiguity there. The XML Namespace http://www.w3.org/1999/xhtml
denotes a set of names. There happen to exist other different 
resources which assign semantics to those terms, and in fact some
of those resources conflict with one another, as

   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"

defines xhtml:html as

   <!ELEMENT html (head, body)>

yet

   PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
   SYSTEM "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd"

defines xhtml:html (same term! same namespace!) as

   <!ELEMENT html (head, frameset)>

so clearly *neither* of the above DTD resources are representations
of the namespace itself, since both include knowledge that is not
in any way inherent in the namespace resource *and* one would expect
representations of a resource to not be in conflict with one another
semantically (but of course, since there is no reasonable definition
of what a valid representation is, so...)

The groundwork of REST and the pairing of the concepts of resource
and representation are great, and serve the needs of the Web, but
they *must* be taken much further if they are to address the needs
of the Semantic Web and facilitate a seamless integration of the
Web and Semantic Web. 

What is or is not a valid representation cannot be left up to 
arbitrary human intuition on a case by case basis but must be
expressed in a sufficiently explicit manner to serve automated
agents and the machinery added so that automated agents are
provided some clue as to the nature of the representations being
obtained. This is particularly true for canonical (bit-equal)
representations of specific digital resources.

If I have a URI denoting a particular revision of a particular
digital resource, there should be some way inherent in the Web
architecture to (a) reliably GET a bit-for-bit exact copy of that
particular revision of that particular resource and (b) know that 
that is what I got, or be told otherwise.

To date, it's just been hit and miss, and lots of good luck, as
the Web architecture has no concept of such a canonical representation
even though most folks presume it, and expect it, and will complain
loudly when they don't get "files" from the server exactly as they
conceive of them.

Now, if a human gets a representation that isn't what was expected,
it's fair to presume they can figure it out more quickly than a
dumb automated SW agent where the "error" may not be detected until
much further along a given process or operation, and likely after
the content has passed between numerous agents.

> This isn't too different from how I think about URIs in general.  In
> writing HTML or talking to people, I mostly use URIs to point to
> reliable, authoritative sources of information.  Often that
> information is about some particular subject (like a book, e-mail
> message, or world event), but I still have to pick a good URI for that
> subject.  I do so based on the qualities of the information source.
> But humans jump quickly to the subject, so when I say "look at
> http://yyy" where that URI points them at a news story about a virus,
> we'll often talk directly about the virus (with no need to focus on
> the news story itself).  

And your point is?  If the URI denotes an information source, it 
denotes an information source. If it denotes an abstract resource,
it denotes an abstract resource. One may indirectly refer to all
kinds of things by a given URI, but the URI ONLY DENOTES ONE THING
and representations provided by HTTP GET should be valid and
reasonable representations OF THAT ONE THING and not of any arbitrary
resource that might be indirectly referred to in terms of that URI!

That's the point.

It is not valid to presume that a RDDL instance is a valid representation
of an XML Namespace just because all of the resources described can be
*indirectly* referred to by the namespace URI. Those other resources and
the information about them provided by a RDDL instance are not
inherent to the XML Namespace resource itself and as such have no business
in any valid representation of that resource.

> Still, if I say "check out http://yyy", and it's a web page about a
> book, you might wonder if it's the web page that's interesting, the
> book that's interesting, or even the subject of the book that's
> interesting.  I try to straighten this out in RDF by making the
> URI-to-whatever mapping explicit and very well documented.  

Again, I have no problem with indirect reference to arbitrary
resources by any URI, but we're talking about (a) what a URI denotes
and (b) what is a valid representation of the specific resource
denoted by that URI.

If you want to be able to more clearly say things like

  <#Sandro>  x:recommends [ x:bookDescribedBy <http://yyy> ] .
  <http://yyy> rdf:type <#WebPage> .

to recommend a book described by a web page, or

  <#Sandro>  x:recommends [ x:webPageDescribing <http://xxx> ] .
  <http://xxx> rdf:type <#AbstractConcept> .

to recommend a web page describing some abstract concept, or whatever, 
great. But in either case, the URI itself denotes just one thing,
and if you dereference that URI with HTTP GET you should get a representation
of that one thing, not of something else that happens to be indirectly
referencable by the URI.

Taking the above, if you dereference http://yyy you should get a 
representation of a web page. If you dereference http://xxx you should
get a representation of an abstract concept (which might very well be
a web page).

> The problem in RDF is when people use URIs directly as node labels [as
> almost every does] because then it can be very hard to tell which
> mapping (which kind of pointing) they had in mind.  TimBL is the main
> force here arguing for what mapping everyone should have in mind, but
> with the WGs sitting out on this issue, consensus seems unlikely, and
> URIs in RDF will continue to be only marginally better signifiers than
> English words.

Well, I see this as being the whole point of RDF. To be able to say
what URIs mean in more explicit terms rather than having to guess
in terms of whatever arbitrary representations one gets from HTTP.

Still, once some software agent (or human) has the sufficient knowledge
about a URI to know what it denotes and the nature of the resource denoted,
it should be able to expect that representations of that resource, if
obtainable, would be reasonably (a) accurate, (b) complete, (c) concise, 
and (d) precise. And if the resource is a digital resource and the 
representation is canonical, then in addition to the above, (e) exact
(a bit for bit copy).

Once we start talking about representations for SW agents rather than
representations for Web (human) users, the needed precision and
consistency of the definition of representations goes up -- and that is
what I think most folks are missing here. 

Good enough for the Web is not necessarily (and in this case, isn't) good 
enough for the SW.

> Back to XML: XML Namespace Names are URI strings for which sense #2
> always holds; to me they always identify an XML Namespace[1].  

I agree. Though it appears we don't agree what an XML Namespace actually
is... (see below)

> They
> may also work in sense #1, where the identified knowledge base is the
> collection of information, which you talked about, about schemas,
> tools, etc for working with XML documents using the namespace.

I strongly disagree. *If* we are to base the Web and SW architecture
on the concept of resource and representation, then XML Namespace
URIs do not denote knowledge bases, and knowledge bases of the kind
embodied in RDDL instances are not valid representations of those
XML Namespace resources.

XML Namespaces are simple sets of names. That's all. Anything more
and we're talking about some *other* resource(s).

The W3C needs to play by its own rules...

>     -- sandro
> 
>     
> [1] But what is an XML Namespace?  It's often described as a
>     collection of strings, but I find that insufficient. 

Too bad. That's what it is. If you find that insufficient, then
work to have the specification revised. But until and unless it
is, neither you nor anyone else has the right to redefine it
according to your own preferences -- if you intend to play fair
with everyone else in the playground (maybe you don't ;-)

>     Two
>     namespaces which conceptually have exactly the same strings in
>     them 

Are exactly the same namespace. Period.

>     still may have different semantics and so are different
>     namespaces.  

No. They may not. XML Namespaces define *no* semantics for their
members.

You are making the error of equating 'namespace' with 'model'. As
I've pointed out several times before

  namespace != vocabulary != model != schema

These are all distinct concepts and instantiated by distinct resources
and if you wish to talk about all of them, you must assign each of
them a distinct URI.

Most models are expressed modularly in multiple schemas, and in 
variant schema languages, and employ multiple functional vocabularies,
which have terms grounded in multiple namespaces. Yet nowhere in
the W3C recs or notes is the inequality between these types of
resources stated explicitly and hence the confusion persists.

Even if there were, by coincidence, a 1:1 relationship between a particular
namespace, vocabulary, model, and schema such that all terms in a vocabulary 
were all grounded in one namespace and no other vocabulary used terms from
that namespace, and a model only used terms from the single vocabulary and no 
other model used terms from that vocabulary, and the model had one single schema 
defining it, etc. there would *still* be four distinct resources there, all 
needing their own URI to talk about them reliably and accurately -- but too 
many folks (lazily) use one URI to ambiguously denote all four resources, and 
that is where we get all this confusion.

They take the URI that denotes the namespace, and then (over)use it to
also denote (not just refer to) the vocabulary, the model, and the 
schema. Bad, bad, bad.

Can humans twist and coerce the Web to do useful things given such
bad practice? Sure. Can SW agents work with such bad practice? I don't 
think so. Can we get the Web folks to appreciate and support the
additional needs of the SW? It's not looking very promising...

Hence this tension between the definition of the Web architecture and
the greater and more demanding needs of the SW.

Regards,

Patrick

--
Patrick Stickler, Nokia/Finland, (+358 40) 801 9690, patrick.stickler@nokia.com
Received on Friday, 31 January 2003 03:20:11 UTC