Syntax vs Semantics vs XML Schema vs RDF Schema vs QNames vs URIs (was RE: Using urn:publicid: for namespaces) from Patrick.Stickler@nokia.com on 2001-08-14 (www-rdf-logic@w3.org from August 2001)

From: <Patrick.Stickler@nokia.com>
Date: Tue, 14 Aug 2001 10:54:35 +0300
To: sean@mysterylights.com, scranefield@infoscience.otago.ac.nz, www-rdf-interest@w3.org, www-rdf-logic@w3.org
Message-ID: <2BF0AD29BC31FE46B78877321144043114BF83@trebe003.NOE.Nokia.com>
(Apologies in advance if any of the following seems to be worded
 or expressed too strongly... insert smileys liberally ;-)


> -----Original Message-----
> From: ext Sean B. Palmer [mailto:sean@mysterylights.com]
> Sent: 14 August, 2001 02:00
> To: Stickler Patrick (NRC/Tampere); 
> scranefield@infoscience.otago.ac.nz;
> www-rdf-interest@w3.org
> Subject: Re: Using urn:publicid: for namespaces
> 
> 
> > > I did send a letter about this to www-rdf-interest a
> > > short while ago; perhaps you missed it :-)
> >
> > I must have. Can you send me a copy?
> 
> I can do better than that: I can give you a URL! [1] from RDF
> Interest, last month.

Thanks. Read through it. Comments to your proposal integrated below.

> [...]
> > > [ :ns <http://www.w3.org/1999/xhtml>; :expEType "title" ] .
> [...]
> > Firstly, I don't see how the above is a valid URI.
> 
> I'm assuming that you know about anonymous nodes in RDF, but aren't
> familiar with the Notation3 serialization. A "[]" is just an anonymous
> node, q.v. [2]. You can give it a URI if you want. In fact, this would
> have been useful for the XML Schema people to have defined the URIs
> that they use to represent their QNames.

This seems rather an "obese" solution. For every resource identified
by a QName in a serialization, create an anonymous node with "some"
URI and two child nodes, one for the namespace and one for the name.
And that's supposed to be better than a single, transparent URI that
all RDF parsers would derive from the QName. 

Sorry, I don't buy it. Nope.

> [...]
> > The problem I am focusing on in my proposal is getting from
> > RDF/XML instances to triples such that no matter what RDF parser
> > you are using, so long as it conforms to the standard, you will
> > get exactly the same set of triples with the exact same URIs
> > for resources, [...]
> 
> Why? Why not just define them as anonymous nodes? You can say that a
> combination of "ns" and "ExpEType/ExpAName" make for an unambiguous
> subject using the following rule:-
> 
>    { { :x :ns :y; :expEType :z . :a :ns :y; expEType :z }
>    log:implies
>    { :x = :a } } a log:Truth; log:forAll :x , :y , :z , :a .

Well, I'm probably going to get grilled for this comment, but personally
I don't like anonymous nodes. After all, just what *is* an anonymous
node. Every application that I've seen that uses them has had to give
them some form of identity, and yet that identity is system dependent.

IMO, anonymous nodes were a hack to allow collection structures as Objects,
but yet collections (or rather ordered collections) in RDF do not work in
an context of multi-source syndication (nor do DAML collections either).
The proper way IMO to model collections is using an ontology of collection
relations and plain old triples with no anonymous nodes; but that's a
separate 
discussion that I don't want to start here. 

Issues of completeness required by the closed world folks can be addressed
by assigning source or authority to statements so that one can selectively
filter those collection members defined in a particular source or by 
a particular authority and "outsiders" cannot add to that "view" of the
collection. IMO, the RDF conceptual model should have no anonymous nodes.
Collections based on serialized, syntactic structures should have no
realization in the underlying conceptual model; but again, that's yet 
another discussion ;-)

I will concede that there *might* be valid and necessary uses for anonymous 
nodes which I am not yet aware of, but irregardless I get the impression 
(and I may very well be wrong, apologies in advance) that anonymous 
nodes are the new, "hot", interesting thing in RDF/DAML and so folks are 
predisposed to using them to solve every problem even when more
constrained, simplier, and better alternatives may be available.

For those who are convinced that anonymous nodes are a good thing, please
think about the implementational burden and portability/interoperability
issues they may introduce. There are lots of standards and models out
there that have really interesting and even elegant concepts, but are 
just too darn hard to implement efficiently, so no tools exist and the
standard dies (HyTime comes quickly to mind ;-). I hope that that doesn't 
happen to RDF because overly complex algorithms and data structures are 
needed to make sense of graphs with a plethora of anonymous nodes requiring 
constant recursive resolution by every SW agent to get to any "real" data
that is useful for a given application.

As someone who has to make stuff work in the "real world", I'd 
*much* rather get a single URI for a resource than some anonymous
node with a namespace and name dangling off it. Even if you give that
anonymous node a URI (in which case it is no longer anonymous ;-) my
axioms cannot and will not reference that URI because they are defined
in terms of resources, not complex QName data structures. And if
different systems name their "anonymous" QName root nodes differently,
my axioms are not portable. 

Sorry, I see an anonymous node-based treatment of QNames creating far, far 
more problems than it solves (see further below). Please let's get back to 
the core of the problem, which is the *mapping* (not representation) of
QNames
to single resource URIs in a consistent, standardized manner. 

> People simply aren't going to adopt "standard mappings". They want
> flexible models.

Flexible models are good, but standard mappings are critical, no? If
we can't insure that every SW agent is going to arrive at the same set
of triples from the same serialized instance, then we might as well
pack it up and quit. Integrity and consistency in global, distributed
knowledge representation for an environment such as SW is absolutely
essential. Without it, it cannot work.

There always has to be a balance between what is mandated by standards,
for the sake of interoperability and consistency, and what is left open
to accomodate new ideas, evolution of methodologies, competition, etc.

Mappings between serializations and triples cannot be flexible. No way.
(even if we never expect SW agents to talk in terms of triples but always 
reserialize to RDF/XML, it still raises problems with standardized axioms
and internal logic of reused software components).

> > If one RDF parser gives you ns:name -> nsname and another
> > gives you ns:name -> ns#name and yet another gives you
> > ns:name -> 'ns'name (a'la SWI Prolog + RDF) and yet another
> > gives you ns:name -> urn:qname:ns/name, etc. etc. [...]
> 
> Oh please! If the material being processed is indeed RDF, then the RDF
> parser should only be expected to use the first form of resolution
> from QName pair to URI. 

According to the current RDF spec, yes.

BUT that form of resolution/concatenation has been shown to be unreliable
and capable of producing ambiguous URIs! It's broken and *must* be replaced 
by something else. The current "popular" proposal, a'la XML Schema,
inserting
a '#' character is unnacceptable because it can produce invalid URIs and
furthermore any combinatoric scheme based on simple concatenation cannot
achieve all of the possible mappings from QNames to URI schemes, such as
e.g. URN schemes which may employ nested bracketing. 

The current concatenation scheme used by RDF is based (IMO solely) on the
use of HTTP URLs and HTML/XML fragment syntax -- and is grossly inadequate
for addressing the possible cases of QName to URI mapping that are allowed
and legal on the Web. It got RDF started, but cannot carry RDF through to
a mature and functional SW.

> That's not the issue. The issue is how to
> represent XML QNames in RDF, not how to process the XML Qnames that
> are used to form the RDF. 

Hello? What? QNames *in* RDF?! I don't think so! QNames are a creature
of the SYNTAX ONLY! They have no, and should have no, realization in 
the set of triples derived from a serialized instance!

*PLEASE* don't tell me that folks are working on how to model QNames in
RDF! What's next? Processing instructions? Character entities? Start
and end tags?

Resources are identified by *URI*s, not sub-graphs with an anonymous root!

> But yes, I agree with you very much that
> this needs to be done somehow. It's useful to say that a certain
> element in one language is the same as one in another language. But
> you can do that using the anonymous node proposal above: no extra
> syntax rubbish required.

My proposal adds a *single* declaratory element to the mix, and is
100% backwards compatible with the existing spec. and *all* existing
RDF systems.

I see it as being a far more constrained and efficient solution than 
adding anonymous nodes and modelling QNames in RDF -- both of which increase
the complexity load on SW software and needlessly complicates the data
model; whereas dealing with the QName to URI issue at the front end
as I propose, before getting to triples, adds no additional burden on 
the software whatsoever and allows *any* URI scheme to be used for
*any* resource while making their syntactic representation explicit,
consistent, and standardized.

> > IMO it is the underlying conceptual model of triples that is the
> > real value of RDF, and the serialization issues are entirely
> > secondary. [...]
> 
> Once again, very much agreed.

Great, but this also means that QNames, being a creature of serialization,
do not belong in the realm of triples and should dissapear as distinct
data structures during parsing of the RDF/XML instance to triples. No?

Just because you *can* model QName structures in RDF, for various reasons,
does not mean such a representation should be core to all knowledge
defined in RDF. If you want an ontology and methodology to talk about
components of XML serialization, fine, but that's very different from
carrying over those components into the underlying RDF data model. 

> > Furthermore, since humans need a means of easy data entry,
> > and would prefer to enter 'en' for language rather than something
> > like "http://some.authority.com/languages/English",
> 
> You have to declare a datatype in that case, but sure, why not?
> 
>    this xml:lang "en" .
> 
> I think there's an enumeration in XML Schema for those values
> somewhere... it'd be cool if the W3C could post them in RDF using
> DAML.

Adopting XML Schema data types in RDF doesn't provide any actual
validation, nor does it provide any of the data type hierarchy
functionality provided by an XML Schema parser. There is no true
integration of XML Schema with DAML or RDF. The XML Schema data
types have simply been used as a standard vocabulary which DAML
schemas can point to, but where one still has to code considerably
to achieve any benefit. 

Now, if (1) there were XML Schemas for RDF, RDF Schema, DAML, etc.
and (2) one defined XML Schemas for each ontology in addition to
defining them in RDF Schema, and (3) there were production quality
XML Schema capable parsers with full support of the XML Infoset,
etc. etc. then one could use such an XML Schema parser to validate
the serialized instance prior to importation via the RDF Parser, but
that's still not the same thing as achieving actual validation of
data types within an RDF engine simply by relating a property
to some XML Schema data type class.

Trust me, as someone who is involved in designing systems needing 
to manage millions of pages of complex technical documentation, and
wanting to do so in a way that exploits metadata to the fullest potential,
I have looked longingly towards XML Schema and the presumed "adoption"
of XML Schema data types by DAML as a way to provide robust metadata
validation in a flexible, modular, and extensible manner using minimal
custom software code -- and unfortunately, in practice it's an illusion.
At present, it's like the good old days of early SGML -- you have to
roll your own, no matter what.

> > and since we really want our SW Agents to deal with resources
> > rather than literals as much as possible, we need to map the
> > literal 'en' to the more informative and useful resource URI [...]
> 
> Huh? Just use datatypes; no need to complicate things. Go through the
> DAML walkthrough (linked to from [3]).

Just because something sounds good on paper doesn't mean it holds up
in application.

I've been through the DAML walkthrough, and am not convinced that XML Schema
data types are worth the effort, insofar as they would have significance
within the RDF space (as opposed to the XML serialization space). I.e.

1. XML Schema is optimized/designed for serializations, not knowledge bases.

2. Most parity and collection related constraints cannot be defined with 
   XML Schema in a way that works with syndication of knowledge from
multiple 
   sources (i.e. multiple serializations).

3. Saying that a given literal value is legal in a serialization says
nothing
   about how that literal value might represent an actual resource in the
   knowledge base nor anything about the relationship of that resource
   to other resources or constraints placed on occurrences of that resource
   within the knowledge base.

4. Literals in serializations tend to be of three types: (a) shorthand
aliases
   for resources (e.g. 'en') which typically belong to bounded enumerations,
   (b) values which are members of infinite, unbounded enumerations (i.e.
   data types such as integers, characters, floats, dates, etc., or (c)
strings
   which are to be treated as opaque, insofar as the RDF engine is concerned
   (and which technically also are members of an infinite set, bound only by
   system limitations). 

   Values of type (a) need to be mapped to resource URIs, and constraints
   for them should be defined using RDF (i.e. RDF Schema, DAML, etc.) and
   thus XML Schema provides no validation benefit. Leaving such values as
   literals in triples loses a considerable amount of knowledge or the
   ability to define constraints in terms of RDF. Values of type (b) can 
   be validated using regular expressions, and for these XML Schema is IMO
   overkill. Values of type (c) require no validation (with regards to
   content) but must be accepted as-is. 

Thus, pointing to XML Schema data types in RDF Schemas provides no actual 
validation, perpetuates the use of literal aliases for actual resource URIs,
and -- even if XML Schema was integrated into an RDF parser -- is overkill
for the validation needed for "true" literal values, per type (b) above.

Don't get me wrong. I like XML Schema for serializations. It's great and
I am impatient for the tools to mature so I can toss DTDs once and for all
out the window -- but XML Schema and RDF Schema are on two separate
functional
and conceptual planes, and trying to merge them is IMO far more trouble than

it's worth.

-------

TO REITERATE: Adding a single mapping element to RDF as proposed
*completely*
solves the whole QName vs. URI problem once and for all, *without* breaking
a single existing RDF application, and works for *any* URI scheme, present
or future. *And* it addresses the literal to URI mapping problem as well and
gives you reasonable data type validation for RDF literals within the RDF
environment without the extra baggage of XML Schema.

What more could you ask for?


Regards,

Patrick

--
Patrick Stickler                      Phone:  +358 3 356 0209
Senior Research Scientist             Mobile: +358 50 483 9453
Software Technology Laboratory        Fax:    +358 7180 35409
Nokia Research Center                 Video:  +358 3 356 0209 / 4227
Visiokatu 1, 33720 Tampere, Finland   Email:  patrick.stickler@nokia.com
Received on Tuesday, 14 August 2001 03:55:08 UTC