XML Enriched N-Triples (XENT) from Sean B. Palmer on 2003-06-14 (www-rdf-interest@w3.org from June 2003)

From: Sean B. Palmer <sean@mysterylights.com>
Date: Sat, 14 Jun 2003 19:08:04 +0100
To: <www-rdf-interest@w3.org>
Message-ID: <00ae01c3329f$e7de4bc0$5954ff3e@localhost>
[+BCC to Tim Bray]

Tim Bray has again brought up the age old debate about the inadequacy
of RDF/XML, this time by linking [1] to yet another person who so
openly slams RDF/XML [2] without, as far as I know, following the old
"don't criticize if you can't do any better" maxim.

Bray has, of course, tried his hand out at an alternate RDF XML
serialization, RPV [3]. RPV has some pretty major shortcomings, in my
opinion. For example, one can't use QNames as an abbreviation method;
the best one can do is to provide a "base" for subjs/preds/objts. It
also doesn't seem to contain any facility for using bNodes--correct me
if I'm wrong.

I've previously tried to come up with alternate serializations myself,
notably BSWL [4], and N3-in-XML [5], but this time I wanted to try a
different approach. I believe that N-Triples is a good starting point
for any serialization due to its extraordinary level of parseability.
It is not, however, easy to author (no QNames, one triple per line),
and nor is it based on XML, which indicates to me that it is unlikely
ever to progress from being a simple RDFCore WG test format to
something used on a wider scale.

So this is a proposal to enrich N-Triples using XML.

At the basic level, XENT (an obvious but chance acronym) is very much
like N-Triples, with very minor XMLification. A <Graph> element is
used to wrap an entire document, upon which namespaces can be
declared. URIs use an 'URI syntax (prefixed with an apostrophe) now
instead of <URI>, since <URI> would obviously be illegal in XML. Each
triple is wrapped in a <t> element, and there is no longer any need
for the trailing period that was previously used for backwards
compatiblity with N3. Line breaks can be added at will since <t> is,
instead of newlines, used to delimit triples.

<Graph xmlns="@@">
<t>'http://example.org/
   'http://example.org/#author 'http://example.org/#bob
</t>
</Graph>

QNames are allowed in place of URIs. You just write these in the
actual text themselves--example coming up. (Aside: I expect that the
major criticism of this format will be its lack of recourse to innate
XML machinery for expressing the various parts of the triples; more on
why I believe that this is actually a *benefit* later on). bNodes are
represented using a $label syntax--this keeps parsing costs down, and
eliminates the _: prefix hack. Literals are now wrapped in an <s>
element.

Example:-

<Graph xmlns="@@" xmlns:ex="http://example.org/stuff/1.0/" >
<t>'http://www.w3.org/TR/rdf-syntax-grammar ex:editor $Dave</t>
<t>$Dave ex:fullName <s>Dave Beckett</s></t>
<t>$Dave ex:homePage 'http://purl.org/net/dajobe/</t>
</Graph>

The last bit of syntax to introduce are the <properties> and <objects>
elements. Consider this N-Triples graph:-

_:Sean <...#name> "Sean B. Palmer" .
_:Sean <...#homepage> <http://purl.org/net/sbp/> .
_:Sean <...#nick> "sbp" .

The subject is repeated quite a lot. Using a <properties> element, one
can basically reduce the repetition.

<Graph xmlns="@@"
   xmlns:foaf="http://xmlns.com/foaf/0.1/" />
<t>$Sean
   <properties>
      foaf:name <s>Sean B. Palmer</s>
      foaf:homepage 'http://purl.org/net/sbp/
      foaf:nick <s>sbp</s>
   </properties>
</t>
</Graph>

I think that this is highly readable, writable, and parseable. In
actual fact, even the non-abbreviated syntax isn't so bad for that
particular example (note that I've added an example of the <objects>
element to this one, too):-

<Graph xmlns="@@"
   xmlns:foaf="http://xmlns.com/foaf/0.1/" />
<t>$Sean foaf:name <s>Sean B. Palmer</s></t>
<t>$Sean foaf:homepage 'http://purl.org/net/sbp/</t>
<t>$Sean foaf:nick <objects><s>sbp</s> <s>SeanP</s></objects></t>
</Graph>

For lots of pred/objt repetition, though, <properties> and <objects>
will be useful. Here's another quick example:-

<Graph xmlns="@@"
       xmlns:dc="http://purl.org/dc/elements/1.1/"
       xmlns:ex="http://example.org/stuff/1.0/" >
<t>'http://www.w3.org/TR/rdf-syntax-grammar
   <properties>
      dc:title <s>RDF/XML Syntax Specification (Revised)</s>
      ex:editor $Dave
   </properties>
</t>
<t>$Dave ex:fullName <s>Dave Beckett</s></t>
<t>$Dave ex:homePage 'http://purl.org/net/dajobe/</t>
</Graph>

Of course, one might be led into believing that datatyping all of the
tokens with <uri> <bNode> and <literal> elements and using elements
for QNames would be easier on parsers, but I challenge anyone raising
this criticism to actually *prove* that that is the case. If I receive
positive feedback on this serialization attempt (though I don't
particularly expect it...) I may attempt to put my money where my
mouth is, as it were, and write a parser. In the meantime, my
rationalization is that XML parsers tend to be in languages that can
cope with a little string munging: all one has to do is make sure that
it is possible to:-

* Keep a list of the namespaces declared, and their short names (both
XSLT and any XML parser worth its salt can do this)
* Be able to tokenize strings splitting on whitespace (easy
programming task)
* Be able to datatype based on whether a token starts with "$" or "'",
and get the substring from [1:] if it's not a QName, and split on the
colon and get the mapping to the URI otherwise (a bit of work, but I'm
sure that this is possible in XSLT and it's obviously laughably easy
in Perl, Python, Java, C, C++, etc.)

That's it. There are some issues, but they're mainly just todos.

* Internationalization. Probably can inherit most of the solutions
from N-Triples.
* Datatypes and lang on literals. <s lang="en">string</s> and <s
dt="datatypeURI">string</s> perhaps.
* XML literals. Perhaps <x> should preserve XML literals, and
everything else has to get flattened to text. Or! Perhaps <s> should
be an XML literal, and people can use <![CDATA[]]> to flatten anything
down should they need to. Tricky.
* Collections? Refication? Shouldn't be too hard to add. <t> could be
used as an object/subject for refication, perhaps, though then you
can't give it an id (unless you add an attribute to the <t> element,
perhaps).
* No more truly blank nodes. Does this even really matter?
* The 'URI syntax trick could be eliminated by saying that any
QName/URI things whose prefixes have been declared using XML
namespaces are QNames, and anything else is a URI, but that's horrid,
and it's only one character.

This is just a quick sketch and I don't have many free cycles with
which to work on it, but I'll try to contribute to any resultant
thread as much as I can. Comments are most welcome, of course.

Thanks,

[1] http://www.tbray.org/ongoing/When/200x/2003/06/13/SemWeb
[2] http://www-uk.hpl.hp.com/people/marbut/isTheSemanticWebHype.pdf
[3] http://www.textuality.com/xml/RPV.html
[4] http://infomesh.net/2001/07/bswl/
[5] http://lists.w3.org/Archives/Public/www-rdf-interest/2002Mar/0128

--
Sean B. Palmer, <http://purl.org/net/sbp/>
"phenomicity by the bucketful" - http://miscoranda.com/
Received on Saturday, 14 June 2003 14:08:15 UTC