The Mire 'twixt Documents And Data from Sean B. Palmer on 2000-12-02 (www-rdf-interest@w3.org from December 2000)

From: Sean B. Palmer <sean@mysterylights.com>
Date: Sat, 2 Dec 2000 17:00:52 -0000
To: <xml-dev@lists.xml.org>
Cc: <swi-dev@egroups.com>, <www-rdf-interest@w3.org>, "William Loughborough" <love26@gorge.net>, "Tapio Markula" <tapio1@gamma.nic.fi>
Message-ID: <002501c05c81$6ff5f320$3fff7ad5@z5n9x1>
Documents are here to stay, and so is data. Roughly put, we have HTML/WWW
for documents, and XML/RDF/SW for data. The problem we all face is that it
is very rare to have either a pure document or pure data. Documents always
have data to back them up, and consequently data always needs some kind of
prose explanation.
Look upon this as "explicit reification" if you must: everything needs a
prose definition at some level. Does this mean the SW has failed before it
has started? Of course not! It will work for pure data models, but there
aren't all that many pure data models out there..the information we mainly
deal with is simply annotated data.
At the moment it appears that we have a mini formatting war [1] going on for
documents vs. data, and the ongoing battles about XML Schema vs. XML DTDs
(or put a bit more rationally XML vs. XHTML). But why can't we just come to
a sort of half document half data consensus?
[[[
I believe that one of the best ways to transition into RDF, if not a
long-term deployment strategy for RDF, is to manage the information in
human-consumable form (XHTML) annotated with just enough info to extract the
RDF statements that the human info is intended to convey. [...] We all know
that we have to produce a human-readable version of the thing... why not use
that as the primary source?
]]] - [2]
Or in other words, using XHTML [3] as a repository for data, but one that
can still be marked up with annotations, explanations, and summaries...aha!
The key concepts we have here is the following: Data can be stored somehow
in XHTML, and annotated with two different types of further data -
annotation intended to facilitate the machine transformation and extraction
of that data into machine (RDF?) form, and annotation to assist humans in
the interpretation of that data [4].
The two most important building blocks for this conversation will be these
simple little tags and attributes (their meanings are self-explanatory):-

     <annotation xmlns="[TBD]">
     <inverseOf
          xmlns="http://www.daml.org/2000/10/daml-ont.daml">
     @annotation @class @type

If we added those simple tags etc. to a kind of XHTML slurry, then we would
have a lot more power to walk through the mire 'twixt documents and data.
But this is all an abstract conversation isn't it? Not really. Browsers
worldwide grok XHTML, and a few can use CSS to style other forms of XML. At
the moment, to cleanly extract data from XHTML, we have to pepper it (i.e.
annotate it) with hundreds of "classes" - class attributes [5] to imply our
meaning, for example as discussed in the semantic design principles [6], and
so instead we could just add a few custom based annotation and logic based
tags (like the ones above) to (e.g.) m12n, and create a transformable form
of XHTML, to bridge the gap.
Strangely enough, the W3C's Amaya already has an annotation system [7], and
an annotation server [8]. But it doesn't tie into the document at all, and
therefore I doubt it has any usage at all (sorry!). However, the principle
of using annotations with data is a great idea, and one that surely should
be pursued.
Summary:-
We need some kind of "lingua franca" to annotate data in such a form so as
to be human readable, and transformable into machine readable format. (And
yes, this does have smackings of SDF [9]).

There aren't many examples of semantically annotated XHTML out there (in
fact, I can't ifnd one satisfactory one...) so I urge people to create
examples.

References:-
[1] http://doctypes.org/
     - Doctypes.org, M. Altheim
[2] http://lists.w3.org/Archives/Public/www-rdf-interest/2000Mar/0103.html
     - XSLT for screen-scraping RDF out of real-world data, Dan Connolly
[3] http://www.w3.org/TR/xhtml1/
     - XHTML 1.0, Steven Pemberton et al.
[4] http://www.mysterylights.com/sbp/#docordata
     - Documents vs. Data, Sean B. Palmer
[5] http://www.w3.org/TR/html401/struct/global.html#adef-class
     - The class Attribute - HTML 4.01, Dave Raggett et al.
[6] http://www.mysterylights.com/sbp/#semanticprinciples
     - Design Principles to Aid Semantics, Sean B. Palmer
[7]
http://www.w3.org/2000/02/collaboration/annotation/AmayaDocs/Annotation.html
     - Annotations in Amaya
[8] http://annotest.w3.org/
     - The W3Cs Annotea project
[9] http://lists.w3.org/Archives/Public/www-rdf-interest/2000Nov/0033.html
     - Semantic Document Frameworks, Sean B. Palmer

P.S. Apologies for the cross post: this note (i.e. rant) covers quite a few
topics...

Kindest Regards,
Sean B. Palmer
http://www.mysterylights.com/sbp/
http://www.w3.org/WAI/ [ERT/GL/PF]
"Perhaps, but let's not get bogged down in semantics."
   - Homer J. Simpson, BABF07.
Received on Saturday, 2 December 2000 12:01:02 UTC