Cross-set Introduction (progress report) from Murray Maloney on 2006-10-04 (public-grddl-wg@w3.org from October 2006)

From: Murray Maloney <murray@muzmo.com>
Date: Wed, 04 Oct 2006 19:38:36 -0400
To: public-grddl-wg <public-grddl-wg@w3.org>
Message-Id: <5.1.1.6.2.20061004175925.02ca4f18@mail.muzmo.com>
Here is an updated version of a proposed Cross-set Introduction to GRDDL.
I intend to continue working on this until next week's call. Feel free to offer
suggestions for changes in the interim. I have tried to capture as much as
possible from previous iterations of introductory material from all the WDs.
If you feel that I have failed to address anything that should be mentioned
in an introduction, please let me know.

Please note that I ask for help with an example of source XHTML, and a
potential result RDF. I don't think that we have to demonstrate an actual
transformation because we are just trying to illustrate dialects of languages
in the source and a nice RDF encoding of the same information.

<div>

<h2 id="intro">Introduction: Data and Documents</h2>

<p>There are many dialects of languages in practice among  the many XML 
documents
on the web.
There are dialects of XHTML, XML and RDF that are used to represent 
everything from
poetry to prose, purchase orders to invoices, spreadsheets to databases, 
schemas to scripts,
and linked lists to ontologies. Some offer more formally defined semantics 
and others more
loosely-couple semantics. Recently, two progressive encoding techniques 
have emerged
to overlay additional semantics onto valid XHTML documents: RDF-a and 
microformats
offer simple, open data formats built upon existing and widely adopted 
standards.
</p>

<p>While this breadth of expression is quite liberating, inspiring new 
dialects to codify both common and customized meanings, it can prove to be 
a barrier to understanding across different domains or fields. How, for 
example, does software discover the author of a poem, a
spreadsheet and an ontology? And how can software determine whether
authors of each are in fact the same person?</p>

<h3>Resource Descriptions</h3>
<p>The Resource Description Framework<a href="#RDFC04">[RDFC04]</a>
provides a standard for making statements about resources in the form
of a subject-predicate-object expression. One way to represent the
fact "<I>The Stand<I>'s author is Stephen King" in RDF would be as a triple
whose subject is "The Stand," whose predicate is "has the author," and
whose object is "Stephen King," The predicate, "has the author"
expresses a relationship between the subject (The Stand) and the object
(Stephen King).  Using URIs to uniquely identify the book, the author and
even the relationship would facilitate software design because not
everyone knows Stephen King or even spells his name consistently.
</p>

<PRE>
[Here, I would like someone to create an example of a source and a result:
         Source XHTML includes META and numerous LINK elements that all
         somehow cite authorship, including dc:author and others.
         Result RDF includes a tidy package of person/author/book triples
         which includes one that asserts that "Stephen King"/author/"The Stand"
         and another which mis-spells it as Steven King.]
</PRE>

<p>GRDDL is a mechanism for <b>G</b>leaning <b>R</b>esource
<b>D</b>escriptions from <b>D</b>ialects of <b>L</b>anguages.
That is, GRDDL provides a relatively inexpensive mechanism for
bootstrapping RDF content from uniform XML dialects; shifting the burden
from formulating RDF to creating transformation algorithms specifically for
each dialect. XML Transformation languages such as XSLT are quite versatile
in their ability to process, manipulate, and generate XML. The use of XSLT to
generate XHTML from single-purpose XML vocabularies is historically celebrated
as a powerful idiom for separating structured content from presentation.</p>

<p>GRDDL shifts this idiom to a different end: separating structured content
from its authoritative meaning (or semantics). GRDDL works by associating
transformations for an individual document, either through direct inclusion of
references or indirectly through profile and namespace documents. Content
authors can nominate the transformations for producing RDF from their content
and use GRDDL to refer to them. </p>

<h3>For example:</h3>
<p>Dublin Core meta-data can be written in
an HTML dialect<a href="#RFC2731">[RFC2731]</a> that has a clear 
correspondence
to an encoding in RDF/XML<a href="#DCRDF">[DCRDF]</a>.
The following HTML and RDF excerpts illustrate the correspondence:</p>

<pre class="example">&lt;html xmlns="http://www.w3.org/1999/xhtml"&gt;
   &lt;head&gt; &lt;title&gt;Some Document&lt;/title&gt;
     &lt;meta name="DC.Subject" content="ADAM; Simple Search; Index+; 
prototype" /&gt;
   &lt;/head&gt; &lt;/html&gt;</pre>

<pre class="example">&lt;rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/"
          xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" &gt;
   &lt;rdf:Description rdf:about=""&gt;
     &lt;dc:subject&gt;ADAM; Simple Search; Index+; 
prototype&lt;/dc:subject&gt;
   &lt;/rdf:Description&gt;
&lt;/rdf:RDF&gt;</pre>

<p>The correspondence between the source and result forms of this example
is expressed as an algorithm in an XSLT transformation,
<a 
href="http://www.w3.org/2000/06/dc-extract/dc-extract.xsl">dc-extract.xsl</a>:</p>

<h3>Transformations</h3>
<p>Transformations are currently commonly expressed using XSLT 1.0,
although other methods are permissible. Generally, if the transformation
can be fully expressed in XSLT 1.0 then it is preferable to use that format
since all GRDDL processors should be capable of interpreting
an XSLT 1.0 document.</p>

<p><a href="http://www.w3.org/TR/xproc/">XProc: An XML Pipeline Language,</a>
<i>a language for describing operations to be performed on XML documents,</i>
has recently been published as a W3C Working Draft.
It merits consideration for expressing more complex or sophisticated 
transformations
which require control over the flow of processing through a variety of XML 
processing tools.
Using XProc, one could apply a sequence of operations such XInclude, 
validation, and transformation to a document, aborting if the result or an 
intermediate stage is not valid.</p>

<h3>GRDDL WD</h3>
<p>
This GRDDL Working Draft is a concise technical specification of the GRDDL
mechanism and its XML syntax. It specifies the GRDDL syntax to use in
valid XHTML and well-formed XML documents, as well as how to encode
GRDDL into namespaces and HTML profiles. Discussions of the GRDDL
transformation link and security issues are also covered. Appendices provide
links to extended examples and existing software and services that employ 
GRDDL.
</p>

<h3>GRDDL Primer</h3>
<p>
A Primer on Gleaning Resource Descriptions from Dialects of Languages (GRDDL)
is a progressive tutorial on the GRDDL mechanism.
It develops on a number of examples from the GRDDL Use Cases document to 
illustrate  GRDDL techniques for associating documents with transformations 
for extracting RDF.
</p>

<h3>GRDDL Use Cases</h3>
<p>This document collects a number of use cases together with their goals and
requirements for GRDDL (Gleaning Resource Descriptions from Dialects of 
Languages),
a mechanism for getting <a href="#RDF">RDF</a> data out of XML documents
and in particular XHTML pages using explicitly associated transformation 
algorithms.
These use cases also illustrate how XML and XHTML documents can be decorated
with <a href="#microformats">microformat</a>, <a 
href="#EmbeddedRDF">Embedded RDF</a> or <a href="#RDFa">RDFa</a> statements 
to support
<a href="#GRDDLTransformation">GRDDL transformations</a> in charge of 
extracting
valuable data that can then be used to automate a variety of tasks.</p>

<PRE>The annotated Table of Use Cases would appear here in the Use Cases 
WD.]</PRE>
</div>
Received on Wednesday, 4 October 2006 23:39:07 UTC