DCMI Metadata Terms - issues with the RDFa script, content negotiation, etc from Thomas Baker on 2012-05-18 (public-rdfa@w3.org from May 2012)

From: Thomas Baker <tom@tombaker.org>
Date: Fri, 18 May 2012 01:08:12 -0400
To: Gregg Kellogg <gregg@greggkellogg.net>
Cc: Danny Ayers <danny.ayers@gmail.com>, Dan Brickley <danbri@danbri.org>, public-rdfa <public-rdfa@w3.org>, "hugh@hubns.com" <hugh@hubns.com>, Richard Cyganiak <richard.cyganiak@deri.org>, Jon Phipps <jphipps@madcreek.com>, Stuart Sutton <sasutton@dublincore.net>
Message-ID: <20120518050812.GA2086@alpha.dublincore.org>
Gregg, all,

The script currently generates RDFa [4] that says, about itself:

    <http://purl.org/dc/terms/> <http://purl.org/dc/terms/creator> "DCMI Usage Board"@en .
    <http://purl.org/dc/terms/> <http://purl.org/dc/terms/description> "This document is...etc..."@en .
    <http://purl.org/dc/terms/> <http://purl.org/dc/terms/identifier> <http://dublincore.org/documents/2012/05/21/dcmi-terms/> .
    <http://purl.org/dc/terms/> <http://purl.org/dc/terms/isVersionOf> <http://dublincore.org/documents/dcmi-terms/> .
    <http://purl.org/dc/terms/> <http://purl.org/dc/terms/issued> "2012-05-21"^^<http://www.w3.org/2001/XMLSchema#date> .
    <http://purl.org/dc/terms/> <http://purl.org/dc/terms/replaces> <http://dublincore.org/documents/2010/10/11/dcmi-terms/> .
    <http://purl.org/dc/terms/> <http://purl.org/dc/terms/title> "DCMI Metadata Terms"@en .

Issues:
-- The subject should not be http://purl.org/dc/terms/, which is just
   one of the four namespace IRIs in DCMI Metadata Terms, but (I guess)
   http://purl.org/dc/.  Where is the script picking up /dc/terms/ (and 
   not, say, /elements/1.1/), and can it be tweaked to output just /dc/?

-- When translated using the distiller [7], the other serializations of
   RDF end up saying about themselves that they are versions of
   /documents/.../dcmi-terms/.  If the other serializations (RDF/XML and
   Turtle) are derived from the RDFa, I guess it is correct for them all
   to point to the RDFa document. I'm not sure this is best practice, 
   but it seems reasonable.  Any opinions about that here?

-- The distiller-derived RDF serializations also say:

    <https://raw.github.com/dublincore/website/master/build/html/dcmi-terms/index.shtml>
        <http://purl.org/dc/terms/tableOfContents>
        <https://raw.github.com/dublincore/website/master/build/html/dcmi-terms/index.shtml#contents> .

    I guess the distiller, if run on the index.shtml after publication,
    would show correct values for the subject and object, but I'm
    flagging it as something would need to be hand-edited out of any
    serializations generated from the RDFa before they were published to
    the website.  Could we perhaps simply to suppress the generation of
    this triple in the script?

I'm working with Jon Phipps on figuring out the content negotiation
piece of the puzzle, and we are guardedly optimistic that we may be able
to implement this before publication, which I am postponing from Monday
of next week to, say, the end of next week or even longer if we're close
to a solution and just need more time.  We have a few extra days to work
out the bugs.

One basic policy decision DCMI has taken is that we will continue, at
least for now, to serve RDF/XML (or Turtle) -- not _just_ RDFa.  (I'd
like to hear opinions about whether Turtle should already be the new
default, instead of RDF/XML.) 

That means that if we do not manage to get content negotiation working,
we may have to point the PURLs to an RDF/XML (or Turtle) representation.
The RDFa would still be there, but it would not be reachable from the
PURLs -- a situation we would need to keep trying to rectify.

However, if we do get content negotiation to work, we need to decide how
the RDF/XML (or Turtle) will be served.  Following the pattern of [9],
my initial idea was to publish one consolidated RDF schema with terms
sharing all four namespace IRIs [10] at
http://dublincore.org/2012/05/30/dc.rdf.

However, Jon thinks this break with our decade-old practice of
publishing a separate schema for each of the four namespace prefixes
might be confusing to data consumers.  He is proposing an approach
whereby PURLs using one of the four namespace IRIs would resolve to four
schemas (as they do now); he may have more to say about this idea
tomorrow.

If we were to hear support (e.g., from this list) for the idea of
publishing four (or five) schemas, I would face the very practical
problem of how to generate four separate schemas from one RDFa document.
I initially considered reviving the scripts used to generate the RDF/XML
schemas from the common source -- we deleted these last week and I have
now retrieved them from an old commit and archived them in [14] (with
header files in [15]).  However, the output of these scripts would need
to be tested against the output of the RDFa-generating script, and the
scripts would need to be edited to produce compatible output -- not just
today, but potentially in the future.  This does not seem like a good
idea.

I was hoping I could extract Turtle representations of terms, by
namespace IRI, with something quite simple like:

    rapper -o ntriples dc.rdf | gawk '$1 ~/dc\/terms/' | rapper -i ntriples -o turtle >dcterms.ttl

The script would need some sed transforms along the way to tweak the
title, description, etc, but this approach would be quick and simple and
we could rest assured that it would represent the RDF content of the
RDFa document accurately.  On the downside, the script would not output
Turtle with @prefix declarations, but would use full IRIs everywhere,
making it a bit less readable.  But that is all theoretical because the
script above simply does not work.  Maybe someone here can say why?  Are
there are more powerful tools that could make these transformations?

The basic question is whether we need to have four (or five!) separate
RDF/XML and four (or five!) separate Turtle representations at all, or
can instead serve up just one dc.rdf and/or one dc.ttl.  What does
everything think?

Tom



On Fri, May 11, 2012 at 10:57:56PM -0400, Gregg Kellogg wrote:
> You can try using the "raw" mode [6], and use it in the distiller URI
> field. Just make sure you speciify the "rdfa" input format. If it was
> an actual HTML file, you probably could rely on content detection.
> 
> You should be able to turn the result into turtle using [7].
> 
> There is a way to be able to view the file as formatted HTML, but I
> think you need to put it in a "ghpages" branch [8].
> 
> > I'm a bit new to Git but proceeding carefully.  Please let me know
> > if there are any problems with the merge...
> > 
> > Tom
> > 
> > [4] https://github.com/dublincore/website/blob/master/build/html/dcmi-terms/index.shtml
> > [5] https://github.com/RDFLib/pyrdfa3
> [6] https://raw.github.com/dublincore/website/master/build/html/dcmi-terms/index.shtml
> [7] http://rdf.greggkellogg.net/distiller?format=turtle&in_fmt=rdfa&uri=https://raw.github.com/dublincore/website/master/build/html/dcmi-terms/index.shtml
> [8] http://help.github.com/pages/

[9] http://dublincore.org/2010/10/11/dcterms.rdf#
[10] https://github.com/dublincore/website/blob/master/build/html/dcmi-terms/dc.rdf
[11] http://purl.org/dc/elements/1.1/
[12] http://purl.org/dc/terms/
[13] http://purl.org/dc/dcmi-type-vocabulary/
[13] http://purl.org/dc/dcam/
[14] https://github.com/dublincore/website/tree/master/archive/xsl-old
[15] https://github.com/dublincore/website/tree/master/archive/headers-old

-- 
Tom Baker <tom@tombaker.org>
Received on Friday, 18 May 2012 05:09:03 UTC