Re: DCMI Metadata Terms - issues with the RDFa script, content negotiation, etc from Gregg Kellogg on 2012-05-18 (public-rdfa@w3.org from May 2012)

From: Gregg Kellogg <gregg@greggkellogg.net>
Date: Fri, 18 May 2012 02:43:50 -0400
To: Thomas Baker <tom@tombaker.org>
CC: Danny Ayers <danny.ayers@gmail.com>, Dan Brickley <danbri@danbri.org>, public-rdfa <public-rdfa@w3.org>, "hugh@hubns.com" <hugh@hubns.com>, Richard Cyganiak <richard.cyganiak@deri.org>, Jon Phipps <jphipps@madcreek.com>, Stuart Sutton <sasutton@dublincore.net>
Message-ID: <69628B81-4074-4B26-9A54-75FA58B3F383@greggkellogg.net>
On May 17, 2012, at 10:08 PM, Thomas Baker wrote:

> Gregg, all,
> 
> The script currently generates RDFa [4] that says, about itself:
> 
>    <http://purl.org/dc/terms/> <http://purl.org/dc/terms/creator> "DCMI Usage Board"@en .
>    <http://purl.org/dc/terms/> <http://purl.org/dc/terms/description> "This document is...etc..."@en .
>    <http://purl.org/dc/terms/> <http://purl.org/dc/terms/identifier> <http://dublincore.org/documents/2012/05/21/dcmi-terms/> .
>    <http://purl.org/dc/terms/> <http://purl.org/dc/terms/isVersionOf> <http://dublincore.org/documents/dcmi-terms/> .
>    <http://purl.org/dc/terms/> <http://purl.org/dc/terms/issued> "2012-05-21"^^<http://www.w3.org/2001/XMLSchema#date> .
>    <http://purl.org/dc/terms/> <http://purl.org/dc/terms/replaces> <http://dublincore.org/documents/2010/10/11/dcmi-terms/> .
>    <http://purl.org/dc/terms/> <http://purl.org/dc/terms/title> "DCMI Metadata Terms"@en .
> 
> Issues:
> -- The subject should not be http://purl.org/dc/terms/, which is just
>   one of the four namespace IRIs in DCMI Metadata Terms, but (I guess)
>   http://purl.org/dc/.  Where is the script picking up /dc/terms/ (and 
>   not, say, /elements/1.1/), and can it be tweaked to output just /dc/?

The base element has @resource set to <http://purl.org/dc/terms/>; this is being used to create triples such as the creator, description and etc. Removing it is easy, and then it will just set those based on the document URL. If the nominal document location is at <http://purl.org/dc/terms/> this would generate the same thing, but AFAIK, <http://purl.org/dc/terms/> does a redirect to <http://dublincore.org/2010/10/11/dcterms.rdf#>. In this case, that would be used as the document base, and be used for generating the triples.

I'll change it to <http://purl.org/dc/>. For future reference, this is actually done in web/xsl/html-dcmiterms.xsl on line 99.

> -- When translated using the distiller [7], the other serializations of
>   RDF end up saying about themselves that they are versions of
>   /documents/.../dcmi-terms/.  If the other serializations (RDF/XML and
>   Turtle) are derived from the RDFa, I guess it is correct for them all
>   to point to the RDFa document. I'm not sure this is best practice, 
>   but it seems reasonable.  Any opinions about that here?

Typically, I'd expect that the URI for all versions of the document would be the same, and that content negotiation would be used to get the most appropriate serialization. I think the distiller's doing the right thing here.

> -- The distiller-derived RDF serializations also say:
> 
>    <https://raw.github.com/dublincore/website/master/build/html/dcmi-terms/index.shtml>
>        <http://purl.org/dc/terms/tableOfContents>
>        <https://raw.github.com/dublincore/website/master/build/html/dcmi-terms/index.shtml#contents> .
> 
>    I guess the distiller, if run on the index.shtml after publication,
>    would show correct values for the subject and object, but I'm
>    flagging it as something would need to be hand-edited out of any
>    serializations generated from the RDFa before they were published to
>    the website.  Could we perhaps simply to suppress the generation of
>    this triple in the script?

That's just because the table of contents is set in a <link> in the head, before the resource is set in the body element. I'll move setting the resource to the html element, and they should be in sync.

> I'm working with Jon Phipps on figuring out the content negotiation
> piece of the puzzle, and we are guardedly optimistic that we may be able
> to implement this before publication, which I am postponing from Monday
> of next week to, say, the end of next week or even longer if we're close
> to a solution and just need more time.  We have a few extra days to work
> out the bugs.
> 
> One basic policy decision DCMI has taken is that we will continue, at
> least for now, to serve RDF/XML (or Turtle) -- not _just_ RDFa.  (I'd
> like to hear opinions about whether Turtle should already be the new
> default, instead of RDF/XML.) 
> 
> That means that if we do not manage to get content negotiation working,
> we may have to point the PURLs to an RDF/XML (or Turtle) representation.
> The RDFa would still be there, but it would not be reachable from the
> PURLs -- a situation we would need to keep trying to rectify.
> 
> However, if we do get content negotiation to work, we need to decide how
> the RDF/XML (or Turtle) will be served.  Following the pattern of [9],
> my initial idea was to publish one consolidated RDF schema with terms
> sharing all four namespace IRIs [10] at
> http://dublincore.org/2012/05/30/dc.rdf.
> 
> However, Jon thinks this break with our decade-old practice of
> publishing a separate schema for each of the four namespace prefixes
> might be confusing to data consumers.  He is proposing an approach
> whereby PURLs using one of the four namespace IRIs would resolve to four
> schemas (as they do now); he may have more to say about this idea
> tomorrow.

This slightly complicates deriving these files from the RDFa, but is definitely doable. We could also go back to generating the RDF/XML from xslt and deriving the Turtle from that.

> If we were to hear support (e.g., from this list) for the idea of
> publishing four (or five) schemas, I would face the very practical
> problem of how to generate four separate schemas from one RDFa document.
> I initially considered reviving the scripts used to generate the RDF/XML
> schemas from the common source -- we deleted these last week and I have
> now retrieved them from an old commit and archived them in [14] (with
> header files in [15]).  However, the output of these scripts would need
> to be tested against the output of the RDFa-generating script, and the
> scripts would need to be edited to produce compatible output -- not just
> today, but potentially in the future.  This does not seem like a good
> idea.

This can probably also be done using a SPARQL CONSTRUCT to select the subsets of the RDFa-based graph to create.

> I was hoping I could extract Turtle representations of terms, by
> namespace IRI, with something quite simple like:
> 
>    rapper -o ntriples dc.rdf | gawk '$1 ~/dc\/terms/' | rapper -i ntriples -o turtle >dcterms.ttl
> 
> The script would need some sed transforms along the way to tweak the
> title, description, etc, but this approach would be quick and simple and
> we could rest assured that it would represent the RDF content of the
> RDFa document accurately.  On the downside, the script would not output
> Turtle with @prefix declarations, but would use full IRIs everywhere,
> making it a bit less readable.  But that is all theoretical because the
> script above simply does not work.  Maybe someone here can say why?  Are
> there are more powerful tools that could make these transformations?

There's still a workflow that allows you to do this by getting N-Triples out of the RDFa. Running your scripts (although I think this can be done in SPARQL), add back in the prefix definitions to the N-Triples and re-serialize through the distiller as Turtle or RDF/XML.

> The basic question is whether we need to have four (or five!) separate
> RDF/XML and four (or five!) separate Turtle representations at all, or
> can instead serve up just one dc.rdf and/or one dc.ttl.  What does
> everything think?

You could also consider having for or five URLs all resolve to the same resource, you'd just get more triples than you would have before, but I don't see the harm in that.

Gregg

> Tom
> 
> 
> 
> On Fri, May 11, 2012 at 10:57:56PM -0400, Gregg Kellogg wrote:
>> You can try using the "raw" mode [6], and use it in the distiller URI
>> field. Just make sure you speciify the "rdfa" input format. If it was
>> an actual HTML file, you probably could rely on content detection.
>> 
>> You should be able to turn the result into turtle using [7].
>> 
>> There is a way to be able to view the file as formatted HTML, but I
>> think you need to put it in a "ghpages" branch [8].
>> 
>>> I'm a bit new to Git but proceeding carefully.  Please let me know
>>> if there are any problems with the merge...
>>> 
>>> Tom
>>> 
>>> [4] https://github.com/dublincore/website/blob/master/build/html/dcmi-terms/index.shtml
>>> [5] https://github.com/RDFLib/pyrdfa3
>> [6] https://raw.github.com/dublincore/website/master/build/html/dcmi-terms/index.shtml
>> [7] http://rdf.greggkellogg.net/distiller?format=turtle&in_fmt=rdfa&uri=https://raw.github.com/dublincore/website/master/build/html/dcmi-terms/index.shtml
>> [8] http://help.github.com/pages/
> 
> [9] http://dublincore.org/2010/10/11/dcterms.rdf#
> [10] https://github.com/dublincore/website/blob/master/build/html/dcmi-terms/dc.rdf
> [11] http://purl.org/dc/elements/1.1/
> [12] http://purl.org/dc/terms/
> [13] http://purl.org/dc/dcmi-type-vocabulary/
> [13] http://purl.org/dc/dcam/
> [14] https://github.com/dublincore/website/tree/master/archive/xsl-old
> [15] https://github.com/dublincore/website/tree/master/archive/headers-old
> 
> -- 
> Tom Baker <tom@tombaker.org>
Received on Friday, 18 May 2012 06:44:49 UTC