giving up on datasets/trig as more than a web cache

Recently, I've tried to argue that trig (or whatever it's called) needs 
to be able to carry distinguished metadata.  This morning I've decided 
it doesn't, really, at least for the use cases I think about.   My 
replacement idea is to think about trig as *just* being a Web Cache, as 
just a convenient shorthand for pairing a bunch of URLs and their RDF 
contents, so you can publish or fetch them all at once.   I had been 
thinking about it as something else, as more of a first-class KR, but 
that doesn't seem to be flying.  (I guess this is yet another hold-over 
from my years of working with N3.)

Let's see if I can explain, for anyone else who might think a dataset 
could/should mean something more, and maybe myself, tomorrow.

The use cases I think about are nearly all about data federation, the 
stuff I wrote about and implemented as a federated phonebook [1].   
They're all about data being gathered from original sources and 
processing systems and passed on toward data consumers, as a package, as 
a new combined-source.   This seems to me like an incredibly important 
use case that requires standardization and something could really 
benefit from the idea of datasets and a dataset syntax.

I envisioned it as a converging pipeline, starting with turtle files 
(rdf graphs) as the leaves, but then having trig files (rdf datasets) as 
the major trunks.   The clients would always be getting a trig file (or 
using a sparql endpoint with the same dataset). For example, in 2.4 we 
get the situation where a division is gathering the data from its 
departments, and then passing them up to headquarters in one combined feed.

But if the feed is trig, and one is going to be able to figure out what 
really came from where/when so that bugs and incorrect data can be 
addressed, then trig has to have distinguished metadata.    And I hear a 
lot of people opposed to that, or at least opposed to any convenient was 
of supporting it, because SPARQL doesn't really have it.   So, instead, 
how about we just make the main feed be turtle, and it only contains the 
metadata.  All the data I was putting in named graphs stays out on the 
web, to be dereferenced by clients if they want.

And then, for performance, if desired, the feed can also link to a trig 
file, saying "here, I've done all the fetching for you; if you're going 
to be dereferencing all this stuff anyway, you might as well take this 
instead".    It can do the same with providing a SPARQL end-point, 
providing it for convenience/performance.

*shrug*    It should work fine.    Maybe it's even better 
architecture.    It certain means the name should not be "SuperTurtle", 
since now trig remains a fairly obscure/internal/dump format, and 
(unlike Turtle) can not actually be used to express data, other than 
simple pairings of URLs and graphs.

     -- Sandro

[1] 
http://dvcs.w3.org/hg/rdf/raw-file/default/rdf-spaces/index.html#use-cases

Received on Thursday, 27 September 2012 12:09:42 UTC