Re: Graph-State Resources (was Re: graphs and documents Re: [ALL] agenda telecon 14 Dec) from Steve Harris on 2011-12-20 (public-rdf-wg@w3.org from December 2011)

From: Steve Harris <steve.harris@garlik.com>
Date: Tue, 20 Dec 2011 10:25:57 +0000
To: Richard Cyganiak <richard@cyganiak.de>
Cc: Sandro Hawke <sandro@w3.org>, Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-wg@w3.org
Message-Id: <2580770A-BB16-4BCB-9C18-4AD8431CDBDF@garlik.com>
On 19 Dec 2011, at 22:49, Richard Cyganiak wrote:

> On 15 Dec 2011, at 17:52, Sandro Hawke wrote:
>>> Unconvinced.  What's an RDFa document?  It's some RDF, some scripts, 
>>> some HTML links, some appearance.  Is that limiting it to RDF?
>> 
>> It's not a Graph-State Resource, as I'm trying to define the term.
> 
> That's a bummer.
> 
>> There's a lot more to its state (except in degenerate cases, like a sort
>> of RDFa-quine) than is conveyed in the triples.
> 
> One of the main use cases that makes me kind of want to have more in RDF datasets than the pure data-structure definition we have right now is web crawling. It would be nice to be able to have a well-defined representation of a web crawl as an RDF dataset. But this critically depends on being able to represent partial state (e.g., only the bits of an RDFa page marked up with RDFa) in the dataset.

Very strongly agreed. However it's pretty hard, and I think it's far too early, with far too little real-world experience of how to handle this sensibly.

It also brings up the issue of scale. The quad backup of our crawl data runs to many, many terrabytes even heavily compressed, so none of the currently standard (for small values of standard) quad / dataset formats is appropriate for transferring that state. It would just take too long for one thing, even over 10gigE.

>> I'm looking for a class of things which have very similar behavior and
>> attributes.  My most recent angle is trying to document how to use REST
>> with these things.  I want to be able to talk about how HEAD, GET, PUT,
>> and PATCH should work on these things.   RDFa documents have to be
>> handled quite differently -- one could not, for instance, PATCH an RDFa
>> document with an application/sparql-update patch.   I'm trying to focus
>> on the class of things for which SPARQL Update is a meaningful PATCH
>> language.
> 
> I think that's restricting it way too much.
> 
> Given adequate parsers and extractors, bits of RDF can be read out of almost every page on the Web.

Yes, we generate RDF mostly from pages of HTML, but also RDF(a), PDF, Excel documents etc.

AS an interesting screw case, suppose you have a page http://example.com/foo, which contains both RDFa, and HTML content that we're interested in. We have to store the RDFa triples, and the derived triples separately.

> Limiting the applicability of the “web-style dataset” pattern to only things published from SPARQL endpoints (or even only update-capable SPARQL endpoints) would result in something that's not useful to most current RDF users. Most of the RDF out there is read-only at this point and doesn't come from SPARQL stores. I don't see this changing anytime soon – RDF coming from SPARQL stores will grow, but so will RDF coming from CMSes and DBs and Excel sheets and screenscraping and other read-only non-SPARQL sources.

Even when it does come from a SPARQL store, sometimes external users get read access, while internal users get read/write.

- Steve

>> Looking for other properties which might apply to GSRs, I thought of
>> VoID and came across this:
>> 
>>       The fundamental concept of VoID is the dataset. A dataset is a
>>       set of RDF triples that are published, maintained or aggregated
>>       by a single provider. Unlike RDF graphs, which are purely
>>       mathematical constructs [RDF-CONCEPTS], the term dataset has a
>>       social dimension: we think of a dataset as a meaningful
>>       collection of triples, that deal with a certain topic, originate
>>       from a certain source or process, are hosted on a certain
>>       server, or are aggregated by a certain custodian. Also,
>>       typically a dataset is accessible on the Web, for example
>>       through resolvable HTTP URIs or through a SPARQL endpoint
>> 
>>               - http://www.w3.org/TR/2011/NOTE-void-20110303/#dataset
>> 
>> Terminology aside, that seems to match g-box rather well.  
> 
> Having written the quoted paragraph, I'm not sure that I agree.
> 
> The prototypical void:Dataset would be something like “all the RDF in DBpedia”. The prototypical g-box would be something like “Bob's FOAF file” (assuming it can change over time).
> 
> The term “g-box” evokes storage of a graph. GSR evokes, to me, observation of the result of HTTP prodding. void:Dataset evokes, to me, a larger, socially meaningful collection of RDF data.
> 
> Best,
> Richard

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Tuesday, 20 December 2011 10:31:02 UTC