Dataset Semantics

I step into this debate, not with any great understanding of the details, but with some expectations as a developer and as an implementer of RDF frameworks.

Part of the problem that I see in the WG dynamics is that there are a number of different ways in which things like a default graph might be used. As a developer, the lack of guidance by this group (past and present) has lead to confusion IMO. This is also true of the SPARQL WG, in that implementations are free to provide different implementations of a default graph: the union of all named graphs, a separate unrelated graph not having a name, or as the default location for meta-data about named graphs themselves. I think this situation is brought about principally because of the lack of guidance these groups have given as to what the use of these features is intended to be.

To not provide guidance now, after there is some experience in implementation, is, I think, a missed opportunity.

Speaking as an implementer, I expect to be able to use a dataset without advance knowledge of how the data is organized. This requires that there is some meta-data that can be used to understand things like entailment regimes and the "meaning" of graph names. The SPARQL Service Description is a natural format for describe this, but there is no default binding to a dataset itself; I think there should be. I my usage, this is typically the default graph, but it could be some other "named" graph; however, if it is named, there doesn't seem to be a way to find it unless there is some normative language for how the dataset description is named within a dataset. I think it is most natural for this to be the default dataset, or that there is a relation defined within the default dataset which names the dataset description.

IMO, the default graph should be used for metadata about the dataset, including, but not limited to, the SPARQL Service Description. I also believe that I should be able to use information in that service description to reason about the named graphs themselves.

As an example use case, I might load information from a particular Wiki page into a graph named with the URL of the page, along with query parameters indicating a particular version of that page (clearly, the format of these URLs is arbitrary, but the way to describe them should be normative). If the page changes, I would likely load the data into a new named graph. I'd like to be able to use information in the dataset description to identify the most current version of the named graphs for a particular page, and potentially the named graphs for a collection of these pages (all pages from a current wiki, for example) at a specific time. I can imagine a system in which these graphs could be described using a vocabulary that allowed me to construct consistent SPARQL queries across these named graphs, but only if the location and semantics of this information can be determined without built-in knowledge of the dataset semantics. Perhaps this is too ambitious, but I believe that this is where we should be going in the long run.

Gregg Kellogg
gregg@greggkellogg.net

Received on Wednesday, 19 September 2012 20:43:01 UTC