- From: Bob MacGregor <bmacgregor@siderean.com>
- Date: Thu, 31 May 2007 11:07:49 -0700
- To: Jeen Broekstra <jeen.broekstra@aduna-software.com>
- Cc: Pat Hayes <phayes@ihmc.us>, public-rdf-dawg-comments@w3.org, Eric Prud'hommeaux <eric@w3.org>, Richard Newman <rnewman@franz.com>
- Message-Id: <260A8F86-45CA-456E-AF67-FE564A1E24FA@siderean.com>
I think that the fundamental problem relates to the fact that the SPARQL language is already obsolete even before it has been finished. This is because current RDF, and the graph-based notion that it promotes, is also obsolete. If one is thinking in terms of quads (e.g., as in Sesame and some major vendor products) then the notion of blank nodes in context position makes perfect sense. However, if you confine your thinking to triples, as Pat has done (correctly in the context of RDF/SPARQL), then I guess that graph names may be necessary. Why is RDF obsolete? I can point to three serious drawbacks. The most immediate is that RDF does not provide for a practical means for storing models containing large numbers of graphs. The most common way to serialize/store RDF/XML is as a number of individual graphs, e.g., thousands of graphs. Much better would be an N4 or NQuads syntax, or the addition of a ":context" attribute to RDF (a sibling to the ":resource" attribute). Right now, there is no acceptable standard (that I'm aware of) for transmitting models containing large numbers of graphs. The second drawback is at this point more oblique. Somewhat over a year ago, we implemented a quad compression scheme that not only saves significant space in the presence of large numbers of graphs, but also resulted in order of magnitude performance improvement on models with a few million quads. Analysis showed that the performance differential was roughly linear in the number of graphs (one graph per document), so for larger applications, there would have been several orders of magnitude difference in performance. We have now embedded the compression into the quad store, i.e., we can't turn it off anymore. The compression is lossless except that it does not preserve graphs names (since we use blank nodes for contexts/graphs, for us its not a loss). While I expect it may take a while for the compression scheme to become widespread, performance always wins out. The third drawback is the difference in mindset. Once you have quads, combined with aggressive use of multiple dimensions of provenance, the notion of graphs introduces a dissonance that makes it harder to visualize what is going on. Take the notion of the "default" graph containing all of the triples from all of the graphs. If we attach security information to each graph (which we often do), the the only time the "all triples" notion makes sense is when you run at system high; for all normal cases, queries only see a subset of the triples belonging to the union of the graphs. More preposterous is the FROM NAMED construct. This makes sense only if you have a very small number of graphs, and if the names of the graphs are actually meaningful (not normally the case when you are seriously into provenance). Richard Newman's suggestion of FROM NAMED * provides a solution, except that the right syntax for that would be to eliminate FROM NAMED entirely and assume the star holds by default. And we will use GRAPH ?cxt to reference contexts, except that our own product will permit blank nodes to bind to the ?cxt argument. Cheers, Bob On May 31, 2007, at 0050, Jeen Broekstra wrote: > Pat Hayes wrote: >> >>> Hi Pat, >>> >>> On May 29, 2007, at 1954, Pat Hayes wrote: >>> >>>> <snip> >>>> >>> >>>> However, I am at a loss to understand how you refer to these >>>> 150,000 >>>> graphs if you have no way to name them. How do you even know how >>>> many >>>> you have? >>>> >>> >>> Each of the graphs consists of triples extracted from a different >>> document. The document might be identified by a file name, or a >>> message ID, >>> a documentum identifier, or whatever. The quads for that document >>> share a common context argument; a blank node. The same >>> blank node appears in subject position to record provenance >>> assertions >>> about the graph (which document, which extractor used, >>> time of extraction, etc). >> >> That works as long as everything is inside the intended scope of the >> blank node identifier, which is usually a document. BUt a query is >> not >> usually inside the same scope as the graph(s) being queried, so to >> use >> the blank node as an identifier in the query is (usually) impossible. > > Allow me to jump in at this point with my personal POV. > > I think you overlook the fact that you can address blank nodes > 'existentially' from a query, e.g. "give me the triples from the graph > identified with the source property ex:foo and value ex:bar" : > > SELECT ?x ?y ?z > WHERE { > ?g ex:foo ex:bar. > GRAPH ?g { ?x ?y ?z .} > } > > Surely in this kind of pattern ?g could well be allowed to be bound > to a > blank node. However, this is currently not possible in SPARQL > because it > explicitly requires that a graph name is a URI. > > FWIW we have implementation experience with allowing blank nodes here, > because that is exactly what Sesame does; we call the mechanism > 'context' rather than 'named graph', by the way since the notion of > 'naming' indeed tends to suggest that it is an actual *name*. > > I don't think it's a matter of scope, because the scope of the blank > node is still the original dataset, it is not directly addressed from > the query. > > > Cheers, > > Jeen > -- > Aduna - Guided Exploration > www.aduna-software.com > > Prinses Julianaplein 14-b > 3817 CS Amersfoort > The Netherlands > +31-33-4659987 (office) Bob MacGregor Chief Scientist Siderean Software, Inc. 310.647.5690 bmacgregor@siderean.com
Received on Thursday, 31 May 2007 18:08:25 UTC