Re: [Turtle] Two formats from Andy Seaborne on 2011-03-03 (public-rdf-wg@w3.org from March 2011)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Thu, 03 Mar 2011 11:18:33 +0000
To: Steve Harris <steve.harris@garlik.com>
CC: Richard Cyganiak <richard@cyganiak.de>, Sandro Hawke <sandro@w3.org>, nathan@webr3.org, RDF-WG <public-rdf-wg@w3.org>
Message-ID: <4D6F7909.808@epimorphics.com>
On 03/03/11 07:51, Steve Harris wrote:
> On 2011-03-02, at 22:13, Richard Cyganiak wrote:
>
>> On 2 Mar 2011, at 19:32, Sandro Hawke wrote:
>>> 2.  The first, our standard version of Turtle, should be very
>>> conservative, inside the space of nearly all existing turtle
>>> documents and software.
>>
>> +1
>>
>>> 3.  We should have a different syntax, with a different
>>> mime-type, for handling [GRAPHS] in a Turtle-like language.
>>>
>>> If that's true, the next big issue is whether this second syntax
>>> is (as Ivan proposed) just Turtle plus the minimum needed to
>>> handle extra graphs (TriG?), or whether (since we don't have
>>> nearly as much BC to worry about) we should take the opportunity
>>> to add some extra stuff here.
>>
>> Adding extra stuff? I'd actually propose the opposite: Let's throw
>> some stuff out from the [GRAPHS] format.
>>
>> At the moment, I see multi-graph formats used mainly to exchange
>> dumps between SPARQL stores. Hence I see this as the main use case
>> to address.
>>
>> We've learned from N-Triples that line-based formats are great for
>> exchanging dumps.
>
> Agreed.
>
>> So, let's take N-Triples and add an optional 4th element to deal
>> with [GRAPHS]. A la N-Quads [1], but being explicit about what the
>> 4th element is. Also add some other good bits along the lines Andy
>> suggested elsewhere (UTF-8, base URI, proper media type). And
>> declare victory.
>
> Yes, but lets make that two formats, not one. I would prefer the
> N-Quads format and media type to mandate 4 columns, to minimise the
> potential surprise once you start parsing a "N-Triples" file.
>
> For one thing, some triplestores have different default behaviours
> when parsing triples formats than quads formats.
>
> In 4store for e.g. if you import<file:triples.nt>, by default it will
> remove any existing triples in<file:triples.nt>  before inserting the
> "new" ones — this appears to match user expectations. However, if you
> import<file:quads.nq>  there is no real point in clearing out
> the<file:quads.nt>  graph, as there's not often any data in the base
> URI of the file in an N-quads file, and users don't seem to want you
> to go round clearing out any graph you find mentioned in the quads
> dump format before inserting — it causes weird import time behaviour,
> and unexpected consequences.
>
> For example if I have a N-Quads file like:
>
> <http://example.com/a>  <http://example.com/p>
> <http://example.com/b>  <http://example.com/G1>  . ...
> <http://example.com/G1>  <http://example.com/contains>
> <http://example.com/a>  <http://example.com/metadata>  . ...
>
> It may well be surprising that importing this will will wipe
> the<http://example.com/metadata>  graph, which might be used by
> multiple N-Quads dump files.
>
> The user has no practical way to know if any / what graphs will be
> affected without pre-parsing the N-Quads dump file, which can be
> impractical for very large files.
>
> There's also the question of what to do if you find a N-Triples file
> in the wild, say as part of a web crawl. Currently it's safe to
> import any N-Triples file, and it will only affect triples within the
> graph of the file itself, but someone could deliberately create
> malicious N-Quads files designed to add data to well known graph
> URIs, or to deliberately corrupt provenance data in related graphs:
>
> <http://example.com/a>  <http://example.com/p>
> <http://example.com/b>  <http://example.com/G1>  .
> <http://example.com/G1>  <http://mystore.example/trustLevel>
> "1.0"^^<http://www.w3.org/2001/XMLSchema#decimal>
> <http://example.com/G1#provenance>  . <http://example.com/G1>
> <http://purl.org/dc/terms/date>
> "1970-02-23T00:00:00Z"<http://example.com/G1#provenance>  .
>
> and so on
>
> Consequently there are several cases where the user would like to
> have different behaviours depending on whether the file you're
> parsing has 3 or 4 columns, so lets make it easy to find out without
> pre-parsing the whole file.

Does

N-quads serializes a dataset (default graph and named graphs)
N-Triples serializes a graph

work for you?

It means that in N-Quads, the absence of the 4th column means default 
graph.  I know 4Store does not have an independent default graph but 
some other systems do.  N-Quads should capture the generality of an RDF 
dataset or graph store.

	Andy
Received on Thursday, 3 March 2011 11:19:15 UTC