Re: [Turtle] Two formats (was: Re: Turtle, Qurtle, Super-Turtle, N-Triple, N-Quads, Trig - BC and Scope) from Steve Harris on 2011-03-03 (public-rdf-wg@w3.org from March 2011)

From: Steve Harris <steve.harris@garlik.com>
Date: Thu, 3 Mar 2011 07:51:17 +0000
To: Richard Cyganiak <richard@cyganiak.de>
Cc: Sandro Hawke <sandro@w3.org>, nathan@webr3.org, RDF-WG <public-rdf-wg@w3.org>
Message-Id: <691E58B3-8C87-4103-A660-E8FCB15902E6@garlik.com>

On 2011-03-02, at 22:13, Richard Cyganiak wrote:

> On 2 Mar 2011, at 19:32, Sandro Hawke wrote:
>> 2.  The first, our standard version of Turtle, should be very
>> conservative, inside the space of nearly all existing turtle documents
>> and software.
> 
> +1
> 
>> 3.  We should have a different syntax, with a different mime-type, for
>> handling [GRAPHS] in a Turtle-like language. 
>> 
>> If that's true, the next big issue is whether this second syntax is (as
>> Ivan proposed) just Turtle plus the minimum needed to handle extra
>> graphs (TriG?), or whether (since we don't have nearly as much BC to
>> worry about) we should take the opportunity to add some extra stuff
>> here.
> 
> Adding extra stuff? I'd actually propose the opposite: Let's throw some stuff out from the [GRAPHS] format.
> 
> At the moment, I see multi-graph formats used mainly to exchange dumps between SPARQL stores. Hence I see this as the main use case to address.
> 
> We've learned from N-Triples that line-based formats are great for exchanging dumps.

Agreed.

> So, let's take N-Triples and add an optional 4th element to deal with [GRAPHS]. A la N-Quads [1], but being explicit about what the 4th element is. Also add some other good bits along the lines Andy suggested elsewhere (UTF-8, base URI, proper media type). And declare victory.

Yes, but lets make that two formats, not one. I would prefer the N-Quads format and media type to mandate 4 columns, to minimise the potential surprise once you start parsing a "N-Triples" file.

For one thing, some triplestores have different default behaviours when parsing triples formats than quads formats.

In 4store for e.g. if you import <file:triples.nt>, by default it will remove any existing triples in <file:triples.nt> before inserting the "new" ones — this appears to match user expectations. However, if you import <file:quads.nq> there is no real point in clearing out the <file:quads.nt> graph, as there's not often any data in the base URI of the file in an N-quads file, and users don't seem to want you to go round clearing out any graph you find mentioned in the quads dump format before inserting — it causes weird import time behaviour, and unexpected consequences.

For example if I have a N-Quads file like:

<http://example.com/a> <http://example.com/p> <http://example.com/b> <http://example.com/G1> .
...
<http://example.com/G1> <http://example.com/contains> <http://example.com/a> <http://example.com/metadata> .
...

It may well be surprising that importing this will will wipe the <http://example.com/metadata> graph, which might be used by multiple N-Quads dump files. 

The user has no practical way to know if any / what graphs will be affected without pre-parsing the N-Quads dump file, which can be impractical for very large files.

There's also the question of what to do if you find a N-Triples file in the wild, say as part of a web crawl. Currently it's safe to import any N-Triples file, and it will only affect triples within the graph of the file itself, but someone could deliberately create malicious N-Quads files designed to add data to well known graph URIs, or to deliberately corrupt provenance data in related graphs:

<http://example.com/a> <http://example.com/p> <http://example.com/b> <http://example.com/G1> .
<http://example.com/G1> <http://mystore.example/trustLevel> "1.0"^^<http://www.w3.org/2001/XMLSchema#decimal> <http://example.com/G1#provenance> .
<http://example.com/G1> <http://purl.org/dc/terms/date> "1970-02-23T00:00:00Z" <http://example.com/G1#provenance> .

and so on

Consequently there are several cases where the user would like to have different behaviours depending on whether the file you're parsing has 3 or 4 columns, so lets make it easy to find out without pre-parsing the whole file.

- Steve

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD

Received on Thursday, 3 March 2011 07:51:52 UTC