W3C home > Mailing lists > Public > public-rdf-wg@w3.org > March 2011

Re: [Turtle] Two formats (was: Re: Turtle, Qurtle, Super-Turtle, N-Triple, N-Quads, Trig - BC and Scope)

From: Steve Harris <steve.harris@garlik.com>
Date: Thu, 3 Mar 2011 07:51:17 +0000
Cc: Sandro Hawke <sandro@w3.org>, nathan@webr3.org, RDF-WG <public-rdf-wg@w3.org>
Message-Id: <691E58B3-8C87-4103-A660-E8FCB15902E6@garlik.com>
To: Richard Cyganiak <richard@cyganiak.de>
On 2011-03-02, at 22:13, Richard Cyganiak wrote:

> On 2 Mar 2011, at 19:32, Sandro Hawke wrote:
>> 2.  The first, our standard version of Turtle, should be very
>> conservative, inside the space of nearly all existing turtle documents
>> and software.
> 
> +1
> 
>> 3.  We should have a different syntax, with a different mime-type, for
>> handling [GRAPHS] in a Turtle-like language. 
>> 
>> If that's true, the next big issue is whether this second syntax is (as
>> Ivan proposed) just Turtle plus the minimum needed to handle extra
>> graphs (TriG?), or whether (since we don't have nearly as much BC to
>> worry about) we should take the opportunity to add some extra stuff
>> here.
> 
> Adding extra stuff? I'd actually propose the opposite: Let's throw some stuff out from the [GRAPHS] format.
> 
> At the moment, I see multi-graph formats used mainly to exchange dumps between SPARQL stores. Hence I see this as the main use case to address.
> 
> We've learned from N-Triples that line-based formats are great for exchanging dumps.

Agreed.

> So, let's take N-Triples and add an optional 4th element to deal with [GRAPHS]. A la N-Quads [1], but being explicit about what the 4th element is. Also add some other good bits along the lines Andy suggested elsewhere (UTF-8, base URI, proper media type). And declare victory.

Yes, but lets make that two formats, not one. I would prefer the N-Quads format and media type to mandate 4 columns, to minimise the potential surprise once you start parsing a "N-Triples" file.

For one thing, some triplestores have different default behaviours when parsing triples formats than quads formats.

In 4store for e.g. if you import <file:triples.nt>, by default it will remove any existing triples in <file:triples.nt> before inserting the "new" ones  this appears to match user expectations. However, if you import <file:quads.nq> there is no real point in clearing out the <file:quads.nt> graph, as there's not often any data in the base URI of the file in an N-quads file, and users don't seem to want you to go round clearing out any graph you find mentioned in the quads dump format before inserting  it causes weird import time behaviour, and unexpected consequences.

For example if I have a N-Quads file like:

<http://example.com/a> <http://example.com/p> <http://example.com/b> <http://example.com/G1> .
...
<http://example.com/G1> <http://example.com/contains> <http://example.com/a> <http://example.com/metadata> .
...

It may well be surprising that importing this will will wipe the <http://example.com/metadata> graph, which might be used by multiple N-Quads dump files. 

The user has no practical way to know if any / what graphs will be affected without pre-parsing the N-Quads dump file, which can be impractical for very large files.

There's also the question of what to do if you find a N-Triples file in the wild, say as part of a web crawl. Currently it's safe to import any N-Triples file, and it will only affect triples within the graph of the file itself, but someone could deliberately create malicious N-Quads files designed to add data to well known graph URIs, or to deliberately corrupt provenance data in related graphs:

<http://example.com/a> <http://example.com/p> <http://example.com/b> <http://example.com/G1> .
<http://example.com/G1> <http://mystore.example/trustLevel> "1.0"^^<http://www.w3.org/2001/XMLSchema#decimal> <http://example.com/G1#provenance> .
<http://example.com/G1> <http://purl.org/dc/terms/date> "1970-02-23T00:00:00Z" <http://example.com/G1#provenance> .

and so on

Consequently there are several cases where the user would like to have different behaviours depending on whether the file you're parsing has 3 or 4 columns, so lets make it easy to find out without pre-parsing the whole file.

- Steve

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Thursday, 3 March 2011 07:51:52 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:04:03 UTC