Re: [TTL] Standardizing N-Triples

* Andy Seaborne <andy.seaborne@epimorphics.com> [2011-04-02 16:31+0100]
> 
> 
> On 02/04/11 01:34, Steve Harris wrote:
> >On 2011-04-01, at 21:39, Nathan wrote:
> >
> >>Eric Prud'hommeaux wrote:
> >>>* Alex Hall<alexhall@revelytix.com>  [2011-04-01 15:29-0400]
> >>>>On Fri, Apr 1, 2011 at 3:21 PM, Nathan<nathan@webr3.org>  wrote:
> >>>>
> >>>>>Andy Seaborne wrote:
> >>>>>
> >>>>>>On 01/04/11 20:06, Nathan wrote:
> >>>>>>
> >>>>>>>Andy Seaborne wrote:
> >>>>>>>
> >>>>>>>>Are there examples of real worlds data that uses relative IRIs in
> >>>>>>>>N-triples? If not, we could decide that theer is no base processing in
> >>>>>>>>RDF-triples, absolute IRIs only.
> >>>>>>>>
> >>>>>>>How can we have @base processing if there are no directives or @base
> >>>>>>>definitions? I'd strongly suggest we keep this to *IRI*s only.
> >>>>>>>
> >>>>>>The base is also set by where the file is read from.
> >>>>>>
> >>>>>Indeed, reliably though? for instance taking in to account the file being
> >>>>>sent by email, being part of a zip archive, being in the message body of a
> >>>>>PUT HTTP request, being in the body of a GET HTTP response with a
> >>>>>Content-Location which differs from the effective request URI?
> >>>>>
> >>>>>Personally, I'd quite like that can of worms left closed for RDF-Triples :)
> >>>>>
> >>>>+1, but that reflects my bias as a developer, where often times all I'm
> >>>>handed is an input stream with no information about where the content came
> >>>>from.  It's nice to be able to use that information when it's available, but
> >>>>I think it's extra complexity that's best left out of a simple format like
> >>>>N-Triples.
> >>>I'm a big fan of relocatable data and often take advantage of the
> >>>ability to have a set of interrelated resources which can be moved
> >>>from one location to another, or accessed both via e.g. http: and
> >>>file: protocols. As an example, the SPARQL test suite manifests have
> >>>relative references to the data, queries and expected results. This
> >>>allows me to run the tests off the web or to download a tarball to an
> >>>arbitrary location and run the tests. Relative references are a very
> >>>handy element of web architecture.
> >>>I expect that, if we demand absolute IRIs, folks will get around it
> >>>with sed scripts and the like, but it will be an unnecessary pain.
> >>
> >>A very good point Eric, personally I hadn't came across this with N-Triples yet due to my own use-cases so far, although I guess in hindsight I can see uses for relative IRIs here too..
> >>
> >>Jury's out for me on this one I'm afraid, can't weigh up the cost / possible ambiguity of relative IRIs vs having a simple unambiguous format.
> >>
> >>Saying that.. I think we can reasonably expect people only to use relative IRIs on the web, and not come crying because they've used them in a base-less environment..!
> >
> >Most (all?) of the other RDF syntaxes already allow for relative IRIs, so it doesn't add any new requirement to a system that can already handle RDF.
> >
> >I agree with Eric that it's useful, I'm not sure whether there will be systems that only consume NTriples though.
> 
> The relative IRI thing can be achieved by using and serving up
> Turtle. We could therefore keep N-triples with a design centered on
> a dump format, and sticking to only absolute IRIs makes sense there
> to "freeze" the data.

It's actually this use case which motivated me to consider the value
of transporability. Well, that, plus simple generator scripts (for
e.g. dumping a database) which are portable between systems if they
don't embed a base IRI. I'm not sure this matters a lot one way or the
other; just trying to guess the discriminators which will cause folks
to use NTriples.

I'm not actually convinced that it's worth foisting another
sublanguage (or profile, if you prefer) on the world. I understand
that the principle motivation is the efficiency of dumping an
reloading, but I expect that far more clock cycles get introduced
responsibly lexing IRIs and unicode literals than by all the rest of
productions which distinguish turtle from ntriples.

Note the complexity of IRI:
     '<' ([^<>"{}|^`\]-[#x00-#x20])* '>'
which expands to quite a number of automata:
     '<'
     ([#-;=?-\[\]_a-z~-\x7F]
      |([\xC2-\xDF][\x80-\xBF])
      |(\xE0([\xA0-\xBF][\x80-\xBF]))
      |([\xE1-\xEC][\x80-\xBF][\x80-\xBF])
      |([\xE1-\xEC][\x80-\xBF][\x80-\xBF])
      |(\xED([\x80-\x9F][\x80-\xBF]))
      |([\xEE-\xEF][\x80-\xBF][\x80-\xBF])
      |(\xF0([\x90-\xBF][\x80-\xBF][\x80-\xBF]))
      |([\xF1-\xF3][\x80-\xBF][\x80-\xBF][\x80-\xBF])
      |(\xF4([\x80-\x8E][\x80-\xBF][\x80-\xBF])
             |(\x8F([\x80-\xBE][\x80-\xBF])
                   |(\xBF[\x80-\xBD])))
      )*
     '>'

(UTF-8 strings are similarly bulky.) You can map UTF-8 to e.g. 32 bit
chars in the intput, but that just means another piece of software is
navigating the automata.

Given that printing Turtle that looks like NTriples is as efficient as
printing NTriples, I guess the useful input would be benchmarks for
parsing UTF-8y NTriples vs. Turtle. If there's not a 20% speed-up,
I'd imagine that the social cost of another language exceeds its
value.

Parsing conventional US-ASCII NTriples is quite simple, even with the
\u's needed for non-ascii chars. CJK (Japanese, Chinese) chars are six
bytes instead of three in UTF-8. Most other chars, e.g. western
europe, are two bytes in UTF-8. ASCII NTriples are bulky but fast, but
I'm not sure that motivates the social cost of another language.


> Turtle would be the usual format, N-Triples a subsidiary format,
> with the RDF specs (primer) in Turtle and just mentioning N-Triples
> is passing.  I'd expect people to author in Turtle.  Andy > >- Steve
> >

-- 
-ericP

Received on Saturday, 2 April 2011 20:29:19 UTC