Re: [TTL] Standardizing N-Triples from William Waites on 2011-04-04 (public-rdf-wg@w3.org from April 2011)

From: William Waites <ww@styx.org>
Date: Mon, 4 Apr 2011 15:11:43 +0200
To: Peter Frederick Patel-Schneider <pfps@research.bell-labs.com>
Cc: steve.harris@garlik.com, eric@w3.org, andy.seaborne@epimorphics.com, nathan@webr3.org, alexhall@revelytix.com, richard@cyganiak.de, public-rdf-wg@w3.org
Message-ID: <20110404131143.GO1333@styx.org>

* [2011-04-04 08:17:10 -0400] Peter Frederick Patel-Schneider <pfps@research.bell-labs.com> écrit:

] I would like to see this sort of argument backed up with numbers
] including all costs, such as I/O.  Ideally, such arguments should come
] with code, so that the quality of the implementation can be checked. 

Ok, a quick test, using lcsh-20110104.nt, which you can get from the
LOC website and contains 4256000 statements.

The tests below are done with rapper which is part of the raptor
package, written in C, publicly available free software, probably one
of the better implementations out there, and just does serialisation
and parsing.

    time rapper -i ntriples -o ntriples lcsh-20110104.nt > /dev/null

    real	1m2.613s
    user	1m0.105s
    sys		0m1.509s

    time rapper -i ntriples -o turtle lcsh-20110104.nt > /dev/null

    real	3m16.250s
    user	2m34.078s
    sys		0m18.509s

    time rapper -i turtle -o ntriples lcsh-20110104.ttl > /dev/null

    real	1m50.161s
    user	1m34.954s
    sys		0m13.135s

    time rapper -i turtle -o turtle lcsh-20110104.ttl > /dev/null

    (memory exhausted, sorry)

When working with turtle, the size of the process gets to be quite
large and suggests a significant part, perhaps all of the file is held
in RAM. In either case, serialising and parsing, the process ends up
taking about 800Mb and trying to do both likely would mean double
that, which is more free memory than my computer has.

There is probably room for improvement in the turtle parser /
serialiser, but quite obviously it is easy to make a streaming
ntriples serialiser and parser and harder to make one for turtle,
otherwise it would have been done. We can't count on their existence
for turtle but it seems reasonable to expect them for ntriples.  That
is probably the main reason why ntriples is preferred for dumps of
large datasets.

Now it is hard to imagine that a turtle parser optimised for lower
memory use would be slower than one that just read everything into RAM
and munged it. The rough measurements above show the turtle parser to
be almost twice as slow as the ntriples one despite it not being
optimised for memory at all.

Cheers,
-w
-- 
William Waites                <mailto:ww@styx.org>
http://river.styx.org/ww/        <sip:ww@styx.org>
F4B3 39BF E775 CF42 0BAB  3DF0 BE40 A6DF B06F FD45

Received on Monday, 4 April 2011 13:12:16 UTC