Re: Shell-native way of converting N-Quads to N-Triples

On Tue, Mar 07, 2023 at 02:26:43PM +0100, Martynas Jusevičius wrote:
> I found an answer from years ago saying "you can convert quads to
> triples with sed/perl", but no actual example on how to do it. Does
> anyone have such a script, ideally as shell-native as possible,
> without additional dependencies?
> 
> I've tried Jena's riot command, It doesn't do what I need because when
> reading quads and writing triples it writes the default graph, which
> is empty.
> 
> Currently I'm using a CONSTRUCT query and Jena's sparql command, but
> it's rather slow on large files.

   What I've found most useful when doing basic RDF processing in the
shell is to convert everything to N-Triples or N-Quads at the start,
and then convert back afterwards. I usually use rapper for this --
although, like most RDF tooling, it has an annoying habit of trying to
load everything into RAM, which limits the size of the input data
files. Somewhere I've got a small python tool I wrote that can split a
Turtle file into smaller self-contained files on a purely syntactic
basis, which makes it easier to convert to N-Triples.

   Once you've got your data in one line per triple, it's then much
easier to deal with the RDF data using shell tools. But be very
careful of string literals with embedded newlines -- those aren't easy
to deal with in basic shell tools.

   There's a few interesting solutions I've come up with, like using
rev|cut|rev to reliably and simply pull out things from the end of
each line (e.g. the graph in N-Quads).

   In general, I've found this kind of thing to be useful for simple
ad-hoc data-mangling or data-analysis, but it's not an approach I'd
adopt for an actual repeatable data pipeline.

   Hugo.

-- 
Hugo Mills             | Reading Mein Kampf won't make you a Nazi. Reading
hugo@... carfax.org.uk | Das Kapital won't make you a communist. But most
http://carfax.org.uk/  | trolls started out with a copy of Lord of the Rings.
PGP: E2AB1DE4          |

Received on Tuesday, 7 March 2023 14:00:11 UTC