about utf-8 support in N-Triples from Pierre-Antoine Champin on 2011-07-20 (public-rdf-wg@w3.org from July 2011)

From: Pierre-Antoine Champin <pierre-antoine.champin@liris.cnrs.fr>
Date: Wed, 20 Jul 2011 18:26:30 +0200
To: "public-rdf-wg@w3.org" <public-rdf-wg@w3.org>
Message-ID: <4E2701B6.30703@liris.cnrs.fr>

Hi all,

it seems that I could not make myself clear during today's telecon, on
my concern about making N-Triples utf-8 compliant.

Don't get me wrong: I would *love* to see N-Triples support utf-8 (and
get the universe rid of ASCII, for that matter ;)

But my concern is that:
* N-Triples will still support \uXXXX escaping
* so there would be several ways to serialize a literal in N-Triples

I could perfectly live with that, but I think that one use-case of
N-Triples is to be processed by RDF-unaware tools, such as grep, sed or
sort.

I know those tools have perfect utf-8 support; but they don't know that
"\u00e9" is the same as "é". So if I'm grep'ing an N-Triples file for
the string "trouvé", I may miss it if it is spelled "trouv\u00e9". And
if I'm sort'ing the triples, the escaped characteres will not be
interpreted, and so get wrongly sorted.

This is my concern in making N-Triples utf-8 compliant: we loose the
good property it had to have exactly one way of serializing a given graph.

Would that be possible to specify that \uXXXX escaping can only be used
in ASCII files, while UTF-8 files *must* use the UTF-8 encoding?

  pa

Received on Wednesday, 20 July 2011 16:27:14 UTC