Re: about utf-8 support in N-Triples from Gavin Carothers on 2011-07-20 (public-rdf-wg@w3.org from July 2011)

From: Gavin Carothers <gavin@topquadrant.com>
Date: Wed, 20 Jul 2011 10:08:25 -0700
To: Pierre-Antoine Champin <pierre-antoine.champin@liris.cnrs.fr>
Cc: "public-rdf-wg@w3.org" <public-rdf-wg@w3.org>
Message-ID: <CAPqY83zYeFpr4hXAHkotg2wG=ApZ7-suS9FdczuU80jZquyRaA@mail.gmail.com>

On Wed, Jul 20, 2011 at 9:26 AM, Pierre-Antoine Champin
<pierre-antoine.champin@liris.cnrs.fr> wrote:
> Hi all,
>
> it seems that I could not make myself clear during today's telecon, on
> my concern about making N-Triples utf-8 compliant.
>
> Don't get me wrong: I would *love* to see N-Triples support utf-8 (and
> get the universe rid of ASCII, for that matter ;)
>
> But my concern is that:
> * N-Triples will still support \uXXXX escaping

+1, all current N-Triple files MUST continue to be valid Turtle and
N-Triple files.

> * so there would be several ways to serialize a literal in N-Triples

Ah, yeah that is more of an issue.

>
> I could perfectly live with that, but I think that one use-case of
> N-Triples is to be processed by RDF-unaware tools, such as grep, sed or
> sort.
>
> I know those tools have perfect utf-8 support; but they don't know that
> "\u00e9" is the same as "é". So if I'm grep'ing an N-Triples file for
> the string "trouvé", I may miss it if it is spelled "trouv\u00e9".

Already true today. But yeah I guess you could get SOME hits in UTF-8
and some literals that are escaped.

> And
> if I'm sort'ing the triples, the escaped characteres will not be
> interpreted, and so get wrongly sorted.

This is already sort of true with new lines. You have to remember that
new lines always have to be escaped.

>
> This is my concern in making N-Triples utf-8 compliant: we loose the
> good property it had to have exactly one way of serializing a given graph.

While I agree we need this, N-Triples does NOT define one. Blank node
labelling is annoying.

>
> Would that be possible to specify that \uXXXX escaping can only be used
> in ASCII files, while UTF-8 files *must* use the UTF-8 encoding?

The "what if humans want to write an annoying code point to type?"
issue doesn't really seem to apply to N-Triples? I'm sort of okay with
this on the serialization side totally not okay with not allowing
mixed mode files to be parsed. Perhaps just a best practice note to
implementers?

--Gavin

Received on Wednesday, 20 July 2011 17:08:56 UTC