Re: about utf-8 support in N-Triples from Andy Seaborne on 2011-07-20 (public-rdf-wg@w3.org from July 2011)

From: Andy Seaborne <andy.seaborne@epimorphics.com>
Date: Wed, 20 Jul 2011 17:42:43 +0100
To: public-rdf-wg@w3.org
Message-ID: <4E270583.1030606@epimorphics.com>

On 20/07/11 17:26, Pierre-Antoine Champin wrote:
> Hi all,
>
> it seems that I could not make myself clear during today's telecon, on
> my concern about making N-Triples utf-8 compliant.
>
> Don't get me wrong: I would *love* to see N-Triples support utf-8 (and
> get the universe rid of ASCII, for that matter ;)
>
> But my concern is that:
> * N-Triples will still support \uXXXX escaping
> * so there would be several ways to serialize a literal in N-Triples
>
> I could perfectly live with that, but I think that one use-case of
> N-Triples is to be processed by RDF-unaware tools, such as grep, sed or
> sort.
>
> I know those tools have perfect utf-8 support; but they don't know that
> "\u00e9" is the same as "é". So if I'm grep'ing an N-Triples file for
> the string "trouvé", I may miss it if it is spelled "trouv\u00e9". And
> if I'm sort'ing the triples, the escaped characteres will not be
> interpreted, and so get wrongly sorted.
>
> This is my concern in making N-Triples utf-8 compliant: we loose the
> good property it had to have exactly one way of serializing a given graph.
>
> Would that be possible to specify that \uXXXX escaping can only be used
> in ASCII files, while UTF-8 files *must* use the UTF-8 encoding?
>
>    pa

The current N-Triples spec does not preclude \u for ASCII chars; it 
would be usual but it's not banned:

" " and "\u0020" as well as "\u00A0" and "\n".

A solution in terms of text tools to create a robust pipeline would be 
to feed the N-triples through a text processor that rewrote \u to the 
UTF-8 form, then feed to grep.  It could check for bad UTF-8 as well.

Having seen the damage feeding UTF-8 through ISO-8859-1 pipelines can do 
(AKA default email settings), I can see a case for ASCII.  On balance, I 
think allowing UTF-8 is the right choice - it's not without issue though.

 Andy

Received on Wednesday, 20 July 2011 16:43:14 UTC