Re: about utf-8 support in N-Triples from Pierre-Antoine Champin on 2011-07-23 (public-rdf-wg@w3.org from July 2011)

From: Pierre-Antoine Champin <pierre-antoine.champin@liris.cnrs.fr>
Date: Sat, 23 Jul 2011 15:00:15 +0200
To: Andy Seaborne <andy.seaborne@epimorphics.com>
CC: "public-rdf-wg@w3.org" <public-rdf-wg@w3.org>
Message-ID: <4E2AC5DF.1010208@liris.cnrs.fr>

On 07/20/2011 06:42 PM, Andy Seaborne wrote:
> 
> 
> On 20/07/11 17:26, Pierre-Antoine Champin wrote:
>> Hi all,
>>
>> it seems that I could not make myself clear during today's telecon, on
>> my concern about making N-Triples utf-8 compliant.
>>
>> Don't get me wrong: I would *love* to see N-Triples support utf-8 (and
>> get the universe rid of ASCII, for that matter ;)
>>
>> But my concern is that:
>> * N-Triples will still support \uXXXX escaping
>> * so there would be several ways to serialize a literal in N-Triples
>>
>> I could perfectly live with that, but I think that one use-case of
>> N-Triples is to be processed by RDF-unaware tools, such as grep, sed or
>> sort.
>>
>> I know those tools have perfect utf-8 support; but they don't know that
>> "\u00e9" is the same as "é". So if I'm grep'ing an N-Triples file for
>> the string "trouvé", I may miss it if it is spelled "trouv\u00e9". And
>> if I'm sort'ing the triples, the escaped characteres will not be
>> interpreted, and so get wrongly sorted.
>>
>> This is my concern in making N-Triples utf-8 compliant: we loose the
>> good property it had to have exactly one way of serializing a given graph.
>>
>> Would that be possible to specify that \uXXXX escaping can only be used
>> in ASCII files, while UTF-8 files *must* use the UTF-8 encoding?
>>
>>    pa
> 
> The current N-Triples spec does not preclude \u for ASCII chars; it 
> would be usual but it's not banned:
> 
> " " and "\u0020" as well as "\u00A0" and "\n".

Oh well... So adding UTF-8 support to N-Triples does not create that
problem, as it was here all along :)

> A solution in terms of text tools to create a robust pipeline would be 
> to feed the N-triples through a text processor that rewrote \u to the 
> UTF-8 form, then feed to grep.  It could check for bad UTF-8 as well.

I thought about that, but had the impression that this step was *only*
necessary if N-Triples supported UTF-8. As even ASCII-only N-Triples
requires this normalisation step, then I see no reason why N-Triples
should not support UTF-8.

> Having seen the damage feeding UTF-8 through ISO-8859-1 pipelines can do 
> (AKA default email settings), I can see a case for ASCII.

I can see a case *against* ISO-8859-1 pipelines :-P

> On balance, I 
> think allowing UTF-8 is the right choice - it's not without issue though.

+1

  pa

Received on Saturday, 23 July 2011 17:32:01 UTC