- From: Pierre-Antoine Champin <pierre-antoine.champin@liris.cnrs.fr>
- Date: Sat, 23 Jul 2011 15:00:15 +0200
- To: Andy Seaborne <andy.seaborne@epimorphics.com>
- CC: "public-rdf-wg@w3.org" <public-rdf-wg@w3.org>
On 07/20/2011 06:42 PM, Andy Seaborne wrote: > > > On 20/07/11 17:26, Pierre-Antoine Champin wrote: >> Hi all, >> >> it seems that I could not make myself clear during today's telecon, on >> my concern about making N-Triples utf-8 compliant. >> >> Don't get me wrong: I would *love* to see N-Triples support utf-8 (and >> get the universe rid of ASCII, for that matter ;) >> >> But my concern is that: >> * N-Triples will still support \uXXXX escaping >> * so there would be several ways to serialize a literal in N-Triples >> >> I could perfectly live with that, but I think that one use-case of >> N-Triples is to be processed by RDF-unaware tools, such as grep, sed or >> sort. >> >> I know those tools have perfect utf-8 support; but they don't know that >> "\u00e9" is the same as "é". So if I'm grep'ing an N-Triples file for >> the string "trouvé", I may miss it if it is spelled "trouv\u00e9". And >> if I'm sort'ing the triples, the escaped characteres will not be >> interpreted, and so get wrongly sorted. >> >> This is my concern in making N-Triples utf-8 compliant: we loose the >> good property it had to have exactly one way of serializing a given graph. >> >> Would that be possible to specify that \uXXXX escaping can only be used >> in ASCII files, while UTF-8 files *must* use the UTF-8 encoding? >> >> pa > > The current N-Triples spec does not preclude \u for ASCII chars; it > would be usual but it's not banned: > > " " and "\u0020" as well as "\u00A0" and "\n". Oh well... So adding UTF-8 support to N-Triples does not create that problem, as it was here all along :) > A solution in terms of text tools to create a robust pipeline would be > to feed the N-triples through a text processor that rewrote \u to the > UTF-8 form, then feed to grep. It could check for bad UTF-8 as well. I thought about that, but had the impression that this step was *only* necessary if N-Triples supported UTF-8. As even ASCII-only N-Triples requires this normalisation step, then I see no reason why N-Triples should not support UTF-8. > Having seen the damage feeding UTF-8 through ISO-8859-1 pipelines can do > (AKA default email settings), I can see a case for ASCII. I can see a case *against* ISO-8859-1 pipelines :-P > On balance, I > think allowing UTF-8 is the right choice - it's not without issue though. +1 pa
Received on Saturday, 23 July 2011 17:32:01 UTC