- From: Andy Seaborne <andy.seaborne@epimorphics.com>
- Date: Wed, 20 Jul 2011 17:42:43 +0100
- To: public-rdf-wg@w3.org
On 20/07/11 17:26, Pierre-Antoine Champin wrote: > Hi all, > > it seems that I could not make myself clear during today's telecon, on > my concern about making N-Triples utf-8 compliant. > > Don't get me wrong: I would *love* to see N-Triples support utf-8 (and > get the universe rid of ASCII, for that matter ;) > > But my concern is that: > * N-Triples will still support \uXXXX escaping > * so there would be several ways to serialize a literal in N-Triples > > I could perfectly live with that, but I think that one use-case of > N-Triples is to be processed by RDF-unaware tools, such as grep, sed or > sort. > > I know those tools have perfect utf-8 support; but they don't know that > "\u00e9" is the same as "é". So if I'm grep'ing an N-Triples file for > the string "trouvé", I may miss it if it is spelled "trouv\u00e9". And > if I'm sort'ing the triples, the escaped characteres will not be > interpreted, and so get wrongly sorted. > > This is my concern in making N-Triples utf-8 compliant: we loose the > good property it had to have exactly one way of serializing a given graph. > > Would that be possible to specify that \uXXXX escaping can only be used > in ASCII files, while UTF-8 files *must* use the UTF-8 encoding? > > pa The current N-Triples spec does not preclude \u for ASCII chars; it would be usual but it's not banned: " " and "\u0020" as well as "\u00A0" and "\n". A solution in terms of text tools to create a robust pipeline would be to feed the N-triples through a text processor that rewrote \u to the UTF-8 form, then feed to grep. It could check for bad UTF-8 as well. Having seen the damage feeding UTF-8 through ISO-8859-1 pipelines can do (AKA default email settings), I can see a case for ASCII. On balance, I think allowing UTF-8 is the right choice - it's not without issue though. Andy
Received on Wednesday, 20 July 2011 16:43:14 UTC