Re: Comments regarding "Turtle and N-Triples Synaxes for RDF" from Richard Cyganiak on 2012-05-21 (public-rdf-comments@w3.org from May 2012)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Mon, 21 May 2012 12:07:45 +0100
To: Gregory Williams <greg@evilfunhouse.com>
Cc: public-rdf-comments@w3.org
Message-Id: <F5C3EB1F-120F-42CE-A6B5-E664A15CB4C3@cyganiak.de>

Hi Gregory,

I'll leave it to the Turtle editors to properly reply, but let me just respond to one point that you raise.

On 19 May 2012, at 19:22, Gregory Williams wrote:
> I'm not happy with the change to make N-Triples a unicode format.

As someone whose native language uses characters outside of US-ASCII, and has written and debugged N-Triples serializers, I'm *very* happy about this change.

> This change means that tools interacting with N-Triples will have to be unicode aware, and support the \u style of unicode escapes used in N-Triples.

That's not quite correct.

Old N-Triples parsers and serializers already have to be aware of the \u escapes.

After the change, N-Triples serializers do *not* have to be aware of \u, but can write UTF-8 directly. This massively simplifies N-Triples serializers.

Whether command line tools like sort/uniq/cut/join need to be Unicode-aware, and understand \u escapes, depends on what you want to do. Some things get harder, some things get easier.

> This is a big change from the old N-Triples format, where command line tools such as sort/uniq/cut/join could be used to easily parse and perform simple processing of N-Triples data. With the unicode change, this strategy is now much more likely to not work, as a single value now has many equivalent syntactic forms (e.g. "Spïdermann" vs. "Sp\u00EFdermann"). Moreover, even the unicode escapes now have many equivalent forms, as the HEX production in the grammar has been made case insensitive, accepting [0-9A-Fa-f] instead of the old [0-9A-F] (e.g. "Sp\u00EFdermann" vs. "Sp\u00efdermann"). As mentioned above, this is also an issue with case insensitive language tags.

That's correct. But note that sort/uniq/cut/join already may or may not work depending on the kind of whitespace used in an old-N-Triples file, so their use is only straightforward if you already know a bit about the way it was serialized.

I'm not sure that allowing both \u00EF and \u00ef was really intentional, it may be an unintentional artefact of defining N-Triples as a subset of Turtle.

The WG is discussing introducing a “canonical” flavour of N-Triples that prescribes the kind and amount of whitespace that can be used, and that would probably outlaw \u escapes except for the characters that cannot be represented in another way. This flavour would be optimized for processing with these command line tools. One can hope that user demand pushes serializers towards generally emitting the canonical flavour.

Best,
Richard

Received on Monday, 21 May 2012 11:08:36 UTC