Re: TriG test suite issues from Eric Prud'hommeaux on 2013-09-21 (public-rdf-comments@w3.org from September 2013)

From: Eric Prud'hommeaux <eric@w3.org>
Date: Sat, 21 Sep 2013 19:28:22 -0400
To: Gregory Williams <greg@evilfunhouse.com>
Cc: "public-rdf-comments@w3.org Comments" <public-rdf-comments@w3.org>
Message-ID: <CANfjZH1kYjAe99FsxL-W1y2UrgS3MUdWzhfeFwAad493E98XRQ@mail.gmail.com>

On Sep 20, 2013 10:42 PM, "Gregory Williams" <greg@evilfunhouse.com> wrote:
>
> In working with the latest TriG test suite, I ran across two issues which
I hope can be addressed.
>
> 1)
>
> Several of the N-Quad files used in the test suite use new-style
N-Triples \u escapes (with lowercase hex chars). I believe this was fixed
in the Turtle test suite to allow existing (old-style) N-Triples parsers to
be used to test implementations of new-Turtle systems,

Yep, that was the intent. Sandro has been using the label "ASCII-ntriples"
to refer to the 2004 language.

> and I think the same reasoning should apply to N-Quads and TriG. The
files with the new-style escapes are:
>
> localName_with_assigned_nfc_bmp_PN_CHARS_BASE_character_boundaries.nq
localName_with_assigned_nfc_PN_CHARS_BASE_character_boundaries.nq
localName_with_nfc_PN_CHARS_BASE_character_boundaries.nq
>
> Can they please be changed to use all-caps hex characters in escapes?
>
>
> 2)
>
> Two (utf8 encoded) TriG files in the test suite contain the U+EFFFF
codepoint. While the TriG (and Turtle) grammars allow this codepoint in
their character ranges, this codepoint is not a valid Unicode character.
This causes problems for me in testing my TriG code because I can't easily
change the behavior or perl or the low-level libraries being used to handle
Unicode and file I/O (which I believe are doing the correct thing in
throwing errors when they see this codepoint). The relevant Unicode code
table[1] says of this codepoint range:
>
> "These codes are intended for process-internal uses, but are not
permitted for interchange."
>
> I see that this issue has been discussed on the mailing list with respect
to the range being used in the grammar, but given this Unicode text, I
can't see how this codepoint can reasonably be used in a test suite and
expected not to cause problems. The two files I see containing this
codepoint are:
>
> prefix_with_PN_CHARS_BASE_character_boundaries.trig
> labeled_blank_node_with_PN_CHARS_BASE_character_boundaries.trig

I believe I fixed the Turtle test suite to avoid the 66 noncharacters
http://www.unicode.org/faq/private_use.html as well. I wrote a Turtle lexer
that excluded them, but it took an extra 30 secs to build in flex so I
commented it out. Note that not only is FFFE-FFFF excluded, but FDD0-FDEF
and (1-10)(FFFE-FFFF).

> thanks,
> .greg
>
>
> [1] http://www.unicode.org/charts/PDF/UEFF80.pdf
>
>

Received on Saturday, 21 September 2013 23:28:49 UTC