Re: Proposed fixed version of N-Triples https://www.w3.org/TR/n-triples/ Section 7 from Andy Seaborne on 2017-06-29 (public-rdf-comments@w3.org from June 2017)

From: Andy Seaborne <andy@apache.org>
Date: Thu, 29 Jun 2017 21:11:18 +0100
To: public-rdf-comments@w3.org
Message-ID: <346761c8-2e4e-4098-95f6-ae41f6575b7b@apache.org>
I think that changing the grammar in this way has disadvantages:

For larger languages, it adds a lot of clutter.

It does not reflect the practical aspects of tools.

Whitespace and comment processing is often done during tokenization and 
tokenizers even have special facilities, or common idioms, for doing 
that.  Having the grammar reflect that help implementers.

 > [[Lines consisting entirely of white space and/or a comment are now 
permitted.]]

Counting the number of lines to find the number of triples is 
intentional IIRC.

     Andy

On 29/06/17 10:15, Peter F. Patel-Schneider wrote:
> A message to semantic-web@w3.org
> https://lists.w3.org/Archives/Public/semantic-web/2017Jun/0065.html inspired
> me to take a closer look at the grammar for N-Triples.  I found a number of
> problems in the grammar for N-Triples there.  I propose the following fixed
> version of the grammar section.
> 
> 
> Problems addressed:
> 1/ White space permitted but not required between any two terminals and at
> beginning and end of document.
> 2/ Comments can only occur in specific places.
> 3/ Lines consisting entirely of white space and/or a comment are permitted.
> 4/ Confusing statement about Unicode code points removed.
> 
> Remaining issue:
> 1/ The grammar in the TR mentions white space in the context of any two
> terminals, which includes between the parts of a literals.  However, there
> is no example or test case that has white space there.   This grammar
> permits white space there.
> 
> 
> 7. Grammar
> 
> An N-Triples document is a Unicode [UNICODE] character string encoded in
> UTF-8.
> [[Remove: Unicode code points only in the range U+0 to U+10FFFF inclusive are
> allowed.  Rationale: These are the only Unicode code points.]]
> 
> White space (tab U+0009 or space U+0020) is allowed but not required between
> any two terminals.
> [[Replace: White space (tab U+0009 or space U+0020) is used to separate two
> terminals
> which would otherwise be (mis-)recognized as one terminal.
> Rationale: In N-Triples there is no possibility of such mis-recognition.]]
> White space is significant in the production STRING_LITERAL_QUOTE.
> 
> Comments in N-Triples take the form of '#', outside an IRIREF or
> STRING_LITERAL_QUOTE, and continue up-to, and excluding, the end of line
> (EOL), or end of file if there is no end of line after the comment
> marker. Comments are treated as white space.
> 
> The EBNF used here is defined in XML 1.0 [EBNF-NOTATION].
> 
> [[White space and comments are now explicit in the grammar similar to the
> situation in early versions of the N-Triples grammar.  Rationale: Makes it
> clear where white space and comments are permitted. ]]
> 
> Escape sequence rules are the same as Turtle [TURTLE]. However, as only the
> STRING_LITERAL_QUOTE production is allowed new lines in literals MUST be
> escaped.
> [1]  ntriplesDoc  ::=  triple? (EOL triple)* END
> [2]  triple   ::=  WS? subject WS? predicate WS? object WS? '.'
> [3]  subject  ::=  IRIREF | BLANK_NODE_LABEL
> [4]  predicate  ::=  IRIREF
> [5]  object   ::=  IRIREF | BLANK_NODE_LABEL | literal
> [6]  literal  ::=  STRING_LITERAL_QUOTE (WS? '^^' WS? IRIREF | WS? LANGTAG)?
> 
> Productions for terminals
> [144s]  LANGTAG  ::=  '@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*
> [[Lines consisting entirely of white space and/or a comment are now permitted.]]
> [7]  EOL  ::=  ( WS? ('#x22' [^#xD#xA]* )? [#xD#xA] )+
> [7a] END ::=  EOL? WS? ('#x22' [^#xD#xA]* )?
> [8]  IRIREF  ::=  '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'
> [9]  STRING_LITERAL_QUOTE  ::=  '"' ([^#x22#x5C#xA#xD] | ECHAR | UCHAR)* '"'
> [141s]  BLANK_NODE_LABEL  ::=  '_:' (PN_CHARS_U | [0-9]) ((PN_CHARS | '.')*
> PN_CHARS)?
> [10]  UCHAR  ::=  '\u' HEX HEX HEX HEX | '\U' HEX HEX HEX HEX HEX HEX HEX HEX
> [153s]  ECHAR  ::=  '\' [tbnrf"'\]
> [157s]  PN_CHARS_BASE  ::=  [A-Z] | [a-z] | [#x00C0-#x00D6] | [#x00D8-#x00F6]
> | [#x00F8-#x02FF] | [#x0370-#x037D] | [#x037F-#x1FFF] | [#x200C-#x200D] |
> [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] |
> [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
> [158s]  PN_CHARS_U  ::=  PN_CHARS_BASE | '_' | ':'
> [160s]  PN_CHARS  ::=  PN_CHARS_U | '-' | [0-9] | #x00B7 | [#x0300-#x036F] |
> [#x203F-#x2040]
> [162s]  HEX  ::=  [0-9] | [A-F] | [a-f]
> 
> [[White space is included in grammar.]]
>  WS ::= [#x9#x20]+
> 
> 
> 
>
Received on Thursday, 29 June 2017 20:11:53 UTC