Re: Proposed fixed version of N-Triples https://www.w3.org/TR/n-triples/ Section 7

* Andy Seaborne <andy@apache.org> [2017-06-29 21:11+0100]
> I think that changing the grammar in this way has disadvantages:
> 
> For larger languages, it adds a lot of clutter.
> 
> It does not reflect the practical aspects of tools.
> 
> Whitespace and comment processing is often done during tokenization and
> tokenizers even have special facilities, or common idioms, for doing that.
> Having the grammar reflect that helps implementers.

strong +1. It is the default behavior of almost every lexer is to
break on whitespace. Arguably, we could have been clearer about that,
though we were clear about matching the longest terminal (which
requires sorting the directives in some lexers).

The yacker for Turtle has:
[[
[56] PASSED TOKENS ::= ([ \t\r\n])+
                     | "#" ([^\r\n])*
]]
I reccommend that we add something like that to the errata and call
this done. The large number of Turtle and SPARQL parsers out there
that do behave this way is evidence that the world made the obvious
assumption about the terminals.

N-Triples is arguably another matter. Historically, it had much more
rigid (and awk-friendly) whitespace rules. It's reasonable for the
community to decide what they should be going forward, look for
violations of that, and ask the custodians of that data if they could
update them. I suspect that most N-Triples are parsed by Turtle
parsers but it could be a favor to simpler parsers.


> > [[Lines consisting entirely of white space and/or a comment are now
> permitted.]]
> 
> Counting the number of lines to find the number of triples is intentional
> IIRC.
> 
>     Andy
> 
> On 29/06/17 10:15, Peter F. Patel-Schneider wrote:
> >A message to semantic-web@w3.org
> >https://lists.w3.org/Archives/Public/semantic-web/2017Jun/0065.html inspired
> >me to take a closer look at the grammar for N-Triples.  I found a number of
> >problems in the grammar for N-Triples there.  I propose the following fixed
> >version of the grammar section.
> >
> >
> >Problems addressed:
> >1/ White space permitted but not required between any two terminals and at
> >beginning and end of document.
> >2/ Comments can only occur in specific places.
> >3/ Lines consisting entirely of white space and/or a comment are permitted.
> >4/ Confusing statement about Unicode code points removed.
> >
> >Remaining issue:
> >1/ The grammar in the TR mentions white space in the context of any two
> >terminals, which includes between the parts of a literals.  However, there
> >is no example or test case that has white space there.   This grammar
> >permits white space there.
> >
> >
> >7. Grammar
> >
> >An N-Triples document is a Unicode [UNICODE] character string encoded in
> >UTF-8.
> >[[Remove: Unicode code points only in the range U+0 to U+10FFFF inclusive are
> >allowed.  Rationale: These are the only Unicode code points.]]
> >
> >White space (tab U+0009 or space U+0020) is allowed but not required between
> >any two terminals.
> >[[Replace: White space (tab U+0009 or space U+0020) is used to separate two
> >terminals
> >which would otherwise be (mis-)recognized as one terminal.
> >Rationale: In N-Triples there is no possibility of such mis-recognition.]]
> >White space is significant in the production STRING_LITERAL_QUOTE.
> >
> >Comments in N-Triples take the form of '#', outside an IRIREF or
> >STRING_LITERAL_QUOTE, and continue up-to, and excluding, the end of line
> >(EOL), or end of file if there is no end of line after the comment
> >marker. Comments are treated as white space.
> >
> >The EBNF used here is defined in XML 1.0 [EBNF-NOTATION].
> >
> >[[White space and comments are now explicit in the grammar similar to the
> >situation in early versions of the N-Triples grammar.  Rationale: Makes it
> >clear where white space and comments are permitted. ]]
> >
> >Escape sequence rules are the same as Turtle [TURTLE]. However, as only the
> >STRING_LITERAL_QUOTE production is allowed new lines in literals MUST be
> >escaped.
> >[1] 	ntriplesDoc 	::= 	triple? (EOL triple)* END
> >[2] 	triple	 	::= 	WS? subject WS? predicate WS? object WS? '.'
> >[3] 	subject 	::= 	IRIREF | BLANK_NODE_LABEL
> >[4] 	predicate 	::= 	IRIREF
> >[5] 	object	 	::= 	IRIREF | BLANK_NODE_LABEL | literal
> >[6] 	literal 	::= 	STRING_LITERAL_QUOTE (WS? '^^' WS? IRIREF | WS? LANGTAG)?
> >
> >Productions for terminals
> >[144s] 	LANGTAG 	::= 	'@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*
> >[[Lines consisting entirely of white space and/or a comment are now permitted.]]
> >[7] 	EOL 	::= 	( WS? ('#x22' [^#xD#xA]* )? [#xD#xA] )+
> >[7a]	END	::= 	EOL? WS? ('#x22' [^#xD#xA]* )?
> >[8] 	IRIREF 	::= 	'<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'
> >[9] 	STRING_LITERAL_QUOTE 	::= 	'"' ([^#x22#x5C#xA#xD] | ECHAR | UCHAR)* '"'
> >[141s] 	BLANK_NODE_LABEL 	::= 	'_:' (PN_CHARS_U | [0-9]) ((PN_CHARS | '.')*
> >PN_CHARS)?
> >[10] 	UCHAR 	::= 	'\u' HEX HEX HEX HEX | '\U' HEX HEX HEX HEX HEX HEX HEX HEX
> >[153s] 	ECHAR 	::= 	'\' [tbnrf"'\]
> >[157s] 	PN_CHARS_BASE 	::= 	[A-Z] | [a-z] | [#x00C0-#x00D6] | [#x00D8-#x00F6]
> >| [#x00F8-#x02FF] | [#x0370-#x037D] | [#x037F-#x1FFF] | [#x200C-#x200D] |
> >[#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] |
> >[#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
> >[158s] 	PN_CHARS_U 	::= 	PN_CHARS_BASE | '_' | ':'
> >[160s] 	PN_CHARS 	::= 	PN_CHARS_U | '-' | [0-9] | #x00B7 | [#x0300-#x036F] |
> >[#x203F-#x2040]
> >[162s] 	HEX 	::= 	[0-9] | [A-F] | [a-f]
> >
> >[[White space is included in grammar.]]
> >	WS	::=	[#x9#x20]+
> >
> >
> >
> >
> 

-- 
-ericP

office: +1.617.599.3509
mobile: +33.6.80.80.35.59

(eric@w3.org)
Feel free to forward this message to any list for any purpose other than
email address distribution.

There are subtle nuances encoded in font variation and clever layout
which can only be seen by printing this message on high-clay paper.

Received on Thursday, 29 June 2017 22:34:35 UTC