[Turtle] Some syntax related issues

Hello, 

I am attaching an email I was forwarded by Stefano D'Angelo (cc'd) highlighting some concerns he had with the Turtle syntax definition. I will summarise his / my findings here, but have the email thread posted below for provenance purposes. 

Relevant points: 


> According to [3], the lexical representation of decimal values should
> 
> always contain a period as decimal indicator, hence rule (18) should
> 
> be:
> 
> (18) decimal ::= ('-' | '+')? ( [0-9]+ '.' [0-9]* | '.' ([0-9])+ )

I not sure I agree with this, "9" (in my reading) is a decimal number, as per [1]

Section 3.2.3.1 states: 

"Leading and trailing zeroes are optional. If the fractional part is zero, 
the period and following zero(es) can be omitted."
Yves brought up a similar issue in this thread on the RDF WG mailing list [2].

[1] http://www.w3.org/TR/xmlschema-2/#decimal
[2] http://www.w3.org/mid/20110228120029.GG12598@dvbstreamer.national.core.bbc.co.uk


> Rules 2, 4 and 5 need to be changed as follows to allow white space usage:
> 
> (2) statement ::= directive ws* '.' | triples ws* '.' | ws+

This makes perfect sense to allow for ws* between directives (prefix's and
base's) and triples and their "."s.

I was a bit concerned that the spacing was covered in another rule in the
grammar, but I didn't spot that if it is the case. 


> (4) prefixID ::= '@prefix' ws+ prefixName? ':' ws* uriref
> 
> (5) base ::= '@base' ws+ uriref


I am not sure that I think it is necessary to allow whitespaces in rules 4 and 5. But apparently librdf allows this. 


> Rules 6, 7 and 8 should also be changed to avoid white space-related
> 
> ambiguities (such as subject being a qname and there is no space
> 
> between subject and predicate):
> 
> (6) triples ::= subject ws+ predicateObjectList
> 
> (7) predicateObjectList ::= verb ws+ objectList (';' verb ws+
> 
> objectList)* (';')?
> 
> (8) object ::= object (ws+ object)*

These changes all look good to me. They are just allowing for whitespaces in places commonly accepted in parsers. 


> In order to be consistent with the current librdf implementation,
> 
> rules 14 and 15 could (should?) also allow white spaces too without
> 
> introducing ambiguities:
> 
> (14) literal ::= quotedString (ws* '@' ws* language )? |
> 
> datatypeString | integer | double | decimal | boolean
> 
> (15) datatypeString ::= quotedString ws* '^^' ws* resource


I am not sure how I feel about these changes I think I would like to see both @ and ^^ come directly after a literal in turtle, but again apparently raptor allows for whitespaces in these places. 

> Then, the current definition of lcharacters allows them to be double
> 
> quotes ("), which makes the end of a longString ambiguous (think
> 
> """ABC"""DEF"""). This can be solved by changing rules 37 and 43 as
> 
> follows:
> 
> (37) longString ::= #x22 #x22 #x22 lcharacter* ( #x22 #x22? lcharacter+ )* #x22 #x22 #x22
> 
> (43) lcharacter ::= scharacter | #x9 | #xA | #xD

fsd

I see what you are doing here, I wonder if rule 37 should look more like : 
(37) longString ::= #x22 #x22 #x22 lcharacter* ( #x22 #x22? lcharacter+)*  #x22 #x22 #x22

Anyways, SPARQL has it's own definition of triple quoting, and I think the best plan to united Turtle with SPARQL triples to allow for maximum consistency.

> In the end the current definition of ucharacters is in disagreement
> 
> with both section 3.3 (String Escapes) and with tests from the test
> 
> suite, since it does not allow \t, \n and \r escapes. It should
> 
> definitely be changed to:
> 
> (41) ucharacter ::= ( echaracter - #x3E ) | '\>'

Which seems sane to me too. 

Anyways, these are some initial thoughts about the current Turtle syntax, I am happy to put these up on the wiki somewhere if need be. 

Regards, 

Mischa 

Below is the original thread : 


> My name is Mischa, I worked with Steve Harris at Garlik, and am on the W3C's
> current RDF WG. I was forwarded a number of emails which you have
> constructed where you critique the Turtle Syntax.

Oh, finally :-)

> So, as a high-level, do you mind if I forward your emails to the working
> group's mailing list, and do you mind if I add my 2 cents, similar to what I
> have done to your emails. If this is OK with you, I will cc you into the
> mails if that is OK?

Sure, please do.

> I have some comments, they will come inline, please feel free to let me know
> if you think I am wrong in the way I have responded to your points.

Sure, my replies follows your comments.

> On 2 Mar 2011, at 11:22, Steve Harris wrote:
[...]
> Hello,
> 
> While implementing my own Turtle parser (see [1]), I took note of the
> 
> following errors in the Turtle grammar as defined in [2].
> 
> According to [3], the lexical representation of decimal values should
> 
> always contain a period as decimal indicator, hence rule (18) should
> 
> be:
> 
> (18) decimal ::= ('-' | '+')? ( [0-9]+ '.' [0-9]* | '.' ([0-9])+ )
> 
> So, I don't agree with this.
> You seem to think that "9" is not a decimal number, and you refer to this
> link below:
> http://www.w3.org/TR/xmlschema-2/#decimal
> The link above suggests that "9" is a decimal for it states in section
> 3.2.3.1 :
> "Leading and trailing zeroes are optional. If the fractional part is zero,
> the period and following zero(es) can be omitted."
> Yves brought up a similar issue in this thread on the RDF WG mailing list.
> 
> http://www.w3.org/mid/20110228120029.GG12598@dvbstreamer.national.core.bbc.co.uk

Mmmm... okay, hence some type inference is needed. If that is ok to
you, it's ok to me as well, but I have to warn you that it might be a
big problem in generic Turtle parsers (and indeed, I remember having
had this kind of issue with librdf some time ago).

> Rules 2, 4 and 5 need to be changed as follows to allow white space usage:
> 
> (2) statement ::= directive ws* '.' | triples ws* '.' | ws+
> 
> This makes perfect sense to allow for ws* between directives (prefix's and
> base's) and triples and their "."s.
> I was a bit concerned that the spacing was covered in another rule in the
> grammar, but it didn't seem to be from my POV.

It is not. I discovered those issues while writing my own parser,
hence there is nothing "redundant" in these "new rules" for sure.

> (4) prefixID ::= '@prefix' ws+ prefixName? ':' ws* uriref
> 
> (5) base ::= '@base' ws+ uriref
> 
> Am not sure I agree with these why would you want spaces in between the
> prefixName and the URIRef ? (Note that URIRef's should change to IRIs in
> this next standardisation effort).
> But again I think this is up for debate, and I think it is worth posing to
> the community.

I added this just because I noticed that librdf allowed that. No other reason.

> Rules 6, 7 and 8 should also be changed to avoid white space-related
> 
> ambiguities (such as subject being a qname and there is no space
> 
> between subject and predicate):
> 
> (6) triples ::= subject ws+ predicateObjectList
> 
> (7) predicateObjectList ::= verb ws+ objectList (';' verb ws+
> 
> objectList)* (';')?
> 
> (8) object ::= object (ws+ object)*
> 
> These all look good to me too. They are all just allowing whitespaces where
> I think libraptor would allow them anyways , right ? Again, I think these
> could be posed to the working group to see what the consensus view is on the
> matter of whitespaces.

Yes, exactly.

> In order to be consistent with the current librdf implementation,
> 
> rules 14 and 15 could (should?) also allow white spaces too without
> 
> introducing ambiguities:
> 
> (14) literal ::= quotedString (ws* '@' ws* language )? |
> 
> datatypeString | integer | double | decimal | boolean
> 
> (15) datatypeString ::= quotedString ws* '^^' ws* resource
> 
> So, I am not sure I agree with these. I like the fact that the @ or the ^^
> need to come right after the quoted string. This matches up to libraptor
> parser anyways. Again, up for debate I guess.

Yes, same thing. Raptor allowed that, so I put them in.

> Then, the current definition of lcharacters allows them to be double
> 
> quotes ("), which makes the end of a longString ambiguous (think
> 
> """ABC"""DEF"""). This can be solved by changing rules 37 and 43 as
> 
> follows:
> 
> (37) longString ::= #x22 #x22 #x22 lcharacter* ( #x22 #x22? lcharacter+ )*
> 
> (43) lcharacter ::= scharacter | #x9 | #xA | #xD
> 
> I see what you are doing here, I wonder if rule 37 should look more like :
> (37) longString ::= #x22 #x22 #x22 lcharacter* ( #x22 #x22? lcharacter+
> )*  #x22 #x22 #x22
> Anyways, SPARQL has it's own definition of triple quoting, and I think the
> best plan to united Turtle with SPARQL triples to allow for maximum
> consistency.

Yes, sorry, that was corrected in a later email. However, this one is
needed IMO, since it would create scanning/parsing ambiguities and/or
difficulties.

> In the end the current definition of ucharacters is in disagreement
> 
> with both section 3.3 (String Escapes) and with tests from the test
> 
> suite, since it does not allow \t, \n and \r escapes. It should
> 
> definitely be changed to:
> 
> (41) ucharacter ::= ( echaracter - #x3E ) | '\>'
> 
> I need to look at this a little further.

Ok.

> So Stephano, are you happy for me to put your concern across to the working
> group? I will aim to forward your mail in its entirety, and I will add my
> comments to your mails in an obvious manner, what do you think to this?
> Cheers,
> Mischa

Sure, it's fine to me. The reason why I sent the mails in the first
place was to have those "bugs" fixed.

However, I just wanted to add that Unicode code points U+D800 to
U+DFFF should not be allowed (those are surrogates reserved for UTF-16
encoding), and byte order marks should be also discussed IMO.

Thank you and please, let me know.

All the best,

Stefano



___________________________________
Mischa Tuffield PhD
Email: mischa.tuffield@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 1-3 Halford Road, Richmond, TW10 6AW
+44(0)845 652 2824  http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD

Received on Wednesday, 9 March 2011 15:51:25 UTC