- From: Mischa Tuffield <mischa.tuffield@garlik.com>
- Date: Wed, 9 Mar 2011 15:50:40 +0000
- To: RDF Working Group WG <public-rdf-wg@w3.org>
- Cc: Stefano D'Angelo <zanga.mail@gmail.com>
- Message-Id: <CDD6F29B-BC0E-4AE7-B994-8017FC38440C@garlik.com>
Hello, I am attaching an email I was forwarded by Stefano D'Angelo (cc'd) highlighting some concerns he had with the Turtle syntax definition. I will summarise his / my findings here, but have the email thread posted below for provenance purposes. Relevant points: > According to [3], the lexical representation of decimal values should > > always contain a period as decimal indicator, hence rule (18) should > > be: > > (18) decimal ::= ('-' | '+')? ( [0-9]+ '.' [0-9]* | '.' ([0-9])+ ) I not sure I agree with this, "9" (in my reading) is a decimal number, as per [1] Section 3.2.3.1 states: "Leading and trailing zeroes are optional. If the fractional part is zero, the period and following zero(es) can be omitted." Yves brought up a similar issue in this thread on the RDF WG mailing list [2]. [1] http://www.w3.org/TR/xmlschema-2/#decimal [2] http://www.w3.org/mid/20110228120029.GG12598@dvbstreamer.national.core.bbc.co.uk > Rules 2, 4 and 5 need to be changed as follows to allow white space usage: > > (2) statement ::= directive ws* '.' | triples ws* '.' | ws+ This makes perfect sense to allow for ws* between directives (prefix's and base's) and triples and their "."s. I was a bit concerned that the spacing was covered in another rule in the grammar, but I didn't spot that if it is the case. > (4) prefixID ::= '@prefix' ws+ prefixName? ':' ws* uriref > > (5) base ::= '@base' ws+ uriref I am not sure that I think it is necessary to allow whitespaces in rules 4 and 5. But apparently librdf allows this. > Rules 6, 7 and 8 should also be changed to avoid white space-related > > ambiguities (such as subject being a qname and there is no space > > between subject and predicate): > > (6) triples ::= subject ws+ predicateObjectList > > (7) predicateObjectList ::= verb ws+ objectList (';' verb ws+ > > objectList)* (';')? > > (8) object ::= object (ws+ object)* These changes all look good to me. They are just allowing for whitespaces in places commonly accepted in parsers. > In order to be consistent with the current librdf implementation, > > rules 14 and 15 could (should?) also allow white spaces too without > > introducing ambiguities: > > (14) literal ::= quotedString (ws* '@' ws* language )? | > > datatypeString | integer | double | decimal | boolean > > (15) datatypeString ::= quotedString ws* '^^' ws* resource I am not sure how I feel about these changes I think I would like to see both @ and ^^ come directly after a literal in turtle, but again apparently raptor allows for whitespaces in these places. > Then, the current definition of lcharacters allows them to be double > > quotes ("), which makes the end of a longString ambiguous (think > > """ABC"""DEF"""). This can be solved by changing rules 37 and 43 as > > follows: > > (37) longString ::= #x22 #x22 #x22 lcharacter* ( #x22 #x22? lcharacter+ )* #x22 #x22 #x22 > > (43) lcharacter ::= scharacter | #x9 | #xA | #xD fsd I see what you are doing here, I wonder if rule 37 should look more like : (37) longString ::= #x22 #x22 #x22 lcharacter* ( #x22 #x22? lcharacter+)* #x22 #x22 #x22 Anyways, SPARQL has it's own definition of triple quoting, and I think the best plan to united Turtle with SPARQL triples to allow for maximum consistency. > In the end the current definition of ucharacters is in disagreement > > with both section 3.3 (String Escapes) and with tests from the test > > suite, since it does not allow \t, \n and \r escapes. It should > > definitely be changed to: > > (41) ucharacter ::= ( echaracter - #x3E ) | '\>' Which seems sane to me too. Anyways, these are some initial thoughts about the current Turtle syntax, I am happy to put these up on the wiki somewhere if need be. Regards, Mischa Below is the original thread : > My name is Mischa, I worked with Steve Harris at Garlik, and am on the W3C's > current RDF WG. I was forwarded a number of emails which you have > constructed where you critique the Turtle Syntax. Oh, finally :-) > So, as a high-level, do you mind if I forward your emails to the working > group's mailing list, and do you mind if I add my 2 cents, similar to what I > have done to your emails. If this is OK with you, I will cc you into the > mails if that is OK? Sure, please do. > I have some comments, they will come inline, please feel free to let me know > if you think I am wrong in the way I have responded to your points. Sure, my replies follows your comments. > On 2 Mar 2011, at 11:22, Steve Harris wrote: [...] > Hello, > > While implementing my own Turtle parser (see [1]), I took note of the > > following errors in the Turtle grammar as defined in [2]. > > According to [3], the lexical representation of decimal values should > > always contain a period as decimal indicator, hence rule (18) should > > be: > > (18) decimal ::= ('-' | '+')? ( [0-9]+ '.' [0-9]* | '.' ([0-9])+ ) > > So, I don't agree with this. > You seem to think that "9" is not a decimal number, and you refer to this > link below: > http://www.w3.org/TR/xmlschema-2/#decimal > The link above suggests that "9" is a decimal for it states in section > 3.2.3.1 : > "Leading and trailing zeroes are optional. If the fractional part is zero, > the period and following zero(es) can be omitted." > Yves brought up a similar issue in this thread on the RDF WG mailing list. > > http://www.w3.org/mid/20110228120029.GG12598@dvbstreamer.national.core.bbc.co.uk Mmmm... okay, hence some type inference is needed. If that is ok to you, it's ok to me as well, but I have to warn you that it might be a big problem in generic Turtle parsers (and indeed, I remember having had this kind of issue with librdf some time ago). > Rules 2, 4 and 5 need to be changed as follows to allow white space usage: > > (2) statement ::= directive ws* '.' | triples ws* '.' | ws+ > > This makes perfect sense to allow for ws* between directives (prefix's and > base's) and triples and their "."s. > I was a bit concerned that the spacing was covered in another rule in the > grammar, but it didn't seem to be from my POV. It is not. I discovered those issues while writing my own parser, hence there is nothing "redundant" in these "new rules" for sure. > (4) prefixID ::= '@prefix' ws+ prefixName? ':' ws* uriref > > (5) base ::= '@base' ws+ uriref > > Am not sure I agree with these why would you want spaces in between the > prefixName and the URIRef ? (Note that URIRef's should change to IRIs in > this next standardisation effort). > But again I think this is up for debate, and I think it is worth posing to > the community. I added this just because I noticed that librdf allowed that. No other reason. > Rules 6, 7 and 8 should also be changed to avoid white space-related > > ambiguities (such as subject being a qname and there is no space > > between subject and predicate): > > (6) triples ::= subject ws+ predicateObjectList > > (7) predicateObjectList ::= verb ws+ objectList (';' verb ws+ > > objectList)* (';')? > > (8) object ::= object (ws+ object)* > > These all look good to me too. They are all just allowing whitespaces where > I think libraptor would allow them anyways , right ? Again, I think these > could be posed to the working group to see what the consensus view is on the > matter of whitespaces. Yes, exactly. > In order to be consistent with the current librdf implementation, > > rules 14 and 15 could (should?) also allow white spaces too without > > introducing ambiguities: > > (14) literal ::= quotedString (ws* '@' ws* language )? | > > datatypeString | integer | double | decimal | boolean > > (15) datatypeString ::= quotedString ws* '^^' ws* resource > > So, I am not sure I agree with these. I like the fact that the @ or the ^^ > need to come right after the quoted string. This matches up to libraptor > parser anyways. Again, up for debate I guess. Yes, same thing. Raptor allowed that, so I put them in. > Then, the current definition of lcharacters allows them to be double > > quotes ("), which makes the end of a longString ambiguous (think > > """ABC"""DEF"""). This can be solved by changing rules 37 and 43 as > > follows: > > (37) longString ::= #x22 #x22 #x22 lcharacter* ( #x22 #x22? lcharacter+ )* > > (43) lcharacter ::= scharacter | #x9 | #xA | #xD > > I see what you are doing here, I wonder if rule 37 should look more like : > (37) longString ::= #x22 #x22 #x22 lcharacter* ( #x22 #x22? lcharacter+ > )* #x22 #x22 #x22 > Anyways, SPARQL has it's own definition of triple quoting, and I think the > best plan to united Turtle with SPARQL triples to allow for maximum > consistency. Yes, sorry, that was corrected in a later email. However, this one is needed IMO, since it would create scanning/parsing ambiguities and/or difficulties. > In the end the current definition of ucharacters is in disagreement > > with both section 3.3 (String Escapes) and with tests from the test > > suite, since it does not allow \t, \n and \r escapes. It should > > definitely be changed to: > > (41) ucharacter ::= ( echaracter - #x3E ) | '\>' > > I need to look at this a little further. Ok. > So Stephano, are you happy for me to put your concern across to the working > group? I will aim to forward your mail in its entirety, and I will add my > comments to your mails in an obvious manner, what do you think to this? > Cheers, > Mischa Sure, it's fine to me. The reason why I sent the mails in the first place was to have those "bugs" fixed. However, I just wanted to add that Unicode code points U+D800 to U+DFFF should not be allowed (those are surrogates reserved for UTF-16 encoding), and byte order marks should be also discussed IMO. Thank you and please, let me know. All the best, Stefano ___________________________________ Mischa Tuffield PhD Email: mischa.tuffield@garlik.com Homepage - http://mmt.me.uk/ Garlik Limited, 1-3 Halford Road, Richmond, TW10 6AW +44(0)845 652 2824 http://www.garlik.com/ Registered in England and Wales 535 7233 VAT # 849 0517 11 Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Wednesday, 9 March 2011 15:51:25 UTC