- From: Mischa Tuffield <mischa.tuffield@garlik.com>
- Date: Wed, 9 Mar 2011 15:50:40 +0000
- To: RDF Working Group WG <public-rdf-wg@w3.org>
- Cc: Stefano D'Angelo <zanga.mail@gmail.com>
- Message-Id: <CDD6F29B-BC0E-4AE7-B994-8017FC38440C@garlik.com>
Hello,
I am attaching an email I was forwarded by Stefano D'Angelo (cc'd) highlighting some concerns he had with the Turtle syntax definition. I will summarise his / my findings here, but have the email thread posted below for provenance purposes.
Relevant points:
> According to [3], the lexical representation of decimal values should
>
> always contain a period as decimal indicator, hence rule (18) should
>
> be:
>
> (18) decimal ::= ('-' | '+')? ( [0-9]+ '.' [0-9]* | '.' ([0-9])+ )
I not sure I agree with this, "9" (in my reading) is a decimal number, as per [1]
Section 3.2.3.1 states:
"Leading and trailing zeroes are optional. If the fractional part is zero,
the period and following zero(es) can be omitted."
Yves brought up a similar issue in this thread on the RDF WG mailing list [2].
[1] http://www.w3.org/TR/xmlschema-2/#decimal
[2] http://www.w3.org/mid/20110228120029.GG12598@dvbstreamer.national.core.bbc.co.uk
> Rules 2, 4 and 5 need to be changed as follows to allow white space usage:
>
> (2) statement ::= directive ws* '.' | triples ws* '.' | ws+
This makes perfect sense to allow for ws* between directives (prefix's and
base's) and triples and their "."s.
I was a bit concerned that the spacing was covered in another rule in the
grammar, but I didn't spot that if it is the case.
> (4) prefixID ::= '@prefix' ws+ prefixName? ':' ws* uriref
>
> (5) base ::= '@base' ws+ uriref
I am not sure that I think it is necessary to allow whitespaces in rules 4 and 5. But apparently librdf allows this.
> Rules 6, 7 and 8 should also be changed to avoid white space-related
>
> ambiguities (such as subject being a qname and there is no space
>
> between subject and predicate):
>
> (6) triples ::= subject ws+ predicateObjectList
>
> (7) predicateObjectList ::= verb ws+ objectList (';' verb ws+
>
> objectList)* (';')?
>
> (8) object ::= object (ws+ object)*
These changes all look good to me. They are just allowing for whitespaces in places commonly accepted in parsers.
> In order to be consistent with the current librdf implementation,
>
> rules 14 and 15 could (should?) also allow white spaces too without
>
> introducing ambiguities:
>
> (14) literal ::= quotedString (ws* '@' ws* language )? |
>
> datatypeString | integer | double | decimal | boolean
>
> (15) datatypeString ::= quotedString ws* '^^' ws* resource
I am not sure how I feel about these changes I think I would like to see both @ and ^^ come directly after a literal in turtle, but again apparently raptor allows for whitespaces in these places.
> Then, the current definition of lcharacters allows them to be double
>
> quotes ("), which makes the end of a longString ambiguous (think
>
> """ABC"""DEF"""). This can be solved by changing rules 37 and 43 as
>
> follows:
>
> (37) longString ::= #x22 #x22 #x22 lcharacter* ( #x22 #x22? lcharacter+ )* #x22 #x22 #x22
>
> (43) lcharacter ::= scharacter | #x9 | #xA | #xD
fsd
I see what you are doing here, I wonder if rule 37 should look more like :
(37) longString ::= #x22 #x22 #x22 lcharacter* ( #x22 #x22? lcharacter+)* #x22 #x22 #x22
Anyways, SPARQL has it's own definition of triple quoting, and I think the best plan to united Turtle with SPARQL triples to allow for maximum consistency.
> In the end the current definition of ucharacters is in disagreement
>
> with both section 3.3 (String Escapes) and with tests from the test
>
> suite, since it does not allow \t, \n and \r escapes. It should
>
> definitely be changed to:
>
> (41) ucharacter ::= ( echaracter - #x3E ) | '\>'
Which seems sane to me too.
Anyways, these are some initial thoughts about the current Turtle syntax, I am happy to put these up on the wiki somewhere if need be.
Regards,
Mischa
Below is the original thread :
> My name is Mischa, I worked with Steve Harris at Garlik, and am on the W3C's
> current RDF WG. I was forwarded a number of emails which you have
> constructed where you critique the Turtle Syntax.
Oh, finally :-)
> So, as a high-level, do you mind if I forward your emails to the working
> group's mailing list, and do you mind if I add my 2 cents, similar to what I
> have done to your emails. If this is OK with you, I will cc you into the
> mails if that is OK?
Sure, please do.
> I have some comments, they will come inline, please feel free to let me know
> if you think I am wrong in the way I have responded to your points.
Sure, my replies follows your comments.
> On 2 Mar 2011, at 11:22, Steve Harris wrote:
[...]
> Hello,
>
> While implementing my own Turtle parser (see [1]), I took note of the
>
> following errors in the Turtle grammar as defined in [2].
>
> According to [3], the lexical representation of decimal values should
>
> always contain a period as decimal indicator, hence rule (18) should
>
> be:
>
> (18) decimal ::= ('-' | '+')? ( [0-9]+ '.' [0-9]* | '.' ([0-9])+ )
>
> So, I don't agree with this.
> You seem to think that "9" is not a decimal number, and you refer to this
> link below:
> http://www.w3.org/TR/xmlschema-2/#decimal
> The link above suggests that "9" is a decimal for it states in section
> 3.2.3.1 :
> "Leading and trailing zeroes are optional. If the fractional part is zero,
> the period and following zero(es) can be omitted."
> Yves brought up a similar issue in this thread on the RDF WG mailing list.
>
> http://www.w3.org/mid/20110228120029.GG12598@dvbstreamer.national.core.bbc.co.uk
Mmmm... okay, hence some type inference is needed. If that is ok to
you, it's ok to me as well, but I have to warn you that it might be a
big problem in generic Turtle parsers (and indeed, I remember having
had this kind of issue with librdf some time ago).
> Rules 2, 4 and 5 need to be changed as follows to allow white space usage:
>
> (2) statement ::= directive ws* '.' | triples ws* '.' | ws+
>
> This makes perfect sense to allow for ws* between directives (prefix's and
> base's) and triples and their "."s.
> I was a bit concerned that the spacing was covered in another rule in the
> grammar, but it didn't seem to be from my POV.
It is not. I discovered those issues while writing my own parser,
hence there is nothing "redundant" in these "new rules" for sure.
> (4) prefixID ::= '@prefix' ws+ prefixName? ':' ws* uriref
>
> (5) base ::= '@base' ws+ uriref
>
> Am not sure I agree with these why would you want spaces in between the
> prefixName and the URIRef ? (Note that URIRef's should change to IRIs in
> this next standardisation effort).
> But again I think this is up for debate, and I think it is worth posing to
> the community.
I added this just because I noticed that librdf allowed that. No other reason.
> Rules 6, 7 and 8 should also be changed to avoid white space-related
>
> ambiguities (such as subject being a qname and there is no space
>
> between subject and predicate):
>
> (6) triples ::= subject ws+ predicateObjectList
>
> (7) predicateObjectList ::= verb ws+ objectList (';' verb ws+
>
> objectList)* (';')?
>
> (8) object ::= object (ws+ object)*
>
> These all look good to me too. They are all just allowing whitespaces where
> I think libraptor would allow them anyways , right ? Again, I think these
> could be posed to the working group to see what the consensus view is on the
> matter of whitespaces.
Yes, exactly.
> In order to be consistent with the current librdf implementation,
>
> rules 14 and 15 could (should?) also allow white spaces too without
>
> introducing ambiguities:
>
> (14) literal ::= quotedString (ws* '@' ws* language )? |
>
> datatypeString | integer | double | decimal | boolean
>
> (15) datatypeString ::= quotedString ws* '^^' ws* resource
>
> So, I am not sure I agree with these. I like the fact that the @ or the ^^
> need to come right after the quoted string. This matches up to libraptor
> parser anyways. Again, up for debate I guess.
Yes, same thing. Raptor allowed that, so I put them in.
> Then, the current definition of lcharacters allows them to be double
>
> quotes ("), which makes the end of a longString ambiguous (think
>
> """ABC"""DEF"""). This can be solved by changing rules 37 and 43 as
>
> follows:
>
> (37) longString ::= #x22 #x22 #x22 lcharacter* ( #x22 #x22? lcharacter+ )*
>
> (43) lcharacter ::= scharacter | #x9 | #xA | #xD
>
> I see what you are doing here, I wonder if rule 37 should look more like :
> (37) longString ::= #x22 #x22 #x22 lcharacter* ( #x22 #x22? lcharacter+
> )* #x22 #x22 #x22
> Anyways, SPARQL has it's own definition of triple quoting, and I think the
> best plan to united Turtle with SPARQL triples to allow for maximum
> consistency.
Yes, sorry, that was corrected in a later email. However, this one is
needed IMO, since it would create scanning/parsing ambiguities and/or
difficulties.
> In the end the current definition of ucharacters is in disagreement
>
> with both section 3.3 (String Escapes) and with tests from the test
>
> suite, since it does not allow \t, \n and \r escapes. It should
>
> definitely be changed to:
>
> (41) ucharacter ::= ( echaracter - #x3E ) | '\>'
>
> I need to look at this a little further.
Ok.
> So Stephano, are you happy for me to put your concern across to the working
> group? I will aim to forward your mail in its entirety, and I will add my
> comments to your mails in an obvious manner, what do you think to this?
> Cheers,
> Mischa
Sure, it's fine to me. The reason why I sent the mails in the first
place was to have those "bugs" fixed.
However, I just wanted to add that Unicode code points U+D800 to
U+DFFF should not be allowed (those are surrogates reserved for UTF-16
encoding), and byte order marks should be also discussed IMO.
Thank you and please, let me know.
All the best,
Stefano
___________________________________
Mischa Tuffield PhD
Email: mischa.tuffield@garlik.com
Homepage - http://mmt.me.uk/
Garlik Limited, 1-3 Halford Road, Richmond, TW10 6AW
+44(0)845 652 2824 http://www.garlik.com/
Registered in England and Wales 535 7233 VAT # 849 0517 11
Registered office: Thames House, Portsmouth Road, Esher, Surrey, KT10 9AD
Received on Wednesday, 9 March 2011 15:51:25 UTC