- From: Hugh Glaser <hugh@glasers.org>
- Date: Mon, 2 Dec 2013 11:09:55 +0000
- To: "<ross.horne@gmail.com>" <ross.horne@gmail.com>
- Cc: public-lod community <public-lod@w3.org>, Andy Seaborne <andy.seaborne@epimorphics.com>
On 2 Dec 2013, at 06:24, Ross Horne <ross.horne@gmail.com> wrote: > Andy is right (as usual!). With the proposed bnode encoding, the graph becomes fatter each time the same triple is loaded. But how much fatter was the question. > > RDF 1.1 has just fixed the mess caused by blurring the roles of the lexer and the parser, as summarised by David recently: http://lists.w3.org/Archives/Public/public-lod/2013Nov/0093.html Ah yes, I forgot that everything is rosy now with 1.1 - sorry. > > Please don't get back into mixing up the lexer and the parser. The lexical spaces of the basic datatypes are disjoint, so in any language we can just write: > - 999 instead of "999"^^xsd:integer > - 9.99 instead of "9.99"^^xsd:decimal > - "WWV" instead of "WWV"^^xsd:string > - 2013-06-6T11:00:00+01:00 instead of "2013-06-6T11:00:00+01:00"^^xsd:dateTime > > As part of a compiler [1], a lexer gobbles up characters, e.g. 999, and turns the characters into a token. A token consists of a string, called an attribute value, plus a token name, e.g. "999"^^xsd:integer. Only a relatively small handful of people writing compilers for languages should have to care about how tokens are represented, not end users of languages. Well personally I prefer the first version I used for my course on this when it came out in 1977, the Dragon Book - "Principles of Compiler Design", before Sethi polluted it with all that type-checking stuff :-) Actually, it wasn’t about blurring the lexer and parser - the graph semantics were different. It was closer to having two representations of zero in the machine (as some machines used to have), and having to write code to ensure that you coped with both of them. Of course your examples do raise the issue of multiple representations for the same thing if the user is not careful. 23.4, 23.5, 23.0, 23.2, 23, 23.1, 023.0, 023 all of which are different RDF terms. Would a lexer/parser make 23.00 and 23.000 different RDF terms, I find myself thinking I should know, but don’t - my guess is it should. (RDF 1.1 doesn’t seem to give guidance on this.) And I find myself getting strangely interested in your dateTime example. I think most lexers will reject it? Or friendly ones will treat it as the correct lexical form: 2013-06-06T11:00:00+01:00 (You need to pad the day) So maybe we need to get a bit more explicit about the RDF term for dateTime (unless I have missed it)? That the RDF term is always in UTC? - This is what the xdd standard says. That the RDF term always has a fractional second part? - Good question. That the RDF term always has a timezone? - Better question. (See http://www.w3.org/TR/xmlschema-2/#dateTime ) Or are we happy with many different representations of a given dateTime? (Of course xsd:dateTime does get into problems with year zero, but lets not worry about that :-) ) But I guess my friendly RDF parser gnomes (all hail!) already have stories for all this. Best Hugh > > For language tags, a little simple conventional datatype subtyping (as opposed to rdfs:subClassOf), could help the programmer further [2]. e.g. a programmer that writes regex("WWV2013"@en, "WWV") clearly meant regex("WWV2013", "WWV") and shouldn't have to care about the distinction, unless I am mistaken. > > Regards, > > Ross > > [1] Ullman, Aho, Lam and Sethi. Compilers: principles, techniques and tools. 1986 > [2] Local Type Checking for Linked Data Consumers. http:/dx.doi.org/10.4204/EPTCS.123.4 > -- Hugh Glaser 20 Portchester Rise Eastleigh SO50 4QS Mobile: +44 75 9533 4155, Home: +44 23 8061 5652
Received on Monday, 2 December 2013 11:10:26 UTC