- From: Graham Klyne <gk@ninebynine.org>
- Date: Wed, 14 Jul 2004 09:57:59 +0100
- To: RDF comments <www-rdf-comments@w3.org>
I've just completed my first cut of implementing an RDF/XML parser in Haskell from the following reference: [1] http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/ I believe this may be one of the first all-new RDF/XML parser implementations to be based on the RDFcore WG's final syntax specification, so I am hoping the following comments and feedback based on my experience may be of use to any future RDF review activity. Generally, I found the specification was admirably clear, complete and accurate as a basis for implementation. I have tested my code using data based on Dave Beckett's Raptor [2] test suite, which I think incorporates all the applicable RDFcore test cases. With very few exceptions, my code passes all of the positive parser test cases. There are a number of negative test cases (invalid RDF) that my code currently fails to detect as such. [2] http://www.redland.opensource.ac.uk/raptor/ What follow are comments mainly on details of presentation: I experienced only two difficulties with the specification, which have been reported in separate messages. The greatest challenge to correct implementation that I faced, and which my code only partially implements per specification, is the creation of XML literal values. Everything else seemed to just work as it should by following the specification. More specific comments follow. ... General, especially section 6.1.2: I'm trying to make my implementation follow as closely as possible the specification. But I am implementing in a functional language (Haskell), and I am finding the idea of events having updateable fields to be unhelpful. Generally, it is much harder to determine algorithmic equivalence when a mutable value is being used than when pure immutable values are used. Specifically, in my implementation, I am separating the 'li-counter' and 'subject' accessors (which are determined by the parsing process) from the other values of the Element event (which can be determined from the incoming XML infoset). The 'subject' accessor information is passed by parameterizing the production rules as needed, and having the nodeElement production rule return the resulting subject node value. I handle the li-counter as part of the parser state, though for expository purposes I might be inclined to describe it as a local counter within the nodeElement production. Generally, I have found that the parser needs only two state variables: - a counter for bnode allocation - a counter for rdf:li attribute allocation. Everything else is handled as a stateless transformation from input to generated triples. ... Section 6.1.2: The subject accessor of the Element event is described as an Identifier event, but it's not clear from the text what is an identifier event. The hyperlink indicates a "URI Reference event" (but this is invisible when working from printed copy). ... Section 6.2: This section does not appear to be entirely consistent, or in complete agreement with the grammar itself. It states here that the infoset is transformed into a "sequence of events", which is subsequently matched by the given grammar (section 6.3). Yet, examination of the grammar production for 'doc' shows that it matches a *single* root event, having children being a list of exactly one <rdf:RDF> element. Thus, the result of applying the transformation in section 6.2, and the grammar in section 7, seem to describe not a sequence but a shallow tree consisting of a root element whose children comprise a sequence of events. I think it would be clearer if the transformation in section 6.2 simply returned the event sequence corresponding to the children of the document root element, and to then to drop grammar production 7.2.8 (doc), using production 7.2.9 (RDF) in its place in section 7.2.1. ... Section 6.2, again: The description "... to turn the tree of events into a sequence in document order" is potentially confusing on another count: the result of transforming an element event is a sequence of element, end element and text events. Attribute events are not part of this sequence. When I originally read this, I understood the intent to be to turn the tree into a linear structure that was subject to a standard parsing approach. On examining the detail, it seems to be an unconventional hybrid approach, creating a tree with a fixed number of levels (root,content,attributes). I find myself thinking that the "parsing" (generation of triples) could have been described in terms of a traversal of the event tree, without applying the "flattening" stage described in section 6.2. ... Section 7, 7.2.2 to 7.2.7 Although these are presented as being RDF grammar productions, they are not grammar productions in the sense that term is normally used. Rather they are separate predicates must be satisfied by facets of the tokens as they are matched. Conventional grammar specification relates to a single sequence of tokens, defining what constitutes a valid sequence. The use of additional "semantic" constraints is not unusual, and this is how I see that the productions 7.2.2 to 7.2.7 are being used. I do find it potentially misleading that they are presented as productions of the main RDF grammar; I suppose they might be regarded as separate mini-grammars that are applied to components of tokens, but I think they would better be presented as separate URI predicates. I also think similar comments might be applied to the attribute predicates 7.2.22 to 7.2.32 and 7.2.34. ... Section 6.1, Events It seems that two quite distinct concepts have been grouped together under the heading of Events: those that are determined entirely by the incoming XML source, and which relate to values in the XML infoset (Root, Element, End Element, Attribute, Text), and derivative values that relate to values in the resulting RDF graph (URI Reference, Blank Node, Plain Literal, Typed Literal). I think it would be more helpful to maintain a separation of concepts more in keeping with implementations that create internal data structures, by treating Events and Nodes as distinct values, with the parsing process describing the mapping from the event stream to nodes and then to the node-triples that are in the resulting graph. Certainly, in my code, it would make more sense if I used my own target "RDFLabel" type rather than Event for the various node values. (I could of course do this, but that would compromise my goal of having the implementation closely follow the specification. In the future, I may well do this.) ... Section 7.2.14. This description refers to a property element ('e') which is not matched by this production, but by each of the productions named by this production. I cannot implement this as described here, because the parser must match the element event before it can process the element name URI. In my case, I push the processing described into the individual property element productions (using a common function to perform the common processing, of course). I think a similar approach could be taken here, using a grammar action notation to describe the handling of rdf:li property URIs. ... Sections 7.2.15, 7.2.16: It seems to be inconsistent that production 7.2.15 allows multiple consecutive whitespace elements, but that 7.2.16 allows only a single character data element for a literal property value. I would expect that ws* in 7.2.15 could be replaced by ws?. ... Sections 7.2.17, 7.2.27: Should datatype URIs be evaluated with reference to the current base URI? The current syntax does not appear to call for this. I see little practical value in doing this, bit it seems somewhat inconsistent that datatype URIs are treated differently from other URIs. If datatype URIs are not evaluated as relative to the current base URI, I think it might be helpful to state this explicitly. ... Section 6, 7.2.17 I've noted in another message difficulties I find in the handling of XML literals in that it makes it hard to achieve a clean separation between the preprocessing to syntax data model events, and subsequent syntax analysis. I have adopted a partial approach here: I transform from infoset to event tree as described, but when "flattening" the event tree I do not flatten eny element that contains a rdf:parseType="Literal" or equivalent attribute. The effect of this is that my XML literal values cannot contain comment or PI information items. ... That's all. #g ------------ Graham Klyne For email: http://www.ninebynine.org/#Contact
Received on Wednesday, 14 July 2004 04:57:41 UTC