Implementation feedback on RDF syntax spec

I've just completed my first cut of implementing an RDF/XML parser in 
Haskell from the following reference:

[1]  http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/

I believe this may be one of the first all-new RDF/XML parser 
implementations to be based on the RDFcore WG's final syntax specification, 
so I am hoping the following comments and feedback based on my experience 
may be of use to any future RDF review activity.

Generally, I found the specification was admirably clear, complete and 
accurate as a basis for implementation.   I have tested my code using data 
based on Dave Beckett's Raptor [2] test suite, which I think incorporates 
all the applicable RDFcore test cases.  With very few exceptions, my code 
passes all of the positive parser test cases.  There are a number of 
negative test cases (invalid RDF) that my code currently fails to detect as 
such.

[2] http://www.redland.opensource.ac.uk/raptor/

What follow are comments mainly on details of presentation:  I experienced 
only two difficulties with the specification, which have been reported in 
separate messages.  The greatest challenge to correct implementation that I 
faced, and which my code only partially implements per specification, is 
the creation of XML literal values.  Everything else seemed to just work as 
it should by following the specification.

More specific comments follow.

...

General, especially section 6.1.2:

I'm trying to make my implementation follow as closely as possible the 
specification.  But I am implementing in a functional language (Haskell), 
and I am finding the idea of events having updateable fields to be 
unhelpful.  Generally, it is much harder to determine algorithmic 
equivalence when a mutable value is being used than when pure immutable 
values are used.

Specifically, in my implementation, I am separating the 'li-counter' and 
'subject' accessors (which are determined by the parsing process) from the 
other values of the Element event (which can be determined from the 
incoming XML infoset).  The 'subject' accessor information is passed by 
parameterizing the production rules as needed, and having the nodeElement 
production rule return the resulting subject node value.  I handle the 
li-counter as part of the parser state, though for expository purposes I 
might be inclined to describe it as a local counter within the nodeElement 
production.

Generally, I have found that the parser needs only two state variables:
- a counter for bnode allocation
- a counter for rdf:li attribute allocation.
Everything else is handled as a stateless transformation from input to 
generated triples.

...

Section 6.1.2:

The subject accessor of the Element event is described as an Identifier 
event, but it's not clear from the text what is an identifier event.  The 
hyperlink indicates a "URI Reference event" (but this is invisible when 
working from printed copy).

...

Section 6.2:

This section does not appear to be entirely consistent, or in complete 
agreement with the grammar itself.  It states here that the infoset is 
transformed into a "sequence of events", which is subsequently matched by 
the given grammar (section 6.3).

Yet, examination of the grammar production for 'doc' shows that it matches 
a *single* root event, having children being a list of exactly one 
<rdf:RDF> element.

Thus, the result of applying the transformation in section 6.2, and the 
grammar in section 7, seem to describe not a sequence but a shallow tree 
consisting of a root element whose children comprise a sequence of events.

I think it would be clearer if the transformation in section 6.2 simply 
returned the event sequence corresponding to the children of the document 
root element, and to then to drop grammar production 7.2.8 (doc), using 
production 7.2.9 (RDF) in its place in section 7.2.1.

...

Section 6.2, again:

The description "... to turn the tree of events into a sequence in document 
order" is potentially confusing on another count:  the result of 
transforming an element event is a sequence of element, end element and 
text events.  Attribute events are not part of this sequence.  When I 
originally read this, I understood the intent to be to turn the tree into a 
linear structure that was subject to a standard parsing approach.  On 
examining the detail, it seems to be an unconventional hybrid approach, 
creating a tree with a fixed number of levels (root,content,attributes).

I find myself thinking that the "parsing" (generation of triples) could 
have been described in terms of a traversal of the event tree, without 
applying the "flattening" stage described in section 6.2.

...

Section 7, 7.2.2 to 7.2.7

Although these are presented as being RDF grammar productions, they are not 
grammar productions in the sense that term is normally used.  Rather they 
are separate predicates must be satisfied by facets of the tokens as they 
are matched.

Conventional grammar specification relates to a single sequence of tokens, 
defining what constitutes a valid sequence.  The use of additional 
"semantic" constraints is not unusual, and this is how I see that the 
productions 7.2.2 to 7.2.7 are being used.  I do find it potentially 
misleading that they are presented as productions of the main RDF 
grammar;  I suppose they might be regarded as separate mini-grammars that 
are applied to components of tokens, but I think they would better be 
presented as separate URI predicates.

I also think similar comments might be applied to the attribute predicates 
7.2.22 to 7.2.32 and 7.2.34.

...

Section 6.1, Events

It seems that two quite distinct concepts have been grouped together under 
the heading of Events:  those that are determined entirely by the incoming 
XML source, and which relate to values in the XML infoset (Root, Element, 
End Element, Attribute, Text), and derivative values that relate to values 
in the resulting RDF graph (URI Reference, Blank Node, Plain Literal, Typed 
Literal).

I think it would be more helpful to maintain a separation of concepts more 
in keeping with implementations that create internal data structures, by 
treating Events and Nodes as distinct values, with the parsing process 
describing the mapping from the event stream to nodes and then to the 
node-triples that are in the resulting graph.

Certainly, in my code, it would make more sense if I used my own target 
"RDFLabel" type rather than Event for the various node values.  (I could of 
course do this, but that would compromise my goal of having the 
implementation closely follow the specification.  In the future, I may well 
do this.)

...

Section 7.2.14.

This description refers to a property element ('e') which is not matched by 
this production, but by each of the productions named by this production.

I cannot implement this as described here, because the parser must match 
the element event before it can process the element name URI.  In my case, 
I push the processing described into the individual property element 
productions (using a common function to perform the common processing, of 
course).  I think a similar approach could be taken here, using a grammar 
action notation to describe the handling of rdf:li property URIs.

...

Sections 7.2.15, 7.2.16:

It seems to be inconsistent that production 7.2.15 allows multiple 
consecutive whitespace elements, but that 7.2.16 allows only a single 
character data element for a literal property value.  I would expect that 
ws* in 7.2.15 could be replaced by ws?.

...

Sections 7.2.17, 7.2.27:

Should datatype URIs be evaluated with reference to the current base 
URI?  The current syntax does not appear to call for this.

I see little practical value in doing this, bit it seems somewhat 
inconsistent that datatype URIs are treated differently from other 
URIs.  If datatype URIs are not evaluated as relative to the current base 
URI, I think it might be helpful to state this explicitly.

...

Section 6, 7.2.17

I've noted in another message difficulties I find in the handling of XML 
literals in that it makes it hard to achieve a clean separation between the 
preprocessing to syntax data model events, and subsequent syntax analysis.

I have adopted a partial approach here:  I transform from infoset to event 
tree as described, but when "flattening" the event tree I do not flatten 
eny element that contains a rdf:parseType="Literal" or equivalent 
attribute.  The effect of this is that my XML literal values cannot contain 
comment or PI information items.

...

That's all.

#g


------------
Graham Klyne
For email:
http://www.ninebynine.org/#Contact

Received on Wednesday, 14 July 2004 04:57:41 UTC