- From: Graham Klyne <gk@ninebynine.org>
- Date: Wed, 14 Jul 2004 09:57:59 +0100
- To: RDF comments <www-rdf-comments@w3.org>
I've just completed my first cut of implementing an RDF/XML parser in
Haskell from the following reference:
[1] http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/
I believe this may be one of the first all-new RDF/XML parser
implementations to be based on the RDFcore WG's final syntax specification,
so I am hoping the following comments and feedback based on my experience
may be of use to any future RDF review activity.
Generally, I found the specification was admirably clear, complete and
accurate as a basis for implementation. I have tested my code using data
based on Dave Beckett's Raptor [2] test suite, which I think incorporates
all the applicable RDFcore test cases. With very few exceptions, my code
passes all of the positive parser test cases. There are a number of
negative test cases (invalid RDF) that my code currently fails to detect as
such.
[2] http://www.redland.opensource.ac.uk/raptor/
What follow are comments mainly on details of presentation: I experienced
only two difficulties with the specification, which have been reported in
separate messages. The greatest challenge to correct implementation that I
faced, and which my code only partially implements per specification, is
the creation of XML literal values. Everything else seemed to just work as
it should by following the specification.
More specific comments follow.
...
General, especially section 6.1.2:
I'm trying to make my implementation follow as closely as possible the
specification. But I am implementing in a functional language (Haskell),
and I am finding the idea of events having updateable fields to be
unhelpful. Generally, it is much harder to determine algorithmic
equivalence when a mutable value is being used than when pure immutable
values are used.
Specifically, in my implementation, I am separating the 'li-counter' and
'subject' accessors (which are determined by the parsing process) from the
other values of the Element event (which can be determined from the
incoming XML infoset). The 'subject' accessor information is passed by
parameterizing the production rules as needed, and having the nodeElement
production rule return the resulting subject node value. I handle the
li-counter as part of the parser state, though for expository purposes I
might be inclined to describe it as a local counter within the nodeElement
production.
Generally, I have found that the parser needs only two state variables:
- a counter for bnode allocation
- a counter for rdf:li attribute allocation.
Everything else is handled as a stateless transformation from input to
generated triples.
...
Section 6.1.2:
The subject accessor of the Element event is described as an Identifier
event, but it's not clear from the text what is an identifier event. The
hyperlink indicates a "URI Reference event" (but this is invisible when
working from printed copy).
...
Section 6.2:
This section does not appear to be entirely consistent, or in complete
agreement with the grammar itself. It states here that the infoset is
transformed into a "sequence of events", which is subsequently matched by
the given grammar (section 6.3).
Yet, examination of the grammar production for 'doc' shows that it matches
a *single* root event, having children being a list of exactly one
<rdf:RDF> element.
Thus, the result of applying the transformation in section 6.2, and the
grammar in section 7, seem to describe not a sequence but a shallow tree
consisting of a root element whose children comprise a sequence of events.
I think it would be clearer if the transformation in section 6.2 simply
returned the event sequence corresponding to the children of the document
root element, and to then to drop grammar production 7.2.8 (doc), using
production 7.2.9 (RDF) in its place in section 7.2.1.
...
Section 6.2, again:
The description "... to turn the tree of events into a sequence in document
order" is potentially confusing on another count: the result of
transforming an element event is a sequence of element, end element and
text events. Attribute events are not part of this sequence. When I
originally read this, I understood the intent to be to turn the tree into a
linear structure that was subject to a standard parsing approach. On
examining the detail, it seems to be an unconventional hybrid approach,
creating a tree with a fixed number of levels (root,content,attributes).
I find myself thinking that the "parsing" (generation of triples) could
have been described in terms of a traversal of the event tree, without
applying the "flattening" stage described in section 6.2.
...
Section 7, 7.2.2 to 7.2.7
Although these are presented as being RDF grammar productions, they are not
grammar productions in the sense that term is normally used. Rather they
are separate predicates must be satisfied by facets of the tokens as they
are matched.
Conventional grammar specification relates to a single sequence of tokens,
defining what constitutes a valid sequence. The use of additional
"semantic" constraints is not unusual, and this is how I see that the
productions 7.2.2 to 7.2.7 are being used. I do find it potentially
misleading that they are presented as productions of the main RDF
grammar; I suppose they might be regarded as separate mini-grammars that
are applied to components of tokens, but I think they would better be
presented as separate URI predicates.
I also think similar comments might be applied to the attribute predicates
7.2.22 to 7.2.32 and 7.2.34.
...
Section 6.1, Events
It seems that two quite distinct concepts have been grouped together under
the heading of Events: those that are determined entirely by the incoming
XML source, and which relate to values in the XML infoset (Root, Element,
End Element, Attribute, Text), and derivative values that relate to values
in the resulting RDF graph (URI Reference, Blank Node, Plain Literal, Typed
Literal).
I think it would be more helpful to maintain a separation of concepts more
in keeping with implementations that create internal data structures, by
treating Events and Nodes as distinct values, with the parsing process
describing the mapping from the event stream to nodes and then to the
node-triples that are in the resulting graph.
Certainly, in my code, it would make more sense if I used my own target
"RDFLabel" type rather than Event for the various node values. (I could of
course do this, but that would compromise my goal of having the
implementation closely follow the specification. In the future, I may well
do this.)
...
Section 7.2.14.
This description refers to a property element ('e') which is not matched by
this production, but by each of the productions named by this production.
I cannot implement this as described here, because the parser must match
the element event before it can process the element name URI. In my case,
I push the processing described into the individual property element
productions (using a common function to perform the common processing, of
course). I think a similar approach could be taken here, using a grammar
action notation to describe the handling of rdf:li property URIs.
...
Sections 7.2.15, 7.2.16:
It seems to be inconsistent that production 7.2.15 allows multiple
consecutive whitespace elements, but that 7.2.16 allows only a single
character data element for a literal property value. I would expect that
ws* in 7.2.15 could be replaced by ws?.
...
Sections 7.2.17, 7.2.27:
Should datatype URIs be evaluated with reference to the current base
URI? The current syntax does not appear to call for this.
I see little practical value in doing this, bit it seems somewhat
inconsistent that datatype URIs are treated differently from other
URIs. If datatype URIs are not evaluated as relative to the current base
URI, I think it might be helpful to state this explicitly.
...
Section 6, 7.2.17
I've noted in another message difficulties I find in the handling of XML
literals in that it makes it hard to achieve a clean separation between the
preprocessing to syntax data model events, and subsequent syntax analysis.
I have adopted a partial approach here: I transform from infoset to event
tree as described, but when "flattening" the event tree I do not flatten
eny element that contains a rdf:parseType="Literal" or equivalent
attribute. The effect of this is that my XML literal values cannot contain
comment or PI information items.
...
That's all.
#g
------------
Graham Klyne
For email:
http://www.ninebynine.org/#Contact
Received on Wednesday, 14 July 2004 04:57:41 UTC