Equivalence for parseType='Literal'?

How should equivalence be computed for the values of properties marked with
parseType='Literal'?  I can think of several options:

1) Literal binary equivalence
2) Character encoding normalized binary equivalence
3) XML DOM equivalence
4) XML Infoset equivalence
5) Canonical XML equivalence

(1) Literal binary equivalence is the easiest for everyone to implement, but
the most likely to yield false failures.  Two XML strings may look identical
to a user, but if one is encoded in UTF-16, and the other in UTF-8, binary
comparison will fail.  This can be mitigated by normalizing to some standard
encoding, like UTF-8, which is what option (2) is about.

Both options (1) and (2) are subject to problems when RDF processors don't
preserve the literal value precisely.  If xml:space="preserve" is not
specified, whitespace handling may cause false failures.  "Innocent" XML
rewriting, such as that performed by Apache mod_dav, could cause false
failures.

(3) would define equivalence as two values that generate the same DOM tree.
(4) would define equivalence as two values that generate the same Infoset.
The last time I looked at the DOM and Infoset specs, they were different,
but by now they might be much closer together.  These are pretty good
possiblities, but it puts all of the burden on clients -- an RDF processor
can't really help.  There's a lot more computational overhead, in any case.

With (5), Canonical XML, the burden could be shifted to the RDF processor,
which could perform "early canonicalization".  That is, whenever a property
is set to a parseType='Literal' value, it could canonicalize the XML first. 
All consumers of RDF should assume that parseType='Literal' values have been
canonicalized, and should just store the value verbatim.  Clients could then
perform equivalence with a simple binary comparison.  The problem is that
Canonicalization throws away stuff that might be important, like processing
instructions and comments.  If the RDF processor doesn't do early
canonicalization, we're back to the whole burden being distributed to every
client.

===============================================================

At the moment, I've decided not to support parseType='Literal' at all (my
RDF processor generates no triples for any property marked such).  Since I
can't make any guarantees about such values, and since I can't even rely on
a round-trip with another agent preserving binary equivalence, I'd rather
just decommit support.

Instead, I offer an almost equivalent alternative.  In place of a
parseType='Literal', I encourage the use of an rdf:value and a qualifier
(ala Dublin Core) which specifies how the text of the value should be
interpreted, one such intepretation being "as XML".  For example, if I have
a property Foo that I want to set to the value:

           <?xml version="1.0" encoding="UTF-8"?>
           <dad><!--comment-->
                <kid1 a="1"/>
                <kid2>foo</kid2>
           </dad>

I use the following RDF representation:

    <rdf:RDF ...>
      <rdf:Description ...>
        <Foo parseType="Resource">
          <rdf:value xml:space="preserve"><![CDATA[
           <?xml version="1.0" encoding="UTF-8"?>
           <dad><!--comment-->
                <kid1 a="1"/>
                <kid2>foo</kid2>
           </dad>]]><rdf:value>
          <someSchema:interpretAs>XML</someSchema:interpretAs>
        </Foo>
      </rdf:Description>
    </rdf:RDF>

I think this does a better job of delivering on the goals of
parseType='Literal' than it itself does.  Clients can rely on binary
equivalence, perhaps after encoding normalization.  I don't have to worry
about XML agents doing "innocent" rewriting.  No special front-end handling
is required in the parser, as is true of parseType='Literal'.  This
representation actually generates triples in the model which distinguish
between an XML and non-XML literal, whereas parseType='Literal' does not (I
think).

The obvious drawback is that this is non-standard.

Comments?

Perry

Received on Thursday, 10 February 2000 17:42:59 UTC