- From: Perry A. Caro <caro@Adobe.COM>
- Date: Thu, 10 Feb 2000 14:41:51 -0800
- To: www-rdf-interest@w3.org
How should equivalence be computed for the values of properties marked with parseType='Literal'? I can think of several options: 1) Literal binary equivalence 2) Character encoding normalized binary equivalence 3) XML DOM equivalence 4) XML Infoset equivalence 5) Canonical XML equivalence (1) Literal binary equivalence is the easiest for everyone to implement, but the most likely to yield false failures. Two XML strings may look identical to a user, but if one is encoded in UTF-16, and the other in UTF-8, binary comparison will fail. This can be mitigated by normalizing to some standard encoding, like UTF-8, which is what option (2) is about. Both options (1) and (2) are subject to problems when RDF processors don't preserve the literal value precisely. If xml:space="preserve" is not specified, whitespace handling may cause false failures. "Innocent" XML rewriting, such as that performed by Apache mod_dav, could cause false failures. (3) would define equivalence as two values that generate the same DOM tree. (4) would define equivalence as two values that generate the same Infoset. The last time I looked at the DOM and Infoset specs, they were different, but by now they might be much closer together. These are pretty good possiblities, but it puts all of the burden on clients -- an RDF processor can't really help. There's a lot more computational overhead, in any case. With (5), Canonical XML, the burden could be shifted to the RDF processor, which could perform "early canonicalization". That is, whenever a property is set to a parseType='Literal' value, it could canonicalize the XML first. All consumers of RDF should assume that parseType='Literal' values have been canonicalized, and should just store the value verbatim. Clients could then perform equivalence with a simple binary comparison. The problem is that Canonicalization throws away stuff that might be important, like processing instructions and comments. If the RDF processor doesn't do early canonicalization, we're back to the whole burden being distributed to every client. =============================================================== At the moment, I've decided not to support parseType='Literal' at all (my RDF processor generates no triples for any property marked such). Since I can't make any guarantees about such values, and since I can't even rely on a round-trip with another agent preserving binary equivalence, I'd rather just decommit support. Instead, I offer an almost equivalent alternative. In place of a parseType='Literal', I encourage the use of an rdf:value and a qualifier (ala Dublin Core) which specifies how the text of the value should be interpreted, one such intepretation being "as XML". For example, if I have a property Foo that I want to set to the value: <?xml version="1.0" encoding="UTF-8"?> <dad><!--comment--> <kid1 a="1"/> <kid2>foo</kid2> </dad> I use the following RDF representation: <rdf:RDF ...> <rdf:Description ...> <Foo parseType="Resource"> <rdf:value xml:space="preserve"><![CDATA[ <?xml version="1.0" encoding="UTF-8"?> <dad><!--comment--> <kid1 a="1"/> <kid2>foo</kid2> </dad>]]><rdf:value> <someSchema:interpretAs>XML</someSchema:interpretAs> </Foo> </rdf:Description> </rdf:RDF> I think this does a better job of delivering on the goals of parseType='Literal' than it itself does. Clients can rely on binary equivalence, perhaps after encoding normalization. I don't have to worry about XML agents doing "innocent" rewriting. No special front-end handling is required in the parser, as is true of parseType='Literal'. This representation actually generates triples in the model which distinguish between an XML and non-XML literal, whereas parseType='Literal' does not (I think). The obvious drawback is that this is non-standard. Comments? Perry
Received on Thursday, 10 February 2000 17:42:59 UTC