- From: pat hayes <phayes@ihmc.us>
- Date: Wed, 17 Sep 2003 20:16:12 -0500
- To: w3c-rdfcore-wg@w3.org
Greetings. Y'all are going to just LOVE me for this, but thinking about the i18n desireables for XML has led me to the observation that one of our old and abandoned designs for handling datatypes would handle this stuff quite smoothly. The key point is that terms denoting datatype values are allowed in the subject position, so attributes like language tags and lexical 'type' can be described as RDF properties. We gave up on this on the grounds largely of triple-bloat, a concern which now seems curiously irrelevant when one contemplates what OWL will look like. Anyway, in the spirit of Brian's comment, >I've tried to be careful not to describe it as a proposal. This is an >alternative design. I'm not proposing it, just describing it. here's the design. Plain literals are just strings, and they denote themselves. There are no typed literals. Datatypes are indicated by class/property names. Datatype values are typically indicated by bnodes, so instead of aaa ppp "sss"^^ddd . we write aaa ppp _:x . _:x ddd "sss" . where the _:x denotes the datatype value. You could use URIs in some cases, eg ex:PIto5places xsd:number "3.14162" . There is a general D-entailment aaa ddd "sss" . |= aaa rdf:type ddd . when sss is a legal lexical form for the datatype ddd; the version of this for XML is an RDF entailment (though see later). This design, unlike our present one, has subject terms denoting datatype values, so lang tags can be considered to be *properties of datatype values*, and the tags themselves can be encoded as simple literals, so we just write an assertion: _:x rdf:langTag "en" . and our current design translates thus: aaa ppp "sss"@ttt . -->> aaa ppp _:x . _:x xsd:string "sss" . _:x rdf:langTag "ttt" . Note that xsd:string is the appropriate datatype for simple literals, providing a way to in effect put a simple literal string in the subject position (encoded as a bnode). In fact, in this design, xsd:string is in effect owl:sameAs applied to literals. ---- This way of handling lang tags allows us to associate lang tags with XML literals without putting the tag into the lexical space of the literal, so allows XML literal to be a normal datatype, just as it is right now (though read on) while also handling one of Martin's requirements. The parsing of parseType="Literal" needs to include the asserting of an appropriate rdf:langTag assertion in the graph, according to the XML rules, but that seems straightforward. This design also allows sub-XML datatypes to automatically inherit language tagging, since they will be members of subClasses of rdf:XMLLiteral and hence of rdf:XMLliteral itself, and hence the members of these classes will still have any properties they had previously. Notice that the property is of the literal *value*, rather than syntactically attached to the literal, so rdf:langTag only makes intuitive sense for self-denoting literals, or at any rate those which denote textual kinds of thing rather than mathematical kinds of thing. However, there is no need to have special rules to 'ignore' lang tags on non-textual datatypes such as numbers: an assertion like _:x xsd:integer "25" . _:x rdf:langTag "en" . is semantically vacuous but harmless, or can be considered harmless as far as RDF is concerned. (A lang-tag-savvy app might complain about things like this.) Also we don't need lang tags as a syntactic attachment to plain literals; the same trick works for plain literals. There isn't any general semantics for rdf:langTag, but for particular cases it can be defined, eg we can define it for simple literals - simple literal *values* can be pairs just as they are right now, and so IEXT(I(rdf:langTag)) is all pairs of the form <<sss, tag>, tag> , and IEXT(I(xsd:string)) is all pairs <<sss, tag>, sss> - and for XML literals. Here's the MT for the datatyping, re-done in a more up-todate style: D is a datatype map, as usual. If <uri, ddd> is in D then: I(uri)=ddd; ddd is in ICEXT(I(rdf:Datatype)); for any string sss, sss is in the lexical space of ddd iff <L2V(ddd)(sss),sss> is in IEXT(ddd); If sss is in the lexical space of ddd then L2V(ddd)(sss) is in ICEXT(ddd) Note that being in the class is necessary but not sufficient for the datatyping rule to apply; this avoids some of the snags we had with this design previously involving subtypes. For example, we can have ex:octal rdfs:subClassOf xsd:integer . _:x ex:octal "10" . and _:x unambiguously denotes eight; in fact _:x owl:sameAs _:y . _:y xsd:integer "8" . The lexical typing only gets invoked by the datatype property; the class membership has to do with the values. Alternative lexical forms give no problem either: _:x xsd:integer "2" . _:x xsd:integer "0002" . BTW, we could now use rdfs:Literal as a generic superproperty of all datatype properties, as well as a superclass of all datatype values, so that _:x rdfs:Literal "10" . would say that _:x was some value which has "10" as a lexical form, but we don't (yet) know which one. Or, we could not do this. ----- This would be a major change and would probably effect several implementations. In order to change our current design to this we would need to: 1. remove typed literals (or, treat them as an abbreviations for the two-triple form, maybe?) 2. remove lang tags from plain literals (or treat these as an abbreviation, similarly) 3. introduce rdf:langTag (or whatever) and add prose discussing the use of lang tags as properties 4. modify the datatype semantics, as above 5. redefine the XML parsing rules for parseType="Literal" 6. rewrite the Lbase translation appropriately I think this would mean changes to every document; it would be a fairly horrendous editing task at this stage. On the other hand, it does have a certain elegance. There is only one kind of literal, and literals are genuinely simple, both syntactically and semantically, and always denote themselves in all contexts (remember non-tidy graphs?); and it uses RDF as a descriptive language rather than extending the syntax in an XML-idiosyncratic way. We abandoned this design, as I recall, for three reasons. First, it seemed too 'indirect' and like triple-bloat. However, in our current design we have to specify the same information, and we can infer the bnode: aaa ppp "10"^^xsd:integer . |= aaa ppp _:x . compare aaa ppp _:x . _:x xsd:integer "10" . an in any case in this post-OWL era, triple-bloat seems to be rampant. I note that it would be harmless to allow the current typed-literal form as an abbreviation for the two-triple form, by the way; or even as an alternative, with inference rules to convert them back and forth. The feeling of being 'indirect' came, as I recall, from a feeling that we *ought* to be able, dammit, to write things like ex:Jill ex:age "10" rather have to go through a bnode: ex:Jill ex:age _:x . _:x xsd:integer "10" . This feeling now seems to me to have been overly naive, however, with the benefit of hindsight. Second, it seemed unintuitive to some folk to have a property and a class with the same name. I never had this trouble myself, and it seems to me to be a good illustration of the usefulness of the intensional semantics that RDF provides: if you've got it, flaunt it. [*see PS] However, the design could be modified by allowing systematic variants for the property or class names, eg using xsd:integer for the property and xsd:Integer for the class. Or we could do without the datatype classes altogether, since aaa rdf:type xsd:integer . (read: aaa is an integer) and aaa xsd:integer _:x . (read: aaa is something denoted by a numeral) convey the exact same information in {xsd:integer}-interpretations. Third, as I recall, there were some issues arising from the long-range datatyping getting too complicated. OK, Im not suggesting re-opening that particular can of worms. (Though I would note that when it does get re-opened in the future, I bet this design will be a lot more tractable than our current design, which will have to be simply shelved.) ---- The other i18n issue involved treating XML literals without markup as being plain text. Assuming that 'plain text' means a character string, I now think we can do that by a bit of semantic sleight of hand as follows. First, observe that any piece of XML can be encoded as a character string, but XML imposes extra equivalence (identity) conditions, such as identifying "<br />" with "<br></br>". So, consider the set of legal XML texts, considered as Unicode strings, and define an equivalence relation on this set by saying that strings with the same XML normal form are equivalent; then say that any such string denotes its equivalence class, and then in a familiar abuse of notation say that singleton classes are identical to their members. Now, any piece of XML text without any markup in it denotes itself, just as a plain literal does. (There may be some whitespace issues which make " " (two spaces) equivalent to " " (one space); if so, this will need to be stated more carefully, eg by applying the normalization only to stuff inside <->.) If we say that this is the value space of rdf:XMLLiteral, rather than the non-text 'structural' sets we have at present, then Martin might be happier. On the other hand, this supports a number of hard-to-state RDF entailments, such as intersubstituting "sss"^^xsd:string and "sss"^^rdf:XMLLiteral under circumstances which can only be recognized by an XML parser, which seems *very* ugly to include in basic RDF, so I would argue that if we do something like this then we treat rdf:XMLLiteral as a genuine datatype so that these entailments are restricted to D-interpretations and are not valid in simple RDF; and it also means that XML *with* markup denotes something very like a character string; in particular, "<"^^rdf:XMLLiteral on this proposal, has got absolutely nothing in common with "<"^^xsd:string. So maybe Martin might not be so happy after all. Anyway, thought I'd just mention it in passing. Pat PS. I thought of an interesting analogy. Literals are a kind of name, and in a simple extensional logic they would have a fixed denotation, eg numerals denote numbers, I("10")=10 (ie, ten) and so on, end of story. But RDF is intensional, and datatypes treat literals like intensional names. Seen in this way, the literal always denotes itself, ie I(literal)=literal; but it has a variable extension, *determined by the datatype context*. In other words, the datatype lexical-to-value map is a kind of extension mapping, like IEXT for properties and ICEXT for classes. Call it ILEXT-d where d is the datatype; then the 'meaning' of a literal string sss in a datatype context defined by d would be ILEXT-d(I(sss)) - compare IEXT(I(p)) or ICEXT(I(a)) where p is a property uri and a is a uri or bnode - which since I(sss) = sss is just ILEXT-d(sss), i.e. L2V(d)(sss). This is exactly what the subject bnode denotes in a datatype triple; in other words, we are using the datatype property name as a kind of explicit extension mapping on literal strings. On this view, then, what a datatype does is to fix the extension mapping for literals, considered as intensional names. The universal superproperty rdfs:Literal works the same way but refuses to supply a context, so letting the extension mapping be anything. -- --------------------------------------------------------------------- IHMC (850)434 8903 or (650)494 3973 home 40 South Alcaniz St. (850)202 4416 office Pensacola (850)202 4440 fax FL 32501 (850)291 0667 cell phayes@ihmc.us http://www.ihmc.us/users/phayes
Received on Wednesday, 17 September 2003 21:16:22 UTC