- From: pat hayes <phayes@ihmc.us>
- Date: Wed, 17 Sep 2003 20:16:12 -0500
- To: w3c-rdfcore-wg@w3.org
Greetings.
Y'all are going to just LOVE me for this, but thinking about the i18n
desireables for XML has led me to the observation that one of our old
and abandoned designs for handling datatypes would handle this stuff
quite smoothly. The key point is that terms denoting datatype values
are allowed in the subject position, so attributes like language tags
and lexical 'type' can be described as RDF properties. We gave up on
this on the grounds largely of triple-bloat, a concern which now
seems curiously irrelevant when one contemplates what OWL will look
like. Anyway, in the spirit of Brian's comment,
>I've tried to be careful not to describe it as a proposal. This is an
>alternative design. I'm not proposing it, just describing it.
here's the design.
Plain literals are just strings, and they denote themselves. There
are no typed literals. Datatypes are indicated by class/property
names. Datatype values are typically indicated by bnodes, so instead
of
aaa ppp "sss"^^ddd .
we write
aaa ppp _:x .
_:x ddd "sss" .
where the _:x denotes the datatype value. You could use URIs in some cases, eg
ex:PIto5places xsd:number "3.14162" .
There is a general D-entailment
aaa ddd "sss" .
|=
aaa rdf:type ddd .
when sss is a legal lexical form for the datatype ddd; the version of
this for XML is an RDF entailment (though see later).
This design, unlike our present one, has subject terms denoting
datatype values, so lang tags can be considered to be *properties of
datatype values*, and the tags themselves can be encoded as simple
literals, so we just write an assertion:
_:x rdf:langTag "en" .
and our current design translates thus:
aaa ppp "sss"@ttt .
-->>
aaa ppp _:x .
_:x xsd:string "sss" .
_:x rdf:langTag "ttt" .
Note that xsd:string is the appropriate datatype for simple literals,
providing a way to in effect put a simple literal string in the
subject position (encoded as a bnode). In fact, in this design,
xsd:string is in effect owl:sameAs applied to literals.
----
This way of handling lang tags allows us to associate lang tags with
XML literals without putting the tag into the lexical space of the
literal, so allows XML literal to be a normal datatype, just as it is
right now (though read on) while also handling one of Martin's
requirements. The parsing of parseType="Literal" needs to include the
asserting of an appropriate rdf:langTag assertion in the graph,
according to the XML rules, but that seems straightforward. This
design also allows sub-XML datatypes to automatically inherit
language tagging, since they will be members of subClasses of
rdf:XMLLiteral and hence of rdf:XMLliteral itself, and hence the
members of these classes will still have any properties they had
previously. Notice that the property is of the literal *value*,
rather than syntactically attached to the literal, so rdf:langTag
only makes intuitive sense for self-denoting literals, or at any rate
those which denote textual kinds of thing rather than mathematical
kinds of thing. However, there is no need to have special rules to
'ignore' lang tags on non-textual datatypes such as numbers: an
assertion like
_:x xsd:integer "25" .
_:x rdf:langTag "en" .
is semantically vacuous but harmless, or can be considered harmless
as far as RDF is concerned. (A lang-tag-savvy app might complain
about things like this.) Also we don't need lang tags as a syntactic
attachment to plain literals; the same trick works for plain literals.
There isn't any general semantics for rdf:langTag, but for particular
cases it can be defined, eg we can define it for simple literals -
simple literal *values* can be pairs just as they are right now, and
so IEXT(I(rdf:langTag)) is all pairs of the form <<sss, tag>, tag> ,
and IEXT(I(xsd:string)) is all pairs <<sss, tag>, sss> - and for XML
literals.
Here's the MT for the datatyping, re-done in a more up-todate style:
D is a datatype map, as usual.
If <uri, ddd> is in D then:
I(uri)=ddd;
ddd is in ICEXT(I(rdf:Datatype));
for any string sss, sss is in the lexical space of ddd iff
<L2V(ddd)(sss),sss> is in IEXT(ddd);
If sss is in the lexical space of ddd then
L2V(ddd)(sss) is in ICEXT(ddd)
Note that being in the class is necessary but not sufficient for the
datatyping rule to apply; this avoids some of the snags we had with
this design previously involving subtypes. For example, we can have
ex:octal rdfs:subClassOf xsd:integer .
_:x ex:octal "10" .
and _:x unambiguously denotes eight; in fact
_:x owl:sameAs _:y .
_:y xsd:integer "8" .
The lexical typing only gets invoked by the datatype property; the
class membership has to do with the values. Alternative lexical forms
give no problem either:
_:x xsd:integer "2" .
_:x xsd:integer "0002" .
BTW, we could now use rdfs:Literal as a generic superproperty of all
datatype properties, as well as a superclass of all datatype values,
so that
_:x rdfs:Literal "10" .
would say that _:x was some value which has "10" as a lexical form,
but we don't (yet) know which one. Or, we could not do this.
-----
This would be a major change and would probably effect several implementations.
In order to change our current design to this we would need to:
1. remove typed literals (or, treat them as an abbreviations for the
two-triple form, maybe?)
2. remove lang tags from plain literals (or treat these as an
abbreviation, similarly)
3. introduce rdf:langTag (or whatever) and add prose discussing the
use of lang tags as properties
4. modify the datatype semantics, as above
5. redefine the XML parsing rules for parseType="Literal"
6. rewrite the Lbase translation appropriately
I think this would mean changes to every document; it would be a
fairly horrendous editing task at this stage.
On the other hand, it does have a certain elegance. There is only one
kind of literal, and literals are genuinely simple, both
syntactically and semantically, and always denote themselves in all
contexts (remember non-tidy graphs?); and it uses RDF as a
descriptive language rather than extending the syntax in an
XML-idiosyncratic way.
We abandoned this design, as I recall, for three reasons. First, it
seemed too 'indirect' and like triple-bloat. However, in our current
design we have to specify the same information, and we can infer the
bnode:
aaa ppp "10"^^xsd:integer .
|=
aaa ppp _:x .
compare
aaa ppp _:x .
_:x xsd:integer "10" .
an in any case in this post-OWL era, triple-bloat seems to be
rampant. I note that it would be harmless to allow the current
typed-literal form as an abbreviation for the two-triple form, by the
way; or even as an alternative, with inference rules to convert them
back and forth. The feeling of being 'indirect' came, as I recall,
from a feeling that we *ought* to be able, dammit, to write things
like
ex:Jill ex:age "10"
rather have to go through a bnode:
ex:Jill ex:age _:x .
_:x xsd:integer "10" .
This feeling now seems to me to have been overly naive, however, with
the benefit of hindsight.
Second, it seemed unintuitive to some folk to have a property and a
class with the same name. I never had this trouble myself, and it
seems to me to be a good illustration of the usefulness of the
intensional semantics that RDF provides: if you've got it, flaunt it.
[*see PS] However, the design could be modified by allowing
systematic variants for the property or class names, eg using
xsd:integer for the property and xsd:Integer for the class. Or we
could do without the datatype classes altogether, since
aaa rdf:type xsd:integer .
(read: aaa is an integer)
and
aaa xsd:integer _:x .
(read: aaa is something denoted by a numeral)
convey the exact same information in {xsd:integer}-interpretations.
Third, as I recall, there were some issues arising from the
long-range datatyping getting too complicated. OK, Im not suggesting
re-opening that particular can of worms. (Though I would note that
when it does get re-opened in the future, I bet this design will be a
lot more tractable than our current design, which will have to be
simply shelved.)
----
The other i18n issue involved treating XML literals without markup as
being plain text. Assuming that 'plain text' means a character
string, I now think we can do that by a bit of semantic sleight of
hand as follows. First, observe that any piece of XML can be encoded
as a character string, but XML imposes extra equivalence (identity)
conditions, such as identifying "<br />" with "<br></br>". So,
consider the set of legal XML texts, considered as Unicode strings,
and define an equivalence relation on this set by saying that strings
with the same XML normal form are equivalent; then say that any such
string denotes its equivalence class, and then in a familiar abuse of
notation say that singleton classes are identical to their members.
Now, any piece of XML text without any markup in it denotes itself,
just as a plain literal does. (There may be some whitespace issues
which make " " (two spaces) equivalent to " " (one space); if so,
this will need to be stated more carefully, eg by applying the
normalization only to stuff inside <->.) If we say that this is the
value space of rdf:XMLLiteral, rather than the non-text 'structural'
sets we have at present, then Martin might be happier.
On the other hand, this supports a number of hard-to-state RDF
entailments, such as intersubstituting "sss"^^xsd:string and
"sss"^^rdf:XMLLiteral under circumstances which can only be
recognized by an XML parser, which seems *very* ugly to include in
basic RDF, so I would argue that if we do something like this then we
treat rdf:XMLLiteral as a genuine datatype so that these entailments
are restricted to D-interpretations and are not valid in simple RDF;
and it also means that XML *with* markup denotes something very like
a character string; in particular,
"<"^^rdf:XMLLiteral
on this proposal, has got absolutely nothing in common with
"<"^^xsd:string. So maybe Martin might not be so happy after all.
Anyway, thought I'd just mention it in passing.
Pat
PS. I thought of an interesting analogy. Literals are a kind of
name, and in a simple extensional logic they would have a fixed
denotation, eg numerals denote numbers, I("10")=10 (ie, ten) and so
on, end of story. But RDF is intensional, and datatypes treat
literals like intensional names. Seen in this way, the literal always
denotes itself, ie I(literal)=literal; but it has a variable
extension, *determined by the datatype context*. In other words, the
datatype lexical-to-value map is a kind of extension mapping, like
IEXT for properties and ICEXT for classes. Call it ILEXT-d where d
is the datatype; then the 'meaning' of a literal string sss in a
datatype context defined by d would be ILEXT-d(I(sss)) - compare
IEXT(I(p)) or ICEXT(I(a)) where p is a property uri and a is a uri or
bnode - which since I(sss) = sss is just ILEXT-d(sss), i.e.
L2V(d)(sss). This is exactly what the subject bnode denotes in a
datatype triple; in other words, we are using the datatype property
name as a kind of explicit extension mapping on literal strings. On
this view, then, what a datatype does is to fix the extension mapping
for literals, considered as intensional names. The universal
superproperty rdfs:Literal works the same way but refuses to supply a
context, so letting the extension mapping be anything.
--
---------------------------------------------------------------------
IHMC (850)434 8903 or (650)494 3973 home
40 South Alcaniz St. (850)202 4416 office
Pensacola (850)202 4440 fax
FL 32501 (850)291 0667 cell
phayes@ihmc.us http://www.ihmc.us/users/phayes
Received on Wednesday, 17 September 2003 21:16:22 UTC