Re: I18N Issue alternative: a passing thought. from pat hayes on 2003-09-18 (w3c-rdfcore-wg@w3.org from September 2003)

From: pat hayes <phayes@ihmc.us>
Date: Wed, 17 Sep 2003 20:16:12 -0500
To: w3c-rdfcore-wg@w3.org
Message-Id: <p06001f17bb8d7e0e8e65@[192.168.1.2]>
Greetings.

Y'all are going to just LOVE me for this, but thinking about the i18n 
desireables for XML has led me to the observation that one of our old 
and abandoned designs for handling datatypes would handle this stuff 
quite smoothly. The key point is that terms denoting datatype values 
are allowed in the subject position, so attributes like language tags 
and lexical 'type' can be described as RDF properties. We gave up on 
this on the grounds largely of triple-bloat, a concern which now 
seems curiously irrelevant when one contemplates what OWL will look 
like.  Anyway, in the spirit of Brian's comment,

>I've tried to be careful not to describe it as a proposal.  This is an
>alternative design.  I'm not proposing it, just describing it.

here's the design.

Plain literals are just strings, and they denote themselves. There 
are no typed literals. Datatypes are indicated by class/property 
names. Datatype values are typically indicated by bnodes, so instead 
of

aaa ppp "sss"^^ddd .

we write

aaa ppp _:x .
_:x ddd "sss" .

where the _:x denotes the datatype value.  You could use URIs in some cases, eg

ex:PIto5places xsd:number "3.14162"  .

There is a general D-entailment

aaa ddd "sss" .
|=
aaa rdf:type ddd .

when sss is a legal lexical form for the datatype ddd; the version of 
this for XML is an RDF entailment (though see later).

This design, unlike our present one, has subject terms denoting 
datatype values, so lang tags can be considered to be *properties of 
datatype values*, and the tags themselves can be encoded as simple 
literals, so we just write an assertion:

_:x rdf:langTag "en" .

and our current design translates thus:

aaa ppp "sss"@ttt .
-->>
aaa ppp _:x .
_:x xsd:string "sss" .
_:x rdf:langTag "ttt" .

Note that xsd:string is the appropriate datatype for simple literals, 
providing a way to in effect put a simple literal string in the 
subject position (encoded as a bnode). In fact, in this design, 
xsd:string is in effect owl:sameAs applied to literals.

----

This way of handling lang tags allows us to associate lang tags with 
XML literals without putting the tag into the lexical space of the 
literal, so allows XML literal to be a normal datatype, just as it is 
right now (though read on) while also handling one of Martin's 
requirements. The parsing of parseType="Literal" needs to include the 
asserting of an appropriate rdf:langTag assertion in the graph, 
according to the XML rules, but that seems straightforward. This 
design also allows sub-XML datatypes to automatically inherit 
language tagging, since they will be members of subClasses of 
rdf:XMLLiteral and hence of rdf:XMLliteral itself, and hence the 
members of these classes will still have any properties they had 
previously. Notice that the property is of the literal *value*, 
rather than syntactically attached to the literal, so rdf:langTag 
only makes intuitive sense for self-denoting literals, or at any rate 
those which denote textual kinds of thing rather than mathematical 
kinds of thing. However, there is no need to have special rules to 
'ignore' lang tags on non-textual datatypes such as numbers: an 
assertion like

_:x xsd:integer "25" .
_:x rdf:langTag "en" .

is semantically vacuous but harmless, or can be considered harmless 
as far as RDF is concerned. (A lang-tag-savvy app might complain 
about things like this.)  Also we don't need lang tags as a syntactic 
attachment to plain literals; the same trick works for plain literals.

There isn't any general semantics for rdf:langTag, but for particular 
cases it can be defined, eg we can define it for simple literals - 
simple literal *values* can be pairs just as they are right now, and 
so IEXT(I(rdf:langTag)) is all pairs of the form <<sss, tag>, tag> , 
and IEXT(I(xsd:string)) is all pairs <<sss, tag>, sss> -  and for XML 
literals.

Here's the MT for the datatyping, re-done in a more up-todate style: 
D is a datatype map, as usual.
If <uri, ddd> is in D then:
I(uri)=ddd;
ddd is in ICEXT(I(rdf:Datatype));
for any string sss,  sss is in the lexical space of ddd iff
<L2V(ddd)(sss),sss> is in IEXT(ddd);
If sss is in the lexical space of ddd then
L2V(ddd)(sss) is in ICEXT(ddd)

Note that being in the class is necessary but not sufficient for the 
datatyping rule to apply; this avoids some of the snags we had with 
this design previously involving subtypes. For example, we can have
ex:octal rdfs:subClassOf xsd:integer .
_:x ex:octal "10" .

and _:x unambiguously denotes eight; in fact

_:x owl:sameAs _:y .
_:y  xsd:integer "8" .

The lexical typing only gets invoked by the datatype property; the 
class membership has to do with the values. Alternative lexical forms 
give no problem either:

_:x xsd:integer "2" .
_:x xsd:integer "0002" .

BTW, we could now use rdfs:Literal as a generic superproperty of all 
datatype properties, as well as a superclass of all datatype values, 
so that

_:x rdfs:Literal "10" .

would say that _:x was some value which has "10" as a lexical form, 
but we don't (yet) know which one. Or, we could not do this.

-----

This would be a major change and would probably effect several implementations.

In order to change our current design to this we would need to:
1. remove typed literals (or, treat them as an abbreviations for the 
two-triple form, maybe?)
2. remove lang tags from plain literals (or treat these as an 
abbreviation, similarly)
3. introduce rdf:langTag (or whatever) and add prose discussing the 
use of lang tags as properties
4. modify the datatype semantics, as above
5. redefine the XML parsing rules for parseType="Literal"
6. rewrite the Lbase translation appropriately

I think this would mean changes to every document; it would be a 
fairly horrendous editing task at this stage.

On the other hand, it does have a certain elegance. There is only one 
kind of literal, and literals are genuinely simple, both 
syntactically and semantically, and always denote themselves in all 
contexts (remember non-tidy graphs?); and it uses RDF as a 
descriptive language rather than extending the syntax in an 
XML-idiosyncratic way.

We abandoned this design, as I recall, for three reasons. First, it 
seemed too 'indirect' and like triple-bloat. However, in our current 
design we have to specify the same information, and we can infer the 
bnode:

aaa ppp "10"^^xsd:integer .
|=
aaa ppp _:x .

compare

aaa ppp _:x .
_:x xsd:integer "10" .

an in any case in this post-OWL era, triple-bloat seems to be 
rampant. I note that it would be harmless to allow the current 
typed-literal form as an abbreviation for the two-triple form, by the 
way; or even as an alternative, with inference rules to convert them 
back and forth. The feeling of being 'indirect' came, as I recall, 
from a feeling that we *ought* to be able, dammit, to write things 
like
ex:Jill ex:age "10"
rather have to go through a bnode:
ex:Jill ex:age _:x .
_:x xsd:integer "10" .
This feeling now seems to me to have been overly naive, however, with 
the benefit of hindsight.

Second, it seemed unintuitive to some folk to have a property and a 
class with the same name. I never had this trouble myself, and it 
seems to me to be a good illustration of the usefulness of the 
intensional semantics that RDF provides: if you've got it, flaunt it. 
[*see PS] However, the design could be modified by allowing 
systematic variants for the property or class names, eg using 
xsd:integer for the property and xsd:Integer for the class.  Or we 
could do without the datatype classes altogether, since

aaa rdf:type xsd:integer .
  (read: aaa is an integer)

and

aaa xsd:integer _:x .
(read: aaa is something denoted by a numeral)

convey the exact same information in {xsd:integer}-interpretations.

Third, as I recall, there were some issues arising from the 
long-range datatyping getting too complicated. OK, Im not suggesting 
re-opening that particular can of worms. (Though I would note that 
when it does get re-opened in the future, I bet this design will be a 
lot more tractable than our current design, which will have to be 
simply shelved.)

----

The other i18n issue involved treating XML literals without markup as 
being  plain text. Assuming that 'plain text' means a character 
string, I now think we can do that by a bit of semantic sleight of 
hand as follows. First, observe that any piece of XML can be encoded 
as a character string, but XML imposes extra equivalence (identity) 
conditions, such as identifying "<br />" with "<br></br>". So, 
consider the set of legal XML texts, considered as Unicode strings, 
and define an equivalence relation on this set by saying that strings 
with the same XML normal form are equivalent; then say that any such 
string denotes its equivalence class, and then in a familiar abuse of 
notation say that singleton classes are identical to their members. 
Now, any piece of XML text without any markup in it denotes itself, 
just as a plain literal does. (There may be some whitespace issues 
which make "  " (two spaces) equivalent to " " (one space); if so, 
this will need to be stated more carefully, eg by applying the 
normalization only to stuff inside <->.) If we say that this is the 
value space of rdf:XMLLiteral, rather than the non-text 'structural' 
sets we have at present, then Martin might be happier.

On the other hand, this supports a number of hard-to-state RDF 
entailments, such as intersubstituting "sss"^^xsd:string and 
"sss"^^rdf:XMLLiteral  under circumstances which can only be 
recognized by an XML parser, which seems *very* ugly to include in 
basic RDF, so I would argue that if we do something like this then we 
treat rdf:XMLLiteral as a genuine datatype so that these entailments 
are restricted to D-interpretations and are not valid in simple RDF; 
and it also means that XML *with* markup denotes something very like 
a character string; in particular,
"&lt;"^^rdf:XMLLiteral
on this proposal, has got absolutely nothing in common with
"<"^^xsd:string.  So maybe Martin might not be so happy after all.

Anyway, thought I'd just mention it in passing.

Pat

PS.  I thought of an interesting analogy. Literals are a kind of 
name, and in a simple extensional logic they would have a fixed 
denotation, eg numerals denote numbers, I("10")=10 (ie, ten) and so 
on, end of story.  But RDF is intensional, and datatypes treat 
literals like intensional names. Seen in this way, the literal always 
denotes itself, ie I(literal)=literal; but it has a variable 
extension, *determined by the datatype context*. In other words, the 
datatype lexical-to-value map is a kind of extension mapping, like 
IEXT for properties and ICEXT for classes.  Call it ILEXT-d where d 
is the datatype; then the 'meaning' of a literal string sss in a 
datatype context defined by d would be ILEXT-d(I(sss)) - compare 
IEXT(I(p)) or ICEXT(I(a)) where p is a property uri and a is a uri or 
bnode - which since I(sss) = sss is just ILEXT-d(sss), i.e. 
L2V(d)(sss).  This is exactly what the subject bnode denotes in a 
datatype triple; in other words, we are using the datatype property 
name as a kind of explicit extension mapping on literal strings. On 
this view, then, what a datatype does is to fix the extension mapping 
for literals, considered as intensional names.  The universal 
superproperty rdfs:Literal works the same way but refuses to supply a 
context, so letting the extension mapping be anything.


-- 
---------------------------------------------------------------------
IHMC	(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32501			(850)291 0667    cell
phayes@ihmc.us       http://www.ihmc.us/users/phayes
Received on Wednesday, 17 September 2003 21:16:22 UTC