Re: I18N Issue alternative: a passing thought. from Graham Klyne on 2003-09-18 (w3c-rdfcore-wg@w3.org from September 2003)

From: Graham Klyne <gk@ninebynine.org>
Date: Thu, 18 Sep 2003 12:21:34 +0100
To: pat hayes <phayes@ihmc.us>, w3c-rdfcore-wg@w3.org
Message-Id: <5.1.0.14.2.20030918115040.02fdbf30@127.0.0.1>
Continuing in the spirit of airing alternative designs, not proposing them...

I think Pat's approach is elegant and quite effective, and is in 
substantial concurrence with earlier thoughts expressed by DanC [1] and 
myself [2].  The main difference that I see is the proposal to represent 
language tags in the graph rather than as part of a literal.

[1] http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Oct/0031.html

[2] http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Nov/0635.html

I'm wondering if the suggestion to translate

>aaa ppp "sss"@ttt .
>-->>
>aaa ppp _:x .
>_:x xsd:string "sss" .
>_:x rdf:langTag "ttt" .

might be problematic in its use of xsd:string, in that this would mean that:

   aaa ppp "sss"@ttt .
entails
   aaa ppp "sss" .

for which there is no corresponding entailment in the current 
design.  Maybe a simple way to avoid this is to apply the "i-default" tag 
(per RFC2277 - http://www.ietf.org/rfc/rfc2277.txt); e.g. so that

aaa ppp "sss" .
-->>
aaa ppp _:x .
_:x xsd:string "sss" .
_:x rdf:langTag "i-default" .

Thus blocking the above entailment.  Hmmm, i-default is not a good choice 
because it suggests a human readable language, but I think a variation on 
this could work.

..

I'm not sure that I fully concur with Pat's proposed handling of 
parseType=Literal, in that I don't see that, in terms of graph formation, 
there needs to be any different treatment from ordinary plain literals ... 
that is, parseType=Literal makes sense as a purely syntactic directive for 
processing of RDF/XML content to plain literal form.  I don't think this is 
inconsistent with Pat's proposal, I just don't see why the 
parseType=Literal case needs to be drawn out specially in this way.  One of 
the things I least like about the current design is the way that syntactic 
processing is not kept distinct from datatype semantics.  Pat's proposal 
discuses treatment of rdf:XMLLiteral as a pure datatype, which seems 
sensible to me.

Concerning:
>_:x rdfs:Literal "10" .
>
>would say that _:x was some value which has "10" as a lexical form, but we 
>don't (yet) know which one. Or, we could not do this.

Would this be a reasonable interpretation for rdf:value, consistent with 
existing usage?

#g
--

At 20:16 17/09/03 -0500, pat hayes wrote:

>Greetings.
>
>Y'all are going to just LOVE me for this, but thinking about the i18n 
>desireables for XML has led me to the observation that one of our old and 
>abandoned designs for handling datatypes would handle this stuff quite 
>smoothly. The key point is that terms denoting datatype values are allowed 
>in the subject position, so attributes like language tags and lexical 
>'type' can be described as RDF properties. We gave up on this on the 
>grounds largely of triple-bloat, a concern which now seems curiously 
>irrelevant when one contemplates what OWL will look like.  Anyway, in the 
>spirit of Brian's comment,
>
>>I've tried to be careful not to describe it as a proposal.  This is an
>>alternative design.  I'm not proposing it, just describing it.
>
>here's the design.
>
>Plain literals are just strings, and they denote themselves. There are no 
>typed literals. Datatypes are indicated by class/property names. Datatype 
>values are typically indicated by bnodes, so instead of
>
>aaa ppp "sss"^^ddd .
>
>we write
>
>aaa ppp _:x .
>_:x ddd "sss" .
>
>where the _:x denotes the datatype value.  You could use URIs in some 
>cases, eg
>
>ex:PIto5places xsd:number "3.14162"  .
>
>There is a general D-entailment
>
>aaa ddd "sss" .
>|=
>aaa rdf:type ddd .
>
>when sss is a legal lexical form for the datatype ddd; the version of this 
>for XML is an RDF entailment (though see later).
>
>This design, unlike our present one, has subject terms denoting datatype 
>values, so lang tags can be considered to be *properties of datatype 
>values*, and the tags themselves can be encoded as simple literals, so we 
>just write an assertion:
>
>_:x rdf:langTag "en" .
>
>and our current design translates thus:
>
>aaa ppp "sss"@ttt .
>-->>
>aaa ppp _:x .
>_:x xsd:string "sss" .
>_:x rdf:langTag "ttt" .
>
>Note that xsd:string is the appropriate datatype for simple literals, 
>providing a way to in effect put a simple literal string in the subject 
>position (encoded as a bnode). In fact, in this design, xsd:string is in 
>effect owl:sameAs applied to literals.
>
>----
>
>This way of handling lang tags allows us to associate lang tags with XML 
>literals without putting the tag into the lexical space of the literal, so 
>allows XML literal to be a normal datatype, just as it is right now 
>(though read on) while also handling one of Martin's requirements. The 
>parsing of parseType="Literal" needs to include the asserting of an 
>appropriate rdf:langTag assertion in the graph, according to the XML 
>rules, but that seems straightforward. This design also allows sub-XML 
>datatypes to automatically inherit language tagging, since they will be 
>members of subClasses of rdf:XMLLiteral and hence of rdf:XMLliteral 
>itself, and hence the members of these classes will still have any 
>properties they had previously. Notice that the property is of the literal 
>*value*, rather than syntactically attached to the literal, so rdf:langTag 
>only makes intuitive sense for self-denoting literals, or at any rate 
>those which denote textual kinds of thing rather than mathematical kinds 
>of thing. However, there is no need to have special rules to 'ignore' lang 
>tags on non-textual datatypes such as numbers: an assertion like
>
>_:x xsd:integer "25" .
>_:x rdf:langTag "en" .
>
>is semantically vacuous but harmless, or can be considered harmless as far 
>as RDF is concerned. (A lang-tag-savvy app might complain about things 
>like this.)  Also we don't need lang tags as a syntactic attachment to 
>plain literals; the same trick works for plain literals.
>
>There isn't any general semantics for rdf:langTag, but for particular 
>cases it can be defined, eg we can define it for simple literals - simple 
>literal *values* can be pairs just as they are right now, and so 
>IEXT(I(rdf:langTag)) is all pairs of the form <<sss, tag>, tag> , and 
>IEXT(I(xsd:string)) is all pairs <<sss, tag>, sss> -  and for XML literals.
>
>Here's the MT for the datatyping, re-done in a more up-todate style: D is 
>a datatype map, as usual.
>If <uri, ddd> is in D then:
>I(uri)=ddd;
>ddd is in ICEXT(I(rdf:Datatype));
>for any string sss,  sss is in the lexical space of ddd iff
><L2V(ddd)(sss),sss> is in IEXT(ddd);
>If sss is in the lexical space of ddd then
>L2V(ddd)(sss) is in ICEXT(ddd)
>
>Note that being in the class is necessary but not sufficient for the 
>datatyping rule to apply; this avoids some of the snags we had with this 
>design previously involving subtypes. For example, we can have
>ex:octal rdfs:subClassOf xsd:integer .
>_:x ex:octal "10" .
>
>and _:x unambiguously denotes eight; in fact
>
>_:x owl:sameAs _:y .
>_:y  xsd:integer "8" .
>
>The lexical typing only gets invoked by the datatype property; the class 
>membership has to do with the values. Alternative lexical forms give no 
>problem either:
>
>_:x xsd:integer "2" .
>_:x xsd:integer "0002" .
>
>BTW, we could now use rdfs:Literal as a generic superproperty of all 
>datatype properties, as well as a superclass of all datatype values, so that
>
>_:x rdfs:Literal "10" .
>
>would say that _:x was some value which has "10" as a lexical form, but we 
>don't (yet) know which one. Or, we could not do this.
>
>-----
>
>This would be a major change and would probably effect several 
>implementations.
>
>In order to change our current design to this we would need to:
>1. remove typed literals (or, treat them as an abbreviations for the 
>two-triple form, maybe?)
>2. remove lang tags from plain literals (or treat these as an 
>abbreviation, similarly)
>3. introduce rdf:langTag (or whatever) and add prose discussing the use of 
>lang tags as properties
>4. modify the datatype semantics, as above
>5. redefine the XML parsing rules for parseType="Literal"
>6. rewrite the Lbase translation appropriately
>
>I think this would mean changes to every document; it would be a fairly 
>horrendous editing task at this stage.
>
>On the other hand, it does have a certain elegance. There is only one kind 
>of literal, and literals are genuinely simple, both syntactically and 
>semantically, and always denote themselves in all contexts (remember 
>non-tidy graphs?); and it uses RDF as a descriptive language rather than 
>extending the syntax in an XML-idiosyncratic way.
>
>We abandoned this design, as I recall, for three reasons. First, it seemed 
>too 'indirect' and like triple-bloat. However, in our current design we 
>have to specify the same information, and we can infer the bnode:
>
>aaa ppp "10"^^xsd:integer .
>|=
>aaa ppp _:x .
>
>compare
>
>aaa ppp _:x .
>_:x xsd:integer "10" .
>
>an in any case in this post-OWL era, triple-bloat seems to be rampant. I 
>note that it would be harmless to allow the current typed-literal form as 
>an abbreviation for the two-triple form, by the way; or even as an 
>alternative, with inference rules to convert them back and forth. The 
>feeling of being 'indirect' came, as I recall, from a feeling that we 
>*ought* to be able, dammit, to write things like
>ex:Jill ex:age "10"
>rather have to go through a bnode:
>ex:Jill ex:age _:x .
>_:x xsd:integer "10" .
>This feeling now seems to me to have been overly naive, however, with the 
>benefit of hindsight.
>
>Second, it seemed unintuitive to some folk to have a property and a class 
>with the same name. I never had this trouble myself, and it seems to me to 
>be a good illustration of the usefulness of the intensional semantics that 
>RDF provides: if you've got it, flaunt it. [*see PS] However, the design 
>could be modified by allowing systematic variants for the property or 
>class names, eg using xsd:integer for the property and xsd:Integer for the 
>class.  Or we could do without the datatype classes altogether, since
>
>aaa rdf:type xsd:integer .
>  (read: aaa is an integer)
>
>and
>
>aaa xsd:integer _:x .
>(read: aaa is something denoted by a numeral)
>
>convey the exact same information in {xsd:integer}-interpretations.
>
>Third, as I recall, there were some issues arising from the long-range 
>datatyping getting too complicated. OK, Im not suggesting re-opening that 
>particular can of worms. (Though I would note that when it does get 
>re-opened in the future, I bet this design will be a lot more tractable 
>than our current design, which will have to be simply shelved.)
>
>----
>
>The other i18n issue involved treating XML literals without markup as 
>being  plain text. Assuming that 'plain text' means a character string, I 
>now think we can do that by a bit of semantic sleight of hand as follows. 
>First, observe that any piece of XML can be encoded as a character string, 
>but XML imposes extra equivalence (identity) conditions, such as 
>identifying "<br />" with "<br></br>". So, consider the set of legal XML 
>texts, considered as Unicode strings, and define an equivalence relation 
>on this set by saying that strings with the same XML normal form are 
>equivalent; then say that any such string denotes its equivalence class, 
>and then in a familiar abuse of notation say that singleton classes are 
>identical to their members. Now, any piece of XML text without any markup 
>in it denotes itself, just as a plain literal does. (There may be some 
>whitespace issues which make "  " (two spaces) equivalent to " " (one 
>space); if so, this will need to be stated more carefully, eg by applying 
>the normalization only to stuff inside <->.) If we say that this is the 
>value space of rdf:XMLLiteral, rather than the non-text 'structural' sets 
>we have at present, then Martin might be happier.
>
>On the other hand, this supports a number of hard-to-state RDF 
>entailments, such as intersubstituting "sss"^^xsd:string and 
>"sss"^^rdf:XMLLiteral  under circumstances which can only be recognized by 
>an XML parser, which seems *very* ugly to include in basic RDF, so I would 
>argue that if we do something like this then we treat rdf:XMLLiteral as a 
>genuine datatype so that these entailments are restricted to 
>D-interpretations and are not valid in simple RDF; and it also means that 
>XML *with* markup denotes something very like a character string; in 
>particular,
>"&lt;"^^rdf:XMLLiteral
>on this proposal, has got absolutely nothing in common with
>"<"^^xsd:string.  So maybe Martin might not be so happy after all.
>
>Anyway, thought I'd just mention it in passing.
>
>Pat
>
>PS.  I thought of an interesting analogy. Literals are a kind of name, and 
>in a simple extensional logic they would have a fixed denotation, eg 
>numerals denote numbers, I("10")=10 (ie, ten) and so on, end of 
>story.  But RDF is intensional, and datatypes treat literals like 
>intensional names. Seen in this way, the literal always denotes itself, ie 
>I(literal)=literal; but it has a variable extension, *determined by the 
>datatype context*. In other words, the datatype lexical-to-value map is a 
>kind of extension mapping, like IEXT for properties and ICEXT for 
>classes.  Call it ILEXT-d where d is the datatype; then the 'meaning' of a 
>literal string sss in a datatype context defined by d would be 
>ILEXT-d(I(sss)) - compare IEXT(I(p)) or ICEXT(I(a)) where p is a property 
>uri and a is a uri or bnode - which since I(sss) = sss is just 
>ILEXT-d(sss), i.e. L2V(d)(sss).  This is exactly what the subject bnode 
>denotes in a datatype triple; in other words, we are using the datatype 
>property name as a kind of explicit extension mapping on literal strings. 
>On this view, then, what a datatype does is to fix the extension mapping 
>for literals, considered as intensional names.  The universal 
>superproperty rdfs:Literal works the same way but refuses to supply a 
>context, so letting the extension mapping be anything.
>
>
>--
>---------------------------------------------------------------------
>IHMC    (850)434 8903 or (650)494 3973   home
>40 South Alcaniz St.    (850)202 4416   office
>Pensacola                       (850)202 4440   fax
>FL 32501                        (850)291 0667    cell
>phayes@ihmc.us       http://www.ihmc.us/users/phayes

------------
Graham Klyne
GK@NineByNine.org
Received on Thursday, 18 September 2003 08:06:27 UTC