Re: I18N Issue alternative: a passing thought. from pat hayes on 2003-09-18 (w3c-rdfcore-wg@w3.org from September 2003)

From: pat hayes <phayes@ihmc.us>
Date: Thu, 18 Sep 2003 12:37:02 -0500
To: Graham Klyne <gk@ninebynine.org>
Cc: w3c-rdfcore-wg@w3.org
Message-Id: <p06001f06bb8f9b2fb187@[10.0.100.9]>
>Continuing in the spirit of airing alternative designs, not proposing them...
>
>I think Pat's approach is elegant and quite effective, and is in 
>substantial concurrence with earlier thoughts expressed by DanC [1] 
>and myself [2].  The main difference that I see is the proposal to 
>represent language tags in the graph rather than as part of a 
>literal.
>
>[1] http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Oct/0031.html
>
>[2] http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Nov/0635.html
>
>I'm wondering if the suggestion to translate
>
>>aaa ppp "sss"@ttt .
>>-->>
>>aaa ppp _:x .
>>_:x xsd:string "sss" .
>>_:x rdf:langTag "ttt" .
>
>might be problematic in its use of xsd:string, in that this would mean that:
>
>   aaa ppp "sss"@ttt .
>entails
>   aaa ppp "sss" .
>
>for which there is no corresponding entailment in the current design.

Ah, indeed I had not noticed that.  I think this will happen with or 
without xsd:string, actually: it has to do with the fact that the tag 
is now a property, so can be omitted in the description, so a 
description of a simple literal without a tag is indistinguishable 
from an incomplete description of a simple literal with an unknown 
tag. That is ugly and may be fatal.

>  Maybe a simple way to avoid this is to apply the "i-default" tag 
>(per RFC2277 - http://www.ietf.org/rfc/rfc2277.txt); e.g. so that
>
>aaa ppp "sss" .
>-->>
>aaa ppp _:x .
>_:x xsd:string "sss" .
>_:x rdf:langTag "i-default" .
>
>Thus blocking the above entailment.  Hmmm, i-default is not a good 
>choice because it suggests a human readable language, but I think a 
>variation on this could work.

Technically, but it makes the whole thing unworkable, I think. If the 
tag assertion is compulsory then the tags will break conventional 
datatyping and we would be better off with the current design.

Pat

>..
>
>I'm not sure that I fully concur with Pat's proposed handling of 
>parseType=Literal, in that I don't see that, in terms of graph 
>formation, there needs to be any different treatment from ordinary 
>plain literals ... that is, parseType=Literal makes sense as a 
>purely syntactic directive for processing of RDF/XML content to 
>plain literal form.  I don't think this is inconsistent with Pat's 
>proposal, I just don't see why the parseType=Literal case needs to 
>be drawn out specially in this way.  One of the things I least like 
>about the current design is the way that syntactic processing is not 
>kept distinct from datatype semantics.  Pat's proposal discuses 
>treatment of rdf:XMLLiteral as a pure datatype, which seems sensible 
>to me.
>
>Concerning:
>>_:x rdfs:Literal "10" .
>>
>>would say that _:x was some value which has "10" as a lexical form, 
>>but we don't (yet) know which one. Or, we could not do this.
>
>Would this be a reasonable interpretation for rdf:value, consistent 
>with existing usage?
>
>#g
>--
>
>At 20:16 17/09/03 -0500, pat hayes wrote:
>
>>Greetings.
>>
>>Y'all are going to just LOVE me for this, but thinking about the 
>>i18n desireables for XML has led me to the observation that one of 
>>our old and abandoned designs for handling datatypes would handle 
>>this stuff quite smoothly. The key point is that terms denoting 
>>datatype values are allowed in the subject position, so attributes 
>>like language tags and lexical 'type' can be described as RDF 
>>properties. We gave up on this on the grounds largely of 
>>triple-bloat, a concern which now seems curiously irrelevant when 
>>one contemplates what OWL will look like.  Anyway, in the spirit of 
>>Brian's comment,
>>
>>>I've tried to be careful not to describe it as a proposal.  This is an
>>>alternative design.  I'm not proposing it, just describing it.
>>
>>here's the design.
>>
>>Plain literals are just strings, and they denote themselves. There 
>>are no typed literals. Datatypes are indicated by class/property 
>>names. Datatype values are typically indicated by bnodes, so 
>>instead of
>>
>>aaa ppp "sss"^^ddd .
>>
>>we write
>>
>>aaa ppp _:x .
>>_:x ddd "sss" .
>>
>>where the _:x denotes the datatype value.  You could use URIs in 
>>some cases, eg
>>
>>ex:PIto5places xsd:number "3.14162"  .
>>
>>There is a general D-entailment
>>
>>aaa ddd "sss" .
>>|=
>>aaa rdf:type ddd .
>>
>>when sss is a legal lexical form for the datatype ddd; the version 
>>of this for XML is an RDF entailment (though see later).
>>
>>This design, unlike our present one, has subject terms denoting 
>>datatype values, so lang tags can be considered to be *properties 
>>of datatype values*, and the tags themselves can be encoded as 
>>simple literals, so we just write an assertion:
>>
>>_:x rdf:langTag "en" .
>>
>>and our current design translates thus:
>>
>>aaa ppp "sss"@ttt .
>>-->>
>>aaa ppp _:x .
>>_:x xsd:string "sss" .
>>_:x rdf:langTag "ttt" .
>>
>>Note that xsd:string is the appropriate datatype for simple 
>>literals, providing a way to in effect put a simple literal string 
>>in the subject position (encoded as a bnode). In fact, in this 
>>design, xsd:string is in effect owl:sameAs applied to literals.
>>
>>----
>>
>>This way of handling lang tags allows us to associate lang tags 
>>with XML literals without putting the tag into the lexical space of 
>>the literal, so allows XML literal to be a normal datatype, just as 
>>it is right now (though read on) while also handling one of 
>>Martin's requirements. The parsing of parseType="Literal" needs to 
>>include the asserting of an appropriate rdf:langTag assertion in 
>>the graph, according to the XML rules, but that seems 
>>straightforward. This design also allows sub-XML datatypes to 
>>automatically inherit language tagging, since they will be members 
>>of subClasses of rdf:XMLLiteral and hence of rdf:XMLliteral itself, 
>>and hence the members of these classes will still have any 
>>properties they had previously. Notice that the property is of the 
>>literal *value*, rather than syntactically attached to the literal, 
>>so rdf:langTag only makes intuitive sense for self-denoting 
>>literals, or at any rate those which denote textual kinds of thing 
>>rather than mathematical kinds of thing. However, there is no need 
>>to have special rules to 'ignore' lang tags on non-textual 
>>datatypes such as numbers: an assertion like
>>
>>_:x xsd:integer "25" .
>>_:x rdf:langTag "en" .
>>
>>is semantically vacuous but harmless, or can be considered harmless 
>>as far as RDF is concerned. (A lang-tag-savvy app might complain 
>>about things like this.)  Also we don't need lang tags as a 
>>syntactic attachment to plain literals; the same trick works for 
>>plain literals.
>>
>>There isn't any general semantics for rdf:langTag, but for 
>>particular cases it can be defined, eg we can define it for simple 
>>literals - simple literal *values* can be pairs just as they are 
>>right now, and so IEXT(I(rdf:langTag)) is all pairs of the form 
>><<sss, tag>, tag> , and IEXT(I(xsd:string)) is all pairs <<sss, 
>>tag>, sss> -  and for XML literals.
>>
>>Here's the MT for the datatyping, re-done in a more up-todate 
>>style: D is a datatype map, as usual.
>>If <uri, ddd> is in D then:
>>I(uri)=ddd;
>>ddd is in ICEXT(I(rdf:Datatype));
>>for any string sss,  sss is in the lexical space of ddd iff
>><L2V(ddd)(sss),sss> is in IEXT(ddd);
>>If sss is in the lexical space of ddd then
>>L2V(ddd)(sss) is in ICEXT(ddd)
>>
>>Note that being in the class is necessary but not sufficient for 
>>the datatyping rule to apply; this avoids some of the snags we had 
>>with this design previously involving subtypes. For example, we can 
>>have
>>ex:octal rdfs:subClassOf xsd:integer .
>>_:x ex:octal "10" .
>>
>>and _:x unambiguously denotes eight; in fact
>>
>>_:x owl:sameAs _:y .
>>_:y  xsd:integer "8" .
>>
>>The lexical typing only gets invoked by the datatype property; the 
>>class membership has to do with the values. Alternative lexical 
>>forms give no problem either:
>>
>>_:x xsd:integer "2" .
>>_:x xsd:integer "0002" .
>>
>>BTW, we could now use rdfs:Literal as a generic superproperty of 
>>all datatype properties, as well as a superclass of all datatype 
>>values, so that
>>
>>_:x rdfs:Literal "10" .
>>
>>would say that _:x was some value which has "10" as a lexical form, 
>>but we don't (yet) know which one. Or, we could not do this.
>>
>>-----
>>
>>This would be a major change and would probably effect several 
>>implementations.
>>
>>In order to change our current design to this we would need to:
>>1. remove typed literals (or, treat them as an abbreviations for 
>>the two-triple form, maybe?)
>>2. remove lang tags from plain literals (or treat these as an 
>>abbreviation, similarly)
>>3. introduce rdf:langTag (or whatever) and add prose discussing the 
>>use of lang tags as properties
>>4. modify the datatype semantics, as above
>>5. redefine the XML parsing rules for parseType="Literal"
>>6. rewrite the Lbase translation appropriately
>>
>>I think this would mean changes to every document; it would be a 
>>fairly horrendous editing task at this stage.
>>
>>On the other hand, it does have a certain elegance. There is only 
>>one kind of literal, and literals are genuinely simple, both 
>>syntactically and semantically, and always denote themselves in all 
>>contexts (remember non-tidy graphs?); and it uses RDF as a 
>>descriptive language rather than extending the syntax in an 
>>XML-idiosyncratic way.
>>
>>We abandoned this design, as I recall, for three reasons. First, it 
>>seemed too 'indirect' and like triple-bloat. However, in our 
>>current design we have to specify the same information, and we can 
>>infer the bnode:
>>
>>aaa ppp "10"^^xsd:integer .
>>|=
>>aaa ppp _:x .
>>
>>compare
>>
>>aaa ppp _:x .
>>_:x xsd:integer "10" .
>>
>>an in any case in this post-OWL era, triple-bloat seems to be 
>>rampant. I note that it would be harmless to allow the current 
>>typed-literal form as an abbreviation for the two-triple form, by 
>>the way; or even as an alternative, with inference rules to convert 
>>them back and forth. The feeling of being 'indirect' came, as I 
>>recall, from a feeling that we *ought* to be able, dammit, to write 
>>things like
>>ex:Jill ex:age "10"
>>rather have to go through a bnode:
>>ex:Jill ex:age _:x .
>>_:x xsd:integer "10" .
>>This feeling now seems to me to have been overly naive, however, 
>>with the benefit of hindsight.
>>
>>Second, it seemed unintuitive to some folk to have a property and a 
>>class with the same name. I never had this trouble myself, and it 
>>seems to me to be a good illustration of the usefulness of the 
>>intensional semantics that RDF provides: if you've got it, flaunt 
>>it. [*see PS] However, the design could be modified by allowing 
>>systematic variants for the property or class names, eg using 
>>xsd:integer for the property and xsd:Integer for the class.  Or we 
>>could do without the datatype classes altogether, since
>>
>>aaa rdf:type xsd:integer .
>>  (read: aaa is an integer)
>>
>>and
>>
>>aaa xsd:integer _:x .
>>(read: aaa is something denoted by a numeral)
>>
>>convey the exact same information in {xsd:integer}-interpretations.
>>
>>Third, as I recall, there were some issues arising from the 
>>long-range datatyping getting too complicated. OK, Im not 
>>suggesting re-opening that particular can of worms. (Though I would 
>>note that when it does get re-opened in the future, I bet this 
>>design will be a lot more tractable than our current design, which 
>>will have to be simply shelved.)
>>
>>----
>>
>>The other i18n issue involved treating XML literals without markup 
>>as being  plain text. Assuming that 'plain text' means a character 
>>string, I now think we can do that by a bit of semantic sleight of 
>>hand as follows. First, observe that any piece of XML can be 
>>encoded as a character string, but XML imposes extra equivalence 
>>(identity) conditions, such as identifying "<br />" with 
>>"<br></br>". So, consider the set of legal XML texts, considered as 
>>Unicode strings, and define an equivalence relation on this set by 
>>saying that strings with the same XML normal form are equivalent; 
>>then say that any such string denotes its equivalence class, and 
>>then in a familiar abuse of notation say that singleton classes are 
>>identical to their members. Now, any piece of XML text without any 
>>markup in it denotes itself, just as a plain literal does. (There 
>>may be some whitespace issues which make "  " (two spaces) 
>>equivalent to " " (one space); if so, this will need to be stated 
>>more carefully, eg by applying the normalization only to stuff 
>>inside <->.) If we say that this is the value space of 
>>rdf:XMLLiteral, rather than the non-text 'structural' sets we have 
>>at present, then Martin might be happier.
>>
>>On the other hand, this supports a number of hard-to-state RDF 
>>entailments, such as intersubstituting "sss"^^xsd:string and 
>>"sss"^^rdf:XMLLiteral  under circumstances which can only be 
>>recognized by an XML parser, which seems *very* ugly to include in 
>>basic RDF, so I would argue that if we do something like this then 
>>we treat rdf:XMLLiteral as a genuine datatype so that these 
>>entailments are restricted to D-interpretations and are not valid 
>>in simple RDF; and it also means that XML *with* markup denotes 
>>something very like a character string; in particular,
>>"&lt;"^^rdf:XMLLiteral
>>on this proposal, has got absolutely nothing in common with
>>"<"^^xsd:string.  So maybe Martin might not be so happy after all.
>>
>>Anyway, thought I'd just mention it in passing.
>>
>>Pat
>>
>>PS.  I thought of an interesting analogy. Literals are a kind of 
>>name, and in a simple extensional logic they would have a fixed 
>>denotation, eg numerals denote numbers, I("10")=10 (ie, ten) and so 
>>on, end of story.  But RDF is intensional, and datatypes treat 
>>literals like intensional names. Seen in this way, the literal 
>>always denotes itself, ie I(literal)=literal; but it has a variable 
>>extension, *determined by the datatype context*. In other words, 
>>the datatype lexical-to-value map is a kind of extension mapping, 
>>like IEXT for properties and ICEXT for classes.  Call it ILEXT-d 
>>where d is the datatype; then the 'meaning' of a literal string sss 
>>in a datatype context defined by d would be ILEXT-d(I(sss)) - 
>>compare IEXT(I(p)) or ICEXT(I(a)) where p is a property uri and a 
>>is a uri or bnode - which since I(sss) = sss is just ILEXT-d(sss), 
>>i.e. L2V(d)(sss).  This is exactly what the subject bnode denotes 
>>in a datatype triple; in other words, we are using the datatype 
>>property name as a kind of explicit extension mapping on literal 
>>strings. On this view, then, what a datatype does is to fix the 
>>extension mapping for literals, considered as intensional names. 
>>The universal superproperty rdfs:Literal works the same way but 
>>refuses to supply a context, so letting the extension mapping be 
>>anything.
>>
>>
>>--
>>---------------------------------------------------------------------
>>IHMC    (850)434 8903 or (650)494 3973   home
>>40 South Alcaniz St.    (850)202 4416   office
>>Pensacola                       (850)202 4440   fax
>>FL 32501                        (850)291 0667    cell
>>phayes@ihmc.us       http://www.ihmc.us/users/phayes
>
>------------
>Graham Klyne
>GK@NineByNine.org


-- 
---------------------------------------------------------------------
IHMC	(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32501			(850)291 0667    cell
phayes@ihmc.us       http://www.ihmc.us/users/phayes
Received on Thursday, 18 September 2003 13:37:05 UTC