Re: Rethinking ISSUE-12 with lang datatypes from Pat Hayes on 2011-05-27 (public-rdf-wg@w3.org from May 2011)

From: Pat Hayes <phayes@ihmc.us>
Date: Fri, 27 May 2011 12:32:30 -0500
To: Ivan Herman <ivan@w3.org>
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-wg@w3.org
Message-Id: <8BD5FF48-0774-486A-8DE7-6E59B3728428@ihmc.us>
On May 27, 2011, at 4:49 AM, Ivan Herman wrote:

> 
> On May 27, 2011, at 11:23 , Andy Seaborne wrote:
> 
>> 
>> 
>> On 25/05/11 17:50, Antoine Zimmermann wrote:
>>> All,
>>> 
>>> 
>>> [disclaimer: I am not vehemently in favour of that proposal, just expressing my thoughts aloud.]
>> 
>> In the same spirit: just thinking aloud.
> 
> Ditto
> 
>> 
>> One of the limitations of datatypes is that lexical space is a 1D, the set of sequences of characters.  If we generalise datatypes for RDF to a "representation space" which can be multi-dimensional, we can formulate and relate language tagged datatypes quite simply.
>> 
>> Restricting the representation space to 1D space of strings, we get back to lexical space and compatibility with XSD etc.
>> 
>> rdf:String is a datatype where the rep space is
>>   (unicode strings) union (unicode strings, validLangTags)
>> The value space is <string> union <string,validLangTags>
>> 
>> rdf:LangTaggedString is a derived datatype of rdf:String, restricting the  represenation space to (unicode strings, validLangTags).
>> 
>> rdf:lang{langTag} is a derived datatype of rdf:LangTaggedString, restricting the representation space to (unicode strings, {langTag})
> 
> But, I believe, the alternative idea was slightly different. If we remove rdf:LangTaggedString from the equation altogether, and we keep only the rdf:lang-{langtag} as a series of datatypes, then the representation space is simply unicode strings plus a specific datatype. Ie, just like we have
> 
> "1"^^xsd:integer
> "1"^^xsd:double
> 
> that are (afaik) disjoint as different, we would have
> 
> "a"^^rdf:lang-en
> "a"^^xsd:string
> 
> different. 

And similarly 

"a"^^rdf:lang-en
"a"^^rdf-lang-en-uk

Right?

> 
> "a" is a shortcut for "a"^^xsd:string
> "a"@en is a shortcut for "a"^^rdf:lang-en
> 
> there is a question whether we would define rdf:lang-en as a subtype (subclass) of xsd:string; and it seems to be safer not to do that. 

It would be definitely wrong to do that. But we could have that rdf:lang-xx are all subclasses of rdf:LangTagString, that would be harmless (and might be useful.) Just don't call it a datatype. 

> 
> SPARQL str() 
> 
> returns the unicode string and drops the datatype for all combination.

Hmm. Does that work for other datatypes? Does str() extract the string "123" from "123"^^xsd:integer ? If not, why not? That is, why is this case different from "abc"^^rdf:lang-en ?  After all, xsd:integer and rdf:lang-en are both just datatypes.

Pat

PS. This tag-as-datatype idea does work, but it raises hairs on the back of my neck, and I have been struggling to say why. It just seems wrong to say that a language tag is a DATAtype. And it seems like overkill.

The key issue with lang tagged literals is that they are the only literal form in RDF that has two strings (as well as an implicit type). All of the complications that we get embroiled in at this point are ways of trying to get these two strings back into being one. rdf:PlainLIteral smooshed them together into one string. Now we are proposing to bury one of them inside a URI to get rid of it. I would vastly prefer that we simply accepted that some literals have more than one string, and adapt our notion of literal typing to accommodate to that fact, rather than trying to disguise it or pretend its not true, and so become obliged to swallow some clearly artificial notion (such as a language tag being a kind of datatype) just to preserve what is in any case a purely arbitrary model of literal typing. 

Peter has expressed a worry that changing this will interfere with the heart, or maybe the foundations, of RDF, but this worry is really nothing more than a vague rumbling sound. Suppose we had said originally that the L2V mapping applied to the lexical form of the literal, rather than to a string embedded in this lexical form. Nothing would have been significantly different in the RDF specs: with a slightly adapted L2V mapping, no entailments would have been altered, and no algorithms need to have been changed. But this pseudo-problem, and all the twisting and turning we and others have gone through and are still going through, would simply not have arisen. We can still do this, and it really would be more like having a haircut than like major abdominal surgery. 

A question for the rdf:lang-en proposal. In order to determine the language tag of a lang-tagged literal, it is necessary to parse the inside of a URI. Is this likely to be a problem? It feels like a problem to me.


> 
> Ivan
> 
>> 
>> "foo"@en is special syntax ("foo", "en").
>> (c.f. 123 for "123"^^xsd:string)
>> 
>> SPARQL str() is defined to return the first element of a tuple.
>> 
>> Then rdf:PlainLiteral is datatype with a 1D lexical space, encoding using "@" as a separator.
>> 
>> (Does it say anywhere in RDF that derived datatypes must be subclasses?)
>> 
>> 	Andy
>> 
> 
> 
> ----
> Ivan Herman, W3C Semantic Web Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> PGP Key: http://www.ivan-herman.net/pgpkey.html
> FOAF: http://www.ivan-herman.net/foaf.rdf
> 
> 
> 
> 
> 
> 
> 

------------------------------------------------------------
IHMC                                     (850)434 8903 or (650)494 3973   
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes
Received on Friday, 27 May 2011 17:33:01 UTC