Re: ISSUE-12 On languages and datatypes

On Jun 8, 2011, at 5:25 PM, William Waites wrote:

> Sorry for diverting the discussion a bit away from the
> LanguageTypedString proposal today, I was a bit feeling the time
> pressure of having a lot to discuss as the call was quickly drawing to
> an end and me having to rush out the door on the school run, so
> perhaps it was a bit out of order and I was not as coherent as I could
> have been.
> What I write below is mostly motivated by an intuitive suspicion that
> language tags are a mis-design in RDF. This intuition is informed by
> many years as a programmer/engineer in a dozen or so different
> programming languages - not one of which has any sort of triadic
> fundamental construct like the RDF literal. Anomalousness on its own
> doesn't suggest mis-design - innovations are by definition anomalous
> at first. But I think the only real argument for this design is that
> it is already out there in the wild. And no, I'm not arguing against
> a mechanism for saying that a string represents an utterance in some
> particular language (natural or otherwise), far from it.
> Our starting point is that we have this construct of a literal which
> itself is a 3-tuple (value, language, datatype) where either or both
> of the language and datatype can be nil or absent and they may not
> both be present.

I now think that this way of describing it, while now traditional, is misguided. The fact is, we have two basically different situations. One of them is a simple character string, a very useful general-purpose data structure. The other case is a piece of natural language text, encoded as a <string, tag> pair, but a basically different kind of entity. It is not a general-purpose data structure, and it is not used to represent of encode anything other than itself. It simply exists as raw data, a kind of conceptual endpoint. We were misled by the apparent triviality of the language tag into thinking of this as two cases of a single construct, but this was a mistake. they are basically different kinds of thing, both structurally and conceptually. People who use language tagging seriously are thinking about these as texts, not as strings. 

> Then it was pointed out that it makes sense that,
> where we only have the value, the datatype can naturally be supposed
> to be xsd:string. So far, I agree.


> But next, if there is a language
> is it suddenly no longer an xsd:string? That doesn't seem to make 
> sense.

I once thought so, but now I disagree. To borrow a term from philosophy, we have to look at the identity conditions. "chat" in French is a **different word** than "chat" in English. Same string, different word. Ergo, the words are not the same as the strings. And indeed, once you look at it carefully, they aren't strings, exactly because they are *in a language*. They aren't just strings of characters, they are language texts. Formally, a pair of a string and a language is not the same kind of thing as a simple string. "Le chat est sur le table" and "fhk frus fns noeptr k" are just two strings, nothing to particularly choose one over the other, but "Le chat est sur le table"@fr and "fhk frus fns noeptr k"@fr are very different. Something that understands the tag might well treat the second one as an error.

> But we can't just suppose xsd:string because that breaks the 
> rule and furthermore aren't strings with languages just a different
> kind of string?

No. See above. 

> I think there is consensus in the group up to this 
> point.

Apparently not :-)

> Now the proposal is to make a special datatype for all language-tagged
> strings and kind of jerry-rig the semantics so that we only ever have
> (value, datatype) but in the case of language tags we actually put
> ((text, language), datatype). As far as I know we don't have any other
> construct with this shape, where the value space is two-dimensional
> like this.

Well, the L2V mapping applies to the text in the actual literal other than the datatype URI itself. These literals are unique in that the sum total of text in them is divided, for essentially historical reasons, into two pieces, a string and a tag. But this really is not a matter of any importance, and nothing turns upon it. Your use of "two-dimensional" to describe this is misleading: there are no dimensions involved here.  Fact is, in this case just as in the untagged case, the L2V mapping is utterly trivial: it is the identity mapping on the abstract syntax. 

But in any case, as Richard suggested on the telecon call, we don't even have to describe this situation in terms of datatypes. We can just specify (as we do at present) that the tagged literal both is, and denotes, the pair of <string, tag>, and specify that the class extension of rdf:LTS is the set of all such pairs. I would like to add an informative note that users who want to consider this a (trivial built-in) datatype are free to do so, even though it does not meet the strict definition of a datatype.

> This removes the inconsistency by letting every literal
> have a datatype and we get to abandon the "no datatypes with
> languages" rule with minimal collateral damage. But the datatype
> doesn't carry much meaning and we can't extend the language part with
> any of the RDF machinery.

Has anyone ever expressed a desire to do this? 

> Why are languages special? They're important, sure. But I struggle to
> think of any other context in which they're given this level of
> special treatment. The closest is XML but even there they're not
> really that special, just another attribute (that happens to be
> inherited in child nodes but inheritance of this type is a well
> understood and not very controversial idea).
> Datatypes are special though, fundamental even, and we have plenty of
> theroretical and practical tools for reasoning about them and working
> with them.

I guess I don't see your point here. Languages are special to those who deal with large amounts of text in various languages. They are, for example, rather special in Europe. Datatypes are very useful, also. But you seem to be arguing for a kind of first/second class distinction in some kind of importance (?) which doesn't seem to be a useful way to think about RDF. RDF features are there for the convenience of RDF users, surely. 

> Languages are complicated, it has been pointed out. Certainly
> true. RDF is a good language for modelling and reasoning about
> complicated things. We lose out by moving languages - as used in RDF -
> outside of the things that we can use RDF to describe.

Well, if we had users clamoring for using RDF to describe languages, you might have a point. But I don't detect this clamor; and in any case, if someone wanted to try writing an OWL ontology of languages, relating them to the standard language tags (using plain literals to encode the tag components, say) then there is nothing to stop them doing this. By having tagged literals, we are not prohibiting more sophisticated approaches to language description. 

> So I propose - and I'm not the first to propose it - that we treat
> language-tagged strings as derived types of xsd:string but not limit
> ourselves to a single placeholder type. In other words,
>  rdflang:en rdfs:subClassOf xsd:string;
>  	     rdfs:label "en".

That does not work. Consider the following entailment sequence (using literal subjects: to avoid this, use bnode subjects and add lots of owl:sameAs statements to connect them to the literals. You get the same bad result.) 

"chat"^^rdflang:en  rdf:type rdflang:en .
"chat"^^rdflang:en  rdf:typ xsd:string .
"chat"^^rdflang:fr  rdf:type rdflang:fr .
"chat"^^rdflang:fr  rdf:typ xsd:string .
"chat"^^rdflang:en owl:sameAs "chat"^^rdflang:fr .

The last is by the identity conditions for xsd:string. 

>  rdflang:en-GB rdfs:subClassOf rdflang:en;
>  		rdfs:label "en-GB".

The problem with this is that all rules like this have exceptions. The 'locale" is sometimes a dialect, sometimes (as with Chinese) a regional way of writing. This is one reason why language tagging is so complicated and messy compared to class reasoning. 

> as a starting point. Where more detailed relationships can be written
> down, then they can be written down.
> This also gets us to a consistent place where every literal is of the
> form (value, datatype) but means that we can also reason about the
> datatypes.

There is not a great deal one can say about datatypes, other than to use them to denote the class of their values. And we can do that with the current proposal. 

It might be worth having a property rdf:langtag which 'extracts' the language tag of a tagged-pair value. Then one could define in OWL useful restriction classes, such as the class of all <string, tag> pairs with the tag en-GB:

BritSpeak rdf:type owl:Restriction .
BritSpeak owl:onProperty rdf:langtag .
BritSpeak owl:hasValue "en-GB" .

> It also means that languages are extensible outside of the
> ISO/IANA registry process in a way that's possible to make
> self-describing:
>  ex:python rdfs:subClassOf xsd:string;
>  	    rdfs:label "python";
> 	    rdfs:comment """A string containing a fragment of
> 	    		    code in the Python programming
> 			    language""".
> (yes, one *could* use x-python for this but then how do I find out
> what x-python means?).
> It is pretty simple in the serialisers and parsers to treat @en or
> xml:lang="en" as syntactic sugar, a placeholder for datatypes with a
> well-known prefix.

This sounds harder than several other 'pretty simple' ideas that were rejected as way too complicated. I don't like anything that requires going inside a URI to extract pieces of it that carry significant meaning. (Didn't TIm B-L have a rant about the evils of this somewhere?) 

> For query languages like SPARQL, it's pretty easy
> to rewrite queries in a pre-processing step.
> Does this proposal have greater cost in terms of the amount of work
> implementers will have to do? Yes. But this cost can be minimised by
> strongly suggesting to keep the current surface form in serialisations
> and queries - in almost all cases data processed by an existing system
> would continue to be processed in the same way it is now, so there is
> a certain level of backwards compatibility built in.

I do not follow how this compatibility would work. If there are no language tags, what happens to all the RDF already out there with language tagged literals? How do we keep the current surface form, if language tags are prohibited? If this is done by tinkering with the abstract syntax form, I would note that the rdf:PlainLIteral mapping of "string"@tag to "string@tag" achieves exactly the same degree of elegance in making everything fit within the currently accepted datatype model, with a far smaller cost in compatibility and implementation effort. 

> It would,
> however, simplify future implementations by making the language more
> consistent, removing an odd special case for one particular dimension
> of data, and would make it immediately possible to model and reason
> about this dimension of data.

But this 'dimension' of data is only relevant to those who wish to use it. And they, apparently, are quite happy with language tagging in its present form, to the point of being in a state of almost armed rebellion when we suggested anything other than this back in 2003. People who do not care about languages will not use language tags, and for them this entire matter is effectively invisible. I really do not see any convincing case here for such a major redesign of a much-used feature, to satisfy only a rather (to me) unconvincing intuition about  abstract elegance. 



IHMC                                     (850)434 8903 or (650)494 3973   
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile

Received on Thursday, 9 June 2011 02:05:34 UTC