ISSUE-12 On languages and datatypes from William Waites on 2011-06-08 (public-rdf-wg@w3.org from June 2011)

From: William Waites <ww@styx.org>
Date: Thu, 9 Jun 2011 00:25:28 +0200
To: RDF WG <public-rdf-wg@w3.org>
Message-ID: <20110608222528.GK42832@styx.org>
Sorry for diverting the discussion a bit away from the
LanguageTypedString proposal today, I was a bit feeling the time
pressure of having a lot to discuss as the call was quickly drawing to
an end and me having to rush out the door on the school run, so
perhaps it was a bit out of order and I was not as coherent as I could
have been.

What I write below is mostly motivated by an intuitive suspicion that
language tags are a mis-design in RDF. This intuition is informed by
many years as a programmer/engineer in a dozen or so different
programming languages - not one of which has any sort of triadic
fundamental construct like the RDF literal. Anomalousness on its own
doesn't suggest mis-design - innovations are by definition anomalous
at first. But I think the only real argument for this design is that
it is already out there in the wild. And no, I'm not arguing against
a mechanism for saying that a string represents an utterance in some
particular language (natural or otherwise), far from it.

Our starting point is that we have this construct of a literal which
itself is a 3-tuple (value, language, datatype) where either or both
of the language and datatype can be nil or absent and they may not
both be present. Then it was pointed out that it makes sense that,
where we only have the value, the datatype can naturally be supposed
to be xsd:string. So far, I agree. But next, if there is a language
is it suddenly no longer an xsd:string? That doesn't seem to make 
sense. But we can't just suppose xsd:string because that breaks the 
rule and furthermore aren't strings with languages just a different
kind of string? I think there is consensus in the group up to this 
point.

Now the proposal is to make a special datatype for all language-tagged
strings and kind of jerry-rig the semantics so that we only ever have
(value, datatype) but in the case of language tags we actually put
((text, language), datatype). As far as I know we don't have any other
construct with this shape, where the value space is two-dimensional
like this. This removes the inconsistency by letting every literal
have a datatype and we get to abandon the "no datatypes with
languages" rule with minimal collateral damage. But the datatype
doesn't carry much meaning and we can't extend the language part with
any of the RDF machinery.

Why are languages special? They're important, sure. But I struggle to
think of any other context in which they're given this level of
special treatment. The closest is XML but even there they're not
really that special, just another attribute (that happens to be
inherited in child nodes but inheritance of this type is a well
understood and not very controversial idea).

Datatypes are special though, fundamental even, and we have plenty of
theroretical and practical tools for reasoning about them and working
with them.

Languages are complicated, it has been pointed out. Certainly
true. RDF is a good language for modelling and reasoning about
complicated things. We lose out by moving languages - as used in RDF -
outside of the things that we can use RDF to describe.

So I propose - and I'm not the first to propose it - that we treat
language-tagged strings as derived types of xsd:string but not limit
ourselves to a single placeholder type. In other words,

  rdflang:en rdfs:subClassOf xsd:string;
  	     rdfs:label "en".

  rdflang:en-GB rdfs:subClassOf rdflang:en;
  		rdfs:label "en-GB".

as a starting point. Where more detailed relationships can be written
down, then they can be written down.

This also gets us to a consistent place where every literal is of the
form (value, datatype) but means that we can also reason about the
datatypes. It also means that languages are extensible outside of the
ISO/IANA registry process in a way that's possible to make
self-describing:

  ex:python rdfs:subClassOf xsd:string;
  	    rdfs:label "python";
	    rdfs:comment """A string containing a fragment of
	    		    code in the Python programming
			    language""".

(yes, one *could* use x-python for this but then how do I find out
what x-python means?).

It is pretty simple in the serialisers and parsers to treat @en or
xml:lang="en" as syntactic sugar, a placeholder for datatypes with a
well-known prefix. For query languages like SPARQL, it's pretty easy
to rewrite queries in a pre-processing step.

Does this proposal have greater cost in terms of the amount of work
implementers will have to do? Yes. But this cost can be minimised by
strongly suggesting to keep the current surface form in serialisations
and queries - in almost all cases data processed by an existing system
would continue to be processed in the same way it is now, so there is
a certain level of backwards compatibility built in. It would,
however, simplify future implementations by making the language more
consistent, removing an odd special case for one particular dimension
of data, and would make it immediately possible to model and reason
about this dimension of data.
Received on Wednesday, 8 June 2011 22:26:00 UTC