- From: William Waites <ww@styx.org>
- Date: Thu, 9 Jun 2011 00:25:28 +0200
- To: RDF WG <public-rdf-wg@w3.org>
Sorry for diverting the discussion a bit away from the LanguageTypedString proposal today, I was a bit feeling the time pressure of having a lot to discuss as the call was quickly drawing to an end and me having to rush out the door on the school run, so perhaps it was a bit out of order and I was not as coherent as I could have been. What I write below is mostly motivated by an intuitive suspicion that language tags are a mis-design in RDF. This intuition is informed by many years as a programmer/engineer in a dozen or so different programming languages - not one of which has any sort of triadic fundamental construct like the RDF literal. Anomalousness on its own doesn't suggest mis-design - innovations are by definition anomalous at first. But I think the only real argument for this design is that it is already out there in the wild. And no, I'm not arguing against a mechanism for saying that a string represents an utterance in some particular language (natural or otherwise), far from it. Our starting point is that we have this construct of a literal which itself is a 3-tuple (value, language, datatype) where either or both of the language and datatype can be nil or absent and they may not both be present. Then it was pointed out that it makes sense that, where we only have the value, the datatype can naturally be supposed to be xsd:string. So far, I agree. But next, if there is a language is it suddenly no longer an xsd:string? That doesn't seem to make sense. But we can't just suppose xsd:string because that breaks the rule and furthermore aren't strings with languages just a different kind of string? I think there is consensus in the group up to this point. Now the proposal is to make a special datatype for all language-tagged strings and kind of jerry-rig the semantics so that we only ever have (value, datatype) but in the case of language tags we actually put ((text, language), datatype). As far as I know we don't have any other construct with this shape, where the value space is two-dimensional like this. This removes the inconsistency by letting every literal have a datatype and we get to abandon the "no datatypes with languages" rule with minimal collateral damage. But the datatype doesn't carry much meaning and we can't extend the language part with any of the RDF machinery. Why are languages special? They're important, sure. But I struggle to think of any other context in which they're given this level of special treatment. The closest is XML but even there they're not really that special, just another attribute (that happens to be inherited in child nodes but inheritance of this type is a well understood and not very controversial idea). Datatypes are special though, fundamental even, and we have plenty of theroretical and practical tools for reasoning about them and working with them. Languages are complicated, it has been pointed out. Certainly true. RDF is a good language for modelling and reasoning about complicated things. We lose out by moving languages - as used in RDF - outside of the things that we can use RDF to describe. So I propose - and I'm not the first to propose it - that we treat language-tagged strings as derived types of xsd:string but not limit ourselves to a single placeholder type. In other words, rdflang:en rdfs:subClassOf xsd:string; rdfs:label "en". rdflang:en-GB rdfs:subClassOf rdflang:en; rdfs:label "en-GB". as a starting point. Where more detailed relationships can be written down, then they can be written down. This also gets us to a consistent place where every literal is of the form (value, datatype) but means that we can also reason about the datatypes. It also means that languages are extensible outside of the ISO/IANA registry process in a way that's possible to make self-describing: ex:python rdfs:subClassOf xsd:string; rdfs:label "python"; rdfs:comment """A string containing a fragment of code in the Python programming language""". (yes, one *could* use x-python for this but then how do I find out what x-python means?). It is pretty simple in the serialisers and parsers to treat @en or xml:lang="en" as syntactic sugar, a placeholder for datatypes with a well-known prefix. For query languages like SPARQL, it's pretty easy to rewrite queries in a pre-processing step. Does this proposal have greater cost in terms of the amount of work implementers will have to do? Yes. But this cost can be minimised by strongly suggesting to keep the current surface form in serialisations and queries - in almost all cases data processed by an existing system would continue to be processed in the same way it is now, so there is a certain level of backwards compatibility built in. It would, however, simplify future implementations by making the language more consistent, removing an odd special case for one particular dimension of data, and would make it immediately possible to model and reason about this dimension of data.
Received on Wednesday, 8 June 2011 22:26:00 UTC