Re: dct:language range WAS: ISSUE-2 (olyerickson): dct:language should be added to DCAT [Best Practices for Publishing Linked Data] from Richard Cyganiak on 2011-12-19 (public-gld-wg@w3.org from December 2011)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Mon, 19 Dec 2011 18:23:10 +0000
To: Stasinos Konstantopoulos <konstant@iit.demokritos.gr>
Cc: "Maali, Fadi" <fadi.maali@deri.org>, Government Linked Data Working Group WG <public-gld-wg@w3.org>
Message-Id: <BDBF7DC7-20CE-43A8-B3E5-F4B21E268AF5@cyganiak.de>

Hi Stasinos,

On 15 Dec 2011, at 02:03, Stasinos Konstantopoulos wrote:
>> Mapping existing data into controlled vocabularies always comes with a cost. And I would think that often the data consumers are in a better position to do that mapping than the data publishers, in terms of skills, quality and economic incentives.
>> 
>> That being said, every effort should be made to *recommend* standard controlled vocabularies, and highlight their use as best practice.
> 
> I fully agree that the step to the First Star is the most important
> step to make, so one should never discourage data publishers. At the
> same time, it should be possible and encouraged to provide more
> structured data. Lumping everything together into the same data
> property seems to me like it's discouraging structure where it would
> have been attainable.

Lumping things together into a single property can be called “unstructured” or it can be called “efficient”. The question is whether the lumping together makes the data hard to work with or not. As long as it's still possible for a data consumer to be certain that a controlled vocabulary was used, and as long as the resulting data is still easy to query, there isn't really any downside to using only a single property.

>> Someone who has "13th c. English" in their metadata very likely cares about these fine distinction, and is likely to be offended by the suggestion that they should dumb down their data to fit into some impoverished ISO scheme…
> 
> I agree and I never even implied that "15th c. English" is not useful
> to whomever made the effort to annotate at such detail. But I do not
> think anybody would be offended by the proposition that "13th c.
> English" is related to the controlled-vocab entry for "English". For
> some applications the latter is enough and they gain the benefit of
> the controlled vocabulary in exchange for giving up the finer-grain
> description; those applications that require the finer grain
> descripltion will have to know how to handle the free text.
> 

> In other words, I find it a good thing that a specification allows a
> publisher to, if they so choose, provide both a link to the
> closest-fitting entry in the controlled vocabulry and a fuller
> free-text description (not to be construed as equivalent to the
> former); or either of the two.

Yes, that makes sense.

> That would suggest one of two solutions:
> 1. defining two properties, one ranging over language URIs and one
> ranging over text literals, or

That would be possible – dc:language and dcat:languageISO for example.

> 2. defining a single property ranging over resources (not literals);
> such resources can be either (a) language URIs or (b) unnamed
> resources with an rdfs:label (or similar) property ranging over
> arbitrary text and (optionally) some subsumption or relatedness
> property ranging over language URIs.

I don't like this style of modelling because it leads to data with lots of little blank nodes that most often only have a single literal attached to them. This is prone to data production errors and it's awkward to write queries against this kind of data. Basically, everyone loses except the conceptual modeller who likes the cleanliness of the model… (The CIDOC CRM is an example of this modelling style. It's quite a pain to work with!)

Also, neither (1) nor (2) solve the problem that we don't have standard URIs for representing ISO languages. As we saw earlier, if every publisher needs to invent their own set of ISO-derived URIs to identify languages, then nothing is achieved.

So my preferred design would be (3):

3. Use dc:language or dcterms:language with literals. In the case where the literals are ISO 639 language codes, use typed literals with the xsd:language datatype.

>> As always, it's best to survey some actual catalogs and see how they represent language, otherwise we can go in circles in this kind of discussion forever.
> 
> That is useful, but it is also useful, IMHO, to pave the way for more
> structure even if the current situation is relatively unstructured.

To quote again what I said above:

>> That being said, every effort should be made to *recommend* standard controlled vocabularies, and highlight their use as best practice.

I think we are in agreement here

Best,
Richard

Received on Monday, 19 December 2011 18:23:49 UTC