Re: dct:language range WAS: ISSUE-2 (olyerickson): dct:language should be added to DCAT [Best Practices for Publishing Linked Data] from Richard Cyganiak on 2011-12-12 (public-gld-wg@w3.org from December 2011)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Mon, 12 Dec 2011 12:22:38 +0000
To: Stasinos Konstantopoulos <konstant@iit.demokritos.gr>
Cc: "Maali, Fadi" <fadi.maali@deri.org>, Government Linked Data Working Group WG <public-gld-wg@w3.org>
Message-Id: <CE85D9F2-8865-4CF0-8A6F-A6F370C13155@cyganiak.de>

Stasinos,

On 9 Dec 2011, at 22:28, Stasinos Konstantopoulos wrote:
> It's hard to imagine anybody having data that won't fit ISO 639.
> Besides listing pretty much every documented language there is
> (including extinct and made-up languages like Klingon) it also lists
> useful clusters ("macrolanguages"), such as "Arabic" (ara), that allow
> one to underspecify when a more detailed description is not available
> ("ara" subsumes 30 variaties of Arabic, all with their own
> three-letter code). It also includes three letter codes for
> "undetermined" (und), "multiple and cannot list all" (mul), and "no
> linguistic content, not applicable" (zxx).

The question is not if the data fits ISO 639. The question is whether the data is already tagged with ISO 639. If it isn't, then someone has to do the tagging – that is, map “English” to “en”, “Irish” to “ga”, “Both English and Irish” to “mul” and so forth. That's not a difficult task, but it has a significant and nonzero cost, and we have to be aware that requiring ISO 639 makes adopting dcat significantly more expensive for data publishers who do not yet have ISO 639 compatible annotations.

In situations like this, such data publishers are likely to either a) not provide the language information at all, b) provide it in whatever form they already have in violation of the standard, or c) even not adopt the standard altogether because it is seen as too complex an undertaking. These concerns apply whenever the use of a controlled vocabulary is demanded in a standard exchange format.

Mapping existing data into controlled vocabularies always comes with a cost. And I would think that often the data consumers are in a better position to do that mapping than the data publishers, in terms of skills, quality and economic incentives.

That being said, every effort should be made to *recommend* standard controlled vocabularies, and highlight their use as best practice.

> If you are thinking of entries such as "15th c. English" and such, I
> agree that that cannot be easily captured in its most general and
> unrestricted form. But it would still be interesting, LOD-wise, to
> have the "English" bit as structured data, possibly qualified in
> free-text as "13th c. English". So we still need to decide on a
> controlled vocabulary that includes a representation for "English"
> even if it does not include one for "15th c. English".

I see this as a job for consumers of the data, not for publishers of the data. Someone who has "15th c. English" in their metadata very likely cares about these fine distinction, and is likely to be offended by the suggestion that they should dumb down their data to fit into some impoverished ISO scheme…

As always, it's best to survey some actual catalogs and see how they represent language, otherwise we can go in circles in this kind of discussion forever.

Best,
Richard

Received on Monday, 12 December 2011 12:23:17 UTC