- From: Stasinos Konstantopoulos <konstant@iit.demokritos.gr>
- Date: Thu, 15 Dec 2011 04:03:30 +0200
- To: Richard Cyganiak <richard@cyganiak.de>
- Cc: "Maali, Fadi" <fadi.maali@deri.org>, Government Linked Data Working Group WG <public-gld-wg@w3.org>
Hi again. On 12 December 2011 14:22, Richard Cyganiak <richard@cyganiak.de> wrote: > > On 9 Dec 2011, at 22:28, Stasinos Konstantopoulos wrote: >> It's hard to imagine anybody having data that won't fit ISO 639. >> Besides listing pretty much every documented language there is >> (including extinct and made-up languages like Klingon) it also lists >> useful clusters ("macrolanguages"), such as "Arabic" (ara), that allow >> one to underspecify when a more detailed description is not available >> ("ara" subsumes 30 variaties of Arabic, all with their own >> three-letter code). It also includes three letter codes for >> "undetermined" (und), "multiple and cannot list all" (mul), and "no >> linguistic content, not applicable" (zxx). > > The question is not if the data fits ISO 639. The question is whether the data is already tagged with ISO 639. If it isn't, then someone has to do the tagging – that is, map “English” to “en”, “Irish” to “ga”, “Both English and Irish” to “mul” and so forth. That's not a difficult task, but it has a significant and nonzero cost, and we have to be aware that requiring ISO 639 makes adopting dcat significantly more expensive for data publishers who do not yet have ISO 639 compatible annotations. > > In situations like this, such data publishers are likely to either a) not provide the language information at all, b) provide it in whatever form they already have in violation of the standard, or c) even not adopt the standard altogether because it is seen as too complex an undertaking. These concerns apply whenever the use of a controlled vocabulary is demanded in a standard exchange format. > > Mapping existing data into controlled vocabularies always comes with a cost. And I would think that often the data consumers are in a better position to do that mapping than the data publishers, in terms of skills, quality and economic incentives. > > That being said, every effort should be made to *recommend* standard controlled vocabularies, and highlight their use as best practice. I fully agree that the step to the First Star is the most important step to make, so one should never discourage data publishers. At the same time, it should be possible and encouraged to provide more structured data. Lumping everything together into the same data property seems to me like it's discouraging structure where it would have been attainable. >> If you are thinking of entries such as "15th c. English" and such, I >> agree that that cannot be easily captured in its most general and >> unrestricted form. But it would still be interesting, LOD-wise, to >> have the "English" bit as structured data, possibly qualified in >> free-text as "13th c. English". So we still need to decide on a >> controlled vocabulary that includes a representation for "English" >> even if it does not include one for "15th c. English". > > I see this as a job for consumers of the data, not for publishers of the data. Someone who has "13th c. English" in their metadata very likely cares about these fine distinction, and is likely to be offended by the suggestion that they should dumb down their data to fit into some impoverished ISO scheme… I agree and I never even implied that "15th c. English" is not useful to whomever made the effort to annotate at such detail. But I do not think anybody would be offended by the proposition that "13th c. English" is related to the controlled-vocab entry for "English". For some applications the latter is enough and they gain the benefit of the controlled vocabulary in exchange for giving up the finer-grain description; those applications that require the finer grain descripltion will have to know how to handle the free text. In other words, I find it a good thing that a specification allows a publisher to, if they so choose, provide both a link to the closest-fitting entry in the controlled vocabulry and a fuller free-text description (not to be construed as equivalent to the former); or either of the two. That would suggest one of two solutions: 1. defining two properties, one ranging over language URIs and one ranging over text literals, or 2. defining a single property ranging over resources (not literals); such resources can be either (a) language URIs or (b) unnamed resources with an rdfs:label (or similar) property ranging over arbitrary text and (optionally) some subsumption or relatedness property ranging over language URIs. I find (1) easier to explain but (2) more conceptually accurate. > As always, it's best to survey some actual catalogs and see how they represent language, otherwise we can go in circles in this kind of discussion forever. That is useful, but it is also useful, IMHO, to pave the way for more structure even if the current situation is relatively unstructured. Best, Stasinos
Received on Thursday, 15 December 2011 02:04:16 UTC