W3C home > Mailing lists > Public > public-ld4lt@w3.org > July 2014

RE: ACTION-7 "Check with w3c groups if there are other approches to represent languages as uris"

From: Penny Labropoulou <penny@ilsp.gr>
Date: Thu, 31 Jul 2014 15:04:26 +0300
To: "'Felix Sasaki'" <fsasaki@w3.org>
Cc: "'Dave Lewis'" <dave.lewis@cs.tcd.ie>, <public-ld4lt@w3.org>
Message-ID: <03d301cfacb7$9318c360$b94a4a20$@ilsp.gr>
Thanx Felix for the explanations.

Indeed they make the picture clearer, although it's disappointing that
there's no common approach by the ontologies. 

Just to clarify things, I meant that I could not find such examples in the
lingvoj ontology; otherwise, I agree, there are plenty of examples around.




From: Felix Sasaki [mailto:fsasaki@w3.org] 
Sent: Thursday, July 31, 2014 1:52 PM
To: Penny Labropoulou
Cc: Dave Lewis; public-ld4lt@w3.org
Subject: Re: ACTION-7 "Check with w3c groups if there are other approches to
represent languages as uris"


Hi Penny,


Am 17.07.2014 um 13:34 schrieb Penny Labropoulou <penny@ilsp.gr
<mailto:penny@ilsp.gr> >:

Dave and Felix,


I think I'm getting even more confused than before as regards language
standardization codes and ontologies!


First of all, a clarification: the ms:linguisticInformation in the
lexical/conceptual resource is meant for other things and not for the
language of the contents of a resource (e.g. what types of linguistic
information are contained, e.g. lemmas, stems, inflectional information
etc.). The ms vocab, in fact, includes the following elements for language:

-          metadataLanguageName and metadataLanguageId - for the language of
the metadata of a resource (similar to catalog_language of the dcat

-          languageName and languageId - for the language of the contents of
a resource (e.g. a Greek/English lexicon or a Spanish corpus etc.)

-          documentLanguageName and documentLanguageId - for the language of
an external publication/document/. that is somehow linked to this resource
(an article describing it, a manual etc.)

-          tagsetLanguageName and tagsetLanguageId - for the language of
tagsets used for the annotation of a corpus


Going to the sources of my confusion, in the dcat vocabulary, there are two

-          the catalog_language (
http://www.w3.org/TR/vocab-dcat/#Property:catalog_language) that Dave refers
to, and which I agree with Dave that this only refers to the language of the

-          the dataset language (
http://www.w3.org/TR/vocab-dcat/#Property:dataset_language) which is to be
used for the language of the dataset; I thought this was meant for the
language of the contents of the language resource (e.g. a lexicon of Greek
words which is described in a certain catalogue in English) and would
correspond to the ms:languageName and ms:languageId - however the usage note
says "This overrides the value of the
<http://www.w3.org/TR/vocab-dcat/#Property:catalog_language> catalog
language in case of conflict." which doesn't make any sense if they refer to
two different things.


As regards the various codes, at META-SHARE we wanted to use (but never
implemented) the BCP 47, which overrides the RFC4646 (
<https://tools.ietf.org/html/bcp47> https://tools.ietf.org/html/bcp47). 



Just to clarify the IETF terminology: BCP 47 is always the name of the
standard for language identification on the Web. that name is stable.
However, there are various *versions*, the various RFCs, of the standard,
e.g. to align with developments in ISO TC 37.

So one should say: BCP 47, as currently represented by RFC 5646 (for
language tags) and RFC 4647 (for matching of language tags). 

The predecessor of RFC 5646 was RFC 4646. So before RFC 5646, RFC 4646 was
the current RFC for BCP 47. 



In this document, there's a note for using the "shortest ISO 639 code" and
the examples consist of mainly two-letter codes (ISO 639-1) and three-letter
codes (ISO 639-3) only when there's no two-letter code for each language -
maybe this explains the dcat Range note???
On the other hand, the lingvoj ontology includes a list of languages (
http://lingvoj.org/languages/all.html) which as they say:"This page is
providing the complete list of ISO 639 languages, and their tags as defined
by  <https://tools.ietf.org/html/bcp47> BCP 47". However, all the languages
at this page appear as ISO 639-3 codes and I have not been able to find
examples such as "en-US" (English as spoken in United States).



Not sure if you mean with "I have not found examples ." lingvoj or in
general? In general such examples are widely deployed. See e.g. guidance for
Microsoft application developers


this is really just one example - you have that for many parts of the IT /
web ecosystem. See another one from Java


(search for en-us on the page)


 I have also not been able to find something in the other ontologies that
brings together in one tag/URI combinations of language+script+country+., as
in BCP47. Maybe I'm missing something?



I think the issue is: in the whole internet, web and even general IT
infrastructure such a combination is common (see above), but in ontologies
there is a) no common approach but a lot of ontologies for the same purpose
b) no alignment with what general IT / web technology does.







From: Felix Sasaki [mailto:fsasaki@w3.org] 
Sent: Thursday, July 17, 2014 1:29 PM
To: Dave Lewis
Cc: public-ld4lt@w3.org <mailto:public-ld4lt@w3.org> 
Subject: Re: ACTION-7 "Check with w3c groups if there are other approches to
represent languages as uris"


Hi Dave,


Am 17.07.2014 um 11:37 schrieb Dave Lewis < <mailto:dave.lewis@cs.tcd.ie>

Hi Felix,
Thank's for this, I'll include it in the agenda for today.

One point:


defines the language used in the meta-data, and for that purpose is probably

However, the others seem more relevant to specifying the language of the
LanguageResource that is the subject of the meta-data.

For this i'd tend to agree that some way of allowing different schemes to be
used for applications that need them, e.g. lexical resources or resource
focussed for language preservation.

But where more specialised language code requirements are not in place, then
we still should specify the best practice, e.g. dct:LinguisticSystem as
specified in dcat for catalogue_language, in order to promote
interoperability in codes as far as possible.



That is what I am not sure about. The dcat specification itself is
ambiguous. If you click on the link of "dct:language", it brings you to


and that defines languages as an RFC 4646 value, which includes ISO 639-3
and much more. But if you follow the links 1 and 2 of

 <http://purl.org/dc/terms/LinguisticSystem> dct:LinguisticSystem 
Resources defined by the Library of Congress (
<http://id.loc.gov/vocabulary/iso639-1.html> 1,
<http://id.loc.gov/vocabulary/iso639-2.html> 2) SHOULD be used.

 you are lead to the ISO 639 one and two codes. So it is a bit difficult to
understand what it actually means: use dct:LinguisticSystem as specified in





The current ms vocab already supports this specialisation, for example
having ms:linguisticInformation information for the
ms:LexicalConceptualResource subclass, which seems reasonable.


On 04/07/2014 13:06, Felix Sasaki wrote:

I did this and was pointed to this proposal was rejected both for RDF 1.0
and RDF 1.1, see for the later this thread
which at least Jose Labra and probably Jorge are already aware of, see

So now we have at least four different approaches for the same purpose

 <http://lingvoj.org/> http://lingvoj.org/
 <http://www.lexvo.org/> http://www.lexvo.org/
 <http://glottolog.org/> http://glottolog.org/

I am wondering what best practice to derive from this - one suggestion was
to use owl:sameAs between these in appropriate situations. Thoughts?

- Felix

Received on Thursday, 31 July 2014 12:05:00 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 16:16:10 UTC