Re: ACTION-7 "Check with w3c groups if there are other approches to represent languages as uris"

Hi Penny,

Am 17.07.2014 um 13:34 schrieb Penny Labropoulou <penny@ilsp.gr>:

> Dave and Felix,
>  
> I think I'm getting even more confused than before as regards language standardization codes and ontologies!
>  
> First of all, a clarification: the ms:linguisticInformation in the lexical/conceptual resource is meant for other things and not for the language of the contents of a resource (e.g. what types of linguistic information are contained, e.g. lemmas, stems, inflectional information etc.). The ms vocab, in fact, includes the following elements for language:
> -          metadataLanguageName and metadataLanguageId – for the language of the metadata of a resource (similar to catalog_language of the dcat vocabulary)
> -          languageName and languageId – for the language of the contents of a resource (e.g. a Greek/English lexicon or a Spanish corpus etc.)
> -          documentLanguageName and documentLanguageId – for the language of an external publication/document/… that is somehow linked to this resource (an article describing it, a manual etc.)
> -          tagsetLanguageName and tagsetLanguageId – for the language of tagsets used for the annotation of a corpus
>  
> Going to the sources of my confusion, in the dcat vocabulary, there are two entries:
> -          the catalog_language (http://www.w3.org/TR/vocab-dcat/#Property:catalog_language) that Dave refers to, and which I agree with Dave that this only refers to the language of the metadata
> -          the dataset language (http://www.w3.org/TR/vocab-dcat/#Property:dataset_language) which is to be used for the language of the dataset; I thought this was meant for the language of the contents of the language resource (e.g. a lexicon of Greek words which is described in a certain catalogue in English) and would correspond to the ms:languageName and ms:languageId – however the usage note says "This overrides the value of the catalog language in case of conflict." which doesn't make any sense if they refer to two different things…
>  
> As regards the various codes, at META-SHARE we wanted to use (but never implemented) the BCP 47, which overrides the RFC4646 (https://tools.ietf.org/html/bcp47). 


Just to clarify the IETF terminology: BCP 47 is always the name of the standard for language identification on the Web. that name is stable. However, there are various *versions*, the various RFCs, of the standard, e.g. to align with developments in ISO TC 37.
So one should say: BCP 47, as currently represented by RFC 5646 (for language tags) and RFC 4647 (for matching of language tags). 
The predecessor of RFC 5646 was RFC 4646. So before RFC 5646, RFC 4646 was the current RFC for BCP 47. 



> In this document, there's a note for using the "shortest ISO 639 code" and the examples consist of mainly two-letter codes (ISO 639-1) and three-letter codes (ISO 639-3) only when there's no two-letter code for each language – maybe this explains the dcat Range note???
> On the other hand, the lingvoj ontology includes a list of languages (http://lingvoj.org/languages/all.html) which as they say:"This page is providing the complete list of ISO 639 languages, and their tags as defined by BCP 47". However, all the languages at this page appear as ISO 639-3 codes and I have not been able to find examples such as "en-US" (English as spoken in United States).


Not sure if you mean with „I have not found examples …“ lingvoj or in general? In general such examples are widely deployed. See e.g. guidance for Microsoft application developers
http://msdn.microsoft.com/en-us/library/cc233978.aspx
this is really just one example - you have that for many parts of the IT / web ecosystem. See another one from Java
http://docs.oracle.com/javase/tutorial/i18n/locale/create.html
(search for en-us on the page)
 

>  I have also not been able to find something in the other ontologies that brings together in one tag/URI combinations of language+script+country+…, as in BCP47. Maybe I'm missing something?


I think the issue is: in the whole internet, web and even general IT infrastructure such a combination is common (see above), but in ontologies there is a) no common approach but a lot of ontologies for the same purpose b) no alignment with what general IT / web technology does.

Best,

Felix

>  
> Best,
> Penny
>  
>  
> From: Felix Sasaki [mailto:fsasaki@w3.org] 
> Sent: Thursday, July 17, 2014 1:29 PM
> To: Dave Lewis
> Cc: public-ld4lt@w3.org
> Subject: Re: ACTION-7 "Check with w3c groups if there are other approches to represent languages as uris"
>  
> Hi Dave,
>  
> Am 17.07.2014 um 11:37 schrieb Dave Lewis <dave.lewis@cs.tcd.ie>:
> 
> 
> Hi Felix,
> Thank's for this, I'll include it in the agenda for today.
> 
> One point:
> 
> http://www.w3.org/TR/vocab-dcat/#Property:catalog_language
> 
> defines the language used in the meta-data, and for that purpose is probably sufficient.
> 
> However, the others seem more relevant to specifying the language of the LanguageResource that is the subject of the meta-data.
> 
> For this i'd tend to agree that some way of allowing different schemes to be used for applications that need them, e.g. lexical resources or resource focussed for language preservation.
> 
> But where more specialised language code requirements are not in place, then we still should specify the best practice, e.g. dct:LinguisticSystem as specified in dcat for catalogue_language, in order to promote interoperability in codes as far as possible.
>  
>  
> That is what I am not sure about. The dcat specification itself is ambiguous. If you click on the link of „dct:language“, it brings you to
> http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=terms#language
> and that defines languages as an RFC 4646 value, which includes ISO 639-3 and much more. But if you follow the links 1 and 2 of
> dct:LinguisticSystem 
> Resources defined by the Library of Congress (1, 2) SHOULD be used.
>  you are lead to the ISO 639 one and two codes. So it is a bit difficult to understand what it actually means: use dct:LinguisticSystem as specified in dcat.
>  
> Cheers,
>  
> Felix
> 
> 
> 
> The current ms vocab already supports this specialisation, for example having ms:linguisticInformation information for the ms:LexicalConceptualResource subclass, which seems reasonable.
> 
> cheers,
> Dave
> 
> On 04/07/2014 13:06, Felix Sasaki wrote:
> 
> I did this and was pointed to this proposal was rejected both for RDF 1.0 and RDF 1.1, see for the later this thread
> http://lists.w3.org/Archives/Public/public-rdf-wg/2012Oct/0001.html
> which at least Jose Labra and probably Jorge are already aware of, see
> http://www.weso.es/MLODPatterns/Linguistic_metadata.html
> 
> 
> So now we have at least four different approaches for the same purpose websites,
> 
> http://www.w3.org/TR/vocab-dcat/#Property:catalog_language
> http://lingvoj.org/
> http://www.lexvo.org/
> http://glottolog.org/
> 
> I am wondering what best practice to derive from this - one suggestion was to use owl:sameAs between these in appropriate situations. Thoughts?
> 
> - Felix

Received on Thursday, 31 July 2014 10:52:52 UTC