- From: Penny Labropoulou <penny@ilsp.gr>
- Date: Thu, 17 Jul 2014 14:34:55 +0300
- To: "'Felix Sasaki'" <fsasaki@w3.org>, "'Dave Lewis'" <dave.lewis@cs.tcd.ie>
- Cc: <public-ld4lt@w3.org>
- Message-ID: <014c01cfa1b3$218f6fa0$64ae4ee0$@ilsp.gr>
Dave and Felix, I think I'm getting even more confused than before as regards language standardization codes and ontologies! First of all, a clarification: the ms:linguisticInformation in the lexical/conceptual resource is meant for other things and not for the language of the contents of a resource (e.g. what types of linguistic information are contained, e.g. lemmas, stems, inflectional information etc.). The ms vocab, in fact, includes the following elements for language: - metadataLanguageName and metadataLanguageId - for the language of the metadata of a resource (similar to catalog_language of the dcat vocabulary) - languageName and languageId - for the language of the contents of a resource (e.g. a Greek/English lexicon or a Spanish corpus etc.) - documentLanguageName and documentLanguageId - for the language of an external publication/document/. that is somehow linked to this resource (an article describing it, a manual etc.) - tagsetLanguageName and tagsetLanguageId - for the language of tagsets used for the annotation of a corpus Going to the sources of my confusion, in the dcat vocabulary, there are two entries: - the catalog_language (http://www.w3.org/TR/vocab-dcat/#Property:catalog_language) that Dave refers to, and which I agree with Dave that this only refers to the language of the metadata - the dataset language (http://www.w3.org/TR/vocab-dcat/#Property:dataset_language) which is to be used for the language of the dataset; I thought this was meant for the language of the contents of the language resource (e.g. a lexicon of Greek words which is described in a certain catalogue in English) and would correspond to the ms:languageName and ms:languageId - however the usage note says "This overrides the value of the <http://www.w3.org/TR/vocab-dcat/#Property:catalog_language> catalog language in case of conflict." which doesn't make any sense if they refer to two different things. As regards the various codes, at META-SHARE we wanted to use (but never implemented) the BCP 47, which overrides the RFC4646 (https://tools.ietf.org/html/bcp47). In this document, there's a note for using the "shortest ISO 639 code" and the examples consist of mainly two-letter codes (ISO 639-1) and three-letter codes (ISO 639-3) only when there's no two-letter code for each language - maybe this explains the dcat Range note??? On the other hand, the lingvoj ontology includes a list of languages (http://lingvoj.org/languages/all.html) which as they say:"This page is providing the complete list of ISO 639 languages, and their tags as defined by <https://tools.ietf.org/html/bcp47> BCP 47". However, all the languages at this page appear as ISO 639-3 codes and I have not been able to find examples such as "en-US" (English as spoken in United States). I have also not been able to find something in the other ontologies that brings together in one tag/URI combinations of language+script+country+., as in BCP47. Maybe I'm missing something? Best, Penny From: Felix Sasaki [mailto:fsasaki@w3.org] Sent: Thursday, July 17, 2014 1:29 PM To: Dave Lewis Cc: public-ld4lt@w3.org Subject: Re: ACTION-7 "Check with w3c groups if there are other approches to represent languages as uris" Hi Dave, Am 17.07.2014 um 11:37 schrieb Dave Lewis <dave.lewis@cs.tcd.ie <mailto:dave.lewis@cs.tcd.ie> >: Hi Felix, Thank's for this, I'll include it in the agenda for today. One point: http://www.w3.org/TR/vocab-dcat/#Property:catalog_language defines the language used in the meta-data, and for that purpose is probably sufficient. However, the others seem more relevant to specifying the language of the LanguageResource that is the subject of the meta-data. For this i'd tend to agree that some way of allowing different schemes to be used for applications that need them, e.g. lexical resources or resource focussed for language preservation. But where more specialised language code requirements are not in place, then we still should specify the best practice, e.g. dct:LinguisticSystem as specified in dcat for catalogue_language, in order to promote interoperability in codes as far as possible. That is what I am not sure about. The dcat specification itself is ambiguous. If you click on the link of "dct:language", it brings you to http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=terms#language and that defines languages as an RFC 4646 value, which includes ISO 639-3 and much more. But if you follow the links 1 and 2 of dct:LinguisticSystem <http://purl.org/dc/terms/LinguisticSystem> Resources defined by the Library of Congress (1 <http://id.loc.gov/vocabulary/iso639-1.html> , 2 <http://id.loc.gov/vocabulary/iso639-2.html> ) SHOULD be used. you are lead to the ISO 639 one and two codes. So it is a bit difficult to understand what it actually means: use dct:LinguisticSystem as specified in dcat. Cheers, Felix The current ms vocab already supports this specialisation, for example having ms:linguisticInformation information for the ms:LexicalConceptualResource subclass, which seems reasonable. cheers, Dave On 04/07/2014 13:06, Felix Sasaki wrote: I did this and was pointed to this proposal was rejected both for RDF 1.0 and RDF 1.1, see for the later this thread http://lists.w3.org/Archives/Public/public-rdf-wg/2012Oct/0001.html which at least Jose Labra and probably Jorge are already aware of, see http://www.weso.es/MLODPatterns/Linguistic_metadata.html So now we have at least four different approaches for the same purpose websites, http://www.w3.org/TR/vocab-dcat/#Property:catalog_language http://lingvoj.org/ http://www.lexvo.org/ http://glottolog.org/ I am wondering what best practice to derive from this - one suggestion was to use owl:sameAs between these in appropriate situations. Thoughts? - Felix
Received on Thursday, 17 July 2014 11:35:42 UTC