- From: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>
- Date: Mon, 06 Apr 2020 15:56:58 +0200
- To: public-ontolex@w3.org, "Felix Sasaki" <felix@sasakiatcf.com>
- Cc: "christian.chiarcos@gmail.com" <christian.chiarcos@gmail.com>
- Message-ID: <op.0inno8ko89jat0@kitaba>
Hi Felix, Am .04.2020, 07:25 Uhr, schrieb Felix Sasaki <felix@sasakiatcf.com>: > I am still involved in W3C, in the internationalization activity. Here > recently a question came up on BCP 47, the IETF standard for language > tags including >the related sub tag registry, and RDF approaches to > represent information about language. > > In RDF, of course you can use BCP 47 language tags for literals, but > there are valuable resources like Lexvo that identify languages via > URIs. Often these >resources are based on ISO standards and have no > direct relation to BCP 47. This leads also to fragmentation, for example > since BCP 47 includes sub tags >that are not part of a given ISO > standard for languages or regions. > > In this context, I have a few questions: > > 1) Do you know of any best practices & use cases for using URIs (from > Lexvo or other sources) in an RDF context? By "using" I mean using the > URIs to >identify the language of a (sub part of an) RDF graph. This is frequently the case when working with underresourced or historical languages. The granularity of BCP47 and/or ISO693 is simply insufficient and the categories too imprecise for many applications in linguistics. For low-resource languages and fine-grained language variety classification, Glottolog is relatively widely used. ISO639-6 that could have been applied here, is withdrawn. For historical languages, there is nothing in existence (and the diachronic dimension in Glottolog is non-satisfactory). In multilingual datasets, people may decide to go for URI-based encoding throughout for the sake of consistency (many languages without language tags). Another reason for using URIs is that these URIs are [or at least, can be] defined and verified, whereas some ISO693 labels can be interpreted differently (e.g., what is the difference between gmh and de? Traditionally, Middle High German extended from 11th - 15th c, nowadays many people prefer 11th - 14th c., so, without any more detailed definition than provided by ISO639/BCP47, people will disagree on whether something is gmh or de). When using URIs, most people will probably prefer ISO639-3 because it is more established than Glottolog, and linguistically more fine-grained than ISO639-1 and ISO639-2 (and it doesn't need to follow the complex composition and selection rules of BCP47). The Library of Congress provides ISO693-1 and ISO639-2 URIs, but SIL (for ISO639-3) does not (AFAIK). This is why we normally go for Lexvo, although it's a bit behind ISO639-3. > 2) Are there any recommendations like: "here use URIs, here use BCP 47"? > For what I found, the main use case of URIs to express information > *about* >languages as first-class objects, but not to attach language > information to other parts of an RDF graph - see 1) above. > > 3) In addition to language, there is other type of metadata needed in an > i18n context, e.g. metadata about directionality of strings. Do you now > about best >practices for representing such metadata in RDF? > > 4) In an "identify language via URIs" approach, how would one identify > the entries of the BCP 47 sub tag registry that do no have an URI? Provide URIs via the registry. > 5) Is there an authority for language related URIs? ISO 639-1: Library of Congress ISO 639-2: Library of Congress ISO 639-3: SIL (not URIs, though) Beyond that, everything is a matter of discussion and perspective, but Glottolog is a good starting point. NB: There was a discussion about revising language tags in an RDF context (https://github.com/w3c/EasierRDF/issues/22). My personal preference would be to permit URIs *as* language tags and to interpret every language tag as a URI in a specific namespace (unless another namespace is declared). This may be a little radical, but we could keep the current (Turtle/SPARQL) notation in this way. Best, Christian -- Prof. Dr. Christian Chiarcos Applied Computational Linguistics Johann Wolfgang Goethe Universität Frankfurt a. M. 60054 Frankfurt am Main, Germany office: Robert-Mayer-Str. 11-15, #107 mail: chiarcos@informatik.uni-frankfurt.de web: http://acoli.cs.uni-frankfurt.de tel: +49-(0)69-798-22463 fax: +49-(0)69-798-28334
Received on Monday, 6 April 2020 13:57:16 UTC