W3C home > Mailing lists > Public > public-ld4lt@w3.org > July 2020

Re: ISO 639 URIs

From: Gilles Sérasset <Gilles.Serasset@univ-grenoble-alpes.fr>
Date: Wed, 8 Jul 2020 11:46:35 +0200
Cc: open-linguistics <open-linguistics@googlegroups.com>, Linked Data for Language Technology Community Group <public-ld4lt@w3.org>, "public-ontolex@w3.org" <public-ontolex@w3.org>
Message-Id: <DAC3130A-C965-4846-8FAA-5F81F36DEF93@univ-grenoble-alpes.fr>
To: Christian Chiarcos <christian.chiarcos@web.de>
Hi Christian, hi all,

Wouldn’t it be nice if the lexvo.org domain was managed by a group of persons from the LLOD area to provide linked data on the languages that would be an aggregation of all the datasets you mentioned, along with all “sameAs” relations ?

I think of a semi-automatic process (a la DBnary) that will update its data from CSVs and other already available linked datasets every month or so and provide an always up to date registry ?

Moreover, the LOC linked data is quite poor compared to what lexvo had (for instance, the English language names “variants” are only available in English, French and German.

This solution will involve a dedicated team of maintainers (on the long run) and a rather small infrastructure to provide the data (which could be simply served from static files + content negotiation). It assumes that the generation of URIs and accompanying data can be made entirely automatically (which may not be the case if there are name clashes among these). It also assumes that the different dataset licences allows for it (which I am unsure regarding SIL…).

I also think that such an alternate dataset may be necessary for other persons who will need to have more information attached to the language they deal with (e.g. date annotations for Historical languages, geographical (space/time) annotation for all languages, etc.). 



> On 7 Jul 2020, at 18:40, Christian Chiarcos <christian.chiarcos@web.de> wrote:
> Dear all,
> for almost a decade, the Linguistic Linked Open Data community has largely relied on http://www.lexvo.org/ for providing LOD-compliant language identifier URIs, esp. with respect to ISO 639-3. Unfortunately, this got a out of sync with the official standard over the years (and when I tried to confirm this impression by checking one of the more recently created language tags, csp [Southern Ping Chinese], I found that lexvo was down).
> However, even if this is fixed, the synchronization issue will arise again, and as ISO 639 keeps developing (at a slow pace), I was wondering whether we should not consider a general shift from lexvo URIs to those provided by the official registration authorities.
> For ISO 693-1 and ISO 692-2, this is the Library of Congress, and they provide
> - a human-readable view: http://id.loc.gov/vocabulary/iso639-2/eng.html, resp. http://id.loc.gov/vocabulary/iso639-1/en.html -- this is actually machine-readable, too: XHTML+RDFa!),
> - a machine-readable view (e.g., http://id.loc.gov/vocabulary/iso639-1/en.nt, http://id.loc.gov/vocabulary/iso639-2/eng.nt), and
> - content negotiation (http://id.loc.gov/vocabulary/iso639-2/eng, http://id.loc.gov/vocabulary/iso639-1/en, working at least for application/rdf+xml)
> The problem here is ISO 693-3. The registration authority is SIL and they provide resolvable URIs, indeed, e.g., http://iso639-3.sil.org/code/eng. However, this is plain XHTML only, nothing machine-readable (in particular not the mapping to the other ISO 639 standards). On the positive side, their URIs seem to be stable, and also to preserve deprecated/retired codes (https://iso639-3.sil.org/code/dud).
> I'm wondering what people think. Basically, I see four alternatives to Lexvo URIs:
> - Work with current SIL URIs, even though these do not provide Linked Data.
> - Approach SIL to provide an RDF dump (if not anything more advanced) in addition to the HTML and TSV editions they currently provide.
> - Approach IANA about an RDF edition of the BCP47 subtag registry (https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry)? This contains a curated subset of ISO language tags and is supposed to be used in RDF anyway. [This has been suggested before: https://www.w3.org/wiki/Languages_as_RDF_Resources]
> - Approach the Datahub team to provide an RDF view on their CSV collection of language codes (https://datahub.io/core/language-codes, harvested from LoC and the IANA subtag registry, but regularly updated)
> What would be your preferences? Any other ideas? In any case, if we're going to reach out to SIL, IANA or Datahub, we should be able to demonstrate that this is a request from a broader community, because it would come with some effort for them.
> Best,
> Christian
> NB: Apologies for sending this to multiple mailing lists, but I think we should work towards a broader consensus for language resources in general here.
Received on Wednesday, 8 July 2020 09:46:58 UTC

This archive was generated by hypermail 2.4.0 : Wednesday, 8 July 2020 09:46:58 UTC