ISO 639 URIs from Christian Chiarcos on 2020-07-07 (public-ld4lt@w3.org from July 2020)

From: Christian Chiarcos <christian.chiarcos@web.de>
Date: Tue, 07 Jul 2020 18:40:46 +0200
To: open-linguistics <open-linguistics@googlegroups.com>
Cc: "Linked Data for Language Technology Community Group" <public-ld4lt@w3.org>, "public-ontolex@w3.org" <public-ontolex@w3.org>
Message-ID: <op.0nd8mcm1br5td5@kitaba>

Dear all,

for almost a decade, the Linguistic Linked Open Data community has largely  
relied on http://www.lexvo.org/ for providing LOD-compliant language  
identifier URIs, esp. with respect to ISO 639-3. Unfortunately, this got a  
out of sync with the official standard over the years (and when I tried to  
confirm this impression by checking one of the more recently created  
language tags, csp [Southern Ping Chinese], I found that lexvo was down).

However, even if this is fixed, the synchronization issue will arise  
again, and as ISO 639 keeps developing (at a slow pace), I was wondering  
whether we should not consider a general shift from lexvo URIs to those  
provided by the official registration authorities.

For ISO 693-1 and ISO 692-2, this is the Library of Congress, and they  
provide
- a human-readable view: http://id.loc.gov/vocabulary/iso639-2/eng.html,  
resp. http://id.loc.gov/vocabulary/iso639-1/en.html -- this is actually  
machine-readable, too: XHTML+RDFa!),
- a machine-readable view (e.g.,  
http://id.loc.gov/vocabulary/iso639-1/en.nt,  
http://id.loc.gov/vocabulary/iso639-2/eng.nt), and
- content negotiation (http://id.loc.gov/vocabulary/iso639-2/eng,  
http://id.loc.gov/vocabulary/iso639-1/en, working at least for  
application/rdf+xml)

The problem here is ISO 693-3. The registration authority is SIL and they  
provide resolvable URIs, indeed, e.g., http://iso639-3.sil.org/code/eng.  
However, this is plain XHTML only, nothing machine-readable (in particular  
not the mapping to the other ISO 639 standards). On the positive side,  
their URIs seem to be stable, and also to preserve deprecated/retired  
codes (https://iso639-3.sil.org/code/dud).

I'm wondering what people think. Basically, I see four alternatives to  
Lexvo URIs:
- Work with current SIL URIs, even though these do not provide Linked Data.
- Approach SIL to provide an RDF dump (if not anything more advanced) in  
addition to the HTML and TSV editions they currently provide.
- Approach IANA about an RDF edition of the BCP47 subtag registry  
(https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry)?  
This contains a curated subset of ISO language tags and is supposed to be  
used in RDF anyway. [This has been suggested before:  
https://www.w3.org/wiki/Languages_as_RDF_Resources]
- Approach the Datahub team to provide an RDF view on their CSV collection  
of language codes (https://datahub.io/core/language-codes, harvested from  
LoC and the IANA subtag registry, but regularly updated)

What would be your preferences? Any other ideas? In any case, if we're  
going to reach out to SIL, IANA or Datahub, we should be able to  
demonstrate that this is a request from a broader community, because it  
would come with some effort for them.

Best,
Christian

NB: Apologies for sending this to multiple mailing lists, but I think we  
should work towards a broader consensus for language resources in general  
here.

Received on Tuesday, 7 July 2020 16:41:14 UTC