Re: URIs for languages from Bernard Vatant on 2012-02-17 (public-lod@w3.org from February 2012)

From: Bernard Vatant <bernard.vatant@mondeca.com>
Date: Fri, 17 Feb 2012 12:10:18 +0100
To: "M. Scott Marshall" <mscottmarshall@gmail.com>
Cc: Gerard de Melo <gdemelo@mpi-inf.mpg.de>, Lars Marius Garshol <lars.garshol@bouvet.no>, Barry Norton <barry.norton@ontotext.com>, public-lod@w3.org
Message-ID: <CAK4ZFVEJGKCet004C-mYeheFsiEOaPKjFbK5Kvj53U-163KNLQ@mail.gmail.com>
Hi all

I wanted to answer Gerard yesterday but some parts of the answer have
already been addressed by Lars Marius, whom I'm happy to read here, and
even happier to see we agree, given taht in the past we sometimes agreed to
disagree on those tricky issues of identification and URIs (... and no less
tricky subject of quality of beer, although for that one I have to bow in
respect to his authority :)

I think there are two aspects which are to be kept distinct for references
as important as languages : the stability of identifiers, and the quality
of descriptions available for those URIs. And a third one is what is
identified ...

For the first point I guess if LoC is not able to ensure stable URIs inside
its DNS, who will? And both from a social (trust) point of view and
technical one, I prefer to have URIs in the id.loc.gov namespace than in
some more or less opaque purl one. For example all fundamental W3C spec at
the basis of all the RDF ecosystem are in the w3.org DNS, and the W3C has a
policy of URI stability which IMO can be adopted by LoC.

Now for data quality. With all due respect to the amazing work of lexvo.org,
I think Gerard's argument about ISO 639-3 being "better" than ISO 639-2 is
off topic here. This is to be discussed inside ISO 639 committees :)
The point is that we have 639-1 and 639-2 and 639-3 and now 639-5 and it's
a mess, OK, but that's legacy our systems have to cope with. What do we
need in linked data land? A minima an exact mirror of those codes in the
form of stable URIs, as close as possible of the source authority for those
codes, and built in such a way that both publication authority and matching
with the ISO normative source are absolutely non-ambiguous.
Seems to me that http://id.loc.gov/vocabulary/iso639-2/grc provides exactly
this.

Of course one can ask why LoC does not publish (yet) also URIs for 639-3,
but hopefully it's in the pipes, as well as countries ISO-3166 as Lars
Marius points (those were also in the original OASIS Published Subjects
publication ...). But id.loc.gov have 639-5 entries.

That other data sets will provide better or more complete information about
things identified by those URIs is not a problem. I think it's OK if a
reference URI provides just the minimal description needed for
disambiguation and context, and basis for maximal re-use. To take a
completely different example, what is the most reused URI in the LOD,
beyond the URI in standards themselves RDF, RDFS, OWL? Certainly
http://xmlns.com/foaf/0.1/Person. What does FOAF itself provide about this
class? Not much. But the fact that millions of triples use it make it a
reference, both at vocabulary and data level, can help to figure what a
foaf:Person can be. For example go to
http://labs.mondeca.com/endpoint/lov_aggregator and run the proposed
default query ...

The referent "in the real world" of
http://id.loc.gov/vocabulary/iso639-2/grc is as fuzzy as the referent of
http://xmlns.com/foaf/0.1/Person. It's indeed a conceptualization of a
language, which has been defined by ISO 639-2 standard according to
criteria most people won't argue about, and some will disagree upon for
good reasons. And that's why we have 639-3. As any classification of
languages, this one defines arbitrary limits in a continuum. What is a
language limit in the real world is and will ever be an open question. But
information systems simply rely on codes provided by an authority to which
they defer the tricky task of deciding about it.

So, when you say your publication is written in French, yes you refer to a
certain concept of French when using a URI based on an ISO code, and I've
no problem with that at all. When you use xml:lang="fr" what it refers to
exactly in the real complex world of languages I can't say, but all systems
using it consider it's the same, and by BCP 47 it's French as defined by
ISO 639-1.

Best regards

Bernard

2012/2/17 M. Scott Marshall <mscottmarshall@gmail.com>

> Hi Bernard, Gerard, (and now Lars),
>
> Thanks for the pointers. It seems like we are better off pointing
> directly to lexvo if we want URIs that will
>
> 1) enable us to precisely and unambiguously refer to any official
> language (including, for example, Cantonese)
>
> 2) provide the name of the language in many languages (potentially
> useful for search indexes and labels in applications).
>
> However, there is a URI longevity issue whenever PURLs are not used
> (see full explanation of issues at http://sharedname.org ). Providing
> a neutral namespace that can be redirected when domain names change is
> the most effective way to create a persistent URI that won't contain
> historical artifacts when the 'name brand'-based domain name changes
> (as has been repeatedly demonstrated by history). So, ideally, an
> organization with long-term governance (not project bound) would
> maintain a namespace such as http://sharedname.org/lang/ that could be
> redirected from lexvo to future-lexvo domains/URLs.
>
> [Lars - your message came in just as I was about to press <send>. I'm
> confused by your reply. What about the problems with LOC lang ids that
> Gerard pointed out? Is that what you meant by "If only they could do
> ISO 3166 countries as well..."?]
>
> Best,
> Scott
>
> On Thu, Feb 16, 2012 at 8:21 PM, Gerard de Melo <gdemelo@mpi-inf.mpg.de>
> wrote:
> > Hi Bernard,
> >
> >
> > I think now we should forget about URIs published by pionneer projects
> such
> > as OASIS TC, lingvoj.org and lexvo.org, and stick to URIs published by
> > genuine authority Library of Congress which is as close to the primary
> > source as can be. So if you want to use a URI for Ancient Greek as
> defined
> > by ISO 639-2, please use http://id.loc.gov/vocabulary/iso639-2/grc.
> >
> > BTW Lars Marius, hello, what do you think? URIs at id.loc.gov are really
> > what we were dreaming to achieve in 2001, right?
> >
> >
> > Now of course I may be a bit biased here, but I do not believe that the
> > id.loc.gov service solves
> > all of the problems. This is from the Lexvo.org FAQ [1]:
> >
> > The advantage of using those URIs is that they are maintained by the
> Library
> > of Congress. However, there are also several issues to consider. First of
> > all, ISO 639-2 is orders of magnitude smaller than ISO 639-3 and for
> example
> > lacks an adequate code for Cantonese, which is spoken by over 60 million
> > speakers.
> > More importantly, the LOC's URIs do not describe languages per se but
> rather
> > describe code-mediated conceptualizations of languages. This implies, for
> > instance, that the French language (<http://lexvo.org/id/iso639-3/fra>)
> has
> > two different counterparts at the LOC,
> > <http://id.loc.gov/vocabulary/iso639-2/fra> and
> > <http://id.loc.gov/vocabulary/iso639-2/fre>, which each have slightly
> > different properties.
> > Finally, connecting your data to Lexvo.org's information is likely to be
> > more useful in practical applications. It offers information about the
> > languages themselves, e.g. where they are spoken, while the LOC mostly
> > provides information about the codes, e.g. when the codes were created
> and
> > updated and what kind of code they are.
> > In practice, you can also use both codes simultaneously in your data.
> > However, you need to be very careful to make sure that you are asserting
> > that a publication is written in French rather than in some concept of
> > French created on January, 1, 1970 in the United States.
> >
> >
> > Best,
> > Gerard
> >
> > [1] http://www.lexvo.org/linkeddata/faq.html
> >
> > --
> > Gerard de Melo [demelo@icsi.berkeley.edu]
> > http://www.icsi.berkeley.edu/~demelo/
>



-- 
*Bernard Vatant
*
Vocabularies & Data Engineering
Tel :  + 33 (0)9 71 48 84 59
Skype : bernard.vatant
Linked Open Vocabularies <http://labs.mondeca.com/dataset/lov>

--------------------------------------------------------
*Mondeca**          **                   *
3 cité Nollez 75018 Paris, France
www.mondeca.com
Follow us on Twitter : @mondecanews <http://twitter.com/#%21/mondecanews>
Received on Friday, 17 February 2012 11:11:11 UTC