Re: Consolidating LOD vocabularies for linguistic annotations from Christian Chiarcos on 2020-10-28 (public-ld4lt@w3.org from October 2020)

From: Christian Chiarcos <christian.chiarcos@web.de>
Date: Wed, 28 Oct 2020 10:51:40 +0100
To: "Linked Data for Language Technology Community Group" <public-ld4lt@w3.org>
Message-ID: <op.0s6y0esxbr5td5@kitaba>
Dear all,

after an extended summer break, it is time to take up LD4LT annotation  
telcos, again. I created a Doodle under  
https://doodle.com/poll/2bvb78z42tpsa5fm. The new Doodle is necessary  
because the original time slot, Thu 10-11 CE(S)T, has a risk of clashing  
with Nexus Linguarum telcos (see last point below).

Major developments in and after the July telco:

- After we spent much of the last two telcos on discussing the relation  
between W3C, resp., their specifications, and ISO, resp., their drafts, it  
became clear that any public discussion of drafts or other internal  
documentation of ISO specifications is discouraged by ISO and its national  
partner organizations. Moreover, it does not seem to be possible to enter  
a formal relationship between W3C CGs and ISO (for legal reasons, not for  
scientific ones) to arrange an official exchange of ideas. In other words,  
the extent to which any public discussion on the development of community  
conventions for linguistic annotations on the web can include information  
from/about ISO standards is limited to publicly available information  
(basically, scientific publications) that describe the respective  
standards or their underlying concepts. Regardless of whether they are  
fully identical to the eventual ISO standard, this is necessary to benefit  
 from the discussions and expertise that has been going into these  
specifications, as we clearly do not want to re-invent the wheel, but to  
contribute to a broadly applicable and inclusive Linked-Data-based  
ecosystem for language technology and language sciences on the web. One  
current problem of the ISO standards is that they do not organically  
translate into Linked-Data-compliant specifications, and this seems not to  
be very likely to improve. An alternative would be to move the entire  
discussion to ISO, but I would strongly prefer an open and transparent  
discussion process without any formal entry barriers to interested  
contributors. A W3C CG provides that, ISO doesn't.

- As for ISO-related papers, these may or may not reflect the current  
state of the standard or its published form. It is still safe to collect  
open access (!) versions of relevant scientific papers published on the  
topics under  
https://github.com/ld4lt/linguistic-annotation/tree/master/doc/iso.  
Before, I had created a private repository with the intent to collect  
proprietary publications and share them in accordance with the exceptions  
to (German) copyright law for the sake of scientific research/education,  
but it seems that sharing full publications is no longer compliant with  
the latest revision of German copyright law. If we want to have such a  
repository, somebody from a country with a more liberal copyright policy  
should create and maintain that repository. A candidate would be the US,  
where this would basically be fair use.

- As for any W3C CG, the mid-term goal of our discussions is to provide a  
community report, which could be, for example, (1) a survey or (2) a  
specification that brings together NIF, Web Annotation, *published* ISO  
standards, etc. In my personal opinion, we should do *both*: a survey on  
their respective features (and we -- mostly Milan Dojchinovski and myself  
-- have begun with that, see  
https://github.com/ld4lt/linguistic-annotation/blob/master/survey/required-features.md),  
and then work towards a vocabulary. This vocabulary could then be input  
for subsequent formal standardization, either through W3C, ISO or both.  
So, there is a possible relation to ISO, and to have some ties with ISO  
remains relevant, but unless there is a way to share ISO-internal  
information in public (and as far as I can see, there isn't, at least not  
on a community-level [at the level of individual cooperation, that's  
different]), this will have to be largely unidirectional, with ISO taking  
potential input from us. The only way I can see direct input from ISO is  
if people involved in ISO standardization point us to their most relevant  
publications on the topics.

- (As many of you know) The COST Action "Nexus Linguarum. European network  
for Web-centred linguistic data science" (CA 18209,  
https://nexuslinguarum.eu/) is a European network of experts on topics of  
linguistic linked data and related topics. Since its establishment in  
October 2019, it has largely focused on internal consolidation and the  
formulation of specific tasks and use cases. While that process is still  
going on, much progress has been demonstrated in the plenary meeting that  
was held in the last two days. One of the tasks centers on modelling  
linguistic data, with a sub-topic on linguistic annotations, which has  
formally taken up work in September 2020, and as many LD4LT members are  
also active in Nexus, I would suggest to collaborate with this Nexus task  
on the creation of the survey of features of existing (community)  
standards of linguistic annotation.

Best regards,
Christian
Received on Wednesday, 28 October 2020 09:52:14 UTC