Re: One lexical entry with multiple POSes from Christian Chiarcos on 2023-07-04 (public-ontolex@w3.org from July 2023)

From: Christian Chiarcos <christian.chiarcos@gmail.com>
Date: Tue, 4 Jul 2023 11:28:28 +0200
To: Fahad Khan <fahad.khan@ilc.cnr.it>
Cc: public-ontolex <public-ontolex@w3.org>
Message-ID: <CAC1YGdjsTHw6sXQc6BwJTwNm0s7c3XXqBg7sKAebD3JzXeHeoA@mail.gmail.com>

Dear Fahad,

Am Mo., 3. Juli 2023 um 18:51 Uhr schrieb Fahad Khan <fahad.khan@ilc.cnr.it
>:

> ... to get rid of the one POS per lexical entry constraint ... But since
> there is some reluctance to update the guidelines except to correct minor
> typos, this is probably not going to happen (and also if it did then that
> would remove one of the big motivations for developing lexicog in the first
> place).
>

Exactly ;)


> However IMO there is an ambiguity as to whether lexical entries are
> supposed to have *exactly*one POS or *at most* one POS. This is
> especially the case since as we discussed in a previous OntoLex call,
> affixes are also classed as lexical entries in the model and these usually
> aren't associated with POSs. So a third potential solution to your
> modelling dilemma would be indeed to assume that a lexical entry can have
> zero or one POS values, and not to associate any POSs with your lexical
> entry using lexinfo:partofspeech, but rather to use some other property to
> specify that the categories noun and adjective are relevant to your lexical
> entry (this solution has the benefit that you can continue using lexical
> entry with its associated axioms).
>

This is actually a feasible solution which doesn't even need lexicog and
requires no rewording, at all. Just pushing the POS information into the
forms (which we can, as we do for other morphosyntactic properties from
LexInfo) solves the problem, indeed. Having a lexical entry without POS is
even valid with a strict a "exactly one" requirement, because in RDF
semantics, we always operate under the open world assumption. Then,
"exactly one" just entails that there is one such property, but it doesn't
give us the exact value (and could be automatically expanded into a blank
node). And it is not even semantically incorrect, because a
language-specific POS category that just covers multiple LexInfo POSes
could be created (I'm not saying that it should be, though). This is
actually common practice in NLP, think of the "IN" tag in the Penn Treebank
which stands for prepositions or subordinating conjunctions (keep in mind
that, in English, practically every preposition can be used as a
complementizer).

This is also consistent with the requirements for lexical resources that
just come without POSes. It would not be feasible to require that every
legacy resource first needs to be POS-tagged before it can be converted.
Think of a 16th c. word list from an extinct South American language that
we don't know too much about. But even for modern languages, the expertise
might just not be available to the person in charge of conversion. For
historical languages, it might even be unclear what POS distinctions apply.
We can create a glossary for Etruscan (or the Phaistos disk, if you will),
but there is just no consensus for many (or, in the Phaistos case, any)
words wrt. their POSes.

I still feeling somewhat unconfortable with (ab)using ontolex:LexicalEntry
for this particular case, because the difference of nouns and adjectives in
French is rather well understood ;)


> PS. Given the capabilities of ChatGPT I wouldn't be so sure the task you
> refer to couldn't be automated.
>

Well, I mean in a controlled, reliable, systematic fashion ;) With training
data or fine-tuning we can get there, to some extent, of course, but any
ML-based solution will have a certain level of noise, so by automated
methods I mean only those based on the analysis of layout conventions alone
(this also has noise, because lexicographers aren't 100% consistent either,
but it's explainable and systematic noise rather than unpredictable
hallucination).

Best,
Christian

Received on Tuesday, 4 July 2023 09:28:45 UTC