Re: Entry with Multiple Part-of-Speech Values from Khadija Ait ElFqih on 2025-11-04 (public-ontolex@w3.org from November 2025)

From: Khadija Ait ElFqih <aitelfqih.khadija@gmail.com>
Date: Tue, 4 Nov 2025 13:11:18 +0100
To: Ilan Kernerman <ilan@lexicala.com>
Cc: Ana Salgado <anacastrosalgado@gmail.com>, "Passarotti Marco Carlo (marco.passarotti)" <marco.passarotti@unicatt.it>, Fahad Khan <anasfkhan81@gmail.com>, "John P. McCrae" <john.mccrae@insight-centre.org>, public-ontolex <public-ontolex@w3.org>
Message-ID: <CAAbyG7ULhBVb9V-+9KSaog0pqM85xCK9trBS-YEWF-QmdZPHNQ@mail.gmail.com>
Dear all,

From the perspective of Arabic lexical resources, the issue of *multiple
parts of speech (POS) per headword* is not an exception but rather a
regular phenomenon in Arabic lexicography. A single written form often
serves several grammatical roles for instance:
– *نحو (nahw)* meaning *direction, way* (noun) and *toward, towards*
(preposition).
– *خير (khayr)* meaning *goodness, virtue* (noun) and *better, best*
(comparative adjective).

The current OntoLex model, which relies on *lexicog:Entry* linked to
multiple *ontolex:LexicalEntry* elements, can technically represent such
cases. However, in practice, this approach is *too complex* for languages
and lexicons where multi-POS phenomena are common. It requires creating and
aligning several structural and lexical components (Entry, LexicalEntry,
Form, Sense) simply to capture different POS values, which makes both data
maintenance and SPARQL querying unnecessarily heavy.

Therefore, I tend to agree that the definition of Entry is too narrow and
tied to a lexicographic structure, and that we might consider loosening
this constraint or providing a simpler representation that can handle
multiple POS directly within one entry, without needing so many nested
components.

This is particularly important for morphologically rich languages like
Arabic, where:
– Traditional dictionaries (e.g. *Lisan Al Arab*, *al-Muʿjam al-Wajīz,
etc....*) routinely group nouns, verbs, and particles under the same root
or lemma;
– The boundary between POS is sometimes fluid (e.g. *خير*, which can
function both as a noun and as a comparative adjective). Enforcing one POS
per entry risks losing meaningful semantic or historical nuances.

At the same time, we still need a degree of interoperability across
resources. A practical solution could be the development of lightweight
application guidelines or profiles for languages like Arabic, specifying:
– when POS distinctions can be merged or should be split;
– and how Arabic POS categories can map to LexInfo or Universal
Dependencies, avoiding uncontrolled proliferation of POS labels.

Below is a simple example of how such a case could currently be
modeled for *نحو
(nahw)*:

*Example (RDF/Turtle):*

:entry-nahw a lexicog:Entry ;
    rdfs:label "نحو"@ar ;
    lexicog:contains :nahw-noun , :nahw-preposition .

:nahw-noun a ontolex:LexicalEntry ;
    ontolex:canonicalForm :form-nahw ;
    lexinfo:partOfSpeech lexinfo:noun ;
    ontolex:sense :sense-direction .

:nahw-preposition a ontolex:LexicalEntry ;
    ontolex:canonicalForm :form-nahw ;
    lexinfo:partOfSpeech lexinfo:preposition ;
    ontolex:sense :sense-toward .

:form-nahw a ontolex:Form ;
    ontolex:writtenRep "نحو"@ar .

*Example explanations:*
– *نحو (nahw)* as a *noun* → *direction, way*
– *نحو (nahw)* as a *preposition* → *toward, towards*
– *خير (khayr)* as a *noun* → *goodness, virtue*
– *خير (khayr)* as a *comparative adjective* → *better, best*

The model, in principle, can represent such distinctions, but in practice
it would benefit from a simpler or more flexible interpretation of the Entry
class, especially for documentation and retro-digitization purposes, where
descriptive accuracy is as important as computational consistency.

Best regards,

k.,

On Tue, Nov 4, 2025 at 12:25 PM Ilan Kernerman <ilan@lexicala.com> wrote:

> Hi all,
>
>
>
> I would argue in favor “of having a single part of speech per entry”.
> Besides categorizing language components in more detail (for various
> language technology purposes), it is needed for cross-lingual purposes, as
> L2 might have different equivalents for different L1 pos.
>
>
>
> If there is no nice and easy solution that satisfies both current (and
> near-future) resources and retrodigitization, and one of them must suffer,
> IMHO our priority should be the former.
>
>
>
> Thanks,
>
> Ilan
>
>
>
>
>
> *From: *Ana Salgado <anacastrosalgado@gmail.com>
> *Date: *Tuesday, 4 November 2025 at 13:17
> *To: *Passarotti Marco Carlo (marco.passarotti) <
> marco.passarotti@unicatt.it>
> *Cc: *Fahad Khan <anasfkhan81@gmail.com>, John P. McCrae <
> john.mccrae@insight-centre.org>, public-ontolex <public-ontolex@w3.org>
> *Subject: *Re: Entry with Multiple Part-of-Speech Values
>
> Hello! I agree as well. In the Dictionary of the Lisbon Academy of
> Sciences, the answer would be positive, but when we look at microstructures
> such as those in the Dictionary of the Real Academia Española, the
> constraints become evident: https://dle.rae.es/capital?m=form
>
> Have a nice day,
>
> Ana
>
>
>
> Passarotti Marco Carlo (marco.passarotti) <marco.passarotti@unicatt.it>
> escreveu (terça, 4/11/2025 à(s) 11:07):
>
> Hi all,
>
>
>
> I support the proposal of getting rid of the constraint of having a single
> PoS per entry.
>
> Very often, dictionaries do not distinguish different components of a
> lexicographic entry per single PoS. They just report that a certain word is
> “adv,,prep.”. In LiLa we had several issues while linking retrodigitized
> dictionaries that follow such habits as for PoS.
>
>
>
> Best,
>
>
>
> Marco
>
>
>
>
>
> Prof. Marco C. Passarotti
> Computational Linguistics
> Index Thomisticus Treebank https://itreebank.marginalia.it/
> ERC Grantee, P.I. LiLa https://lila-erc.eu/ (Grant Agreement No. 769994)
> CIRCSE Research Centre
> https://centridiricerca.unicatt.it/circse_index.html
>
>
>
>
> Università Cattolica del Sacro Cuore
> Largo Gemelli, 1
> 20123 Milan, Italy
> marco.passarotti@unicatt.it
> tel. +39-02-72342380
>
>
>
> Il giorno 4 nov 2025, alle ore 11:53, Fahad Khan <anasfkhan81@gmail.com>
> ha scritto:
>
>
>
> Dear John,
> IMHO the definition of Entry is too narrow (it is tied to a lexicographic
> source) and entails quite a complex encoding with the existence and
> alignment of different structural components and lexical components just to
> capture, e.g., the case of part of speech values associated with different
> senses (think of all the overhead in the case of a lexicon where this is
> common and the difficulty of writing SPARQL queries). The question isn't
> just one of providing a solution but a good one. For instance, I think
> David's solution of language specific categories might make
> interoperability between different resources more difficult and lead to a
> profusion of PoS categories.
> From what I understand the necessity of having a single part of speech per
> entry was a necessity for certain NLP tasks, but nowadays the creation of
> lexicons for language documentation/retrodigitsation is a much more
> frequent use case in LLOD. I think it makes sense to get rid of it.
> Cheers,
> Fahad
>
>
>
> Il giorno lun 3 nov 2025 alle ore 17:16 John P. McCrae <
> john.mccrae@insight-centre.org> ha scritto:
>
> Hi all,
>
>
>
> As part of the OntoLex core model changes we are looking into the issues
> of multiple part-of-speech values here:
>
>
>
> https://github.com/ontolex/ontolex/issues/47
>
>
>
> In particular, this problem already appears to be solved by the use of the
> `Entry` class from `lexicog` or as David Lindemann suggests by using more
> general or language-specific categories.
>
>
>
> I was wondering if there are any use cases that anyone has that are not
> solved by this modelling, or other comments
>
>
>
> Regards,
>
> John
>
>
>
> PS. I will copy/summarize replies to this email to GitHub. You may also
> post directly to GitHub.
>
>
>
>

--
Attachments

image/png attachment: cropped-europe-flag.png
image/png attachment: cropped-erc_high_res.png
image/png attachment: cropped-lila-logo-9.png
Received on Tuesday, 4 November 2025 12:13:17 UTC