Re: Entry with Multiple Part-of-Speech Values from Max Ionov on 2025-11-04 (public-ontolex@w3.org from November 2025)

From: Max Ionov <max.ionov@gmail.com>
Date: Tue, 4 Nov 2025 14:22:23 +0100
To: public-ontolex@w3.org
Message-ID: <5e613352-ab0c-4d7d-8ccc-ec635bc01fa0@gmail.com>
Dear all,

As always, this issue sparks a lively discussion.

I see three separate issues among the responses, and I think the 
discussion can be more productive if they are separated. Here are the 
issues I think are conflated (and my subjective solutions):

 1.

    (resource level) The resource joins separate meanings with separate
    word senses under one umbrella entry (RAE example from Ana). This
    problem is what |lexicog:Entry| was meant to solve: grouping
    together entries that in a language correspond to separate concepts
    and also potentially have different parts of speech. *In my mind,
    these should be different lexical entries, grouped within one
    lexicographic entry*. A strong argument for this would be having
    other dictionaries separate them, i.e. proving that this is an
    resource-specific decision.

 2.

    The part of speech in a language is not a perfect match to a part of
    speech in another language (e.g. David’s example
    <https://github.com/ontolex/ontolex/issues/47#issuecomment-3481604881>
    from Basque). This happens quite often with nominals (e.g. nouns,
    adjectives and adverbs) where it is difficult to draw a boundary
    between them, so the same lexical entry can be both classified as an
    adjective and an adverb. This could be handled by duplicating the
    entries, but I strongly believe that this is kind of a Procrustean
    bed and does not reflect linguistic reality. However, *the solution
    to this, in my mind, is to create a resource- or language-specific
    composite PoS* (like David suggested), which still works with the
    restriction.

 3.

    The entry, while being one linguistic unit, has separate parts of
    speech within its /inflectional/ paradigm (Khadija Arabic example).
    It can be argued that this can be split into different
    sub-paradigms, one per part of speech, but (a) it is not always
    simple, (b) this does not allow to follow a lexicographic and/or
    morphological tradition. I would argue that using |lexicog:Entry|
    here is not only too complex, but does not reflect linguistic and
    lexicographic reality. And for these cases, *I think we should be
    able to provide more than one part of speech per entry, or, better,
    not provide a part of speech for the entry, but to its forms*. I
    think, connecting a PoS to a form in these cases somewhat solves
    Ilan’s concern about losing detailed information about language
    components.

As for Marco’s example, I am not sure if it fits either of the three 
cases, but I feel like this is a case of an underspecified PoS, which is 
somewhat similar to the second issue.

Best,

Max

On 4/11/25 13:11, Khadija Ait ElFqih wrote:

> Dear all,
>
> From the perspective of Arabic lexical resources, the issue of 
> /multiple parts of speech (POS) per headword/ is not an exception but 
> rather a regular phenomenon in Arabic lexicography. A single written 
> form often serves several grammatical roles for instance:
> – *نحو (nahw)* meaning /direction, way/ (noun) and /toward, towards/ 
> (preposition).
> – *خير (khayr)* meaning /goodness, virtue/ (noun) and /better, best/ 
> (comparative adjective).
>
> The current OntoLex model, which relies on *|lexicog:Entry|* linked to 
> multiple *|ontolex:LexicalEntry|* elements, can technically represent 
> such cases. However, in practice, this approach is *too complex* for 
> languages and lexicons where multi-POS phenomena are common. It 
> requires creating and aligning several structural and lexical 
> components (Entry, LexicalEntry, Form, Sense) simply to capture 
> different POS values, which makes both data maintenance and SPARQL 
> querying unnecessarily heavy.
>
> Therefore, I tend to agree that the definition of |Entry| is too 
> narrow and tied to a lexicographic structure, and that we might 
> consider loosening this constraint or providing a simpler 
> representation that can handle multiple POS directly within one entry, 
> without needing so many nested components.
>
> This is particularly important for morphologically rich languages like 
> Arabic, where:
> – Traditional dictionaries (e.g. /Lisan Al Arab/, /al-Muʿjam al-Wajīz, 
> etc..../) routinely group nouns, verbs, and particles under the same 
> root or lemma;
> – The boundary between POS is sometimes fluid (e.g. *خير*, which can 
> function both as a noun and as a comparative adjective). Enforcing one 
> POS per entry risks losing meaningful semantic or historical nuances.
>
> At the same time, we still need a degree of interoperability across 
> resources. A practical solution could be the development of 
> lightweight application guidelines or profiles for languages like 
> Arabic, specifying:
> – when POS distinctions can be merged or should be split;
> – and how Arabic POS categories can map to LexInfo or Universal 
> Dependencies, avoiding uncontrolled proliferation of POS labels.
>
> Below is a simple example of how such a case could currently be 
> modeled for *نحو (nahw)*:
>
> *Example (RDF/Turtle):*
>
> |:entry-nahw a lexicog:Entry ; rdfs:label "نحو"@ar ; lexicog:contains 
> :nahw-noun , :nahw-preposition . :nahw-noun a ontolex:LexicalEntry ; 
> ontolex:canonicalForm :form-nahw ; lexinfo:partOfSpeech lexinfo:noun ; 
> ontolex:sense :sense-direction . :nahw-preposition a 
> ontolex:LexicalEntry ; ontolex:canonicalForm :form-nahw ; 
> lexinfo:partOfSpeech lexinfo:preposition ; ontolex:sense :sense-toward 
> . :form-nahw a ontolex:Form ; ontolex:writtenRep "نحو"@ar . |
>
> *Example explanations:*
> – /نحو (nahw)/ as a *noun* → /direction, way/
> – /نحو (nahw)/ as a *preposition* → /toward, towards/
> – /خير (khayr)/ as a *noun* → /goodness, virtue/
> – /خير (khayr)/ as a *comparative adjective* → /better, best/
>
> The model, in principle, can represent such distinctions, but in 
> practice it would benefit from a simpler or more flexible 
> interpretation of the |Entry| class, especially for documentation and 
> retro-digitization purposes, where descriptive accuracy is as 
> important as computational consistency.
>
> Best regards,
>
> k.,
>
>
> On Tue, Nov 4, 2025 at 12:25 PM Ilan Kernerman <ilan@lexicala.com> wrote:
>
>     Hi all,
>
>     I would argue in favor “of having a single part of speech per
>     entry”. Besides categorizing language components in more detail
>     (for various language technology purposes), it is needed for
>     cross-lingual purposes, as L2 might have different equivalents for
>     different L1 pos.
>
>     If there is no nice and easy solution that satisfies both current
>     (and near-future) resources and retrodigitization, and one of them
>     must suffer, IMHO our priority should be the former.
>
>     Thanks,
>
>     Ilan
>
>     *From: *Ana Salgado <anacastrosalgado@gmail.com>
>     *Date: *Tuesday, 4 November 2025 at 13:17
>     *To: *Passarotti Marco Carlo (marco.passarotti)
>     <marco.passarotti@unicatt.it>
>     *Cc: *Fahad Khan <anasfkhan81@gmail.com>, John P. McCrae
>     <john.mccrae@insight-centre.org>, public-ontolex
>     <public-ontolex@w3.org>
>     *Subject: *Re: Entry with Multiple Part-of-Speech Values
>
>     Hello! I agree as well. In the Dictionary of the Lisbon Academy of
>     Sciences, the answer would be positive, but when we look at
>     microstructures such as those in the Dictionary of the Real
>     Academia Española, the constraints become evident:
>     https://dle.rae.es/capital?m=form
>
>     Have a nice day,
>
>     Ana
>
>     Passarotti Marco Carlo (marco.passarotti)
>     <marco.passarotti@unicatt.it> escreveu (terça, 4/11/2025 à(s) 11:07):
>
>         Hi all,
>
>         I support the proposal of getting rid of the constraint of
>         having a single PoS per entry.
>
>         Very often, dictionaries do not distinguish different
>         components of a lexicographic entry per single PoS. They just
>         report that a certain word is “adv,,prep.”. In LiLa we had
>         several issues while linking retrodigitized dictionaries that
>         follow such habits as for PoS.
>
>         Best,
>
>         Marco
>
>         Prof. Marco C. Passarotti
>         Computational Linguistics
>         Index Thomisticus Treebank https://itreebank.marginalia.it/
>         ERC Grantee, P.I. LiLa https://lila-erc.eu/ (Grant Agreement
>         No. 769994)
>         CIRCSE Research Centre
>         https://centridiricerca.unicatt.it/circse_index.html
>
>
>         Università Cattolica del Sacro Cuore
>         Largo Gemelli, 1
>         20123 Milan, Italy
>         marco.passarotti@unicatt.it
>         tel. +39-02-72342380
>
>
>
>             Il giorno 4 nov 2025, alle ore 11:53, Fahad Khan
>             <anasfkhan81@gmail.com> ha scritto:
>
>             Dear John,
>             IMHO the definition of Entry is too narrow (it is tied to
>             a lexicographic source) and entails quite a complex
>             encoding with the existence and alignment of different
>             structural components and lexical components just to
>             capture, e.g., the case of part of speech values
>             associated with different senses (think of all the
>             overhead in the case of a lexicon where this is common and
>             the difficulty of writing SPARQL queries). The question
>             isn't just one of providing a solution but a good one. For
>             instance, I think David's solution of language specific
>             categories might make interoperability between different
>             resources more difficult and lead to a profusion of PoS
>             categories.
>             From what I understand the necessity of having a single
>             part of speech per entry was a necessity for certain NLP
>             tasks, but nowadays the creation of lexicons for language
>             documentation/retrodigitsation is a much more frequent use
>             case in LLOD. I think it makes sense to get rid of it.
>             Cheers,
>             Fahad
>
>             Il giorno lun 3 nov 2025 alle ore 17:16 John P. McCrae
>             <john.mccrae@insight-centre.org> ha scritto:
>
>                 Hi all,
>
>                 As part of the OntoLex core model changes we are
>                 looking into the issues of multiple part-of-speech
>                 values here:
>
>                 https://github.com/ontolex/ontolex/issues/47
>
>                 In particular, this problem already appears to be
>                 solved by the use of the `Entry` class from `lexicog`
>                 or as David Lindemann suggests by using more
>                 general or language-specific categories.
>
>                 I was wondering if there are any use cases that anyone
>                 has that are not solved by this modelling, or other
>                 comments
>
>                 Regards,
>
>                 John
>
>                 PS. I will copy/summarize replies to this email to
>                 GitHub. You may also post directly to GitHub.
>
>
>
> -- 
>
&#8203;
Received on Tuesday, 4 November 2025 13:22:31 UTC