- From: Max Ionov <max.ionov@gmail.com>
- Date: Tue, 4 Nov 2025 14:22:23 +0100
- To: public-ontolex@w3.org
- Message-ID: <5e613352-ab0c-4d7d-8ccc-ec635bc01fa0@gmail.com>
Dear all,
As always, this issue sparks a lively discussion.
I see three separate issues among the responses, and I think the
discussion can be more productive if they are separated. Here are the
issues I think are conflated (and my subjective solutions):
1.
(resource level) The resource joins separate meanings with separate
word senses under one umbrella entry (RAE example from Ana). This
problem is what |lexicog:Entry| was meant to solve: grouping
together entries that in a language correspond to separate concepts
and also potentially have different parts of speech. *In my mind,
these should be different lexical entries, grouped within one
lexicographic entry*. A strong argument for this would be having
other dictionaries separate them, i.e. proving that this is an
resource-specific decision.
2.
The part of speech in a language is not a perfect match to a part of
speech in another language (e.g. David’s example
<https://github.com/ontolex/ontolex/issues/47#issuecomment-3481604881>
from Basque). This happens quite often with nominals (e.g. nouns,
adjectives and adverbs) where it is difficult to draw a boundary
between them, so the same lexical entry can be both classified as an
adjective and an adverb. This could be handled by duplicating the
entries, but I strongly believe that this is kind of a Procrustean
bed and does not reflect linguistic reality. However, *the solution
to this, in my mind, is to create a resource- or language-specific
composite PoS* (like David suggested), which still works with the
restriction.
3.
The entry, while being one linguistic unit, has separate parts of
speech within its /inflectional/ paradigm (Khadija Arabic example).
It can be argued that this can be split into different
sub-paradigms, one per part of speech, but (a) it is not always
simple, (b) this does not allow to follow a lexicographic and/or
morphological tradition. I would argue that using |lexicog:Entry|
here is not only too complex, but does not reflect linguistic and
lexicographic reality. And for these cases, *I think we should be
able to provide more than one part of speech per entry, or, better,
not provide a part of speech for the entry, but to its forms*. I
think, connecting a PoS to a form in these cases somewhat solves
Ilan’s concern about losing detailed information about language
components.
As for Marco’s example, I am not sure if it fits either of the three
cases, but I feel like this is a case of an underspecified PoS, which is
somewhat similar to the second issue.
Best,
Max
On 4/11/25 13:11, Khadija Ait ElFqih wrote:
> Dear all,
>
> From the perspective of Arabic lexical resources, the issue of
> /multiple parts of speech (POS) per headword/ is not an exception but
> rather a regular phenomenon in Arabic lexicography. A single written
> form often serves several grammatical roles for instance:
> – *نحو (nahw)* meaning /direction, way/ (noun) and /toward, towards/
> (preposition).
> – *خير (khayr)* meaning /goodness, virtue/ (noun) and /better, best/
> (comparative adjective).
>
> The current OntoLex model, which relies on *|lexicog:Entry|* linked to
> multiple *|ontolex:LexicalEntry|* elements, can technically represent
> such cases. However, in practice, this approach is *too complex* for
> languages and lexicons where multi-POS phenomena are common. It
> requires creating and aligning several structural and lexical
> components (Entry, LexicalEntry, Form, Sense) simply to capture
> different POS values, which makes both data maintenance and SPARQL
> querying unnecessarily heavy.
>
> Therefore, I tend to agree that the definition of |Entry| is too
> narrow and tied to a lexicographic structure, and that we might
> consider loosening this constraint or providing a simpler
> representation that can handle multiple POS directly within one entry,
> without needing so many nested components.
>
> This is particularly important for morphologically rich languages like
> Arabic, where:
> – Traditional dictionaries (e.g. /Lisan Al Arab/, /al-Muʿjam al-Wajīz,
> etc..../) routinely group nouns, verbs, and particles under the same
> root or lemma;
> – The boundary between POS is sometimes fluid (e.g. *خير*, which can
> function both as a noun and as a comparative adjective). Enforcing one
> POS per entry risks losing meaningful semantic or historical nuances.
>
> At the same time, we still need a degree of interoperability across
> resources. A practical solution could be the development of
> lightweight application guidelines or profiles for languages like
> Arabic, specifying:
> – when POS distinctions can be merged or should be split;
> – and how Arabic POS categories can map to LexInfo or Universal
> Dependencies, avoiding uncontrolled proliferation of POS labels.
>
> Below is a simple example of how such a case could currently be
> modeled for *نحو (nahw)*:
>
> *Example (RDF/Turtle):*
>
> |:entry-nahw a lexicog:Entry ; rdfs:label "نحو"@ar ; lexicog:contains
> :nahw-noun , :nahw-preposition . :nahw-noun a ontolex:LexicalEntry ;
> ontolex:canonicalForm :form-nahw ; lexinfo:partOfSpeech lexinfo:noun ;
> ontolex:sense :sense-direction . :nahw-preposition a
> ontolex:LexicalEntry ; ontolex:canonicalForm :form-nahw ;
> lexinfo:partOfSpeech lexinfo:preposition ; ontolex:sense :sense-toward
> . :form-nahw a ontolex:Form ; ontolex:writtenRep "نحو"@ar . |
>
> *Example explanations:*
> – /نحو (nahw)/ as a *noun* → /direction, way/
> – /نحو (nahw)/ as a *preposition* → /toward, towards/
> – /خير (khayr)/ as a *noun* → /goodness, virtue/
> – /خير (khayr)/ as a *comparative adjective* → /better, best/
>
> The model, in principle, can represent such distinctions, but in
> practice it would benefit from a simpler or more flexible
> interpretation of the |Entry| class, especially for documentation and
> retro-digitization purposes, where descriptive accuracy is as
> important as computational consistency.
>
> Best regards,
>
> k.,
>
>
> On Tue, Nov 4, 2025 at 12:25 PM Ilan Kernerman <ilan@lexicala.com> wrote:
>
> Hi all,
>
> I would argue in favor “of having a single part of speech per
> entry”. Besides categorizing language components in more detail
> (for various language technology purposes), it is needed for
> cross-lingual purposes, as L2 might have different equivalents for
> different L1 pos.
>
> If there is no nice and easy solution that satisfies both current
> (and near-future) resources and retrodigitization, and one of them
> must suffer, IMHO our priority should be the former.
>
> Thanks,
>
> Ilan
>
> *From: *Ana Salgado <anacastrosalgado@gmail.com>
> *Date: *Tuesday, 4 November 2025 at 13:17
> *To: *Passarotti Marco Carlo (marco.passarotti)
> <marco.passarotti@unicatt.it>
> *Cc: *Fahad Khan <anasfkhan81@gmail.com>, John P. McCrae
> <john.mccrae@insight-centre.org>, public-ontolex
> <public-ontolex@w3.org>
> *Subject: *Re: Entry with Multiple Part-of-Speech Values
>
> Hello! I agree as well. In the Dictionary of the Lisbon Academy of
> Sciences, the answer would be positive, but when we look at
> microstructures such as those in the Dictionary of the Real
> Academia Española, the constraints become evident:
> https://dle.rae.es/capital?m=form
>
> Have a nice day,
>
> Ana
>
> Passarotti Marco Carlo (marco.passarotti)
> <marco.passarotti@unicatt.it> escreveu (terça, 4/11/2025 à(s) 11:07):
>
> Hi all,
>
> I support the proposal of getting rid of the constraint of
> having a single PoS per entry.
>
> Very often, dictionaries do not distinguish different
> components of a lexicographic entry per single PoS. They just
> report that a certain word is “adv,,prep.”. In LiLa we had
> several issues while linking retrodigitized dictionaries that
> follow such habits as for PoS.
>
> Best,
>
> Marco
>
> Prof. Marco C. Passarotti
> Computational Linguistics
> Index Thomisticus Treebank https://itreebank.marginalia.it/
> ERC Grantee, P.I. LiLa https://lila-erc.eu/ (Grant Agreement
> No. 769994)
> CIRCSE Research Centre
> https://centridiricerca.unicatt.it/circse_index.html
>
>
> Università Cattolica del Sacro Cuore
> Largo Gemelli, 1
> 20123 Milan, Italy
> marco.passarotti@unicatt.it
> tel. +39-02-72342380
>
>
>
> Il giorno 4 nov 2025, alle ore 11:53, Fahad Khan
> <anasfkhan81@gmail.com> ha scritto:
>
> Dear John,
> IMHO the definition of Entry is too narrow (it is tied to
> a lexicographic source) and entails quite a complex
> encoding with the existence and alignment of different
> structural components and lexical components just to
> capture, e.g., the case of part of speech values
> associated with different senses (think of all the
> overhead in the case of a lexicon where this is common and
> the difficulty of writing SPARQL queries). The question
> isn't just one of providing a solution but a good one. For
> instance, I think David's solution of language specific
> categories might make interoperability between different
> resources more difficult and lead to a profusion of PoS
> categories.
> From what I understand the necessity of having a single
> part of speech per entry was a necessity for certain NLP
> tasks, but nowadays the creation of lexicons for language
> documentation/retrodigitsation is a much more frequent use
> case in LLOD. I think it makes sense to get rid of it.
> Cheers,
> Fahad
>
> Il giorno lun 3 nov 2025 alle ore 17:16 John P. McCrae
> <john.mccrae@insight-centre.org> ha scritto:
>
> Hi all,
>
> As part of the OntoLex core model changes we are
> looking into the issues of multiple part-of-speech
> values here:
>
> https://github.com/ontolex/ontolex/issues/47
>
> In particular, this problem already appears to be
> solved by the use of the `Entry` class from `lexicog`
> or as David Lindemann suggests by using more
> general or language-specific categories.
>
> I was wondering if there are any use cases that anyone
> has that are not solved by this modelling, or other
> comments
>
> Regards,
>
> John
>
> PS. I will copy/summarize replies to this email to
> GitHub. You may also post directly to GitHub.
>
>
>
> --
>
​
Received on Tuesday, 4 November 2025 13:22:31 UTC