Re: One lexical entry with multiple POSes from Christian Chiarcos on 2023-07-04 (public-ontolex@w3.org from July 2023)

From: Christian Chiarcos <christian.chiarcos@gmail.com>
Date: Tue, 4 Jul 2023 11:54:30 +0200
To: Ilan Kernerman <ilan@lexicala.com>
Cc: Fahad Khan <fahad.khan@ilc.cnr.it>, public-ontolex <public-ontolex@w3.org>
Message-ID: <CAC1YGdjH-wJs+tL-2HcX5OhTvNz1LbDxqMy5y9LogoG+DuPeyQ@mail.gmail.com>

Dear Ilan,

I support keeping the one-pos-per-entry principle – which IMHO makes
> Ontolex/Lexicog more thorough, consistent, open and useful, despite such
> constraints – and seeking solutions to specific clashes, like what you
> suggest.
>

Whether we like that or not, there's no real alternative right now ;)


> ... bringing us back to asking what is the ultimate purpose of Ontolex –
> to provide automated 1-to-1 replications for (often) imperfect dictionaries
> or try to design the utmost up-to-date semantic representation of lexical
> data for actual use today?
>

I think it should cover both, because the ultimate goal may be the latter,
but, for the foreseeable future, the bulk of data is more in line with the
first.

The macro- and micro-structure of good old dictionaries has also been
> determined by real constraints, such as their specific media and space
> limits, resulting in entries with unsystematic structures (like this,
> otherwise beautiful, one). If the media is the message, it will be
> necessary to adapt.
>
>
>
> Moreover, I doubt multi-pos-per-entry would enable “more efficient
> modelling”, and how capable existing tools are for “representing such data
> without requiring human re-interpretation”, unless the goal were only to
> mirror the original entry rather than also broaden its scope.
>

By "more efficient", I mean less redundancy, less verbosity and a more
compact documentation (fewer standards/modules to look into). As for the
capability of tools, I think of template-based approaches based on the
analysis of layout, abbreviations and sequential order, alone (like XSLT,
regular expressions, CFGs etc.). As for "only to mirror", I still think we
need to permit the option to let data providers do exactly that, because
that's where most data will be coming from and because it keeps the entry
barrier low (which is high enough already).


>  It usually requires extra time and manual work to deal with such cases
> manually, but maybe it’s good that not everything can be automized, yet ;)?
> And perhaps some amount of duplication of senses is unavoidable and,
> actually, the advantages exceed the drawbacks?
>

I would agree. But we can actually have both mirrorred and enriched
resources without any risks to data quality if we provide the means to
distinguish them clearly. This is why, for this particular case, I would
prefer the lexicog solution over the underspecified lexical entry option
that Fahad was proposing, because that would allow us to spot
"semistructured" entries (which this one clearly is) immediately. In an
additional step, then, "proper" lexical entries can be inferred, if POS
information is provided and different pieces of information are assigned
appropriately. (The situation is very different for languages where
expertise on POS tagging is hard to provide or not provided by default,
because then, a lexical entry cannot be expected to have a POS, in the
first place.)


> One way or the other, I would consider guidelines as a means, not an end,
> which should be open for reconsideration if they can be improved.
>

Of course, but I'm actually more on the conservative side of things here
and think we should limit ourselves to minor rewording, clarifying comments
and obvious typos. Any reconsideration comes with the requirement to either
update the existing data or with a risk for compatibility. This could be
done with major version updates, but not along the way.

Thanks a lot,
Christian

>

Received on Tuesday, 4 July 2023 09:54:47 UTC