- From: Jorge Gracia del Río <jogracia@unizar.es>
- Date: Tue, 4 Jul 2023 12:19:00 +0200
- To: Christian Chiarcos <christian.chiarcos@gmail.com>
- Cc: Fahad Khan <fahad.khan@ilc.cnr.it>, public-ontolex <public-ontolex@w3.org>
- Message-ID: <CAMe8T+tqW_KsZ8FDT1-C6vL2iocUtNgq4vuYEibVOFGg179HJQ@mail.gmail.com>
Dear Christian, all, Thanks for starting this interesting discussion. From my side, I fully support Ilan's view on this. Trying to adapt the model to the restrictions and needs of every single dictionary is not feasible. One of the beauties of lemon is that it was built as a reference model for lexical data automatically processable on the Web, not bound to the restrictions of the media and source format. For lexicog we tried to cover a good number of common issues and patterns not covered by Ontolex with regard to lexicographic aspects, but we were conscious that some minor usages and some ill-defined or underspecified dictionaries would not fit and would need some extra pre-processing. I think that this might be the case with the dictionary of your example. Of course I am in favour of adaptations to the model and to work on its evolution, but we need to be cautions and not to re-interpret the model to adapt it to any possible legacy dictionary (e.g., by moving POS from the Lexical Entry to the Form), due to the risk of hampering interoperability across a plethora of existing and future lemon-based lexical data. In your particular case, I'd go for the lexicog solution with one lexical entry per POS, and duplicating lexical senses if needed (which actually won't be duplicated since they will be connecting different things). Or, curating the source data to avoid existing imprecisions. Best, Jorge El mar, 4 jul 2023 a las 11:29, Christian Chiarcos (< christian.chiarcos@gmail.com>) escribió: > Dear Fahad, > > Am Mo., 3. Juli 2023 um 18:51 Uhr schrieb Fahad Khan < > fahad.khan@ilc.cnr.it>: > >> ... to get rid of the one POS per lexical entry constraint ... But since >> there is some reluctance to update the guidelines except to correct minor >> typos, this is probably not going to happen (and also if it did then that >> would remove one of the big motivations for developing lexicog in the first >> place). >> > > Exactly ;) > > >> However IMO there is an ambiguity as to whether lexical entries are >> supposed to have *exactly*one POS or *at most* one POS. This is >> especially the case since as we discussed in a previous OntoLex call, >> affixes are also classed as lexical entries in the model and these usually >> aren't associated with POSs. So a third potential solution to your >> modelling dilemma would be indeed to assume that a lexical entry can have >> zero or one POS values, and not to associate any POSs with your lexical >> entry using lexinfo:partofspeech, but rather to use some other property to >> specify that the categories noun and adjective are relevant to your lexical >> entry (this solution has the benefit that you can continue using lexical >> entry with its associated axioms). >> > > This is actually a feasible solution which doesn't even need lexicog and > requires no rewording, at all. Just pushing the POS information into the > forms (which we can, as we do for other morphosyntactic properties from > LexInfo) solves the problem, indeed. Having a lexical entry without POS is > even valid with a strict a "exactly one" requirement, because in RDF > semantics, we always operate under the open world assumption. Then, > "exactly one" just entails that there is one such property, but it doesn't > give us the exact value (and could be automatically expanded into a blank > node). And it is not even semantically incorrect, because a > language-specific POS category that just covers multiple LexInfo POSes > could be created (I'm not saying that it should be, though). This is > actually common practice in NLP, think of the "IN" tag in the Penn Treebank > which stands for prepositions or subordinating conjunctions (keep in mind > that, in English, practically every preposition can be used as a > complementizer). > > This is also consistent with the requirements for lexical resources that > just come without POSes. It would not be feasible to require that every > legacy resource first needs to be POS-tagged before it can be converted. > Think of a 16th c. word list from an extinct South American language that > we don't know too much about. But even for modern languages, the expertise > might just not be available to the person in charge of conversion. For > historical languages, it might even be unclear what POS distinctions apply. > We can create a glossary for Etruscan (or the Phaistos disk, if you will), > but there is just no consensus for many (or, in the Phaistos case, any) > words wrt. their POSes. > > I still feeling somewhat unconfortable with (ab)using ontolex:LexicalEntry > for this particular case, because the difference of nouns and adjectives in > French is rather well understood ;) > > >> PS. Given the capabilities of ChatGPT I wouldn't be so sure the task you >> refer to couldn't be automated. >> > > Well, I mean in a controlled, reliable, systematic fashion ;) With > training data or fine-tuning we can get there, to some extent, of course, > but any ML-based solution will have a certain level of noise, so by > automated methods I mean only those based on the analysis of layout > conventions alone (this also has noise, because lexicographers aren't 100% > consistent either, but it's explainable and systematic noise rather than > unpredictable hallucination). > > Best, > Christian >
Received on Tuesday, 4 July 2023 10:19:19 UTC